zeppelin/docs/quickstart/tutorial.md at e36d2fbfbb517bd8442f49bffe4f506df9101c0b

mirror of https://github.com/apache/zeppelin synced 2026-05-24 09:38:26 +00:00

Mina Lee 70ab1a376d [ZEPPELIN-952] Refine website style

### What is this PR for?
- update document style (font, line-spacing)
- apply same formats for documents
- fix broke document styles

### What type of PR is it?
Documentation

### What is the Jira issue?
[ZEPPELIN-952](https://issues.apache.org/jira/browse/ZEPPELIN-952)

### Screenshots (if appropriate)
**Before**
<img width="1184" alt="screen shot 2016-06-04 at 9 51 38 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803667/d0dd5ac2-2a9f-11e6-9ed0-ddc369a97612.png">

**After**
<img width="1184" alt="screen shot 2016-06-04 at 9 15 08 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803666/cd9212ea-2a9f-11e6-986e-17992a495ab6.png">

**Before**
<img width="1183" alt="screen shot 2016-06-04 at 10 08 53 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803695/03e73126-2aa1-11e6-8675-3ca437aeb833.png">

**After**
<img width="1184" alt="screen shot 2016-06-04 at 10 08 18 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803696/078ce866-2aa1-11e6-9044-4f5e16649eb4.png">

**Before**
<img width="1184" alt="screen shot 2016-06-04 at 10 10 47 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803704/5787e9ba-2aa1-11e6-804c-076a8f3aa852.png">

**After**
<img width="1184" alt="screen shot 2016-06-04 at 10 11 22 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803707/5afb5d0c-2aa1-11e6-98c7-7440db35bd2f.png">

**Before**
<img width="188" alt="screen shot 2016-06-04 at 10 12 36 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803719/92e5cc3e-2aa1-11e6-9a9f-e12150e78733.png">

**After**
<img width="199" alt="screen shot 2016-06-04 at 10 12 55 pm" src="https://cloud.githubusercontent.com/assets/8503346/15803721/958e8c00-2aa1-11e6-8768-8350db6e7173.png">

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Mina Lee <minalee@nflabs.com>

Closes #962 from minahlee/ZEPPELIN-952 and squashes the following commits:

f9bee91 [Mina Lee] Capitalize hawq
72481bd [Mina Lee] Update doc titles
495a074 [Mina Lee] remove old style.css
27ca869 [Mina Lee] use code block for file location in spark.md
eb821f1 [Mina Lee] Change file location and rename file
72f8ec3 [Mina Lee] change storage doc layout and fix pre block
4202208 [Mina Lee] Apply same format for rest api docs
5875066 [Mina Lee] split display page into text and html
8bc5a6e [Mina Lee] prettify document
0cb953e [Mina Lee] remove incubating tag

2016-06-08 11:44:26 -07:00

6.8 KiB

Raw Blame History

layout	title	description	group
page	Tutorial	Tutorial is valid for Spark 1.3 and higher	quickstart

Zeppelin Tutorial

This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see here first.

Current main backend processing engine of Zeppelin is Apache Spark. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

## Tutorial with Local File

1. Data Refine

Before you start Zeppelin tutorial, you will need to download bank.zip.

First, to transform csv format data into RDD of Bank objects, run following script. This will also remove header using filter function.


val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")

2. Data Retrieval

Suppose we want to see age distribution from bank. To do this, run:

%sql select age, count(1) from bank where age < 30 group by age order by age

You can make input box for setting age condition by replacing 30 with ${maxAge=30}.

%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age

## Tutorial with Streaming Data

1. Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at Twitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey, apiSecret, accessToken, accessTokenSecret) with your API keys on following script.

This will create a RDD of Tweet objects and register these stream data as a table:

import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
  val configs = new HashMap[String, String] ++= Seq(
    "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
  println("Configuring Twitter OAuth")
  configs.foreach{ case(key, value) =>
    if (value.trim.isEmpty) {
      throw new Exception("Error setting authentication - value for " + key + " not set")
    }
    val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
    System.setProperty(fullKey, value.trim)
    println("\tProperty " + fullKey + " set as [" + value.trim + "]")
  }
  println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
  Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
  // Below line works only in spark 1.3.0.
  // For spark 1.1.x and spark 1.2.x,
  // use rdd.registerTempTable("tweets") instead.
  rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()

2. Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let's begin by extracting maximum 10 tweets which contain the word girl.

%sql select * from tweets where text like '%girl%' limit 10

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

%sql select createdAt, count(1) from tweets group by createdAt order by createdAt

You can make user-defined function and use it in Spark SQL. Let's try it by making function named sentiment. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.

def sentiment(s:String) : String = {
    val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
    val negative = Array("hate", "bad", "stupid", "is")
    
    var st = 0;

    val words = s.split(" ")    
    positive.foreach(p =>
        words.foreach(w =>
            if(p==w) st = st+1
        )
    )
    
    negative.foreach(p=>
        words.foreach(w=>
            if(p==w) st = st-1
        )
    )
    if(st>0)
        "positivie"
    else if(st<0)
        "negative"
    else
        "neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

To check how people think about girls using sentiment function we've made above, run this:

%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)

6.8 KiB Raw Blame History

Zeppelin Tutorial

1. Data Refine

2. Data Retrieval

1. Data Refine

2. Data Retrieval

6.8 KiB

Raw Blame History