zeppelin/docs/tutorial/tutorial.md

---
layout: page
title: "Tutorial"
description: "Tutorial is valid for Spark 1.3 and higher"
group: tutorial
---
<!--
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
### Zeppelin Tutorial

We will assume you have Zeppelin installed already. If that's not the case, see [Install](../install/install.html).

Zeppelin's current main backend processing engine is [Apache Spark](https://spark.apache.org). If you're new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

<br />
### Tutorial with Local File

#### Data Refine

Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip). 

First, to transform data from csv format into RDD of `Bank` objects, run following script. This will also remove header using `filter` function.

```scala

val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class  
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")
```

<br />
#### Data Retrieval

Suppose we want to see age distribution from `bank`. To do this, run:

```sql
%sql select age, count(1) from bank where age < 30 group by age order by age
```

You can make input box for setting age condition by replacing `30` with `${maxAge=30}`.

```sql
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
```

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

```sql
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
```

<br />
### Tutorial with Streaming Data 

#### Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script.

This will create a RDD of `Tweet` objects and register these stream data as a table:

```scala
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
  val configs = new HashMap[String, String] ++= Seq(
    "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
  println("Configuring Twitter OAuth")
  configs.foreach{ case(key, value) =>
    if (value.trim.isEmpty) {
      throw new Exception("Error setting authentication - value for " + key + " not set")
    }
    val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
    System.setProperty(fullKey, value.trim)
    println("\tProperty " + fullKey + " set as [" + value.trim + "]")
  }
  println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
  Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
  // Below line works only in spark 1.3.0.
  // For spark 1.1.x and spark 1.2.x,
  // use rdd.registerTempTable("tweets") instead.
  rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()
```

<br />
#### Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let's begin by extracting maximum 10 tweets which contain the word "girl".

```sql
%sql select * from tweets where text like '%girl%' limit 10
```

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

```sql
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
```


You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter.

```scala
def sentiment(s:String) : String = {
    val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
    val negative = Array("hate", "bad", "stupid", "is")
    
    var st = 0;

    val words = s.split(" ")    
    positive.foreach(p =>
        words.foreach(w =>
            if(p==w) st = st+1
        )
    )
    
    negative.foreach(p=>
        words.foreach(w=>
            if(p==w) st = st-1
        )
    )
    if(st>0)
        "positivie"
    else if(st<0)
        "negative"
    else
        "neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

```

To check how people think about girls using `sentiment` function we've made above, run this:

```sql
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)
```
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`---`
			`layout: page`
			`title: "Tutorial"`
improved readability ### info on versioning: ### 1. declare the validity of this tutorial for Spark 1.5 in "document description" 2. remove not relevant info comment about DataFrames started in Spark 1.3 ### put some basic comments into code ### Author: Tomas <xhudik@gmail.com> Closes #339 from xhudik/patch-2 and squashes the following commits: 2f11ff7 [Tomas] Update tutorial.md 1dd7c52 [Tomas] improved readability 2015-10-09 08:17:44 +00:00			`description: "Tutorial is valid for Spark 1.3 and higher"`
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`group: tutorial`
			`---`
Apply #407 into master branch #407 addressed https://issues.apache.org/jira/browse/ZEPPELIN-406 and merged into branch-0.5.5 This PR applies it to the master branch Author: Lee moon soo <moon@apache.org> Closes #424 from Leemoonsoo/ZEPPELIN-406-master and squashes the following commits: 64b6550 [Lee moon soo] Add apache header and inline json message example into doc ba72645 [Lee moon soo] ZEPPELIN-406 Handle license issue found in 0.5.5-incubating rc2 2015-11-12 10:54:38 +00:00			`<!--`
			`Licensed under the Apache License, Version 2.0 (the "License");`
			`you may not use this file except in compliance with the License.`
			`You may obtain a copy of the License at`
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00
Apply #407 into master branch #407 addressed https://issues.apache.org/jira/browse/ZEPPELIN-406 and merged into branch-0.5.5 This PR applies it to the master branch Author: Lee moon soo <moon@apache.org> Closes #424 from Leemoonsoo/ZEPPELIN-406-master and squashes the following commits: 64b6550 [Lee moon soo] Add apache header and inline json message example into doc ba72645 [Lee moon soo] ZEPPELIN-406 Handle license issue found in 0.5.5-incubating rc2 2015-11-12 10:54:38 +00:00			`http://www.apache.org/licenses/LICENSE-2.0`

			`Unless required by applicable law or agreed to in writing, software`
			`distributed under the License is distributed on an "AS IS" BASIS,`
			`WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.`
			`See the License for the specific language governing permissions and`
			`limitations under the License.`
			`-->`
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`### Zeppelin Tutorial`

			`We will assume you have Zeppelin installed already. If that's not the case, see [Install](../install/install.html).`

			`Zeppelin's current main backend processing engine is [Apache Spark](https://spark.apache.org). If you're new to the system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.`

			`<br />`
			`### Tutorial with Local File`

			`#### Data Refine`

			`Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip).`

			First, to transform data from csv format into RDD of `Bank` objects, run following script. This will also remove header using `filter` function.

			```scala
improved readability ### info on versioning: ### 1. declare the validity of this tutorial for Spark 1.5 in "document description" 2. remove not relevant info comment about DataFrames started in Spark 1.3 ### put some basic comments into code ### Author: Tomas <xhudik@gmail.com> Closes #339 from xhudik/patch-2 and squashes the following commits: 2f11ff7 [Tomas] Update tutorial.md 1dd7c52 [Tomas] improved readability 2015-10-09 08:17:44 +00:00
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`val bankText = sc.textFile("yourPath/bank/bank-full.csv")`

			`case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)`

improved readability ### info on versioning: ### 1. declare the validity of this tutorial for Spark 1.5 in "document description" 2. remove not relevant info comment about DataFrames started in Spark 1.3 ### put some basic comments into code ### Author: Tomas <xhudik@gmail.com> Closes #339 from xhudik/patch-2 and squashes the following commits: 2f11ff7 [Tomas] Update tutorial.md 1dd7c52 [Tomas] improved readability 2015-10-09 08:17:44 +00:00			`// split each line, filter out header (starts with "age"), and map it into Bank case class`
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(`
			`s=>Bank(s(0).toInt,`
			`s(1).replaceAll("\"", ""),`
			`s(2).replaceAll("\"", ""),`
			`s(3).replaceAll("\"", ""),`
			`s(5).replaceAll("\"", "").toInt`
			`)`
			`)`

improved readability ### info on versioning: ### 1. declare the validity of this tutorial for Spark 1.5 in "document description" 2. remove not relevant info comment about DataFrames started in Spark 1.3 ### put some basic comments into code ### Author: Tomas <xhudik@gmail.com> Closes #339 from xhudik/patch-2 and squashes the following commits: 2f11ff7 [Tomas] Update tutorial.md 1dd7c52 [Tomas] improved readability 2015-10-09 08:17:44 +00:00			`// convert to DataFrame and create temporal table`
ZEPPELIN-279: move website w/ docs to master branch Details at the [ZEPPELIN-279](https://issues.apache.org/jira/browse/ZEPPELIN-279) Author: Alexander Bezzubov <bzz@apache.org> Closes #282 from bzz/ZEPPELIN-279-move-docs-to-master and squashes the following commits: 16d2cd0 [Alexander Bezzubov] ZEPPELIN-279: updating refs to new /docs location a051a35 [Alexander Bezzubov] ZEPPELIN-279: moving website gh-pages -> master:/docs 2015-09-04 23:11:31 +00:00			`bank.toDF().registerTempTable("bank")`
			```

			`<br />`
			`#### Data Retrieval`

			Suppose we want to see age distribution from `bank`. To do this, run:

			```sql
			`%sql select age, count(1) from bank where age < 30 group by age order by age`
			```

			You can make input box for setting age condition by replacing `30` with `${maxAge=30}`.

			```sql
			`%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age`
			```

			`Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:`

			```sql
			`%sql select age, count(1) from bank where marital="${marital=single,single\|divorced\|married}" group by age order by age`
			```

			`<br />`
			`### Tutorial with Streaming Data`

			`#### Data Refine`

			Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script.

			This will create a RDD of `Tweet` objects and register these stream data as a table:

			```scala
			`import org.apache.spark.streaming._`
			`import org.apache.spark.streaming.twitter._`
			`import org.apache.spark.storage.StorageLevel`
			`import scala.io.Source`
			`import scala.collection.mutable.HashMap`
			`import java.io.File`
			`import org.apache.log4j.Logger`
			`import org.apache.log4j.Level`
			`import sys.process.stringSeqToProcess`

			`/** Configures the Oauth Credentials for accessing Twitter */`
			`def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {`
			`val configs = new HashMap[String, String] ++= Seq(`
			`"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)`
			`println("Configuring Twitter OAuth")`
			`configs.foreach{ case(key, value) =>`
			`if (value.trim.isEmpty) {`
			`throw new Exception("Error setting authentication - value for " + key + " not set")`
			`}`
			`val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")`
			`System.setProperty(fullKey, value.trim)`
			`println("\tProperty " + fullKey + " set as [" + value.trim + "]")`
			`}`
			`println()`
			`}`

			`// Configure Twitter credentials`
			`val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"`
			`val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"`
			`val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"`
			`val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"`
			`configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)`

			`import org.apache.spark.streaming.twitter._`
			`val ssc = new StreamingContext(sc, Seconds(2))`
			`val tweets = TwitterUtils.createStream(ssc, None)`
			`val twt = tweets.window(Seconds(60))`

			`case class Tweet(createdAt:Long, text:String)`
			`twt.map(status=>`
			`Tweet(status.getCreatedAt().getTime()/1000, status.getText())`
			`).foreachRDD(rdd=>`
			`// Below line works only in spark 1.3.0.`
			`// For spark 1.1.x and spark 1.2.x,`
			`// use rdd.registerTempTable("tweets") instead.`
			`rdd.toDF().registerAsTable("tweets")`
			`)`

			`twt.print`

			`ssc.start()`
			```

			`<br />`
			`#### Data Retrieval`

			`For each following script, every time you click run button you will see different result since it is based on real-time data.`

			`Let's begin by extracting maximum 10 tweets which contain the word "girl".`

			```sql
			`%sql select * from tweets where text like '%girl%' limit 10`
			```

			`This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:`

			```sql
			`%sql select createdAt, count(1) from tweets group by createdAt order by createdAt`
			```


			You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes(positive, negative, neutral) towards the parameter.

			```scala
			`def sentiment(s:String) : String = {`
			`val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")`
			`val negative = Array("hate", "bad", "stupid", "is")`

			`var st = 0;`

			`val words = s.split(" ")`
			`positive.foreach(p =>`
			`words.foreach(w =>`
			`if(p==w) st = st+1`
			`)`
			`)`

			`negative.foreach(p=>`
			`words.foreach(w=>`
			`if(p==w) st = st-1`
			`)`
			`)`
			`if(st>0)`
			`"positivie"`
			`else if(st<0)`
			`"negative"`
			`else`
			`"neutral"`
			`}`

			`// Below line works only in spark 1.3.0.`
			`// For spark 1.1.x and spark 1.2.x,`
			`// use sqlc.registerFunction("sentiment", sentiment _) instead.`
			`sqlc.udf.register("sentiment", sentiment _)`

			```

			To check how people think about girls using `sentiment` function we've made above, run this:

			```sql
			`%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)`
			```