zeppelin/docs/quickstart/tutorial.md
1ambda 4b6d3e5574 [ZEPPELIN-2596] Improving documentation page
### What is this PR for?

Improving documentation page. Please check *TODO* and *Screenshots* sections for detail.
The motivation is described in [the JIRA ticket](https://issues.apache.org/jira/browse/ZEPPELIN-2583) and discussion is ongoing on the mailing list.

### What type of PR is it?
[Improvement | Documentation]

### Todos
* [x] - improved the navbar style
* [x] - improved the main page
* [x] - re-organized content structure
* [x] - added tutorial pages: `spark_with_zeppelin.md`, `python_with_zeppelin.md`, `sql_with_zeppelin.md` for overview
* [x] - added `multi_user_support.md` page to provide overview
* [x] - added the empty `interpreter_binding_mode` page. This will be handed in the different issue: [ZEPPELIN-2582](https://issues.apache.org/jira/browse/ZEPPELIN-2582)
* [x] - added the empty `trouble_shooting` page. This can be filled in the following PRs.
* [x] - added the empty `useful_developer_tools` page. This can be filled in the following PRs.

### What is the Jira issue?

[ZEPPELIN-2596](https://issues.apache.org/jira/browse/ZEPPELIN-2596)

### How should this be tested?

1. checkout
2. `cd docs`
3. `bundle install` (make sure that you have ruby 2.1.0+ and bundle)
4. `bundle exec jekyll serve --watch`
5. open `localhost:4000`

### Screenshots (if appropriate)

#### better navbar: before
![2596_before_nav](https://cloud.githubusercontent.com/assets/4968473/26542353/89004e7a-4494-11e7-89c0-28d608f5f375.gif)

#### better navbar: after

![2596_after_nav](https://cloud.githubusercontent.com/assets/4968473/26542356/8bfb7b90-4494-11e7-9979-0bcaef8ba97b.gif)

#### improved main page: before

![2596_before_main](https://cloud.githubusercontent.com/assets/4968473/26542358/8f35b0be-4494-11e7-8a6c-e74ec52fc384.gif)

#### improved main page: after

![2596_after_main](https://cloud.githubusercontent.com/assets/4968473/26542366/93b333c8-4494-11e7-981f-3f7b4545868f.gif)

#### organized content structure: before

![2596_before_content](https://cloud.githubusercontent.com/assets/4968473/26542398/ad81ac26-4494-11e7-9a17-70dff41396fb.gif)

#### organized content structure: after

![2596_after_content](https://cloud.githubusercontent.com/assets/4968473/26542403/b0a42ad2-4494-11e7-8bd3-8a5bd194c6af.gif)

### Questions:
* Does the licenses files need update? - NO
* Is there breaking changes for older versions? - NO
* Does this needs documentation? -  related with docs

Author: 1ambda <1amb4a@gmail.com>

Closes #2371 from 1ambda/updating-version-doc and squashes the following commits:

eb02fa967 [1ambda] fix: navbar focus color applies after folding
026379ed6 [1ambda] fix: Remove docs/.listen_test
a7dd4737b [1ambda] fix: sora's comment 1.2
18c5058f7 [1ambda] fix: resolve description in python_with_zeppelin.md
d3ad67c73 [1ambda] fix: sora's comment 4
d133dbbcc [1ambda] fix: resolve sora's comment 3
513c6ff2c [1ambda] fix: resolve sora's comment 1.1
4c2946928 [1ambda] fix: resovle sora's comment 2
1c3946ac6 [1ambda] fix: sora's comment 1
4d6e4267f [1ambda] fix: Resolve sola's comment 3
d0524cafe [1ambda] fix: Set less shadow for nav
5f1f998ba [1ambda] docs: Add useful_develop_tools.md
9dfd62c74 [1ambda] fix: Typo in installation.md
30f7d7e06 [1ambda] fix: Typo in helium ctrl
d6877e792 [1ambda] docs: Add python_with_zeppelin.md
7027e96c0 [1ambda] docs: Improve python conda, docker doc style
e55b50a9d [1ambda] fix: Invalid URLs
75ddeeaff [1ambda] docs: replace URIs in interpreter
5b43993a4 [1ambda] docs: Add sql_with_zeppelin
053794e84 [1ambda] docs: Add spark_with_zeppelin.md
d4d88b9c7 [1ambda] docs: Improve proxy doc
b46cdd126 [1ambda] docs: Add empty interpreter_binding_mode.md
06fcb239e [1ambda] docs: Add empty personalized_mode.md
4991cf0a7 [1ambda] docs: Update upgrading.md
53142b7a0 [1ambda] fix: Simplify install.md
8a5c1e721 [1ambda] docs: Add multi_user_support.md
34095775e [1ambda] fix: Increase font size to 15px
a03b04b33 [1ambda] fix: Remove sample text from trouble_shooting.md
199842590 [1ambda] fix: Remove docker doc link
66a2a7d26 [1ambda] docs: Improve impersonation page
0a6e3fc1d [1ambda] docs: Improve install doc
ccd999ed5 [1ambda] docs: Improve helium doc
f8d742d08 [1ambda] fix: an invalid link in navbar
b7aa5f884 [1ambda] fix: URLs in development
61a175d94 [1ambda] docs: Update install.md
4c56de5c4 [1ambda] fix: URLs in setup
0b1d63513 [1ambda] fix: URLs in quickstart
28970a4fe [1ambda] feat: Add docs/usage
735946bca [1ambda] feat: rename /quickstart
b351cf237 [1ambda] fix: Add missing links
b70770b4f [1ambda] feat: Change URLs in nav, index
94e80aef6 [1ambda] fix: doens't display navbar version in small
6e0cab110 [1ambda] feat: Update doc section names
b9ce256ff [1ambda] feat: Hide version in navbar when md
f8bab52be [1ambda] fix: Better image display in index.md
eeb37d5b5 [1ambda] fix: Add RL padding for mobile browser
ceb60b5ee [1ambda] feat: Style collapsed nav for mobile browser
4ebafb4b6 [1ambda] commit
2017-06-23 17:44:13 +09:00

6.9 KiB

layout title description group
page Apache Zeppelin Tutorial This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher. quickstart

{% include JB/setup %}

Zeppelin Tutorial

This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see here first.

Current main backend processing engine of Zeppelin is Apache Spark. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.

Tutorial with Local File

Data Refine

Before you start Zeppelin tutorial, you will need to download bank.zip.

First, to transform csv format data into RDD of Bank objects, run following script. This will also remove header using filter function.


val bankText = sc.textFile("yourPath/bank/bank-full.csv")

case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)

// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
    s=>Bank(s(0).toInt, 
            s(1).replaceAll("\"", ""),
            s(2).replaceAll("\"", ""),
            s(3).replaceAll("\"", ""),
            s(5).replaceAll("\"", "").toInt
        )
)

// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")

Data Retrieval

Suppose we want to see age distribution from bank. To do this, run:

%sql select age, count(1) from bank where age < 30 group by age order by age

You can make input box for setting age condition by replacing 30 with ${maxAge=30}.

%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age

Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:

%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age

## Tutorial with Streaming Data

Data Refine

Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at Twitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey, apiSecret, accessToken, accessTokenSecret) with your API keys on following script.

This will create a RDD of Tweet objects and register these stream data as a table:

import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess

/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
  val configs = new HashMap[String, String] ++= Seq(
    "apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
  println("Configuring Twitter OAuth")
  configs.foreach{ case(key, value) =>
    if (value.trim.isEmpty) {
      throw new Exception("Error setting authentication - value for " + key + " not set")
    }
    val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
    System.setProperty(fullKey, value.trim)
    println("\tProperty " + fullKey + " set as [" + value.trim + "]")
  }
  println()
}

// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)

import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))

case class Tweet(createdAt:Long, text:String)
twt.map(status=>
  Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
  // Below line works only in spark 1.3.0.
  // For spark 1.1.x and spark 1.2.x,
  // use rdd.registerTempTable("tweets") instead.
  rdd.toDF().registerAsTable("tweets")
)

twt.print

ssc.start()

Data Retrieval

For each following script, every time you click run button you will see different result since it is based on real-time data.

Let's begin by extracting maximum 10 tweets which contain the word girl.

%sql select * from tweets where text like '%girl%' limit 10

This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:

%sql select createdAt, count(1) from tweets group by createdAt order by createdAt

You can make user-defined function and use it in Spark SQL. Let's try it by making function named sentiment. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.

def sentiment(s:String) : String = {
    val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
    val negative = Array("hate", "bad", "stupid", "is")
    
    var st = 0;

    val words = s.split(" ")    
    positive.foreach(p =>
        words.foreach(w =>
            if(p==w) st = st+1
        )
    )
    
    negative.foreach(p=>
        words.foreach(w=>
            if(p==w) st = st-1
        )
    )
    if(st>0)
        "positivie"
    else if(st<0)
        "negative"
    else
        "neutral"
}

// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)

To check how people think about girls using sentiment function we've made above, run this:

%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)