### What is this PR for? Improving documentation page. Please check *TODO* and *Screenshots* sections for detail. The motivation is described in [the JIRA ticket](https://issues.apache.org/jira/browse/ZEPPELIN-2583) and discussion is ongoing on the mailing list. ### What type of PR is it? [Improvement | Documentation] ### Todos * [x] - improved the navbar style * [x] - improved the main page * [x] - re-organized content structure * [x] - added tutorial pages: `spark_with_zeppelin.md`, `python_with_zeppelin.md`, `sql_with_zeppelin.md` for overview * [x] - added `multi_user_support.md` page to provide overview * [x] - added the empty `interpreter_binding_mode` page. This will be handed in the different issue: [ZEPPELIN-2582](https://issues.apache.org/jira/browse/ZEPPELIN-2582) * [x] - added the empty `trouble_shooting` page. This can be filled in the following PRs. * [x] - added the empty `useful_developer_tools` page. This can be filled in the following PRs. ### What is the Jira issue? [ZEPPELIN-2596](https://issues.apache.org/jira/browse/ZEPPELIN-2596) ### How should this be tested? 1. checkout 2. `cd docs` 3. `bundle install` (make sure that you have ruby 2.1.0+ and bundle) 4. `bundle exec jekyll serve --watch` 5. open `localhost:4000` ### Screenshots (if appropriate) #### better navbar: before  #### better navbar: after  #### improved main page: before  #### improved main page: after  #### organized content structure: before  #### organized content structure: after  ### Questions: * Does the licenses files need update? - NO * Is there breaking changes for older versions? - NO * Does this needs documentation? - related with docs Author: 1ambda <1amb4a@gmail.com> Closes #2371 from 1ambda/updating-version-doc and squashes the following commits:eb02fa967[1ambda] fix: navbar focus color applies after folding026379ed6[1ambda] fix: Remove docs/.listen_testa7dd4737b[1ambda] fix: sora's comment 1.218c5058f7[1ambda] fix: resolve description in python_with_zeppelin.mdd3ad67c73[1ambda] fix: sora's comment 4d133dbbcc[1ambda] fix: resolve sora's comment 3513c6ff2c[1ambda] fix: resolve sora's comment 1.14c2946928[1ambda] fix: resovle sora's comment 21c3946ac6[1ambda] fix: sora's comment 14d6e4267f[1ambda] fix: Resolve sola's comment 3d0524cafe[1ambda] fix: Set less shadow for nav5f1f998ba[1ambda] docs: Add useful_develop_tools.md9dfd62c74[1ambda] fix: Typo in installation.md30f7d7e06[1ambda] fix: Typo in helium ctrld6877e792[1ambda] docs: Add python_with_zeppelin.md7027e96c0[1ambda] docs: Improve python conda, docker doc stylee55b50a9d[1ambda] fix: Invalid URLs75ddeeaff[1ambda] docs: replace URIs in interpreter5b43993a4[1ambda] docs: Add sql_with_zeppelin053794e84[1ambda] docs: Add spark_with_zeppelin.mdd4d88b9c7[1ambda] docs: Improve proxy docb46cdd126[1ambda] docs: Add empty interpreter_binding_mode.md06fcb239e[1ambda] docs: Add empty personalized_mode.md4991cf0a7[1ambda] docs: Update upgrading.md53142b7a0[1ambda] fix: Simplify install.md8a5c1e721[1ambda] docs: Add multi_user_support.md34095775e[1ambda] fix: Increase font size to 15pxa03b04b33[1ambda] fix: Remove sample text from trouble_shooting.md199842590[1ambda] fix: Remove docker doc link66a2a7d26[1ambda] docs: Improve impersonation page0a6e3fc1d[1ambda] docs: Improve install docccd999ed5[1ambda] docs: Improve helium docf8d742d08[1ambda] fix: an invalid link in navbarb7aa5f884[1ambda] fix: URLs in development61a175d94[1ambda] docs: Update install.md4c56de5c4[1ambda] fix: URLs in setup0b1d63513[1ambda] fix: URLs in quickstart28970a4fe[1ambda] feat: Add docs/usage735946bca[1ambda] feat: rename /quickstartb351cf237[1ambda] fix: Add missing linksb70770b4f[1ambda] feat: Change URLs in nav, index94e80aef6[1ambda] fix: doens't display navbar version in small6e0cab110[1ambda] feat: Update doc section namesb9ce256ff[1ambda] feat: Hide version in navbar when mdf8bab52be[1ambda] fix: Better image display in index.mdeeb37d5b5[1ambda] fix: Add RL padding for mobile browserceb60b5ee[1ambda] feat: Style collapsed nav for mobile browser4ebafb4b6[1ambda] commit
6.9 KiB
| layout | title | description | group |
|---|---|---|---|
| page | Apache Zeppelin Tutorial | This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher. | quickstart |
{% include JB/setup %}
Zeppelin Tutorial
This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see here first.
Current main backend processing engine of Zeppelin is Apache Spark. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.
Tutorial with Local File
Data Refine
Before you start Zeppelin tutorial, you will need to download bank.zip.
First, to transform csv format data into RDD of Bank objects, run following script. This will also remove header using filter function.
val bankText = sc.textFile("yourPath/bank/bank-full.csv")
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)
// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
)
// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")
Data Retrieval
Suppose we want to see age distribution from bank. To do this, run:
%sql select age, count(1) from bank where age < 30 group by age order by age
You can make input box for setting age condition by replacing 30 with ${maxAge=30}.
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
## Tutorial with Streaming Data
Data Refine
Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at Twitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey, apiSecret, accessToken, accessTokenSecret) with your API keys on following script.
This will create a RDD of Tweet objects and register these stream data as a table:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess
/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
println("Configuring Twitter OAuth")
configs.foreach{ case(key, value) =>
if (value.trim.isEmpty) {
throw new Exception("Error setting authentication - value for " + key + " not set")
}
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
System.setProperty(fullKey, value.trim)
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
}
println()
}
// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))
case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
rdd.toDF().registerAsTable("tweets")
)
twt.print
ssc.start()
Data Retrieval
For each following script, every time you click run button you will see different result since it is based on real-time data.
Let's begin by extracting maximum 10 tweets which contain the word girl.
%sql select * from tweets where text like '%girl%' limit 10
This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
You can make user-defined function and use it in Spark SQL. Let's try it by making function named sentiment. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.
def sentiment(s:String) : String = {
val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
val negative = Array("hate", "bad", "stupid", "is")
var st = 0;
val words = s.split(" ")
positive.foreach(p =>
words.foreach(w =>
if(p==w) st = st+1
)
)
negative.foreach(p=>
words.foreach(w=>
if(p==w) st = st-1
)
)
if(st>0)
"positivie"
else if(st<0)
"negative"
else
"neutral"
}
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)
To check how people think about girls using sentiment function we've made above, run this:
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)