### What is this PR for? As more and more document pages are added, it's really hard to find specific pages. So I added searching feature to Zeppelin documentation site([jekyll](https://jekyllrb.com/) based site) using [lunr.js](http://lunrjs.com/). - **How does it work?** I created [`search_data.json`](6e02423f54/docs/search_data.json) which is used for docs info template. `lunr.js` combines all of the text from all of the docs in `docs/` into `_site/search_data.json`. It looks like below.  All the info are comes from [Jekyll YAML front matter](https://jekyllrb.com/docs/frontmatter/) variables. (i.e. title, group, description.. that's why I rewrote all docs' title and description.) [search.js](6e02423f54/docs/assets/themes/zeppelin/js/search.js) will do this job using this data! ### What type of PR is it? Improvement & Feature ### Todos * [x] - Keep consistency for all docs pages' `Title` * [x] - Add some overview sentences to all docs pages' `Description` section (this will be used as the result preview) * [x] - Add apache license header to all docs page (some pages are missing the license header currently) * [x] - Add LICENSE for `lunr.min.js` ### What is the Jira issue? [ZEPPELIN-1219](https://issues.apache.org/jira/browse/ZEPPELIN-1219) ### How should this be tested? 1. Apply this patch and build `ZEPPELIN_HOME/docs` dir -> please see [docs/README.md#build-documentation](https://github.com/apache/zeppelin/tree/master/docs#build-documentation) 2. Click `search` icon in navbar and go to `search.html` page 3. Type anything you want to search in the search bar (i.e. type `python`, `spark`, `dynamic` ... ) ### Screenshots (if appropriate)   ### Questions: * Does the licenses files need update? Yes, for `lunr.min.js` * Is there breaking changes for older versions? no * Does this needs documentation? no Author: AhyoungRyu <fbdkdud93@hanmail.net> Closes #1266 from AhyoungRyu/ZEPPELIN-1219 and squashes the following commits:7ec8854[AhyoungRyu] Modify 'no result' sentence91b71a7[AhyoungRyu] Remove Apache license header since JSON doesn't allow comment34afd5d[AhyoungRyu] Add Apache license header to search_data.json6784282[AhyoungRyu] Minor search page UI update0389d28[AhyoungRyu] Make index.md not to be searched9f1ba42[AhyoungRyu] Disable enterkey press & change iconbd4956a[AhyoungRyu] Add docs.js & search.js to exclude list in pom.xml624b051[AhyoungRyu] Add Apache license header to search.js1381152[AhyoungRyu] Fix search result skipping issue6e775f5[AhyoungRyu] Make pleasecontribute.md not to be searchedee11136[AhyoungRyu] Fix some typosfa01299[AhyoungRyu] Refine 'description' in some docs as @bzz suggestedda0cff9[AhyoungRyu] Exclude lunr.min.js36ba7f1[AhyoungRyu] Add lunr.min.js license infof6a05a6[AhyoungRyu] Apply css style for the search results68eb997[AhyoungRyu] Attach 'Apache Zeppelin ZEPPELIN_VERSION Documentation: ' to titled908c37[AhyoungRyu] Add searching pagea951fa6[AhyoungRyu] Add search icon to navbar0688a79[AhyoungRyu] Keep consistency all docs' front matter for the right search result040f532[AhyoungRyu] Add template for storing docs info based on jekyll front matter0705bd6[AhyoungRyu] Add js files: lunr.min.js & search.js
6.9 KiB
| layout | title | description | group |
|---|---|---|---|
| page | Apache Zeppelin Tutorial | This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher. | quickstart |
{% include JB/setup %}
Zeppelin Tutorial
This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see here first.
Current main backend processing engine of Zeppelin is Apache Spark. If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.
Tutorial with Local File
Data Refine
Before you start Zeppelin tutorial, you will need to download bank.zip.
First, to transform csv format data into RDD of Bank objects, run following script. This will also remove header using filter function.
val bankText = sc.textFile("yourPath/bank/bank-full.csv")
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)
// split each line, filter out header (starts with "age"), and map it into Bank case class
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll("\"", ""),
s(2).replaceAll("\"", ""),
s(3).replaceAll("\"", ""),
s(5).replaceAll("\"", "").toInt
)
)
// convert to DataFrame and create temporal table
bank.toDF().registerTempTable("bank")
Data Retrieval
Suppose we want to see age distribution from bank. To do this, run:
%sql select age, count(1) from bank where age < 30 group by age order by age
You can make input box for setting age condition by replacing 30 with ${maxAge=30}.
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
## Tutorial with Streaming Data
Data Refine
Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at Twitter Credential Setup. After you get API keys, you should fill out credential related values(apiKey, apiSecret, accessToken, accessTokenSecret) with your API keys on following script.
This will create a RDD of Tweet objects and register these stream data as a table:
import org.apache.spark.streaming._
import org.apache.spark.streaming.twitter._
import org.apache.spark.storage.StorageLevel
import scala.io.Source
import scala.collection.mutable.HashMap
import java.io.File
import org.apache.log4j.Logger
import org.apache.log4j.Level
import sys.process.stringSeqToProcess
/** Configures the Oauth Credentials for accessing Twitter */
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
val configs = new HashMap[String, String] ++= Seq(
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
println("Configuring Twitter OAuth")
configs.foreach{ case(key, value) =>
if (value.trim.isEmpty) {
throw new Exception("Error setting authentication - value for " + key + " not set")
}
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
System.setProperty(fullKey, value.trim)
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
}
println()
}
// Configure Twitter credentials
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
import org.apache.spark.streaming.twitter._
val ssc = new StreamingContext(sc, Seconds(2))
val tweets = TwitterUtils.createStream(ssc, None)
val twt = tweets.window(Seconds(60))
case class Tweet(createdAt:Long, text:String)
twt.map(status=>
Tweet(status.getCreatedAt().getTime()/1000, status.getText())
).foreachRDD(rdd=>
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use rdd.registerTempTable("tweets") instead.
rdd.toDF().registerAsTable("tweets")
)
twt.print
ssc.start()
Data Retrieval
For each following script, every time you click run button you will see different result since it is based on real-time data.
Let's begin by extracting maximum 10 tweets which contain the word girl.
%sql select * from tweets where text like '%girl%' limit 10
This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
You can make user-defined function and use it in Spark SQL. Let's try it by making function named sentiment. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.
def sentiment(s:String) : String = {
val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
val negative = Array("hate", "bad", "stupid", "is")
var st = 0;
val words = s.split(" ")
positive.foreach(p =>
words.foreach(w =>
if(p==w) st = st+1
)
)
negative.foreach(p=>
words.foreach(w=>
if(p==w) st = st-1
)
)
if(st>0)
"positivie"
else if(st<0)
"negative"
else
"neutral"
}
// Below line works only in spark 1.3.0.
// For spark 1.1.x and spark 1.2.x,
// use sqlc.registerFunction("sentiment", sentiment _) instead.
sqlc.udf.register("sentiment", sentiment _)
To check how people think about girls using sentiment function we've made above, run this:
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)