mirror of
https://github.com/apache/zeppelin
synced 2026-05-24 09:38:26 +00:00
### What is this PR for? Improving documentation page. Please check *TODO* and *Screenshots* sections for detail. The motivation is described in [the JIRA ticket](https://issues.apache.org/jira/browse/ZEPPELIN-2583) and discussion is ongoing on the mailing list. ### What type of PR is it? [Improvement | Documentation] ### Todos * [x] - improved the navbar style * [x] - improved the main page * [x] - re-organized content structure * [x] - added tutorial pages: `spark_with_zeppelin.md`, `python_with_zeppelin.md`, `sql_with_zeppelin.md` for overview * [x] - added `multi_user_support.md` page to provide overview * [x] - added the empty `interpreter_binding_mode` page. This will be handed in the different issue: [ZEPPELIN-2582](https://issues.apache.org/jira/browse/ZEPPELIN-2582) * [x] - added the empty `trouble_shooting` page. This can be filled in the following PRs. * [x] - added the empty `useful_developer_tools` page. This can be filled in the following PRs. ### What is the Jira issue? [ZEPPELIN-2596](https://issues.apache.org/jira/browse/ZEPPELIN-2596) ### How should this be tested? 1. checkout 2. `cd docs` 3. `bundle install` (make sure that you have ruby 2.1.0+ and bundle) 4. `bundle exec jekyll serve --watch` 5. open `localhost:4000` ### Screenshots (if appropriate) #### better navbar: before  #### better navbar: after  #### improved main page: before  #### improved main page: after  #### organized content structure: before  #### organized content structure: after  ### Questions: * Does the licenses files need update? - NO * Is there breaking changes for older versions? - NO * Does this needs documentation? - related with docs Author: 1ambda <1amb4a@gmail.com> Closes #2371 from 1ambda/updating-version-doc and squashes the following commits:eb02fa967[1ambda] fix: navbar focus color applies after folding026379ed6[1ambda] fix: Remove docs/.listen_testa7dd4737b[1ambda] fix: sora's comment 1.218c5058f7[1ambda] fix: resolve description in python_with_zeppelin.mdd3ad67c73[1ambda] fix: sora's comment 4d133dbbcc[1ambda] fix: resolve sora's comment 3513c6ff2c[1ambda] fix: resolve sora's comment 1.14c2946928[1ambda] fix: resovle sora's comment 21c3946ac6[1ambda] fix: sora's comment 14d6e4267f[1ambda] fix: Resolve sola's comment 3d0524cafe[1ambda] fix: Set less shadow for nav5f1f998ba[1ambda] docs: Add useful_develop_tools.md9dfd62c74[1ambda] fix: Typo in installation.md30f7d7e06[1ambda] fix: Typo in helium ctrld6877e792[1ambda] docs: Add python_with_zeppelin.md7027e96c0[1ambda] docs: Improve python conda, docker doc stylee55b50a9d[1ambda] fix: Invalid URLs75ddeeaff[1ambda] docs: replace URIs in interpreter5b43993a4[1ambda] docs: Add sql_with_zeppelin053794e84[1ambda] docs: Add spark_with_zeppelin.mdd4d88b9c7[1ambda] docs: Improve proxy docb46cdd126[1ambda] docs: Add empty interpreter_binding_mode.md06fcb239e[1ambda] docs: Add empty personalized_mode.md4991cf0a7[1ambda] docs: Update upgrading.md53142b7a0[1ambda] fix: Simplify install.md8a5c1e721[1ambda] docs: Add multi_user_support.md34095775e[1ambda] fix: Increase font size to 15pxa03b04b33[1ambda] fix: Remove sample text from trouble_shooting.md199842590[1ambda] fix: Remove docker doc link66a2a7d26[1ambda] docs: Improve impersonation page0a6e3fc1d[1ambda] docs: Improve install docccd999ed5[1ambda] docs: Improve helium docf8d742d08[1ambda] fix: an invalid link in navbarb7aa5f884[1ambda] fix: URLs in development61a175d94[1ambda] docs: Update install.md4c56de5c4[1ambda] fix: URLs in setup0b1d63513[1ambda] fix: URLs in quickstart28970a4fe[1ambda] feat: Add docs/usage735946bca[1ambda] feat: rename /quickstartb351cf237[1ambda] fix: Add missing linksb70770b4f[1ambda] feat: Change URLs in nav, index94e80aef6[1ambda] fix: doens't display navbar version in small6e0cab110[1ambda] feat: Update doc section namesb9ce256ff[1ambda] feat: Hide version in navbar when mdf8bab52be[1ambda] fix: Better image display in index.mdeeb37d5b5[1ambda] fix: Add RL padding for mobile browserceb60b5ee[1ambda] feat: Style collapsed nav for mobile browser4ebafb4b6[1ambda] commit
198 lines
No EOL
6.9 KiB
Markdown
198 lines
No EOL
6.9 KiB
Markdown
---
|
|
layout: page
|
|
title: "Apache Zeppelin Tutorial"
|
|
description: "This tutorial page contains a short walk-through tutorial that uses Apache Spark backend. Please note that this tutorial is valid for Spark 1.3 and higher."
|
|
group: quickstart
|
|
---
|
|
<!--
|
|
Licensed under the Apache License, Version 2.0 (the "License");
|
|
you may not use this file except in compliance with the License.
|
|
You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
-->
|
|
{% include JB/setup %}
|
|
|
|
# Zeppelin Tutorial
|
|
|
|
<div id="toc"></div>
|
|
|
|
This tutorial walks you through some of the fundamental Zeppelin concepts. We will assume you have already installed Zeppelin. If not, please see [here](./install.html) first.
|
|
|
|
Current main backend processing engine of Zeppelin is [Apache Spark](https://spark.apache.org). If you're new to this system, you might want to start by getting an idea of how it processes data to get the most out of Zeppelin.
|
|
|
|
## Tutorial with Local File
|
|
|
|
### Data Refine
|
|
|
|
Before you start Zeppelin tutorial, you will need to download [bank.zip](http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip).
|
|
|
|
First, to transform csv format data into RDD of `Bank` objects, run following script. This will also remove header using `filter` function.
|
|
|
|
```scala
|
|
|
|
val bankText = sc.textFile("yourPath/bank/bank-full.csv")
|
|
|
|
case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer)
|
|
|
|
// split each line, filter out header (starts with "age"), and map it into Bank case class
|
|
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!="\"age\"").map(
|
|
s=>Bank(s(0).toInt,
|
|
s(1).replaceAll("\"", ""),
|
|
s(2).replaceAll("\"", ""),
|
|
s(3).replaceAll("\"", ""),
|
|
s(5).replaceAll("\"", "").toInt
|
|
)
|
|
)
|
|
|
|
// convert to DataFrame and create temporal table
|
|
bank.toDF().registerTempTable("bank")
|
|
```
|
|
|
|
### Data Retrieval
|
|
|
|
Suppose we want to see age distribution from `bank`. To do this, run:
|
|
|
|
```sql
|
|
%sql select age, count(1) from bank where age < 30 group by age order by age
|
|
```
|
|
|
|
You can make input box for setting age condition by replacing `30` with `${maxAge=30}`.
|
|
|
|
```sql
|
|
%sql select age, count(1) from bank where age < ${maxAge=30} group by age order by age
|
|
```
|
|
|
|
Now we want to see age distribution with certain marital status and add combo box to select marital status. Run:
|
|
|
|
```sql
|
|
%sql select age, count(1) from bank where marital="${marital=single,single|divorced|married}" group by age order by age
|
|
```
|
|
|
|
<br />
|
|
## Tutorial with Streaming Data
|
|
|
|
### Data Refine
|
|
|
|
Since this tutorial is based on Twitter's sample tweet stream, you must configure authentication with a Twitter account. To do this, take a look at [Twitter Credential Setup](https://databricks-training.s3.amazonaws.com/realtime-processing-with-spark-streaming.html#twitter-credential-setup). After you get API keys, you should fill out credential related values(`apiKey`, `apiSecret`, `accessToken`, `accessTokenSecret`) with your API keys on following script.
|
|
|
|
This will create a RDD of `Tweet` objects and register these stream data as a table:
|
|
|
|
```scala
|
|
import org.apache.spark.streaming._
|
|
import org.apache.spark.streaming.twitter._
|
|
import org.apache.spark.storage.StorageLevel
|
|
import scala.io.Source
|
|
import scala.collection.mutable.HashMap
|
|
import java.io.File
|
|
import org.apache.log4j.Logger
|
|
import org.apache.log4j.Level
|
|
import sys.process.stringSeqToProcess
|
|
|
|
/** Configures the Oauth Credentials for accessing Twitter */
|
|
def configureTwitterCredentials(apiKey: String, apiSecret: String, accessToken: String, accessTokenSecret: String) {
|
|
val configs = new HashMap[String, String] ++= Seq(
|
|
"apiKey" -> apiKey, "apiSecret" -> apiSecret, "accessToken" -> accessToken, "accessTokenSecret" -> accessTokenSecret)
|
|
println("Configuring Twitter OAuth")
|
|
configs.foreach{ case(key, value) =>
|
|
if (value.trim.isEmpty) {
|
|
throw new Exception("Error setting authentication - value for " + key + " not set")
|
|
}
|
|
val fullKey = "twitter4j.oauth." + key.replace("api", "consumer")
|
|
System.setProperty(fullKey, value.trim)
|
|
println("\tProperty " + fullKey + " set as [" + value.trim + "]")
|
|
}
|
|
println()
|
|
}
|
|
|
|
// Configure Twitter credentials
|
|
val apiKey = "xxxxxxxxxxxxxxxxxxxxxxxxx"
|
|
val apiSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
|
val accessToken = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
|
val accessTokenSecret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
|
|
configureTwitterCredentials(apiKey, apiSecret, accessToken, accessTokenSecret)
|
|
|
|
import org.apache.spark.streaming.twitter._
|
|
val ssc = new StreamingContext(sc, Seconds(2))
|
|
val tweets = TwitterUtils.createStream(ssc, None)
|
|
val twt = tweets.window(Seconds(60))
|
|
|
|
case class Tweet(createdAt:Long, text:String)
|
|
twt.map(status=>
|
|
Tweet(status.getCreatedAt().getTime()/1000, status.getText())
|
|
).foreachRDD(rdd=>
|
|
// Below line works only in spark 1.3.0.
|
|
// For spark 1.1.x and spark 1.2.x,
|
|
// use rdd.registerTempTable("tweets") instead.
|
|
rdd.toDF().registerAsTable("tweets")
|
|
)
|
|
|
|
twt.print
|
|
|
|
ssc.start()
|
|
```
|
|
|
|
### Data Retrieval
|
|
|
|
For each following script, every time you click run button you will see different result since it is based on real-time data.
|
|
|
|
Let's begin by extracting maximum 10 tweets which contain the word **girl**.
|
|
|
|
```sql
|
|
%sql select * from tweets where text like '%girl%' limit 10
|
|
```
|
|
|
|
This time suppose we want to see how many tweets have been created per sec during last 60 sec. To do this, run:
|
|
|
|
```sql
|
|
%sql select createdAt, count(1) from tweets group by createdAt order by createdAt
|
|
```
|
|
|
|
|
|
You can make user-defined function and use it in Spark SQL. Let's try it by making function named `sentiment`. This function will return one of the three attitudes( positive, negative, neutral ) towards the parameter.
|
|
|
|
```scala
|
|
def sentiment(s:String) : String = {
|
|
val positive = Array("like", "love", "good", "great", "happy", "cool", "the", "one", "that")
|
|
val negative = Array("hate", "bad", "stupid", "is")
|
|
|
|
var st = 0;
|
|
|
|
val words = s.split(" ")
|
|
positive.foreach(p =>
|
|
words.foreach(w =>
|
|
if(p==w) st = st+1
|
|
)
|
|
)
|
|
|
|
negative.foreach(p=>
|
|
words.foreach(w=>
|
|
if(p==w) st = st-1
|
|
)
|
|
)
|
|
if(st>0)
|
|
"positivie"
|
|
else if(st<0)
|
|
"negative"
|
|
else
|
|
"neutral"
|
|
}
|
|
|
|
// Below line works only in spark 1.3.0.
|
|
// For spark 1.1.x and spark 1.2.x,
|
|
// use sqlc.registerFunction("sentiment", sentiment _) instead.
|
|
sqlc.udf.register("sentiment", sentiment _)
|
|
|
|
```
|
|
|
|
To check how people think about girls using `sentiment` function we've made above, run this:
|
|
|
|
```sql
|
|
%sql select sentiment(text), count(1) from tweets where text like '%girl%' group by sentiment(text)
|
|
``` |