mirror of
https://github.com/apache/zeppelin
synced 2026-05-24 09:38:26 +00:00
Add Scio doc
This commit is contained in:
parent
61850d7f65
commit
0a3b49a50a
1 changed files with 151 additions and 0 deletions
151
docs/interpreter/scio.md
Normal file
151
docs/interpreter/scio.md
Normal file
|
|
@ -0,0 +1,151 @@
|
|||
---
|
||||
layout: page
|
||||
title: "Scio Interpreter for Apache Zeppelin"
|
||||
description: "Scio is a Scala DSL for Apache Beam/Google Dataflow model."
|
||||
group: interpreter
|
||||
---
|
||||
<!--
|
||||
Licensed under the Apache License, Version 2.0 (the "License");
|
||||
you may not use this file except in compliance with the License.
|
||||
You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
-->
|
||||
{% include JB/setup %}
|
||||
|
||||
# Scio Interpreter for Apache Zeppelin
|
||||
|
||||
<div id="toc"></div>
|
||||
|
||||
## Overview
|
||||
Scio is a Scala DSL for [Google Cloud Dataflow](https://github.com/GoogleCloudPlatform/DataflowJavaSDK) and [Apache Beam](http://beam.incubator.apache.org/) inspired by [Spark](http://spark.apache.org/) and [Scalding](https://github.com/twitter/scalding). See the current [wiki](https://github.com/spotify/scio/wiki) and [API documentation](http://spotify.github.io/scio/) for more information.
|
||||
|
||||
## Configuration
|
||||
<table class="table-configuration">
|
||||
<tr>
|
||||
<th>Name</th>
|
||||
<th>Default Value</th>
|
||||
<th>Description</th>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>argz</td>
|
||||
<td>--runner=InProcessPipelineRunner</td>
|
||||
<td>Scio Pipeline runner</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>zeppelin.scio.maxResult</td>
|
||||
<td>1000</td>
|
||||
<td>Max number of SCollection results to display</td>
|
||||
</tr>
|
||||
|
||||
</table>
|
||||
|
||||
## Enabling the Scio Interpreter
|
||||
|
||||
In a notebook, to enable the **Scio** interpreter, click the **Gear** icon and select **scio**.
|
||||
|
||||
## Using the Scio Interpreter
|
||||
|
||||
In a paragraph, use `%scio` to select the **Scio** interpreter. You can use it much the same way as vanilla Scala REPL and [Scio REPL](https://github.com/spotify/scio/wiki/Scio-REPL). Context is shared among all *Scio* paragraphs. There is special variable **argz** which holds arguments from Scio interpreter settings. The easiest way to proceed is to create a context via standard `ContextAndArgs`.
|
||||
|
||||
```scala
|
||||
%scio
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
```
|
||||
|
||||
Use `sc` context the way you would in regular pipeline/REPL.
|
||||
|
||||
Example:
|
||||
|
||||
```scala
|
||||
%scio
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
sc.parallelize(Seq("foo", "foo", "bar")).countByValue.closeAndDisplay()
|
||||
```
|
||||
|
||||
Please refer to [Scio wiki](https://github.com/spotify/scio/wiki) for more complex examples.
|
||||
|
||||
### Progress
|
||||
|
||||
There can be only one paragraph running at a time. There is no notion of overall progress - the progress indicator will state `42`, which is just a magic number.
|
||||
|
||||
### SCollection display helpers
|
||||
|
||||
Scio interpreter comes with display helpers to ease working with Zeppelin notebooks. Simply use `closeAndDisplay()` on `SCollection` to close context and display the results. The number of results is limited by `zeppelin.scio.maxResult` (by default 1000).
|
||||
|
||||
Supported `SCollection` types:
|
||||
* Scio's typed BigQuery
|
||||
* Scala's case classes
|
||||
* Scala's tuples
|
||||
* Google BigQuery's TableRow
|
||||
* Apache Avro
|
||||
* All Scala's `Product` like data
|
||||
* All Scala's `AnyVal`
|
||||
|
||||
#### BigQuery example:
|
||||
|
||||
```scala
|
||||
%scio
|
||||
@BigQueryType.fromQuery("""|SELECT departure_airport,count(case when departure_delay>0 then 1 else 0 end) as no_of_delays
|
||||
|FROM [bigquery-samples:airline_ontime_data.flights]
|
||||
|group by departure_airport
|
||||
|order by 2 desc
|
||||
|limit 10""".stripMargin) class Flights
|
||||
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
sc.bigQuerySelect(Flights.query).closeAndDisplay(Flights.schema)
|
||||
```
|
||||
|
||||
#### BigQuery typed example:
|
||||
|
||||
```scala
|
||||
%scio
|
||||
@BigQueryType.fromQuery("""|SELECT departure_airport,count(case when departure_delay>0 then 1 else 0 end) as no_of_delays
|
||||
|FROM [bigquery-samples:airline_ontime_data.flights]
|
||||
|group by departure_airport
|
||||
|order by 2 desc
|
||||
|limit 10""".stripMargin) class Flights
|
||||
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
sc.typedBigQuery[Flights]().flatMap(_.no_of_delays).mean.closeAndDisplay()
|
||||
```
|
||||
|
||||
#### Avro example:
|
||||
|
||||
```scala
|
||||
%scio
|
||||
import com.spotify.data.ExampleAvro
|
||||
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
sc.avroFile[ExampleAvro]("gs://<bucket>/tmp/my.avro").take(10).closeAndDisplay()
|
||||
```
|
||||
|
||||
#### Avro example with a view schema:
|
||||
|
||||
```scala
|
||||
%scio
|
||||
import com.spotify.data.ExampleAvro
|
||||
import org.apache.avro.Schema
|
||||
|
||||
val (sc, args) = ContextAndArgs(argz)
|
||||
val view = Schema.parse("""{"type":"record","name":"ExampleAvro","namespace":"com.spotify.data","fields":[{"name":"track","type":"string"}, {"name":"artist", "type":"string"}]}""")
|
||||
|
||||
sc.avroFile[EndSongCleaned]("gs://<bucket>/tmp/my.avro").take(10).closeAndDisplay(view)
|
||||
```
|
||||
|
||||
### Google credentials
|
||||
|
||||
Scio Interpreter will try to infer your Google Cloud credentials from its environment, it will take into the account:
|
||||
* `argz` interpreter settings ([doc](https://github.com/spotify/scio/wiki#options))
|
||||
* environment variable (`GOOGLE_APPLICATION_CREDENTIALS`)
|
||||
* gcloud configuration
|
||||
|
||||
#### BigQuery macro credentials
|
||||
|
||||
Currently BigQuery project for macro expansion is inferred using Google Dataflow's [DefaultProjectFactory().create()](https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/options/GcpOptions.java#L187)
|
||||
Loading…
Reference in a new issue