### What is this PR for? Based on #338 , I refactor most of pig interpreter. As I don't think the approach in #338 is the best approach. In #338, we use script `bin/pig` to launch pig script, it is different to control that job (hard to kill and get progress and stats info). In this PR, I use pig api to launch pig script. Besides that I implement another interpreter type `%pig.query` to leverage the display system of zeppelin. For the details you can check `pig.md` ### What type of PR is it? [Feature] ### Todos * Syntax Highlight * new interpreter type `%pig.udf`, so that user can write pig udf in zeppelin directly and don't need to build udf jar manually. ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-335 ### How should this be tested? Unit test is added and also manual test is done ### Screenshots (if appropriate)  ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <zjffdu@apache.org> Author: Ali Bajwa <abajwa@hortonworks.com> Author: AhyoungRyu <ahyoungryu@apache.org> Author: Jeff Zhang <zjffdu@gmail.com> Closes #1476 from zjffdu/ZEPPELIN-335 and squashes the following commits:73a07f0[Jeff Zhang] minor updatea1b742b[Jeff Zhang] minor update on doce858301[Jeff Zhang] address commentsc85a090[Jeff Zhang] add license58b4b2f[Jeff Zhang] minor update of docs1ae7db2[Jeff Zhang] Merge pull request #2 from AhyoungRyu/ZEPPELIN-335/docsfe014a7[AhyoungRyu] Fix docs title in front matterdf7a6db[AhyoungRyu] Add pig.md to dropdown menu5e2e222[AhyoungRyu] Minor update for pig.md39f161a[Jeff Zhang] address comments05a3b9b[Jeff Zhang] add pig.mda09a7f7[Jeff Zhang] refactor pig Interpreterc28beb5[Ali Bajwa] Updated based on comments: 1. Documentation: added pig.md with interpreter documentation and added pig entry to index.md 2. Added test junit test based on passwd file parsing example here https://pig.apache.org/docs/r0.10.0/start.html#run 3. Removed author tag from comment (this was copied from shell interpreter https://github.com/apache/incubator-zeppelin/blob/master/shell/src/main/java/org/apache/zeppelin/shell/ShellInterpreter.java#L42) 4. Implemented cancel functionality 5. Display output stream in case of error2586336[Ali Bajwa] exposed timeout and pig executable via interpreter and added comments7abad20[Ali Bajwa] initial commit of pig interpreter
2.8 KiB
| layout | title | description | group |
|---|---|---|---|
| page | Pig Interpreter for Apache Zeppelin | Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. | manual |
{% include JB/setup %}
Pig Interpreter for Apache Zeppelin
Overview
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Supported interpreter type
-
%pig.script(default)All the pig script can run in this type of interpreter, and display type is plain text.
-
%pig.queryAlmost the same as
%pig.script. The only difference is that you don't need to add alias in the last statement. And the display type is table.
Supported runtime mode
- Local
- MapReduce
- Tez (Only Tez 0.7 is supported)
How to use
How to setup Pig
-
Local Mode
Nothing needs to be done for local mode
-
MapReduce Mode
HADOOP_CONF_DIR needs to be specified in
ZEPPELIN_HOME/conf/zeppelin-env.sh. -
Tez Mode
HADOOP_CONF_DIR and TEZ_CONF_DIR needs to be specified in
ZEPPELIN_HOME/conf/zeppelin-env.sh.
How to configure interpreter
At the Interpreters menu, you have to create a new Pig interpreter. Pig interpreter has below properties by default.
| Property | Default | Description |
|---|---|---|
| zeppelin.pig.execType | mapreduce | Execution mode for pig runtime. local | mapreduce | tez |
| zeppelin.pig.includeJobStats | false | whether display jobStats info in %pig.script |
| zeppelin.pig.maxResult | 1000 | max row number displayed in %pig.query |
Example
pig
%pig
raw_data = load 'dataset/sf_crime/train.csv' using PigStorage(',') as (Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y);
b = group raw_data all;
c = foreach b generate COUNT($1);
dump c;
pig.query
%pig.query
b = foreach raw_data generate Category;
c = group b by Category;
foreach c generate group as category, COUNT($1) as count;
Data is shared between %pig and %pig.query, so that you can do some common work in %pig, and do different kinds of query based on the data of %pig.