update pig tutorial

This commit is contained in:
Jeff Zhang 2017-02-24 10:05:09 +08:00
parent b8542de569
commit c6cb5ffa5b
2 changed files with 41 additions and 13 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 151 KiB

View file

@ -15,14 +15,16 @@ group: manual
[Apache Pig](https://pig.apache.org/) is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
## Supported interpreter type
- `%pig.script` (default)
- `%pig.script` (default pig interpreter, so you can use `%pig`)
All the pig script can run in this type of interpreter, and display type is plain text.
`%pig.script` is like the pig grunt shell. Anything you can run in pig grunt shell can be run in `%pig.script` interpreter, it is used for running pig script where you dont need to visualize the data, it is suitable for data munging.
- `%pig.query`
Almost the same as `%pig.script`. The only difference is that you don't need to add alias in the last statement. And the display type is table.
`%pig.query` is a little different compared with `%pig.script`. It is used for exploratory data analysis via pig latin where you can leverage zeppelins visualization ability. There're 2 minor differences in the last statement between `%pig.script` and `%pig.query`
- No pig alias in the last statement in `%pig.query` (read the examples below).
- The last statement must be in single line in `%pig.query`
## Supported runtime mode
- Local
- MapReduce
@ -95,24 +97,50 @@ Besides, we use paragraph title as job name if it exists, else use the last line
```
%pig
raw_data = load 'dataset/sf_crime/train.csv' using PigStorage(',') as (Dates,Category,Descript,DayOfWeek,PdDistrict,Resolution,Address,X,Y);
b = group raw_data all;
c = foreach b generate COUNT($1);
dump c;
bankText = load 'bank.csv' using PigStorage(';');
bank = foreach bankText generate $0 as age, $1 as job, $2 as marital, $3 as education, $5 as balance;
bank = filter bank by age != '"age"';
bank = foreach bank generate (int)age, REPLACE(job,'"','') as job, REPLACE(marital, '"', '') as marital, (int)(REPLACE(balance, '"', '')) as balance;
store bank into 'clean_bank.csv' using PigStorage(';'); -- this statement is optional, it just show you that most of time %pig.script is used for data munging before querying the data.
```
##### pig.query
Get the number of each age where age is less than 30
```
%pig.query
b = foreach raw_data generate Category;
c = group b by Category;
foreach c generate group as category, COUNT($1) as count;
bank_data = filter bank by age < 30;
b = group bank_data by age;
foreach b generate group, COUNT($1) as cou;
```
The same as above, but use dynamic text form so that use can specify the variable maxAge in textbox. (See screenshot below). Dynamic form is a very cool feature of zeppelin, you can refer this [link]((../manual/dynamicform.html)) for details.
```
%pig.query
bank_data = filter bank by age < ${maxAge=40};
b = group bank_data by age;
foreach b generate group, COUNT($1);
```
Get the number of each age for specific marital type, also use dynamic form here. User can choose the marital type in the dropdown list (see screenshot below).
```
%pig.query
bank_data = filter bank by marital=='${marital=single,single|divorced|married}';
b = group bank_data by age;
foreach b generate group, COUNT($1);
```
The above examples are in the pig tutorial note in Zeppelin, you can check that for details. Here's the screenshot.
<img class="img-responsive" width="1024px" style="margin:0 auto; padding: 26px;" src="../assets/themes/zeppelin/img/pig_zeppelin_tutorial.png" />
Data is shared between `%pig` and `%pig.query`, so that you can do some common work in `%pig`, and do different kinds of query based on the data of `%pig`.
Besides, we recommend you to specify alias explicitly so that the visualization can display the column name correctly. Here, we name `COUNT($1)` as `count`, if you don't do this,
then we will name it using position, here we will use `col_1` to represent `COUNT($1)` if you don't specify alias for it.
There's one pig tutorial note in zeppelin for your reference. And we also has one wiki page for more details of running pig in zeppelin. https://cwiki.apache.org/confluence/display/ZEPPELIN/Running+Pig+in+Apache+Zeppelin