zeppelin/docs/interpreter/python.md
Alexander Bezzubov d8b54cf76d ZEPPELIN-1115: Python - interpreter for SQL over DataFrame
### What is this PR for?
Add new interpreter to Python group: `%python.sql` for SQL over DataFrame support

### What type of PR is it?
Improvement

### TODOs
* [x] add new interpreter `%python.sql`
* [x] add test
* [x] make Python-dependant tests, excluded from CI
   * PythonInterpreterWithPythonInstalledTest
   * PythonPandasSqlInterpreterTest
   * run manually by `mvn -Dpython.test.exclude='' test -pl python -am`
* [x] add docs `%python.sql`
* [x] make `%python.sql` fail gracefully in case there is no Pandas or PandaSQL installed
* [x] after #747 is merged - rebase and remove `-Dpython.test.exclude=''` from both profiles

### What is the Jira issue?
[ZEPPELIN-1115](https://issues.apache.org/jira/browse/ZEPPELIN-1115)

### How should this be tested?
`mvn -Dpython.test.exclude='' test -pl python -am` should pass or manually run
 - Given the DataFrame i.e

  ```
%python
import pandas as pd
rates = pd.read_csv("bank.csv", sep=";")
  ```
 - SQL query it like

  ```
%python.sql
SELECT * FROM rates LIMIT 10
  ```

### Screenshots (if appropriate)
![screen shot 2016-07-11 at 23 56 04](https://cloud.githubusercontent.com/assets/5582506/16735171/1ebb9354-47c3-11e6-9354-6364e9374a20.png)

### Questions:
* Does the licenses files need update? No, no dependencies were included in source or binary release
* Is there breaking changes for older versions? No
* Does this needs documentation? Yes

Author: Alexander Bezzubov <bzz@apache.org>

Closes #1164 from bzz/ZEPPELIN-1115/python/add-sql-for-dataframes and squashes the following commits:

0f2f852 [Alexander Bezzubov] Fail SQL gracefully if no python dependencies installed
aca2bdf [Alexander Bezzubov] Fix typos in docs 
158ba6a [Alexander Bezzubov] Remove third-party dependant test from CI
5fe46fc [Alexander Bezzubov] Update Python Matplotlib notebook example
72884c8 [Alexander Bezzubov] Add docs for %python.sql feature
e931dc4 [Alexander Bezzubov] Make test for PythonPandasSqlInterpreter usable
76bbb44 [Alexander Bezzubov] Complete implementation of the PythonPandasSqlInterpreter
f6ca1eb [Alexander Bezzubov] Add %python.sql to interpreter menue
11ba490 [Alexander Bezzubov] Add draft implementation of %python.sql for DataFrames
2016-07-15 18:37:18 +09:00

138 lines
4.3 KiB
Markdown

---
layout: page
title: "Python Interpreter"
description: "Python Interpreter"
group: interpreter
---
{% include JB/setup %}
# Python 2 & 3 Interpreter for Apache Zeppelin
<div id="toc"></div>
## Configuration
<table class="table-configuration">
<tr>
<th>Property</th>
<th>Default</th>
<th>Description</th>
</tr>
<tr>
<td>zeppelin.python</td>
<td>python</td>
<td>Path of the already installed Python binary (could be python2 or python3).
If python is not in your $PATH you can set the absolute directory (example : /usr/bin/python)
</td>
</tr>
<tr>
<td>zeppelin.python.maxResult</td>
<td>1000</td>
<td>Max number of dataframe rows to display.</td>
</tr>
</table>
## Enabling Python Interpreter
In a notebook, to enable the **Python** interpreter, click on the **Gear** icon and select **Python**
## Using the Python Interpreter
In a paragraph, use **_%python_** to select the **Python** interpreter and then input all commands.
The interpreter can only work if you already have python installed (the interpreter doesn't bring it own python binaries).
To access the help, type **help()**
## Python modules
The interpreter can use all modules already installed (with pip, easy_install...)
## Using Zeppelin Dynamic Forms
You can leverage [Zeppelin Dynamic Form]({{BASE_PATH}}/manual/dynamicform.html) inside your Python code.
**Zeppelin Dynamic Form can only be used if py4j Python library is installed in your system. If not, you can install it with `pip install py4j`.**
Example :
```python
%python
### Input form
print (z.input("f1","defaultValue"))
### Select form
print (z.select("f1",[("o1","1"),("o2","2")],"2"))
### Checkbox form
print("".join(z.checkbox("f3", [("o1","1"), ("o2","2")],["1"])))
```
## Zeppelin features not fully supported by the Python Interpreter
* Interrupt a paragraph execution (`cancel()` method) is currently only supported in Linux and MacOs. If interpreter runs in another operating system (for instance MS Windows) , interrupt a paragraph will close the whole interpreter. A JIRA ticket ([ZEPPELIN-893](https://issues.apache.org/jira/browse/ZEPPELIN-893)) is opened to implement this feature in a next release of the interpreter.
* Progression bar in webUI (`getProgress()` method) is currently not implemented.
* Code-completion is currently not implemented.
## Matplotlib integration
The python interpreter can display matplotlib graph with the function `z.show()`.
You need to have matplotlib module installed and a XServer running to use this functionality !
```python
%python
import matplotlib.pyplot as plt
plt.figure()
(.. ..)
z.show(plt)
plt.close()
```
z.show function can take optional parameters to adapt graph width and height
```python
%python
z.show(plt, width='50px')
z.show(plt, height='150px')
```
<img class="img-responsive" src="../assets/themes/zeppelin/img/docs-img/pythonMatplotlib.png" />
## Pandas integration
Apache Zeppelin [Table Display System]({{BASE_PATH}}/displaysystem/basicdisplaysystem.html#table) provides built-in data visualization capabilities. Python interpreter leverages it to visualize Pandas DataFrames though similar `z.show()` API, same as with [Matplotlib integration](#matplotlib-integration).
Example:
```python
import pandas as pd
rates = pd.read_csv("bank.csv", sep=";")
z.show(rates)
```
## SQL over Pandas DataFrames
There is a convenience `%python.sql` interpreter that matches Apache Spark experience in Zeppelin and enables usage of SQL language to query [Pandas DataFrames](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) and visualization of results though built-in [Table Display System]({{BASE_PATH}}/displaysystem/basicdisplaysystem.html#table).
**Pre-requests**
- Pandas `pip install pandas`
- PandaSQL `pip install -U pandasql`
In case default binded interpreter is Python (first in the interpreter list, under the _Gear Icon_), you can just use it as `%sql` i.e
- first paragraph
```python
import pandas as pd
rates = pd.read_csv("bank.csv", sep=";")
```
- next paragraph
```sql
%sql
SELECT * FROM rates WHERE age < 40
```
Otherwise it can be referred to as `%python.sql`
## Technical description
For in-depth technical details on current implementation please refer to [python/README.md](https://github.com/apache/zeppelin/blob/master/python/README.md).