zeppelin/python/README.md

67 lines
3.4 KiB
Markdown
Raw Permalink Normal View History

# Overview
Python interpreter for Apache Zeppelin
# Architecture
Current interpreter implementation spawns new system python process through `ProcessBuilder` and re-directs it's stdin\strout to Zeppelin
# Details
ZEPPELIN-1115: Python - interpreter for SQL over DataFrame ### What is this PR for? Add new interpreter to Python group: `%python.sql` for SQL over DataFrame support ### What type of PR is it? Improvement ### TODOs * [x] add new interpreter `%python.sql` * [x] add test * [x] make Python-dependant tests, excluded from CI * PythonInterpreterWithPythonInstalledTest * PythonPandasSqlInterpreterTest * run manually by `mvn -Dpython.test.exclude='' test -pl python -am` * [x] add docs `%python.sql` * [x] make `%python.sql` fail gracefully in case there is no Pandas or PandaSQL installed * [x] after #747 is merged - rebase and remove `-Dpython.test.exclude=''` from both profiles ### What is the Jira issue? [ZEPPELIN-1115](https://issues.apache.org/jira/browse/ZEPPELIN-1115) ### How should this be tested? `mvn -Dpython.test.exclude='' test -pl python -am` should pass or manually run - Given the DataFrame i.e ``` %python import pandas as pd rates = pd.read_csv("bank.csv", sep=";") ``` - SQL query it like ``` %python.sql SELECT * FROM rates LIMIT 10 ``` ### Screenshots (if appropriate) ![screen shot 2016-07-11 at 23 56 04](https://cloud.githubusercontent.com/assets/5582506/16735171/1ebb9354-47c3-11e6-9354-6364e9374a20.png) ### Questions: * Does the licenses files need update? No, no dependencies were included in source or binary release * Is there breaking changes for older versions? No * Does this needs documentation? Yes Author: Alexander Bezzubov <bzz@apache.org> Closes #1164 from bzz/ZEPPELIN-1115/python/add-sql-for-dataframes and squashes the following commits: 0f2f852 [Alexander Bezzubov] Fail SQL gracefully if no python dependencies installed aca2bdf [Alexander Bezzubov] Fix typos in docs :zap: 158ba6a [Alexander Bezzubov] Remove third-party dependant test from CI 5fe46fc [Alexander Bezzubov] Update Python Matplotlib notebook example 72884c8 [Alexander Bezzubov] Add docs for %python.sql feature e931dc4 [Alexander Bezzubov] Make test for PythonPandasSqlInterpreter usable 76bbb44 [Alexander Bezzubov] Complete implementation of the PythonPandasSqlInterpreter f6ca1eb [Alexander Bezzubov] Add %python.sql to interpreter menue 11ba490 [Alexander Bezzubov] Add draft implementation of %python.sql for DataFrames
2016-07-14 05:15:42 +00:00
- **UnitTests**
To run full suit of tests, including ones that depend on real Python interpreter AND external libraries installed (like Pandas, Pandasql, etc) do
```
./mvnw -Dpython.test.exclude='' test -pl python -am
ZEPPELIN-1115: Python - interpreter for SQL over DataFrame ### What is this PR for? Add new interpreter to Python group: `%python.sql` for SQL over DataFrame support ### What type of PR is it? Improvement ### TODOs * [x] add new interpreter `%python.sql` * [x] add test * [x] make Python-dependant tests, excluded from CI * PythonInterpreterWithPythonInstalledTest * PythonPandasSqlInterpreterTest * run manually by `mvn -Dpython.test.exclude='' test -pl python -am` * [x] add docs `%python.sql` * [x] make `%python.sql` fail gracefully in case there is no Pandas or PandaSQL installed * [x] after #747 is merged - rebase and remove `-Dpython.test.exclude=''` from both profiles ### What is the Jira issue? [ZEPPELIN-1115](https://issues.apache.org/jira/browse/ZEPPELIN-1115) ### How should this be tested? `mvn -Dpython.test.exclude='' test -pl python -am` should pass or manually run - Given the DataFrame i.e ``` %python import pandas as pd rates = pd.read_csv("bank.csv", sep=";") ``` - SQL query it like ``` %python.sql SELECT * FROM rates LIMIT 10 ``` ### Screenshots (if appropriate) ![screen shot 2016-07-11 at 23 56 04](https://cloud.githubusercontent.com/assets/5582506/16735171/1ebb9354-47c3-11e6-9354-6364e9374a20.png) ### Questions: * Does the licenses files need update? No, no dependencies were included in source or binary release * Is there breaking changes for older versions? No * Does this needs documentation? Yes Author: Alexander Bezzubov <bzz@apache.org> Closes #1164 from bzz/ZEPPELIN-1115/python/add-sql-for-dataframes and squashes the following commits: 0f2f852 [Alexander Bezzubov] Fail SQL gracefully if no python dependencies installed aca2bdf [Alexander Bezzubov] Fix typos in docs :zap: 158ba6a [Alexander Bezzubov] Remove third-party dependant test from CI 5fe46fc [Alexander Bezzubov] Update Python Matplotlib notebook example 72884c8 [Alexander Bezzubov] Add docs for %python.sql feature e931dc4 [Alexander Bezzubov] Make test for PythonPandasSqlInterpreter usable 76bbb44 [Alexander Bezzubov] Complete implementation of the PythonPandasSqlInterpreter f6ca1eb [Alexander Bezzubov] Add %python.sql to interpreter menue 11ba490 [Alexander Bezzubov] Add draft implementation of %python.sql for DataFrames
2016-07-14 05:15:42 +00:00
```
- **Py4j support**
[Py4j](https://www.py4j.org/) enables Python programs to dynamically access Java objects in a JVM.
It is required in order to use Zeppelin [dynamic forms](https://zeppelin.apache.org/docs/latest/manual/dynamicform.html) feature.
### Dev prerequisites
* Python 2 or 3 installed with py4j (0.9.2) and matplotlib (1.31 or later) installed on each
* Tests only checks the interpreter logic and starts any Python process! Python process is mocked with a class that simply output its input.
* Code wrote in `bootstrap.py` and `bootstrap_input.py` should always be Python 2 and 3 compliant.
* Use PEP8 convention for python code.
### Technical overview
* When interpreter is starting it launches a python process inside a Java ProcessBuilder. Python is started with -i (interactive mode) and -u (unbuffered stdin, stdout and stderr) options. Thus the interpreter has a "sleeping" python process.
* Interpreter sends command to python with a Java `outputStreamWiter` and read from an `InputStreamReader`. To know when stop reading stdout, interpreter sends `print "*!?flush reader!?*"`after each command and reads stdout until he receives back the `*!?flush reader!?*`.
* When interpreter is starting, it sends some Python code (bootstrap.py and bootstrap_input.py) to initialize default behavior and functions (`help(), z.input()...`). bootstrap_input.py is sent only if py4j library is detected inside Python process.
* [Py4J](https://www.py4j.org/) Python and Java libraries are used to load input zeppelin Java class into the python process (make java code with python code !). Therefore the interpreter can directly create Zeppelin input form inside the Python process (and eventually with some python variable already defined). JVM opens a random open port to be accessible from python process.
* JavaBuilder can't send SIGINT signal to interrupt paragraph execution. Therefore interpreter will directly send a `kill SIGINT PID` to python process to interrupt execution. Python process catches SIGINT signal with some code defined in bootstrap.py
ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell ### What is this PR for? This PR is the first of two major steps needed to improve matplotlib integration in Zeppelin (ZEPPELIN-1344). The latter, which is a plotting backend with fully interactive tools enabled, will be done afterwards in a separate PR. This PR specifically for automatically displaying output from calls to matplotlib plotting functions inline with each paragraph. Thanks to the addition of post-execute hooks (ZEPPELIN-1423), there is no need to call any `show()` function to display an inline plot, just like in Jupyter. ### What type of PR is it? Improvement ### Todos The main code has been written and anyone who reads this is encouraged to test it, but there are a few minor todos: - [x] - Add unit tests - [x] - Add documentation - [x] - Add screenshot showing iterative plotting with angular mode ### What is the Jira issue? [ZEPPELIN-1345](https://issues.apache.org/jira/browse/ZEPPELIN-1345) ### How should this be tested? In a pyspark or python paragraph, enter and run ``` python import matplotlib.pyplot as plt plt.plot([1, 2, 3]) ``` The plot should be displayed automatically without calling any `show()` function whatsoever. A special method called `configure_mpl()` can also be used to modify the inline plotting behavior. For example, ``` python z.configure_mpl(close=False, angular=True) plt.plot([1, 2, 3]) ``` allows for iterative updates to the plot provided you have PY4J installed for your python installation (which of course is always the case if you use pypsark). To clarify, this feature only currently works with pyspark (not python as there are no `angularBind()` and `angularUnbind()` methods yet). Doing something like: ``` plt.plot([3, 2, 1]) ``` will update the plot that was generated by the previous paragraph by leveraging Zeppelin's Angular Display System. However, by setting `close=False`, matplotlib will no longer automatically close figures so it is now up to the user to explicitly close each figure instance they create. There's quite a bit more options for `z.configure_mpl()`, but I will save that discussion for the documentation. ### Screenshots (if appropriate) ![img](http://i.imgur.com/e1xHKnV.gif) ### Questions: - Does the licenses files need update? No - Is there breaking changes for older versions? No - Does this needs documentation? Yes Author: Alex Goodman <agoodm@users.noreply.github.com> Closes #1534 from agoodm/ZEPPELIN-1345 and squashes the following commits: 9ef6ff7 [Alex Goodman] Move mpl backend files to /interpreter 24f89c6 [Alex Goodman] Catch potential NullPointerExceptions from hook registry bdb584e [Alex Goodman] Make sure expressions are printed when no plots are shown 22b6fe4 [Alex Goodman] Remove unused variable d3d1aa0 [Alex Goodman] Fix CI test failure c90d204 [Alex Goodman] Update spark.md bcf0bf3 [Alex Goodman] Update python.md for new matplotlib integration c9b65a5 [Alex Goodman] Add iterative plotting example image 8029a05 [Alex Goodman] Update python/README.md f2d9e86 [Alex Goodman] Exclude tests are excluded in python/pom.xml 86b1c90 [Alex Goodman] Fix tutorial notebook not loading c37b00f [Alex Goodman] Fix legend in tutorial notebook a321d79 [Alex Goodman] Update python.md 82350e3 [Alex Goodman] Update matplotlib tutorial notebook 9792f97 [Alex Goodman] Add unit tests 8b9b973 [Alex Goodman] Fix NullPointerExceptions in unit tests 82135ad [Alex Goodman] Removed unused variable f9c9498 [Alex Goodman] Added support for Angular Display System edf750a [Alex Goodman] Add new matplotlib backend for python/pyspark interpreters
2016-11-06 06:03:04 +00:00
* Matplotlib figures are displayed inline with the notebook automatically using a built-in backend for zeppelin in conjunction with a post-execute hook.
ZEPPELIN-1115: Python - interpreter for SQL over DataFrame ### What is this PR for? Add new interpreter to Python group: `%python.sql` for SQL over DataFrame support ### What type of PR is it? Improvement ### TODOs * [x] add new interpreter `%python.sql` * [x] add test * [x] make Python-dependant tests, excluded from CI * PythonInterpreterWithPythonInstalledTest * PythonPandasSqlInterpreterTest * run manually by `mvn -Dpython.test.exclude='' test -pl python -am` * [x] add docs `%python.sql` * [x] make `%python.sql` fail gracefully in case there is no Pandas or PandaSQL installed * [x] after #747 is merged - rebase and remove `-Dpython.test.exclude=''` from both profiles ### What is the Jira issue? [ZEPPELIN-1115](https://issues.apache.org/jira/browse/ZEPPELIN-1115) ### How should this be tested? `mvn -Dpython.test.exclude='' test -pl python -am` should pass or manually run - Given the DataFrame i.e ``` %python import pandas as pd rates = pd.read_csv("bank.csv", sep=";") ``` - SQL query it like ``` %python.sql SELECT * FROM rates LIMIT 10 ``` ### Screenshots (if appropriate) ![screen shot 2016-07-11 at 23 56 04](https://cloud.githubusercontent.com/assets/5582506/16735171/1ebb9354-47c3-11e6-9354-6364e9374a20.png) ### Questions: * Does the licenses files need update? No, no dependencies were included in source or binary release * Is there breaking changes for older versions? No * Does this needs documentation? Yes Author: Alexander Bezzubov <bzz@apache.org> Closes #1164 from bzz/ZEPPELIN-1115/python/add-sql-for-dataframes and squashes the following commits: 0f2f852 [Alexander Bezzubov] Fail SQL gracefully if no python dependencies installed aca2bdf [Alexander Bezzubov] Fix typos in docs :zap: 158ba6a [Alexander Bezzubov] Remove third-party dependant test from CI 5fe46fc [Alexander Bezzubov] Update Python Matplotlib notebook example 72884c8 [Alexander Bezzubov] Add docs for %python.sql feature e931dc4 [Alexander Bezzubov] Make test for PythonPandasSqlInterpreter usable 76bbb44 [Alexander Bezzubov] Complete implementation of the PythonPandasSqlInterpreter f6ca1eb [Alexander Bezzubov] Add %python.sql to interpreter menue 11ba490 [Alexander Bezzubov] Add draft implementation of %python.sql for DataFrames
2016-07-14 05:15:42 +00:00
* `%python.sql` support for Pandas DataFrames is optional but can be downloaded from [here](https://github.com/yhat/pandasql) if user does not have one installed.
[ZEPPELIN-2753] Basic Implementation of IPython Interpreter ### What is this PR for? This is the first step for implement IPython Interpreter in Zeppelin. I just use the jupyter_client to create and manage the ipython kernel. We don't need to care about python compilation and execution, all the things are delegated to ipython kernel. Ideally all the features of ipython should be available in Zeppelin as well. For now, user can use %python.ipython for IPython Interpreter. And if ipython is available, the default python interpreter will use ipython. But user can still set `zeppelin.python.useIPython` as false to enforce to use the old implementation of python interpreter. Main features: * IPython interpreter support ** All the ipython features are available, including visualization, ipython magics. * ZeppelinContext support * Streaming output support * Support Ipython in PySpark Regarding the visualization, ideally all the visualization libraries work in jupyter should also work here. In unit test, I only verify the following 3 popular visualization library. could add more later. * matplotlib * bokeh * ggplot ### What type of PR is it? [Feature ] ### Todos * [ ] - Task ### What is the Jira issue? * https://issues.apache.org/jira/browse/ZEPPELIN-2753 ### How should this be tested? Unit test is added. ### Screenshots (if appropriate) Verify bokeh in IPython Interpreter ![image](https://user-images.githubusercontent.com/164491/27999716-756d749e-6552-11e7-90bb-4c6b08f4ab5c.png) Verify matplotlib ![image](https://user-images.githubusercontent.com/164491/28046960-e881b28e-6619-11e7-9e1f-7f4662f842f3.png) Verify ZeppelinContext ![image](https://user-images.githubusercontent.com/164491/28119378-4212620c-6747-11e7-89d5-3b5e609593ce.png) Verify Streaming ![streaming](https://user-images.githubusercontent.com/164491/28950974-8f92fe1e-78fa-11e7-841f-3174da198bb7.gif) ### Questions: * Does the licenses files need update? No * Is there breaking changes for older versions? No * Does this needs documentation? No Author: Jeff Zhang <zjffdu@apache.org> Closes #2474 from zjffdu/ZEPPELIN-2753 and squashes the following commits: e869f31 [Jeff Zhang] address comments b0b5c95 [Jeff Zhang] [ZEPPELIN-2753] Basic Implementation of IPython Interpreter
2017-08-26 03:59:43 +00:00
# IPython Overview
IPython interpreter for Apache Zeppelin
# IPython Requirements
You need to install the following python packages to make the IPython interpreter work.
* jupyter 5.x
* IPython
* ipykernel
* grpcio
If you have installed anaconda, then you just need to install grpc.
# IPython Architecture
Current interpreter delegate the whole work to ipython kernel via `jupyter_client`. Zeppelin would launch a python process which host the ipython kernel.
Zeppelin interpreter process will communicate with the python process via `grpc`. Ideally every feature works in IPython should work in Zeppelin as well.