zeppelin/python/README.md
Alex Goodman 438dbca686 ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell
### What is this PR for?

This PR is the first of two major steps needed to improve matplotlib integration in Zeppelin (ZEPPELIN-1344). The latter, which is a plotting backend with fully interactive tools enabled, will be done afterwards in a separate PR. This PR specifically for automatically displaying output from calls to matplotlib plotting functions inline with each paragraph. Thanks to the addition of post-execute hooks (ZEPPELIN-1423), there is no need to call any `show()` function to display an inline plot, just like in Jupyter.
### What type of PR is it?

Improvement
### Todos

The main code has been written and anyone who reads this is encouraged to test it, but there are a few minor todos:
- [x] - Add unit tests
- [x] - Add documentation
- [x] - Add screenshot showing iterative plotting with angular mode
### What is the Jira issue?

[ZEPPELIN-1345](https://issues.apache.org/jira/browse/ZEPPELIN-1345)
### How should this be tested?

In a pyspark or python paragraph, enter and run

``` python
import matplotlib.pyplot as plt
plt.plot([1, 2, 3])
```

The plot should be displayed automatically without calling any `show()` function whatsoever. A special method called `configure_mpl()` can also be used to modify the inline plotting behavior. For example,

``` python
z.configure_mpl(close=False, angular=True)
plt.plot([1, 2, 3])
```

allows for iterative updates to the plot provided you have PY4J installed for your python installation (which of course is always the case if you use pypsark). To clarify, this feature only currently works with pyspark (not python as there are no `angularBind()` and `angularUnbind()` methods yet). Doing something like:

```
plt.plot([3, 2, 1])
```

will update the plot that was generated by the previous paragraph by leveraging Zeppelin's Angular Display System. However, by setting `close=False`, matplotlib will no longer automatically close figures so it is now up to the user to explicitly close each figure instance they create. There's quite a bit more options for `z.configure_mpl()`, but I will save that discussion for the documentation.
### Screenshots (if appropriate)
![img](http://i.imgur.com/e1xHKnV.gif)

### Questions:
- Does the licenses files need update? No
- Is there breaking changes for older versions? No
- Does this needs documentation? Yes

Author: Alex Goodman <agoodm@users.noreply.github.com>

Closes #1534 from agoodm/ZEPPELIN-1345 and squashes the following commits:

9ef6ff7 [Alex Goodman] Move mpl backend files to /interpreter
24f89c6 [Alex Goodman] Catch potential NullPointerExceptions from hook registry
bdb584e [Alex Goodman] Make sure expressions are printed when no plots are shown
22b6fe4 [Alex Goodman] Remove unused variable
d3d1aa0 [Alex Goodman] Fix CI test failure
c90d204 [Alex Goodman] Update spark.md
bcf0bf3 [Alex Goodman] Update python.md for new matplotlib integration
c9b65a5 [Alex Goodman] Add iterative plotting example image
8029a05 [Alex Goodman] Update python/README.md
f2d9e86 [Alex Goodman] Exclude tests are excluded in python/pom.xml
86b1c90 [Alex Goodman] Fix tutorial notebook not loading
c37b00f [Alex Goodman] Fix legend in tutorial notebook
a321d79 [Alex Goodman] Update python.md
82350e3 [Alex Goodman] Update matplotlib tutorial notebook
9792f97 [Alex Goodman] Add unit tests
8b9b973 [Alex Goodman] Fix NullPointerExceptions in unit tests
82135ad [Alex Goodman] Removed unused variable
f9c9498 [Alex Goodman] Added support for Angular Display System
edf750a [Alex Goodman] Add new matplotlib backend for python/pyspark interpreters
2016-11-08 07:20:21 -08:00

3 KiB

Overview

Python interpreter for Apache Zeppelin

Architecture

Current interpreter implementation spawns new system python process through ProcessBuilder and re-directs it's stdin\strout to Zeppelin

Details

  • UnitTests

To run full suit of tests, including ones that depend on real Python interpreter AND external libraries installed (like Pandas, Pandasql, etc) do

mvn -Dpython.test.exclude='' test -pl python -am
  • Py4j support

Py4j enables Python programs to dynamically access Java objects in a JVM. It is required in order to use Zeppelin dynamic forms feature.

  • bootstrap process

Interpreter environment is setup with thex bootstrap.py It defines help() and z convenience functions

Dev prerequisites

  • Python 2 or 3 installed with py4j (0.9.2) and matplotlib (1.31 or later) installed on each

  • Tests only checks the interpreter logic and starts any Python process! Python process is mocked with a class that simply output it input.

  • Code wrote in bootstrap.py and bootstrap_input.py should always be Python 2 and 3 compliant.

  • Use PEP8 convention for python code.

Technical overview

  • When interpreter is starting it launches a python process inside a Java ProcessBuilder. Python is started with -i (interactive mode) and -u (unbuffered stdin, stdout and stderr) options. Thus the interpreter has a "sleeping" python process.

  • Interpreter sends command to python with a Java outputStreamWiter and read from an InputStreamReader. To know when stop reading stdout, interpreter sends print "*!?flush reader!?*"after each command and reads stdout until he receives back the *!?flush reader!?*.

  • When interpreter is starting, it sends some Python code (bootstrap.py and bootstrap_input.py) to initialize default behavior and functions (help(), z.input()...). bootstrap_input.py is sent only if py4j library is detected inside Python process.

  • Py4J python and java libraries is used to load Input zeppelin Java class into the python process (make java code with python code !). Therefore the interpreter can directly create Zeppelin input form inside the Python process (and eventually with some python variable already defined). JVM opens a random open port to be accessible from python process.

  • JavaBuilder can't send SIGINT signal to interrupt paragraph execution. Therefore interpreter directly send a kill SIGINT PID to python process to interrupt execution. Python process catch SIGINT signal with some code defined in bootstrap.py

  • Matplotlib figures are displayed inline with the notebook automatically using a built-in backend for zeppelin in conjunction with a post-execute hook.

  • %python.sql support for Pandas DataFrames is optional and provided using https://github.com/yhat/pandasql if user have one installed