zeppelin/python
Alex Goodman 4ac577f711 [ZEPPELIN-1639] Add tests with external python dependencies to CI build
### What is this PR for?
Take 2 of #1618 because I had some earlier problems with rebasing. Since, then I have added some new features, namely:

- Matplotlib integration tests for pyspark
- `install_external_dependencies.sh` which conditionally installs the R or python dependencies based on the specified build profile in `.travis.yml`. This saves several minutes of time for a few of the build profiles since the R dependencies are compiled from source and therefore take quite a bit of time to install.
- The extra python unit tests which require external dependencies (`matplotlib` and `pandas`) are now relegated to two separate build profiles. This is done primarily to efficiently test both Python 2 and 3.
- Some minor bugs in the python and pyspark interpreters (mostly with respect to python 3 compatibility) were caught as a result of these tests, and are also fixed in this PR.

### What type of PR is it?
Improvement and Bugfix

### What is the Jira issue?
[ZEPPELIN-1639](https://issues.apache.org/jira/browse/ZEPPELIN-1639)

### How should this be tested?
CI tests should be green!

### Questions:
* Does the licenses files need update? No
* Is there breaking changes for older versions? No
* Does this needs documentation? No

Author: Alex Goodman <agoodm@users.noreply.github.com>

Closes #1632 from agoodm/ZEPPELIN-1639 and squashes the following commits:

01380c2 [Alex Goodman] Make sure python 3 profile uses scala 2.11
363019e [Alex Goodman] Use spark 2.0 with python 3
a4f43af [Alex Goodman] Update comments in .travis.yml
5a60181 [Alex Goodman] Isolate python tests
73663f6 [Alex Goodman] Update tests for new InterpreterContext constructor
5709c5d [Alex Goodman] Re-add pyspark to build profile
ee95d67 [Alex Goodman] Move python 3 tests to all modules with spark 2.0
3a76958 [Alex Goodman] Travis
42da31c [Alex Goodman] Shorten python version
b6b88be [Alex Goodman] Add python dependencies to .travis.yml
2016-11-29 17:00:07 +09:00
..
src [ZEPPELIN-1639] Add tests with external python dependencies to CI build 2016-11-29 17:00:07 +09:00
pom.xml ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell 2016-11-08 07:20:21 -08:00
README.md ZEPPELIN-1345 - Create a custom matplotlib backend that natively supports inline plotting in a python interpreter cell 2016-11-08 07:20:21 -08:00

Overview

Python interpreter for Apache Zeppelin

Architecture

Current interpreter implementation spawns new system python process through ProcessBuilder and re-directs it's stdin\strout to Zeppelin

Details

  • UnitTests

To run full suit of tests, including ones that depend on real Python interpreter AND external libraries installed (like Pandas, Pandasql, etc) do

mvn -Dpython.test.exclude='' test -pl python -am
  • Py4j support

Py4j enables Python programs to dynamically access Java objects in a JVM. It is required in order to use Zeppelin dynamic forms feature.

  • bootstrap process

Interpreter environment is setup with thex bootstrap.py It defines help() and z convenience functions

Dev prerequisites

  • Python 2 or 3 installed with py4j (0.9.2) and matplotlib (1.31 or later) installed on each

  • Tests only checks the interpreter logic and starts any Python process! Python process is mocked with a class that simply output it input.

  • Code wrote in bootstrap.py and bootstrap_input.py should always be Python 2 and 3 compliant.

  • Use PEP8 convention for python code.

Technical overview

  • When interpreter is starting it launches a python process inside a Java ProcessBuilder. Python is started with -i (interactive mode) and -u (unbuffered stdin, stdout and stderr) options. Thus the interpreter has a "sleeping" python process.

  • Interpreter sends command to python with a Java outputStreamWiter and read from an InputStreamReader. To know when stop reading stdout, interpreter sends print "*!?flush reader!?*"after each command and reads stdout until he receives back the *!?flush reader!?*.

  • When interpreter is starting, it sends some Python code (bootstrap.py and bootstrap_input.py) to initialize default behavior and functions (help(), z.input()...). bootstrap_input.py is sent only if py4j library is detected inside Python process.

  • Py4J python and java libraries is used to load Input zeppelin Java class into the python process (make java code with python code !). Therefore the interpreter can directly create Zeppelin input form inside the Python process (and eventually with some python variable already defined). JVM opens a random open port to be accessible from python process.

  • JavaBuilder can't send SIGINT signal to interrupt paragraph execution. Therefore interpreter directly send a kill SIGINT PID to python process to interrupt execution. Python process catch SIGINT signal with some code defined in bootstrap.py

  • Matplotlib figures are displayed inline with the notebook automatically using a built-in backend for zeppelin in conjunction with a post-execute hook.

  • %python.sql support for Pandas DataFrames is optional and provided using https://github.com/yhat/pandasql if user have one installed