Python, R, Scala and Julia in one Notebook
Use R, Julia, Scala or Python? The answer is: Yes! How to create versatile environment, in which different languages are available and able to communicate with each other? Without changing program you work in and where data may be passed between specific structures characteristic for the language? This post will show you how to do it with four most powerful languages used in Data Science: Python, R, Scala and Julia.
Introduction
It is hard to choose best language for data analysis, especially if you are beginner and do not want to go into details about strengths and weaknesses of particular solutions. Should I choose R or Python (2 or 3), maybe Julia would be faster? What should I use to work with large data sets? My answer is: use what you can! Take the best you can from several languages and make it work. This post will show you how.
Use what you can! Take the best you can!
I have chosen four most powerful languages used in Data Science and Big Data analysis, which should give you broadest span of accessible technologies and speed up development or data analysis. Current versions of chosen software:
Language | Version | |
---|---|---|
Python 2 | 2.7.13 | |
Python 3 | 3.6.1 | |
R | 3.4.0 | |
Julia | 0.7.0 | |
Scala | 2.12.1 |
Procedure
I will follow this procedure to prepare work environment:
- installing Python interpreter
- installing R language
- installing Julia language compiler
- installing Scala or sbt compiler
- installing additional jupyter kernels:
So my goal is to install Python, Scala, Julia and R on working machine. Additionally to run notebooks in languages other than Python I need to install specific middle-ware called kernels.
First let’s grab necessary tools - compilers and interpreters for each language (linked in table above). All languages presented here are multi-platform and can be installed on Windows, Linux and Mac OS machines. Since my working system is Fedora 25 I will describe install procedures for this OS. Debian based Linux distributions do not differ much from Red Hat family (except using different package manager and repos). Windows versions have convenient installers and installation process is trivial. If necessary, I will place link to the windows installer in each section.
Python interpreter
In all Linux distributions Python is available “out of the box”. Unfortunately in many cases default Python is still 2.7 branch. It should be mentioned, that 2.7 is very old and it is being slowly depreciated. It is last supported branch from 2.x family and it’s support will drop in 2020 (see: PEP 373). There will be no official bug fixes after that date. Additionally most of the currently used libraries are ready to run at least 3.5 version. Unless you have some obscure dependency, there is no excuse not to use Python 3. Older version is included here just for to demonstrate how to manage different Python versions in isolated environments.
There is no excuse not to use Python 3 anymore. Grab it! Use it!
I assume you have basic knowledge of Python flavors available today and their strengths and weaknesses. In this tutorial I will use Python 3 running on Fedora 25 workstation.
Moreover I have chosen specific distribution of Python, prepared by Continuum Analytics called Anaconda. It is the most comprehensive and free bundle of Python software dedicated to Data Science.
I strongly recommended to use Anaconda distribution, which will install Python interpreter, the Jupyter Notebook, and several other packages commonly used in data science and this tutorial. If you choose Anaconda 3, your interpreter will be of version 3.6 (current version) or higher (3.7 alpha is already available).
I will just follow instructions from installation page and simply execute downloaded script:
Script contains all binaries and weights nearly 475 MB. After executing, you should see Anaconda installer:
Automatic process will guide you through installation. You don’t have to run installer as a root, unless you want it to be installed for all users. In that case you need sudo privilege to install it in some globally accessible location like /opt/anaconda3
and append /etc/profile
with the location of the interpreter export PATH=/opt/Anaconda3/bin:$PATH
. This will make Python change system-wide. If you run it with normal user privileges it will choose your user’s home directory and place it in /home/<user>/anaconda3
. Installer will install bunch of python packages including MKL optimizations (Math Kernel Library), numpy, pandas, matplotlib, scikit-learn and Jupyter - just to name few. Nearly 200 packages grouped together to make your life easier. To get list of the packages execute this command: conda list | wc -l
. Anyway lets continue with installation…
Answering yes
will make Anaconda’s python your default interpreter. Sometimes installer may be a little bit outdated. Restart terminal, or source .bashrc
and upgrade Python. You can do it with all installed packages using single command: conda update --all
; or just interpreter itself: conda update python
.
After that you should have brand new Python 3.6.1 as your main system interpreter. To confirm it type conda info
or simply try to run python interpreter in your terminal.
Conda command should result in displaying detailed information about current Anaconda installation:
Running python should expose interpreters REPL:
Windows users have to follow simple next,next installer prepared by Continuum. It can be downloaded here.
R language
Fedora users can install R
from the standard Fedora repository using:
The RPM of R
is a meta package, which will install following components:
- R-core User RPM
- R-core-devel Developer RPM containing header files
- R-java RPM to ensure that R is configured for use with Java
- libRmath Standalone R math library
- libRmath-devel Header file for the standalone R math library
According to CRAN manual Fedora will also require developer versions, which contain header files necessary to properly install other R packages and to compile them from the source. Windows users have simple installer at their disposal.
Julia compiler
A Copr repository is provided for Fedora users. To install Julia just run:
Bleeding edge version of Julia is held in separate repository, which can be added with this command: sudo dnf copr enable nalimilan/julia-nightlies
. I decided to install nightly Julia release. Adding any of mentioned above results in:
After that you can easily install specific build of Julia compiler:
To verify either type julia --version
or simply try to run Julia REPL:
Scala / sbt
Scala compiles to Java byte-code using Java Virtual Machine. Therefore before using it, JRE or JDK must be installed in the system.
Install Oracle Java JDK/JRE
Download JRE or JDK from Oracle www. Select one of the available options (i.e. download JRE -> jre-8u131-linux-x64.rpm
or download JDK -> jdk-8u131-linux-x64.rpm
). What is the difference? JRE stands for Java Runtime Environment , which covers all end-users needs if it comes to run software written in Java. JDK is developers environment (Java Developers Kit) containing JRE with additional tools supporting Java programs development and debugging. Installation must be done with administrator privileges.
… or:
Please remember to change paths to downloaded files. Windows users should choose msi
installer and follow java installer instructions.
This step may be skipped for “nix” operating systems, since dnf
or apt
will manage dependences for you and install appropriate openJDK environment during Scala installation.
Install Scala
Installing Scala is straight forward. Windows users have typical installer. On fedora and other “nix” systems Scala is part of official repository and can be installed with single command:
Scala compiles to Java byte-code
When process is finished you can check your installation simply by trying to use Scala REPL:
It works! Perfect!
You may notice, that scala from this repo is little bit outdated. Don’t worry. It will be updated when used with sbt build-tool.
When you write small programs which consist of only one, or just two source files, then it’s easy enough to compile those source files by typing scalac MyProgram.scala
in the terminal. Scalac is name of Scala compiler program. But when project gets bigger, with dozens or maybe even hundreds of source files, then it becomes too tedious to compile all those source files manually. You start to think: “There must be a better way.”
C/C++ programmers use make
tool. For programs compiling to Java byte-code there is sbt
. It is a general purpose build tool written in Scala. In the next section I will describe its installation.
Install sbt
As I mentioned before manual compilation of complex project with hundreds of files is nearly impossible. Not mentioning managing all dependencies required by hundreds of files. Build-tools, such as sbt
(Simple Build Tool), can automate this time consuming and tedious process. It will manage compiling all source files and dependencies for you. This means that if you need to use some libraries written by others, sbt can automatically download the right versions of those libraries and include them in your project. Moreover you can compose automatic unit-tests which can also be run by build-tool. Sbt will also provide some boilerplate code to automate starting new projects in Scala.
As usual Windows users will use sbt installer. “Nix” users will have to add official repository to their system. Sbt binaries are published to Bintray repository, and conveniently Bintray provides an RPM variant for Fedora. Super user privileges are required.
Once Bintray repository is added you can install latest sbt
using package manager:
Sbt is a tool that runs from the project folder. When it starts it will try to read build.sbt
file which should contain information about project. You can also use sbt creator to start new project. To test sbt I will create test directory and run project creator:
Sbt created nice basic project structure and filled it with boilerplate code. Main project folder is hello
and contains two important pieces: build.sbt
file and src/
folder with your Scala program. If you enter the folder and run sbt from it, you should be directed to sbt shell, which will allow you to compile, run or test your code.
>
sign is sbt shell prompt. You can run scala from it simply by typing console
:
Type Ctrl+D
(Ctrl+Z
on Windows) to exit scala REPL and go back to the sbt shell.
It is also possible to run
and test
your code:
Kernels installation
Jupyter notebook is a fantastic tool, that allows my favorite programming style: Prototype driven development. Existence of REPL (Read, Evaluate, Print Loop) in all installed languages can be used to instantaneously test our code. We can also test our tests. This makes commits cleaner and faster. After all - it is much easier to control very small chunks of the code and work on it interactively. First IPython notebook made it possible with python, but later Jupyter project began to live own life and extended functionality by adding more languages it could “manage”. It is done by specific middle-ware called “kernel”. There is nearly 100 different kernels now(see here). Lets get Big Four.
Ipykernel (additional python)
In many Linux distributions python 3.x is accessible from python3
command, but it is cumbersome to manage both python versions and their dependencies by calling specific pip/pip3 or python/python3 from the system level. Managing versions, as well as third party dependencies and environmental variables is very, very confusing. Unless one uses correct tool. Anaconda has such tool out of the box and it is called conda
. It is able to manage packages and virtual environments. With minimal effort one can create and delete entire environments with specific python and packages configuration. Other, very popular environment manager is virtualenv
especially with additional package virtualenvwrapper
. I will use manager built in conda. More about conda virtual env capabilities here or in built in help system conda env --help
.
So…. I have Anaconda 3 with Python 3.6.1 on my Fedora machine. Let’s say I want to install additional python interpreter from 2.7 branch (last 2.x branch supported by Python Foundation). In order to create environment with specific python version, run this command:
This should install bunch of packages, including latest python 2 version with basic tools to manage packages in new environment. I also specified, that I want ipykernel
package to be installed as well. Now I have to activate new environment following conda help from the screen:
Prompt changed, indicating that I am now in py27 environment with its packages. Quick pip list
reveals, that only handful of basic packages (including ipykernel and its dependencies) were installed :
After activation I have to add this kernel to the global list of kernels managed by Jupyter package from main Anaconda installation. To register this kernel I have to enter: python -m ipykernel install --user
in the terminal with this kernel activated:
Done.I should have both kernels accessible when I run my Jupyter Notebook:
IRkernel with conda
Continuum Analytics did great job making R language available almost out of the box. Conda can access different repositories by specifying --chanel
or -c
flag when calling install option. Continuum is maintaining repository with most popular R packages ported, so they can be used as python packages. To install R packages for conda enter:
This command should install dozens of R packages and will make R kernel available to you when you run Jupyter Notebook. For more information go to R with conda.
Simple and efficient. To run R code in Jupyter notebook simply choose R kernel from the drop-down list:
Con of this method is that you have to install nearly 160 packages, taking few gigs of space. On the other hand I will have nice and ready to go R environment in my python after typing just five words. Awesome!
If you want to see alternative way (much less space hungry) check next section.
IRkernel from R
We will soon submit the IRkernel package to CRAN. Until then, you can install it via the devtools package:
Effect is identical to the one R will install requested packages
IJulia kernel
Once you have Julia installed on your machine, run Julia app (you will see fancy prompt by julia>
), then type:
julia> Pkg.add("IJulia")
INFO: Initializing package repository /home/mdyzma/.julia/v0.7
INFO: Cloning METADATA from https://github.com/JuliaLang/METADATA.jl
INFO: Cloning ...
INFO: Installing ...
INFO: Building ...
Julia package manager will take care of dependencies and download requested software. Specifically it will download and install basic python environment based on Miniconda, which will be local for Julia, and accessible only by Julia. Thanks to that you don’t really need any python installed in your system to run Julia Notebook. In that case only one kernel will be available. Julia will use its private python interpreter and minimal Jupyter installation to run notebook with it’s kernel. You can run it at any time typing in Julia REPL:
Since I already had two additional kernels installed (python 2, and R), IJulia will be added to the collection.
conda update --all
), including Jupyter notebook, which changed little bit its UI. You may notice that list of kernels is displayed first. Opposite to previous screen-shots.
In Julia language using <package Name>
is an import statement, which pre-compiles and gets ready to work module/program denoted in the statement. Next line calls this programs subroutine called notebook
. If you use some arguments, you can modify notebooks behavior. For example notebook(detached=true)
, Julia will run notebook server in the background and you will be able to use or exit REPL without closing the notebook.
By default, the notebook “dashboard” opens in your home directory, but you can open the dashboard in a different directory with notebook(dir="/some/path")
.
But we want to add IJulia kernel to existing Jupyter installation. to do that you need to set environmental variable JUPYTER
to the value of your current jupyer program path before running Pkg.add("IJulia")
.
IScala kernel
We will clone one of the kernels offering Scala support from GitHub:
Simply run sh
script prepared by the author and Scala should be added to the list of available kernels:
Once all kernels are installed, you can print all available kernels using Jupyter function:
Unfortunately those kernels are not connected. In current configuration only one kernel is accessible. Jupyter supports nearly 100 scripting languages, but each notebook is allowed to use only single kernel. Moving data between kernels is not possible. This is why notebooks are usually used sequentially. It means, that changing kernel during computation may not be possible for all kernels. Python kernel is able to use ipython magic functions to switch kernels from cell to cell, Julia has very useful packages - pycall
and rcall
, which allow to execute native code in this languages. Only R kernel has no connections outside, and it’s usage as a base and execute other kernel is impossible without complicated serialization and deserialization system.
To sum up, it is possible to use other kernels in python and Julia based notebooks. But this functionality is limited.
One to bind them all
There is initiative of Vatlab called SoS notebook. A kernel capable of translating other kernels data structures and functions to perform truly multi-language analysis in one notebook document.
<iframe width=”560” height=”420” src="https://youtu.be/xrwhNMRTBp4"></iframe>
Summary
Now we have versatile, multi-language prototyping environment in the browser.
If you struggle between using python Julia, Scala or R, don’t! Use all of them! At the same time in the same notebook, passing data structures between languages and perform analysis with the best tools they can offer. With Jupyter notebook it is all possible. It is possible to add even more players to the game. Julia, Haskel, Lua, bash, Octave… Pick whatever you can… Currently Jupyter supports nearly 100 different kernels (check here).