Jupyter Notebook - python based lab book
Data analysis using Jupyter Notebook. Natural sciences more and more rely on skills related to Data Science. Experiments produce more and more data, and skilled researcher has to know how to deal with variety of data and sometime very large datasets. Anaconda Python Distribution offers large set of great tools to manipulate any kind of data “out of the box”. Multitude of community packages allows to read, analyze and report all kinds of data produced by science. It is caused mostly by simple fact, that scientific community is developing it’s tools mostly in Python. How to work with small and large data in python and make Jupyter Notebook your lab-book? Check this article.
Install Anaconda
Data Science skills are essential for every decent researcher.
Anaconda Python Distribution prepared by Continuum Analytics is the most comprehensive and free bundle of Python software dedicated to Data Science.
I strongly recommended to use Anaconda distribution, which will install Python interpreter, the Jupyter Notebook, and several other packages commonly used in data science and this tutorial. If you choose Anaconda 3, your interpreter will be of version 3.6 (current version) or higher (3.7 alpha is already available).
Execute script and just follow instructions from installation program (your current version may differ from the one listed here):
To make sure software is up to date, run:
Anaconda will install nearly 200 packages (182 to be exact), including most important for this tutorial: Jupyter Notebook, ipython, pandas, numpy, statsmodels
Anaconda channels
conda
is built in Anaconda package manager, which uses default, maintained by Continuum Analytics python packages repository. Some packages are distributed in repositories owned by groups other than Anaconda team. Repositories are called channels. One can indicate channel simply by choosing -c
or --channel
flag during invoking conda install
command. Some of the channels are supported by continuum Analytics, like conda-forge
, omnia
or r
. They are full of excellent packages developed by Anaconda community. Every time I mention I want to use other channel than default, conda will check this repositories for available packages. It is possible to add this channels to the .condarc
file (see: here). First config file must be created by running conda config
command. If other version of this file is placed in Anaconda installation root directory it will override users home configuration. to notify package manager, that every time I want to install something this channels should be checked. Order of repositories is important. In case packages are deployed to both repositories listed in channels section, last repository super-seeds all above it. Example file looks like this:
Last group of inputs is very important for users behind corporate proxy, which will block conda
package lookup, unless correct settings are provided. Additionally one can alway use official python package manager pip
in parallel to conda
. Conda is able to sense origin of the package and shows this during package listing. Pip checks PyPI (Python Packages Index) repository for python packages (which stores nearly 110 000 packages).
Another great part about Anaconda and Jupyter Notebook. It is cross-platform, which means, that Notebook files created on one system will open on other system with similar package configuration.
Jupyter Notebook / Jupyter Lab
Jupyter Notebook allows to create and share documents that contain live code, equations, visualizations and explanatory text. Text may be written in markdown markup language. Code can produce rich output such as images, videos, LaTeX, and JavaScript. Interactive widgets can be used to manipulate and visualize data in real time.
Alternatively you may add path to the existing Jupyter notebook file with .ipybn
extension. If you add path to the notebook file ( extension .ipynb
), it will be opened in the location of the file. Jupyter automatically runs a local server.
It should start notebook server in your browser (default address: http://127.0.0.1:8888):
There is a handy collection of extensions that add functionality to the Jupyter notebook. Extensions are grouped in package nbextensions
, which is not included in fresh Anaconda installation and I will install it using conda:
Because I used conda it will automatically register all extensions and copy necessary javascript
and css
files in Jupyter environment for me. If I used pip instead, I would have to fetch additional command: jupyter contrib nbextension install --user
Jupyter Lab is new project of Jupyter team, which eventually will replace good old notebook, but currently it is in very early alpha release version and it is not recommended to be used in any serious project. It has built in file manager, image browser, documentation and many, many other. And one disadvantage: ipywidgets
and nbextensions
do not work yet or their functionality must be loaded through lab extensions system, which is not very convenient. Another handy extension is watermark
package. It will timestamp notebooks and provide basic python configuration info. I will fetch latest version from GitHub:
Jupyter lab is not accessible in Anaconda distribution out of the box and must be installed. One can do it with default package manager from conda-forge
channel:
Lab environment can be started using jupyter lab
command. It should start in default browser under this address: http://127.0.0.1:8888.
More information about this project and development plans can be found here.
Simple conda list
shows all packages installed in current environment. Main tools I will use in this tutorial include:
numpy
- array calculationssympy
- symbolic mathematicsmatplotlib
- 2D plottingstatsmodels
- statistical modelspandas
- data structures and analysis
which are part of SciPy - python based scientific ecosystem. Numpy and Pandas alone have enormous documentations, which are worth to check. Huge advantage of notebook environment is that it allows to compute and manipulate data directly and export entire notebook in various formats, to share with other. There is also JupyterHub
, providing access to the notebook for multiple users, which can be used as a official project documentation in secure location and controlled access. Check JupyterHub documentation to learn more.
Experiments examples:
To present power enclosed in python and jupyter notebook I will create several scenarios of typical experiments conducted in labs. It will cover data acquisition, modeling, visualization and data storage.
Protein concentration
Example of simple experiment with data from VIS spectrophotometric experiment. Lets assume we have solution with protein colored by protein dye. For a uniform absorbing solution the proportion of light passing through is called the transmittance: \(T\), and the proportion of light absorbed by molecules in the medium is absorbance, \(Abs\). Experiment consists of three steps:
- Determine the absorption spectra \(\lambda_{max}\)
- Calculate the extinction coefficient (\(\epsilon\)) of the standards.
- Determine the concentration of proteins in solution
Determine the absorption spectra
In order to obtain \(\lambda_{max}\) measure the absorbance of the diluted sample at 50 nm intervals between 350-700 nm. This will give an estimate of where the sample absorbs most (peaks) and least (valleys). I will generate them using python. Procedure requires that all measurements to be within 0-0.7 absorbance range. If maximal values are higher, sample should be diluted:
Data can be generated:
Or read from file:
Pandas can ingest every text and binary data format, which fits RAM memory. This way I received table called DataFrame. To find basic statistics, lets call describe()
method on data Frame.
Find wavelength for which absorbance is highest (it is trivial for few arguments, but becomes more complex when amount of data grows):
And plot it to get more direct data feel:
Calculate the extinction coefficient (\(\epsilon\)) of the standards.
The Beer-Lambert Law states that Absorbance is proportional to the concentration of the absorbing molecules, the length of light-path through the medium and the molar extinction coefficient:
\[Abs = \epsilon \cdot c \cdot l\]where:
- \(Abs\) – absorbance
- \(\epsilon\) – light extinction coefficient at max absorption wavelength \(\lambda_{max}\)
- \(c\) – substance concentration
- \(l\) – length of light-path
therefore \(\epsilon\) is equal to:
\[\epsilon = \frac{Abs_{410}}{c \cdot l}\]where c is concentration of the standard.
Determine the concentration of proteins in solution
Knowing epsilon just simply solve standard Beer-Lamber equation for \(c\). Another approach is to construct calibration curve from known samples, determine function, which fits data best and use it to calculate x (concentration) with known y (absorbance).
Calibration curve
Consider the following example involving a set of six standard points (5, 10, 25, 30, 40, 50, 60, and 70 µg/mL). Absorbance: (0.106, 0.236, 0.544, 0.690, 0.791, 0.861, 0.882, 0.911). I have two columns of x and y values of the calibration curve points.
Conc. | Abs. | |
---|---|---|
5 | 0.106 | |
10 | 0.236 | |
25 | 0.544 | |
30 | 0.690 | |
40 | 0.791 | |
50 | 0.861 | |
60 | 0.882 | |
70 | 0.911 |
It is quite easy to plot this points to visualize data and get some general overview of their shape, like that:
Not bad, six lines of code and plot is ready.
Linear fit
To fit the data to the line I will use scipy
package. Linear fit means I will try to find function with general definition:
where:
- \(A\) - is slope
- \(b\) - is intercept
and specific function will have minimal error value fitted for all points with least square regression.
Now I will visualize standard curve points and line fitted to them:
For this set of points linear fitting seems to be suboptimal choice. \(R^2 = 0.8754454029810919\). Lets try other type of function - polynomial.
Polynomial fit
Polynomial fitting may better reflect character of the points. In above data set there is step increase region with more flat part at the top. It would be much better to fit other type of function, which will be able to reflect plateau at the end. Such properties have polynomials or logarithmic functions. Lets try with second degree polynomial known as quadratic function. In this case second degree polynomial is sufficient. Fitting function creates list of coefficients for least-squares fit of the data points to the polynomial function described in general as: \(p(x) = c_0 + c_1 x + … + c_n x_n\):
Coefficients are: -0.03969222, 0.03034985, -0.00024301. Therefore polynomial has form:
\[y = -0.0396x^2 + 0.0303x - 0.00024\]You can go ahead and write script taking third degree polynomial to fit the data, or run snippet prepared in Jupyter Notebook. As I mentioned before, there is minimal difference between quadratic and cubic fit. Both polynomials comparison should look similar to this:
Concnetration interpolation
Interpolation of the unknown sample is simple as resolving one of the functions in respect to x. In case of linear regression transformation is trivial (rounded to four decimal places):
\[x = \frac{(absorbance - intercept)}{slope} = \frac{(absorbance - 0.176)}{0.0124}\]Lets calculate concentration for 0.5 absorbance:
\[x = \frac{(0.5 - 0.176)}{0.0124} = 26.129\]Althought this is very simple algebra, it can also be replaced by python computations. sympy
module allows to perform symbolic math operations like this:
What this lines do? First two import sympy package and set some printing options. Next identify normal python variables x, y, A, b as sympy Symbol()
objects. Further expression is set and solved against x. This approach calculates exact x for y equal 0. If I substitute y with some value, I will have to change expression to:
For second degree polynomial it is little bit more complicated. We can use well known math formula for quadratic equation roots, or ask sympy and numpy to calculate it for us.
To get exact symbolic solution:
Which results in:
\[\left\{- \frac{a_{2}}{2 a_{1}} - \frac{1}{2 a_{1}} \sqrt{- 4 a_{1} a_{3} + a_{2}^{2}}, - \frac{a_{2}}{2 a_{1}} + \frac{1}{2 a_{1}} \sqrt{- 4 a_{1} a_{3} + a_{2}^{2}}\right\}\]For numeric solution I will solve the equation \(f(x) - y = 0\) using np.roots
, where \(f(x)\) is our polynomial:
For absorbance equal 0.5 program returned two values: 21.47501868 and 103.41541963. Quick look at the graph shows that value we are looking for is 21.475.
But why there are two values and how to identify correct one? It is easy If we go back to the fitting and check what kind of function was used to fit data. I used parabola (quadratic), ascending arm of the parabola to be exact. Therefore for each point within x range (5-70) one can expect that quadratic function will have additional solution from descending arm, outside of the x scope. In reality our fitting function looks like this:
That is why only points fitted within x data range make sense and rest is irrelevant. There is one concern, which is related to the maximum point, which may be placed within x data range and then plateau data may suffer from that, giving wrong results. Possibly for this data set. logarithmic function would be perfect. However from Beer-Lambert law we know, that only linear growth phase from 0.05 to 0.7 absorbance is relevant, thus accuracy of the asymptotic region can be neglected.
Example Jupyter notebook can be downloaded from GitHub.