{ "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _tutorial/data:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "nbsphinx": "hidden" }, "outputs": [], "source": [ "import pandas as pd\n", "pd.options.display.max_columns = pd.options.display.max_rows = 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**scikit-chem** provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses [fuel](http://fuel.readthedocs.io/en/latest/) to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In memory datasets " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Datasets consist of **sets** and **sources**. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.\n", "\n", "For demonstration purposes, we will use the [Bursi Ames](http://ftp.ics.uci.edu/pub/baldig/learning/Bursi/source/jm040835a.pdf) dataset. This has 3 sets:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('train', 'valid', 'test')" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.available_sets()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And many sources:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.available_sources()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " \n", " Currently, the nature of the sources are not alway well documented, but as a guide, **X** are moleccular features, **y** are target variables, **A** are atom features, **G** are distances. When available, they will be detailed in the docstring of the dataset, accessible with ``help``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we will load the X_morg and the y **sources** for all the **sets**. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).\n", "\n", "We can load the data for requested sets and sources using the in memory API:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}\n", "\n", "(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The requested data is loaded as nested tuples, sorted first by **set**, and then by **source**, which can easily be unpacked as above." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train shapes: (3007, 2048) (3007,)\n", "valid shapes: (645, 2048) (645,)\n", "test shapes: (645, 2048) (645,)\n" ] } ], "source": [ "print('train shapes:', X_train.shape, y_train.shape)\n", "print('valid shapes:', X_valid.shape, y_valid.shape)\n", "print('test shapes:', X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The raw data is loaded as numpy arrays:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ..., \n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which should be ready to use as fuel for modelling!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data as pandas objects\n", "\n", "The data is originally saved as pandas objects, and can be retrieved as such using the `read_frame` class method. \n", "\n", "Features are available under the 'feats' namespace:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
morgan_fp_idx01234...20432044204520462047
batch
1728-95-600000...00000
74550-97-300000...00000
16757-83-800000...00000
553-97-900000...00000
115-39-900000...00000
....................................
874-60-200000...00000
92-66-000000...00000
594-71-800000...00000
55792-21-700000...00000
84987-77-900000...00000
\n", "

4297 rows × 2048 columns

\n", "
" ], "text/plain": [ "morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 \\\n", "batch ... \n", "1728-95-6 0 0 0 0 0 ... 0 0 0 0 \n", "74550-97-3 0 0 0 0 0 ... 0 0 0 0 \n", "16757-83-8 0 0 0 0 0 ... 0 0 0 0 \n", "553-97-9 0 0 0 0 0 ... 0 0 0 0 \n", "115-39-9 0 0 0 0 0 ... 0 0 0 0 \n", "... ... ... ... ... ... ... ... ... ... ... \n", "874-60-2 0 0 0 0 0 ... 0 0 0 0 \n", "92-66-0 0 0 0 0 0 ... 0 0 0 0 \n", "594-71-8 0 0 0 0 0 ... 0 0 0 0 \n", "55792-21-7 0 0 0 0 0 ... 0 0 0 0 \n", "84987-77-9 0 0 0 0 0 ... 0 0 0 0 \n", "\n", "morgan_fp_idx 2047 \n", "batch \n", "1728-95-6 0 \n", "74550-97-3 0 \n", "16757-83-8 0 \n", "553-97-9 0 \n", "115-39-9 0 \n", "... ... \n", "874-60-2 0 \n", "92-66-0 0 \n", "594-71-8 0 \n", "55792-21-7 0 \n", "84987-77-9 0 \n", "\n", "[4297 rows x 2048 columns]" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.read_frame('feats/X_morg')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Target variables under 'targets':" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "batch\n", "1728-95-6 1\n", "74550-97-3 1\n", "16757-83-8 1\n", "553-97-9 0\n", "115-39-9 0\n", " ..\n", "874-60-2 1\n", "92-66-0 0\n", "594-71-8 1\n", "55792-21-7 0\n", "84987-77-9 1\n", "Name: is_mutagen, dtype: uint8" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.read_frame('targets/y')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set membership masks under 'indices':" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "batch\n", "1728-95-6 True\n", "74550-97-3 True\n", "16757-83-8 True\n", "553-97-9 True\n", "115-39-9 True\n", " ... \n", "874-60-2 False\n", "92-66-0 False\n", "594-71-8 False\n", "55792-21-7 False\n", "84987-77-9 False\n", "Name: split, dtype: bool" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.read_frame('indices/train')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, molecules are accessible via 'structure':" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "batch\n", "1728-95-6 \n", "2319-96-2 \n", " ... \n", "84-64-0