{ "cells": [ { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. _tutorial/data:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "nbsphinx": "hidden" }, "outputs": [], "source": [ "import pandas as pd\n", "pd.options.display.max_columns = pd.options.display.max_rows = 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "**scikit-chem** provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses [fuel](http://fuel.readthedocs.io/en/latest/) to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In memory datasets " ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Datasets consist of **sets** and **sources**. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.\n", "\n", "For demonstration purposes, we will use the [Bursi Ames](http://ftp.ics.uci.edu/pub/baldig/learning/Bursi/source/jm040835a.pdf) dataset. This has 3 sets:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('train', 'valid', 'test')" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.available_sets()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And many sources:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.data.BursiAmes.available_sources()" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", " \n", " Currently, the nature of the sources are not alway well documented, but as a guide, **X** are moleccular features, **y** are target variables, **A** are atom features, **G** are distances. When available, they will be detailed in the docstring of the dataset, accessible with ``help``." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we will load the X_morg and the y **sources** for all the **sets**. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).\n", "\n", "We can load the data for requested sets and sources using the in memory API:" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [], "source": [ "kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}\n", "\n", "(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The requested data is loaded as nested tuples, sorted first by **set**, and then by **source**, which can easily be unpacked as above." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "train shapes: (3007, 2048) (3007,)\n", "valid shapes: (645, 2048) (645,)\n", "test shapes: (645, 2048) (645,)\n" ] } ], "source": [ "print('train shapes:', X_train.shape, y_train.shape)\n", "print('valid shapes:', X_valid.shape, y_valid.shape)\n", "print('test shapes:', X_test.shape, y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The raw data is loaded as numpy arrays:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([[0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " ..., \n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0],\n", " [0, 0, 0, ..., 0, 0, 0]])" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which should be ready to use as fuel for modelling!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data as pandas objects\n", "\n", "The data is originally saved as pandas objects, and can be retrieved as such using the `read_frame` class method. \n", "\n", "Features are available under the 'feats' namespace:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
morgan_fp_idx | \n", "0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "... | \n", "2043 | \n", "2044 | \n", "2045 | \n", "2046 | \n", "2047 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
batch | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
1728-95-6 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
74550-97-3 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
16757-83-8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
553-97-9 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
115-39-9 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
874-60-2 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
92-66-0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
594-71-8 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
55792-21-7 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
84987-77-9 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
4297 rows × 2048 columns
\n", "