{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _tutorial/data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false,
    "nbsphinx": "hidden"
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.options.display.max_columns = pd.options.display.max_rows = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "**scikit-chem** provides a simple interface to chemical datasets, and a framework for constructing these datasets.  The data module uses [fuel](http://fuel.readthedocs.io/en/latest/) to make complex out of memory iterative functionality straightforward (see the fuel documentation).  It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## In memory datasets "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "Datasets consist of **sets** and **sources**.  Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.\n",
    "\n",
    "For demonstration purposes, we will use the [Bursi Ames](http://ftp.ics.uci.edu/pub/baldig/learning/Bursi/source/jm040835a.pdf) dataset.  This has 3 sets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('train', 'valid', 'test')"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.available_sets()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And many sources:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.available_sources()"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "    \n",
    "    Currently, the nature of the sources are not alway well documented, but as a guide, **X** are moleccular features, **y** are target variables, **A** are atom features, **G** are distances.  When available, they will be detailed in the docstring of the dataset, accessible with ``help``."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this example, we will load the X_morg and the y **sources** for all the **sets**.  These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).\n",
    "\n",
    "We can load the data for requested sets and sources using the in memory API:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}\n",
    "\n",
    "(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The requested data is loaded as nested tuples, sorted first by **set**, and then by **source**, which can easily be unpacked as above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train shapes: (3007, 2048) (3007,)\n",
      "valid shapes: (645, 2048) (645,)\n",
      "test shapes: (645, 2048) (645,)\n"
     ]
    }
   ],
   "source": [
    "print('train shapes:', X_train.shape, y_train.shape)\n",
    "print('valid shapes:', X_valid.shape, y_valid.shape)\n",
    "print('test shapes:', X_test.shape, y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The raw data is loaded as numpy arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       ..., \n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0],\n",
       "       [0, 0, 0, ..., 0, 0, 0]])"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "y_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which should be ready to use as fuel for modelling!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data as pandas objects\n",
    "\n",
    "The data is originally saved as pandas objects, and can be retrieved as such using the `read_frame` class method.  \n",
    "\n",
    "Features are available under the 'feats' namespace:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>morgan_fp_idx</th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>...</th>\n",
       "      <th>2043</th>\n",
       "      <th>2044</th>\n",
       "      <th>2045</th>\n",
       "      <th>2046</th>\n",
       "      <th>2047</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>batch</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1728-95-6</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>74550-97-3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16757-83-8</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>553-97-9</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>115-39-9</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>874-60-2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>92-66-0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>594-71-8</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>55792-21-7</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84987-77-9</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4297 rows × 2048 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "morgan_fp_idx  0     1     2     3     4     ...   2043  2044  2045  2046  \\\n",
       "batch                                        ...                            \n",
       "1728-95-6         0     0     0     0     0  ...      0     0     0     0   \n",
       "74550-97-3        0     0     0     0     0  ...      0     0     0     0   \n",
       "16757-83-8        0     0     0     0     0  ...      0     0     0     0   \n",
       "553-97-9          0     0     0     0     0  ...      0     0     0     0   \n",
       "115-39-9          0     0     0     0     0  ...      0     0     0     0   \n",
       "...             ...   ...   ...   ...   ...  ...    ...   ...   ...   ...   \n",
       "874-60-2          0     0     0     0     0  ...      0     0     0     0   \n",
       "92-66-0           0     0     0     0     0  ...      0     0     0     0   \n",
       "594-71-8          0     0     0     0     0  ...      0     0     0     0   \n",
       "55792-21-7        0     0     0     0     0  ...      0     0     0     0   \n",
       "84987-77-9        0     0     0     0     0  ...      0     0     0     0   \n",
       "\n",
       "morgan_fp_idx  2047  \n",
       "batch                \n",
       "1728-95-6         0  \n",
       "74550-97-3        0  \n",
       "16757-83-8        0  \n",
       "553-97-9          0  \n",
       "115-39-9          0  \n",
       "...             ...  \n",
       "874-60-2          0  \n",
       "92-66-0           0  \n",
       "594-71-8          0  \n",
       "55792-21-7        0  \n",
       "84987-77-9        0  \n",
       "\n",
       "[4297 rows x 2048 columns]"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.read_frame('feats/X_morg')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Target variables under 'targets':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "batch\n",
       "1728-95-6     1\n",
       "74550-97-3    1\n",
       "16757-83-8    1\n",
       "553-97-9      0\n",
       "115-39-9      0\n",
       "             ..\n",
       "874-60-2      1\n",
       "92-66-0       0\n",
       "594-71-8      1\n",
       "55792-21-7    0\n",
       "84987-77-9    1\n",
       "Name: is_mutagen, dtype: uint8"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.read_frame('targets/y')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set membership masks under 'indices':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "batch\n",
       "1728-95-6      True\n",
       "74550-97-3     True\n",
       "16757-83-8     True\n",
       "553-97-9       True\n",
       "115-39-9       True\n",
       "              ...  \n",
       "874-60-2      False\n",
       "92-66-0       False\n",
       "594-71-8      False\n",
       "55792-21-7    False\n",
       "84987-77-9    False\n",
       "Name: split, dtype: bool"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.read_frame('indices/train')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, molecules are accessible via 'structure':"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "batch\n",
       "1728-95-6      <Mol: [H]c1c([H])c([H])c(-c2nc(-c3c([H])c([H])...\n",
       "119-34-6       <Mol: [H]Oc1c([H])c([H])c(N([H])[H])c([H])c1[N...\n",
       "371-40-4           <Mol: [H]c1c([H])c(N([H])[H])c([H])c([H])c1F>\n",
       "2319-96-2      <Mol: [H]c1c([H])c([H])c2c([H])c3c(c([H])c(C([...\n",
       "1822-51-1         <Mol: [H]c1nc([H])c([H])c(C([H])([H])Cl)c1[H]>\n",
       "                                     ...                        \n",
       "84-64-0        <Mol: [H]c1c([H])c([H])c(C(=O)OC2([H])C([H])([...\n",
       "121808-62-6    <Mol: [H]OC(=O)C1([H])N(C(=O)C2([H])N([H])C(=O...\n",
       "134-20-3       <Mol: [H]c1c([H])c([H])c(N([H])[H])c(C(=O)OC([...\n",
       "6441-91-4      <Mol: [H]Oc1c([H])c(S(=O)(=O)O[H])c([H])c2c([H...\n",
       "97534-21-9     <Mol: [H]Oc1nc(=S)n([H])c(O[H])c1C(=O)N([H])c1...\n",
       "Name: structure, dtype: object"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "skchem.data.BursiAmes.read_frame('structure')"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. todo::\n",
    "\n",
    "    Example using fuel directly."
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. note::\n",
    "\n",
    "    The dataset building functionality is likely to undergo a large change in future so is not documented here.  Please look at the example datasets to understand the format required to build the datasets directly."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Edit Metadata",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  },
  "widgets": {
   "state": {},
   "version": "1.1.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}