{ "cells": [ { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "nbsphinx": "hidden" }, "outputs": [], "source": [ "import skchem\n", "import pandas as pd\n", "pd.options.display.max_rows = pd.options.display.max_columns = 10" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Pipelining" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**scikit-chem** expands on the scikit-learn `Pipeline` object to support filtering. It is initialized using a list of Transformer objects." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pipeline = skchem.pipeline.Pipeline([\n", " skchem.standardizers.ChemAxonStandardizer(keep_failed=True),\n", " skchem.forcefields.UFF(),\n", " skchem.filters.OrganicFilter(),\n", " skchem.descriptors.MorganFeaturizer()])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The pipeline will apply each in turn to objects, using the the highest priority function that each object implements, according to the order `transform_filter` > `filter` > `transform`.\n", "\n", "For example, our pipeline can transform sodium acetate all the way to fingerprints:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "morgan_fp_idx\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "2043 0\n", "2044 0\n", "2045 0\n", "2046 0\n", "2047 0\n", "Name: MorganFeaturizer, dtype: uint8" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.transform_filter(mol)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It also works on collections of molecules:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "batch\n", "ethane \n", "propane \n", "benzene \n", "sodium acetate \n", "serine \n", "Name: structure, dtype: object" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi', name_column=1); mols" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:04 Time: 0:00:04\n", "UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:00 Time: 0:00:00\n", "OrganicFilter: 100% (5 of 5) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00\n", "MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
morgan_fp_idx01234...20432044204520462047
batch
ethane00000...00000
propane00000...00000
benzene00000...00000
sodium acetate00000...00000
serine00000...00100
\n", "

5 rows × 2048 columns

\n", "
" ], "text/plain": [ "morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 \\\n", "batch ... \n", "ethane 0 0 0 0 0 ... 0 0 0 0 \n", "propane 0 0 0 0 0 ... 0 0 0 0 \n", "benzene 0 0 0 0 0 ... 0 0 0 0 \n", "sodium acetate 0 0 0 0 0 ... 0 0 0 0 \n", "serine 0 0 0 0 0 ... 0 0 1 0 \n", "\n", "morgan_fp_idx 2047 \n", "batch \n", "ethane 0 \n", "propane 0 \n", "benzene 0 \n", "sodium acetate 0 \n", "serine 0 \n", "\n", "[5 rows x 2048 columns]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipeline.transform_filter(mols)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }