{ "cells": [ { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "nbsphinx": "hidden" }, "outputs": [], "source": [ "import skchem\n", "import pandas as pd\n", "pd.options.display.max_rows = 10\n", "pd.options.display.max_columns = 10\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Transforming" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Operations on compounds are implemented as `Transformer`s in **scikit-chem**, which are analoguous to `Transformer` objects in **scikit-learn**. These objects define a 1:1 mapping between input and output objects in a collection (i.e. the length of the collection remains the same during a transform). These mappings can be very varied, but the three main types currently implemented in `scikit-chem` are `Standardizers`, `Forcefields` and `Featurizers`." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Standardizers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chemical data curation is a difficult concept, and data may be formatted differently depending on the source, or even the habits of the curator. \n", "\n", "For example, **solvents** or **salts** might be included the representation, which might be considered an unnecessary detail to a modeller, or even irrelevant to an experimentalist, if the compound is solvated is a standard solvent during the protocol.\n", "\n", "Even the structure of molecules that would be considered the 'same', can often be drawn very differently. For example, [tautomers](https://en.wikipedia.org/wiki/Tautomers) are arguably the same molecule in different conditions, and [mesomers](https://en.wikipedia.org/wiki/Resonance) might be considered different aspects of the same molecule. \n", "\n", "Often, it is sensible to canonicalize these compounds in a process called **Standardization**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `scikit-chem`, the [standardizers](../api/skchem.standardizers.rst) package provides this functionality. `Standardizer` objects transform `Mol` objects into other `Mol` objects, which have their representation canonicalized (or into `None` if the protocol fails). The details of the canonicalization may be configured at object initialization, or by altering properties." ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. tip::\n", "\n", " Currently, the only available `Standardizer` is a wrapper of the `ChemAxon Standardizer `_. This requires the ChemAxon `JChem software suite `_ to be installed and licensed (free academic licenses are available from the website). We hope to implement an open source `Standardizer` in future." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As an example, we will standardize the sodium acetate:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'CC(=O)[O-].[Na+]'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]', name='sodium acetate'); mol.to_smiles()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A `Standardizer` object is initialized:" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "collapsed": true }, "outputs": [], "source": [ "std = skchem.standardizers.ChemAxonStandardizer()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling transform on sodium acetate yields the conjugate 'canonical' acid, acetic acid." ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'CC(=O)O'" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mol_std = std.transform(mol); mol_std.to_smiles()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standardization of a collection of `Mol`s can be achieved by calling `transform` on a `pandas.Series`:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "name\n", "ethane \n", "propane \n", "benzene \n", "sodium acetate \n", "serine \n", "Name: structure, dtype: object" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi', \n", " name_column=1); mols" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:01 Time: 0:00:01\n" ] }, { "data": { "text/plain": [ "name\n", "ethane \n", "propane \n", "benzene \n", "sodium acetate \n", "serine \n", "Name: structure, dtype: object" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std.transform(mols)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A loading bar is provided by default, although this can be disabled by lowering the verbosity:" ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "name\n", "ethane \n", "propane \n", "benzene \n", "sodium acetate \n", "serine \n", "Name: structure, dtype: object" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "std.verbose = 0\n", "std.transform(mols)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. todo::\n", "\n", " failure cases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Forcefields" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often the three dimensional structure of a compound is of relevance, but many chemical formats, such as [SMILES](http://www.daylight.com/dayhtml/doc/theory/theory.smiles.html) do not encode this information (and often even in formats which serialize geometry only coordinates in two dimensions are provided).\n", "\n", "To produce a reasonable three dimensional **conformer**, a compound must be roughly embedded in three dimensions according to local geometrical constraints, and forcefields used to optimize the geometry of a compound." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In `scikit-chem`, the [forcefields](../api/skchem.forcefields.rst) package provides access to this functionality. Two forcefields, the [Universal Force Field (UFF)](http://pubs.acs.org/doi/abs/10.1021/ja00051a040) and the Merck Molecular Force Field (MMFF) are currently provided. We will use the UFF:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "uff = skchem.forcefields.UFF()\n", "mol = uff.transform(mol_std)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mol.atoms" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This uses the forcefield to generate a reasonable three dimensional structure. In `rdkit` (and thus `scikit-chem`, conformers are separate entities). The forcefield creates a new conformer on the object:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ,\n", " ]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mol.conformers[0].atom_positions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The molecule can be visualized by drawing it:" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAQkAAAECCAYAAADtryKnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAADCVJREFUeJzt3X9oVfUfx/HX7t3P9ou1pWttFdNpbc5wKKWrWNaakCSt\nf4rIIkKEkAjsj8RBBOX+6AsSlYT2R/tDhKAopXCKRIElSk6dQ3SpsbnW/LZfen/Me+853z/u19nU\nvZu2ee71Ph//zB2c943sPjnnfO45J811XQHAZHxeDwAgsREJACYiAcBEJACYiAQAE5EAYCISAExE\nAoCJSAAwEQkAJiIBwJTuwWtysQjgnbQb/QH2JACYiAQAE5EAYCISAExEAoCJSAAwEQkAJiIBwEQk\nAJiIBAATkQBgIhIATEQCgIlIADARCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYiAcBE\nJACYiAQAE5EAYCISAExEAoCJSAAwEQkAJiIBwEQkAJiIBAATkQBgIhIATEQCgIlIADARCQAmIgHA\nRCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYiAcBEJACYiAQAE5EAYCISAExEAoCJSAAwEQkAJiIB\nwEQkAJiIBAATkQBgIhIATEQCgIlIADARCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYi\nAcBEJACYiAQAE5EAYCISAExEAoCJSAAwEQkAJiIBwEQkAJjSvR4ACeTrr6UdO6RZs6QLF6ScHGn9\nemnOHK8ng4fSXNe91a95y18QU9DTI61cKW3fLj34oBSLSR98EI/Fhx96PR2mT9qN/gCHG4g7cECq\nqJBqaiSfT8rIkOrrpY4OryeDx4gE4sLh+Ne/71lmZUl+vzfzIGEQCcTV1EjHjkkDA1e2nTghzZ/v\n3UxICEQCcYsWSQ89JH3+uRQIxM9RfPON9OqrXk8Gj3HiElecPy9t3hwPRHq69PzzUlNT/M+4Xdzw\niUsiAam3V+rqkp5+WnKc+J5EdrY0NBT/WlDg9YSYPqxu4CacOhX/fIQUX9nIz49//fhj6aefvJ0N\nniMSkMbGpDvumLjNcaRQiL0IEAkovvyZmztxm+PEt+fleTMTEgaRQDwGOTkTt8Vi8VBkZHgzExIG\nkUA8Etc73HBdPkwFIgFdPxKxGJGAJCIB1538xCWRgIhEynNdVxf8foXuvHPidsdRzO+XywepUh6R\nSHGO66ptZESHr1rdGHYc/ScW03/Zk0h5RCLFOY6js8PDcq5a3RiLxdTtOIoSiZRHJFKc4ziKRqPK\nysq6ZrvruvInQCSi0ah+++03OY4jSQqHwzp9+rTHU6UOIpHiHMeR4zhKv+rcQywWS6hIHDt2TLFY\nTJIUDAZ1/Phxj6dKHZyVSnGXI3F1DC7vSVwdD684jqNLly7JcRxFIhF5cGFiykqM3wB4xnXdhN+T\nkKTBwUEdOHBAPp9P4XBYwWDQ65FSBpFIcZMdbkQiEfl8voSJRGFhoRYuXKj09HSNjIzo8OHDXo+U\nMohEirt8SJGRkaGxsTGFw2Hl5uZqdHRU2dnZCROJjIwMFRYWKiMj47qHR5g5RCLF5eTk6MUXX9TZ\ns2f16aefamRkRBUVFXrggQfU0NDg6TmJaDSqtLQbvkcKphmrGykuFotpz5492rBhg6qrq/XWW2+p\nuLhYmzdv1rfffqtTp055MlcoFNKePXs0MDCgzMzMCcEqKCjQ448/7slcqYjb16Ug13V18eJF/fLL\nL3rvvfc0b948vf/++yotLR3/O4FAQFu2bNEXX3yhxx57TGvWrNG8efN0x9XXeEwzx3HU39+v9vZ2\n1dTUqK6ujkOL6cU9LmGLxWLav3+/tm/frrGxMTU3N6uxsfGaD1Nd1tfXp127dmn//v3Kzc1VU1OT\nGhoaVDADd6xyXVednZ06c+aMqqurNWfOHA43ph+RwOT6+vr07rvv6ujRo3r77bfV2Ng4pTe74zga\nGhrSoUOH9Nlnn2l4eFhvvvmmnn322Wl7E4+NjWnfvn2KRqOqr69XUVERgZgZRAJXcRxpaEihH37Q\n2q1bVbF4sVpaWibdc/gnkUhEu3fv1qZNm1RUVKS1a9fq4YcfVklJyU2/qf/8808dPHhQfr9fTz75\npDIzM2/q38GUEAn8TTQqffed9OWXcvLzNbJmjfIXLJiWFYtgMKh9+/Zp9+7d6u/v1/Lly7Vy5UqV\nl5dPORaxWEzff/+92tvbtXr1atXV1cnn41z6DCMS+L+TJ6X16+OheOcdafHia+9j+S+5rqtgMKje\n3l61tbVp7969am5u1htvvKG8f7iB7tjYmD755BO1tbVpy5YtWrJkScJ8BPw2RyRSmuPEn+X59ddS\nW5u0apW0bt21d8KeAa7r6vjx42ptbVV3d7dWr16tFStWqLy8/JrDh8HBQX300Uc6cuSItm7dqpKS\nkhmfD+OIRMoaGZG++ip+eDF3rvTyy1J19S0fw3VdHT58WDt37lRXV5fuv/9+NTc3a/HixfL5fPr1\n11/V2tqquXPnat26dbr77rs5QXlrEYnbWigkrVkTf1ZnMCg9+qjU0hK/F+ULL0hlZdLGjVJ5efzx\nfB6KRCIaHBzUrl27tG3bNs2fP18NDQ3atm2b1q9fr6amJuVM8+EPpoRI3LbCYempp6SXXpJefz3+\nPIwNG+Ix2LhRam+XnnhCuslVi+n0+++/69y5c3rkkUfk8/nU2dmpTZs2qaurSy0tLXruuefYe/DO\nDf/Hc6YoWRw7Jv3xh/Taa1cemPPMM1Jra/y5nStWeDufoaioSKtWrVJxcbGysrIIRJJhvSlZnDsn\nVVZO3FOYNSthb3nf39+vI0eOqKOjQ93d3XIcR4sWLdKJEyfGb0OH5EAkkkVpqXTwoBQIXNnW358Q\nhxfXE4lEFAgEFAgEFAqF5Lquqqurdfr0aUUiEa/Hww0gEsmipka67z5pxw4pEok/YWvvXmnpUq8n\nu66KigotW7ZM9fX1qq2tld/v1+zZszU4OEgkkgyRSBb5+dLOndL27VJdnVRVJQ0PS6+84vVkUzZ7\n9mz5fD7udJ1kOHGZTO69N76K0dcnFRZKM3Al5nTIyspSfn7++AnKzMxMlZSUKCcnR7W1tWpvb9fC\nhQs9nhJTxZ5Esjl0SOrtje9ZJKjS0lLV1taOR+Kuu+7S8uXLJUmNjY1ejoabQCSSzc8/S93dUpIu\nIy5YsEDZ2dm6cOGC16NgiohEsrl4UfqHi6cSWVZWlsrKytTb2+v1KJgiIpFsgsFpv5rzVquqqlJP\nTw8P2EkSRCLJdC9ZopGqKq/H+Feqqqo0ODiocDjs9SiYAiKRZM7k5Sl0Cy79nknZ2dnKy8vT6Oio\n16NgCohEkolEIrfFzVmKioo0PDzMIUcSSP7fthQTjUZvi0jU1tYSiCTBnkSSqaysVLbH94qYDn6/\nf0IkQqGQ+vr6PJwIkyESScBxHHV0dCgQCIx/zqCnp0cnT570erSb1tfXN+HpYOfPn9ePP/7o4USY\nDJFIEr29vbp06dL490NDQxoYGPBwIqSK5D+4TSF/Xw0IBoMeTjI9zpw5M74MOjo6yjmKBEUkkkhn\nZ+f4Q3X++usv3XPPPR5P9O8UFxersrJSkjQwMMCSaIIiEklk6dKlKioqkiQdPXo06a9/KCgoUFlZ\nmdLS0hSLxbitXYIiEkkkLS1t/I3EGwq3CpGAJ0pLS1VcXDz+fUlJiZYtW+bhRJgMt9QHUssN74Ky\nBArARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYiAcBEJACYiAQAE5EAYCISAExEAoCJSAAwEQkA\nJiIBwEQkAJiIBAATkQBgIhIATEQCgIlIADARCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJ\nACYiAcBEJACYiAQAE5EAYCISAExEAoCJSAAwEQkAJiIBwEQkAJiIBAATkQBgIhIATEQCgIlIADAR\nCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYiAcBEJACYiAQAE5EAYCISAExEAoCJSAAw\nEQkAJiIBwEQkAJiIBAATkQBgIhIATEQCgIlIADARCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgA\nMBEJACYiAcBEJACY0j14zTQPXhPATWJPAoCJSAAwEQkAJiIBwEQkAJiIBAATkQBgIhIATEQCgIlI\nADARCQAmIgHARCQAmIgEABORAGAiEgBMRAKAiUgAMBEJACYiAcBEJACYiAQA0/8ALkdp5hhfnbIA\nAAAASUVORK5CYII=\n", "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "skchem.vis.draw(mol)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. todo::\n", "\n", " multiple conformer generation." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Featurizers (fingerprint and descriptor generators)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Chemical representation is not by itself very amenable to data analysis and mining techniques. Often, a fixed length vector representation is required. This is achieved by calculating **features** from the chemical representation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In **scikit-chem**, this is provided by the `descriptors` package. A selection of features are available:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['PhysicochemicalFeaturizer',\n", " 'AtomFeaturizer',\n", " 'AtomPairFeaturizer',\n", " 'MorganFeaturizer',\n", " 'MACCSFeaturizer',\n", " 'TopologicalTorsionFeaturizer',\n", " 'RDKFeaturizer',\n", " 'ErGFeaturizer',\n", " 'ConnectivityInvariantsFeaturizer',\n", " 'FeatureInvariantsFeaturizer',\n", " 'ChemAxonNMRPredictor',\n", " 'ChemAxonFeaturizer',\n", " 'ChemAxonAtomFeaturizer',\n", " 'GraphDistanceTransformer',\n", " 'SpacialDistanceTransformer']" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "skchem.descriptors.__all__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Circular fingerprints](http://www.ncbi.nlm.nih.gov/pubmed/16523386) (of which Morgan fingerprints are an example) are often considered the most consistently well performing descriptor across a wide variety of compounds." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "morgan_fp_idx\n", "0 0\n", "1 0\n", "2 0\n", "3 0\n", "4 0\n", " ..\n", "2043 0\n", "2044 0\n", "2045 0\n", "2046 0\n", "2047 0\n", "Name: MorganFeaturizer, dtype: uint8" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf = skchem.descriptors.MorganFeaturizer()\n", "mf.transform(mol)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also call the standardizer on a series of `Mol`s:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\r", "MorganFeaturizer: 20% (1 of 5) |######### | Elapsed Time: 0:00:00 ETA: 0:00:00\r", "MorganFeaturizer: 40% (2 of 5) |################## | Elapsed Time: 0:00:00 ETA: 0:00:00\r", "MorganFeaturizer: 60% (3 of 5) |########################### | Elapsed Time: 0:00:00 ETA: 0:00:00\r", "MorganFeaturizer: 80% (4 of 5) |#################################### | Elapsed Time: 0:00:00 ETA: 0:00:00\r", "MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
morgan_fp_idx01234...20432044204520462047
000000...00000
100000...00000
200000...00000
300000...00000
401000...00000
\n", "

5 rows × 2048 columns

\n", "
" ], "text/plain": [ "morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 \\\n", "0 0 0 0 0 0 ... 0 0 0 0 \n", "1 0 0 0 0 0 ... 0 0 0 0 \n", "2 0 0 0 0 0 ... 0 0 0 0 \n", "3 0 0 0 0 0 ... 0 0 0 0 \n", "4 0 1 0 0 0 ... 0 0 0 0 \n", "\n", "morgan_fp_idx 2047 \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", "[5 rows x 2048 columns]" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf.transform(mols.structure)" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note::\n", "\n", " Note that Morgan fingerprints are **1D**, and thus when we use a single ``Mol`` as input, we get the features in a **1D** ``pandas.Series`` . When we use a collection of ``Mol`` s, the features are returned in a ``pandas.DataFrame`` , which is one higher dimension than a ``pandas.Series``, as a collection of ``Mol`` s are a dimension higher than a ``Mol`` by itself.\n", "\n", " Some descriptors, such as the ``AtomFeaturizer`` , will yield **2D** features when used on a ``Mol``, and thus will yield the **3D** ``pandas.Panel`` when used on a collection of ``Mol`` s." ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }