Quickstart — scikit-chem 0.0.6 documentation

Imports¶

scikit-chem imports all subpackages with the main package, so all we need to do is import the main package, skchem. We will also need pandas.

In [3]:

import skchem
import pandas as pd

Loading the data¶

We can use skchem.read_sdf to import the sdf file:

In [4]:

ms_raw = skchem.read_sdf('mols.sdf'); ms_raw

Out[4]:

batch
1728-95-6     <Mol: COc1ccc(-c2nc(-c3ccccc3)c(-c3ccccc3)[nH]...
91-08-7                            <Mol: Cc1c(N=C=O)cccc1N=C=O>
89786-04-9      <Mol: CC1(Cn2ccnn2)C(C(=O)O)N2C(=O)CC2S1(=O)=O>
2439-35-2                               <Mol: C=CC(=O)OCCN(C)C>
95-94-3                             <Mol: Clc1cc(Cl)c(Cl)cc1Cl>
                                    ...
89930-60-9    <Mol: CCCn1cc2c3c(cccc31)C1C=C(C)CN(C)C1C2.O=C...
9002-92-0             <Mol: CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCO>
90597-22-1               <Mol: Nc1ccn(C2C=C(CO)C(O)C2O)c(=O)n1>
924-43-6                                      <Mol: CCC(C)ON=O>
97534-21-9              <Mol: O=C1NC(=S)NC(=O)C1C(=O)Nc1ccccc1>
Name: structure, dtype: object

And pandas to import the labels.

In [5]:

y_raw = pd.read_csv('labels.csv').set_index('name').squeeze(); y_raw

Out[5]:

name
1728-95-6        mutagen
91-08-7          mutagen
89786-04-9    nonmutagen
2439-35-2     nonmutagen
95-94-3       nonmutagen
                 ...
89930-60-9       mutagen
9002-92-0     nonmutagen
90597-22-1    nonmutagen
924-43-6         mutagen
97534-21-9    nonmutagen
Name: Ames test categorisation, dtype: object

Quickly check the class balance:

In [9]:

y_raw.value_counts().plot.bar()

Out[9]:

<matplotlib.axes._subplots.AxesSubplot at 0x121204278>

And binarize them:

In [8]:

y_bin = (y_raw == 'mutagen').astype(np.uint8); y_bin

Out[8]:

name
1728-95-6     1
91-08-7       1
89786-04-9    0
2439-35-2     0
95-94-3       0
             ..
89930-60-9    1
9002-92-0     0
90597-22-1    0
924-43-6      1
97534-21-9    0
Name: Ames test categorisation, dtype: uint8

The classes are (mercifully) quite balanced.

Cleaning¶

The data is unlikely to be canonicalized, and potentially contain broken or difficult molecules, so we will now clean it.

Standardization¶

The first step is to apply a Transformer to canonicalize the representations. Specifically, we will use the ChemAxon Standardizer wrapper. Some compounds are likely to fail this procedure, however they are likely to still be valid structures, so we will use the keep_failed configuration option on the object to keep these, rather than returning a None, or raising an error.

Tip

Transformer s implement the transform method, which converts Mol s into something else. This can either be another Mol, such as in this case, or into a vector or even a number. The result will be packaged as a pandas data structure of appropriate dimensionality.

In [8]:

std = skchem.standardizers.ChemAxonStandardizer(keep_failed=True)

In [9]:

ms = std.transform(ms_raw); ms

ChemAxonStandardizer: 100% (4337 of 4337) |####################################| Elapsed Time: 0:00:32 Time: 0:00:32

Out[9]:

batch
1728-95-6     <Mol: COc1ccc(-c2nc(-c3ccccc3)c(-c3ccccc3)[nH]...
91-08-7                            <Mol: Cc1c(N=C=O)cccc1N=C=O>
89786-04-9      <Mol: CC1(Cn2ccnn2)C(C(=O)O)N2C(=O)CC2S1(=O)=O>
2439-35-2                               <Mol: C=CC(=O)OCCN(C)C>
95-94-3                             <Mol: Clc1cc(Cl)c(Cl)cc1Cl>
                                    ...
89930-60-9          <Mol: CCCn1cc2c3c(cccc31)C1C=C(C)CN(C)C1C2>
9002-92-0             <Mol: CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCO>
90597-22-1               <Mol: Nc1ccn(C2C=C(CO)C(O)C2O)c(=O)n1>
924-43-6                                      <Mol: CCC(C)ON=O>
97534-21-9             <Mol: O=C(Nc1ccccc1)c1c(O)nc(=S)[nH]c1O>
Name: structure, dtype: object

Tip

This pattern is the typical way to handle all operations while using scikit-chem. The available configuration options for all classes may be found in the class’s docstring, available in the documentation or using the builtin help function.

Filter undesirable molecules¶

Next, we will remove molecules that are likely to not work well with the circular descriptors that we will use. These are usually large or inorganic molecules.

To do this, we will use some Filters, which implement the filter method.

Tip

Filter s drop compounds that fail a predicicate. The results of the predicate can be found by using transform - that’s right, each Filter is also a Transformer ! Labels with similar index can be passed in as a second argument, and will also be filtered and returned as a second return value.

In [10]:

of = skchem.filters.OrganicFilter()

ms, y = of.filter(ms, y)

OrganicFilter: 100% (4337 of 4337) |###########################################| Elapsed Time: 0:00:05 Time: 0:00:05

In [11]:

mf = skchem.filters.MassFilter(above=100, below=900)

ms, y = mf.filter(ms, y)

MassFilter: 100% (4337 of 4337) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00

In [12]:

nf = skchem.filters.AtomNumberFilter(above=5, below=100, include_hydrogens=True)

ms, y = nf.filter(ms, y)

AtomNumberFilter: 100% (4068 of 4068) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00

Optimize Geometry¶

We would like to calculate some features that require three dimensional coordinates, so we will next calculate three dimensional conformers using the Universal Force Field. Additionally, some compounds may be unfeasible - these should be dropped from the dataset. In order to do this, we will use the transform_filter method:

In [13]:

uff = skchem.forcefields.UFF()

ms, y = uff.transform_filter(ms, y)

/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 109883-99-0
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 135768-83-1
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 13366-73-9
  warnings.warn(msg)
UFF: 100% (4046 of 4046) |#####################################################| Elapsed Time: 0:01:47 Time: 0:01:47

In [14]:

len(ms)

Out[14]:

As we can see, we get a warning that 3 molecules failed to embed, have been dropped. If we didn’t care about the warnings, we could have set the warn_on_fail property to False (or set it using a keyword argument at initialization). Conversely, if we really cared about failures, we could have set error_on_fail to True, which would raise an Error if any Mols failed to embed.

Tip

TransformFilter s implement the transform_filter method. This is a combination of transform and filter , which converts Mol s into something else and drops instances that fail the predicate. The ChemAxonStandardizer object is also a TransformFilter, as it can drop Mol s that fail to standardize.

Visualize Chemical Space¶

scikit-chem adds a custom mol accessor to pandas.Series, which provides a shorthand for calling methods on all Mols in the collection. This is analogous to the str accessor:

In [15]:

y_raw.str.get_dummies()

Out[15]:

	mutagen	nonmutagen
name
1728-95-6	1	0
91-08-7	1	0
89786-04-9	0	1
2439-35-2	0	1
95-94-3	0	1
...	...	...
89930-60-9	1	0
9002-92-0	0	1
90597-22-1	0	1
924-43-6	1	0
97534-21-9	0	1

4043 rows × 2 columns

We will use this function to binarize the labels:

In [16]:

y = y.str.get_dummies()['mutagen']

Amongst other options, it is provides access to chemical space plotting functionality. This will featurize the molecules using a passed featurizer (or a string shortcut), and a dimensionality reduction technique to reduce the feature space to two dimensions, which are then plotted. In this example, we use circular Morgan fingerprints, reduced by t-SNE to visualize structural diversity in the dataset.

In [17]:

ms.mol.visualize(fper='morgan',
                 dim_red='tsne', dim_red_kw={'method': 'exact'},
                 c=y,
                 cmap='copper')

The data appears to be reasonably separable in structural space, so we may suspect that Morgan fingerprints will be a good representation for modelling the data.

Featurizing the data¶

As previously noted, Morgan fingerprints would be a good fit for this data. To calculate them, we will use the MorganFeaturizer class, which is a Transformer.

In [18]:

mf = skchem.descriptors.MorganFeaturizer()

X, y = mf.transform(ms, y); X

MorganFeaturizer: 100% (4043 of 4043) |########################################| Elapsed Time: 0:00:01 Time: 0:00:01

Out[18]:

morgan_fp_idx	0	1	2	3	4	...	2043	2044	2045	2046	2047
batch
1728-95-6	0	0	0	0	0	...	0	0	0	0	0
91-08-7	0	0	0	0	0	...	0	0	0	0	0
89786-04-9	0	0	0	0	0	...	0	0	0	0	0
2439-35-2	0	0	0	0	0	...	0	0	0	0	0
95-94-3	0	0	0	0	0	...	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
89930-60-9	0	0	0	0	0	...	0	0	0	0	0
9002-92-0	0	0	0	0	0	...	1	0	1	0	0
90597-22-1	0	0	0	0	0	...	0	0	0	0	0
924-43-6	0	0	0	0	0	...	0	0	0	0	0
97534-21-9	0	0	0	0	0	...	0	0	0	0	0

4043 rows × 2048 columns

Pipelining¶

If this process appeared unnecessarily laborious (as it should!), scikit-chem provides a Pipeline class that will sequentially apply objects passed to it. For this example, we could have simply performed:

In [35]:

pipeline = skchem.pipeline.Pipeline([
        skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
        skchem.forcefields.UFF(),
        skchem.filters.OrganicFilter(),
        skchem.filters.MassFilter(above=100, below=1000),
        skchem.filters.AtomNumberFilter(above=5, below=100),
        skchem.descriptors.MorganFeaturizer()
])

X, y = pipeline.transform_filter(ms_raw, y_raw)

ChemAxonStandardizer: 100% (4337 of 4337) |####################################| Elapsed Time: 0:00:22 Time: 0:00:22
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 37364-66-2
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 109883-99-0
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 135768-83-1
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 57817-89-7
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 58071-32-2
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 13366-73-9
  warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 89213-87-6
  warnings.warn(msg)
UFF: 100% (4337 of 4337) |#####################################################| Elapsed Time: 0:01:49 Time: 0:01:49
OrganicFilter: 100% (4330 of 4330) |###########################################| Elapsed Time: 0:00:05 Time: 0:00:05
MassFilter: 100% (4330 of 4330) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
AtomNumberFilter: 100% (4070 of 4070) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00
MorganFeaturizer: 100% (4047 of 4047) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00

Modelling the data¶

In this section, we will try building some basic scikit-learn models on the data.

Partitioning the data¶

To decide on the best model to use, we should perform some model selection. This will require comparing the relative performance of a selection of candidate molecules each trained on the same train set, and evaluated on a validation set.

In cheminformatics, partitioning datasets usually requires some thought, as chemical datasets usually vastly overrepresent certain scaffolds, and underrepresent others. In order to get as unbiased an estimate of performance as possible, one can either downsample compounds in a region of high density, or artifically favor splits that pool in the same split molecules that are too close in chemical space.

scikit-chem provides this functionality in the SimThresholdSplit class, which applies single link heirachical clustering to produce a large number of clusters consisting of highly similar compounds. These clusters are then randomly assigned to the desired splits, such that no split contains compounds that are more similar to compounds in any other split than the clustering threshold.

In [37]:

cv = skchem.cross_validation.SimThresholdSplit(fper=None, n_jobs=4).fit(X)
train, valid, test = cv.split((60, 20, 20))
X_train, X_valid, X_test = X[train], X[valid], X[test]
y_train, y_valid, y_test = y[train], y[valid], y[test]

Model selection¶

In [21]:

import sklearn.ensemble
import sklearn.linear_model
import sklearn.naive_bayes

In [38]:

rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100)
nb = sklearn.naive_bayes.BernoulliNB()
lr = sklearn.linear_model.LogisticRegression()

In [39]:

X_train.shape, y_train.shape

Out[39]:

((2428, 2048), (2428,))

In [42]:

rf_score = rf.fit(X_train, y_train).score(X_valid, y_valid)
nb_score = nb.fit(X_train, y_train).score(X_valid, y_valid)
lr_score = lr.fit(X_train, y_train).score(X_valid, y_valid)

print(rf_score, nb_score, lr_score)

0.843016069221 0.812113720643 0.796044499382

Random Forests appear to work best (although we should have chosen hyperparameters using Random or Grid search).

Assessing the Final performance¶

In [43]:

rf.fit(X_train.append(X_valid), y_train.append(y_valid)).score(X_test, y_test)

Out[43]:

0.83580246913580247