scikit-chem provides a high level, Pythonic interface to the rdkit library, with wrappers for other popular cheminformatics tools.
For a brief introduction to the ideas behind the package, please read the introductory notes. Installation info may be found on the installation page. To get started straight away, try the quick start guide. For a more in depth understanding, check out the tutorial and the API reference.
To read the code, submit feature requests, report a bug or contribute to the project, please visit the projects github repository.
scikit-chem is a high level cheminformatics library built on rdkit that aims to integrate with the Scientific Python Stack by promoting interoperativity with libraries such as pandas and scikit-learn, and emulating similar patterns and APIs as found in those libraries.
Some notable features include:
Pythonic core API
A simple interface for chemical datasets
Structure visualization
Interactivity in Jupyter Notebooks
scikit-chem should be thought of as a simple complement to the excellent rdkit - scikit-chem objects are subclasses of rdkit objects, and as such, the two libraries can usually be used together easily when the advanced functionality of rdkit is required.
New features, improvements and bug-fixes by release.
This is a minor release in the unstable 0.0.x series, with breaking API changes.
0187d92: Improvements to the rdkit abstraction views (Mol.atoms
, Mol.bonds
, {Mol,Atom,Bond}.props
).
This is a minor release in the unstable 0.0.x series, with breaking API changes.
Highlights include a refactor of base classes to provide a more consistent and extensible API, the construction of this documentation and incremental improvements to the continuous integration.
Objects no longer take pandas dataframes as input directly, but instead require molecules to be passed as a Series, with their data as a supplemental series or dataframe (this may be reverted in a patch).
Base classes were established for Transformer
, Filter
, TransformFilter
.
Verbosity options were added, allowing progress bars for most objects.
Dataset support was added.
We will be working on a mutagenicity dataset, released by Kazius et
al.. 4337 compounds,
provided as the file mols.sdf
, were subjected to the AMES
test. The results are given
in labels.csv
. We will clean the molecules, perform a brief chemical
space analysis before finally assessing potential predictive models
built on the data.
scikit-chem imports all subpackages with the main package, so all we
need to do is import the main package, skchem
. We will also need
pandas
.
In [3]:
import skchem
import pandas as pd
We can use skchem.read_sdf
to import the sdf file:
In [4]:
ms_raw = skchem.read_sdf('mols.sdf'); ms_raw
Out[4]:
batch
1728-95-6 <Mol: COc1ccc(-c2nc(-c3ccccc3)c(-c3ccccc3)[nH]...
91-08-7 <Mol: Cc1c(N=C=O)cccc1N=C=O>
89786-04-9 <Mol: CC1(Cn2ccnn2)C(C(=O)O)N2C(=O)CC2S1(=O)=O>
2439-35-2 <Mol: C=CC(=O)OCCN(C)C>
95-94-3 <Mol: Clc1cc(Cl)c(Cl)cc1Cl>
...
89930-60-9 <Mol: CCCn1cc2c3c(cccc31)C1C=C(C)CN(C)C1C2.O=C...
9002-92-0 <Mol: CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCO>
90597-22-1 <Mol: Nc1ccn(C2C=C(CO)C(O)C2O)c(=O)n1>
924-43-6 <Mol: CCC(C)ON=O>
97534-21-9 <Mol: O=C1NC(=S)NC(=O)C1C(=O)Nc1ccccc1>
Name: structure, dtype: object
And pandas to import the labels.
In [5]:
y_raw = pd.read_csv('labels.csv').set_index('name').squeeze(); y_raw
Out[5]:
name
1728-95-6 mutagen
91-08-7 mutagen
89786-04-9 nonmutagen
2439-35-2 nonmutagen
95-94-3 nonmutagen
...
89930-60-9 mutagen
9002-92-0 nonmutagen
90597-22-1 nonmutagen
924-43-6 mutagen
97534-21-9 nonmutagen
Name: Ames test categorisation, dtype: object
Quickly check the class balance:
In [9]:
y_raw.value_counts().plot.bar()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x121204278>
And binarize them:
In [8]:
y_bin = (y_raw == 'mutagen').astype(np.uint8); y_bin
Out[8]:
name
1728-95-6 1
91-08-7 1
89786-04-9 0
2439-35-2 0
95-94-3 0
..
89930-60-9 1
9002-92-0 0
90597-22-1 0
924-43-6 1
97534-21-9 0
Name: Ames test categorisation, dtype: uint8
The classes are (mercifully) quite balanced.
The data is unlikely to be canonicalized, and potentially contain broken or difficult molecules, so we will now clean it.
The first step is to apply a Transformer
to canonicalize the
representations. Specifically, we will use the ChemAxon Standardizer
wrapper. Some compounds are likely to fail this procedure, however they
are likely to still be valid structures, so we will use the
keep_failed
configuration option on the object to keep these, rather
than returning a None, or raising an error.
Tip
Transformer
s implement the transform
method, which converts Mol
s into something else. This can either be another Mol
, such as in this case, or into a vector or even a number. The result will be packaged as a pandas
data structure of appropriate dimensionality.
In [8]:
std = skchem.standardizers.ChemAxonStandardizer(keep_failed=True)
In [9]:
ms = std.transform(ms_raw); ms
ChemAxonStandardizer: 100% (4337 of 4337) |####################################| Elapsed Time: 0:00:32 Time: 0:00:32
Out[9]:
batch
1728-95-6 <Mol: COc1ccc(-c2nc(-c3ccccc3)c(-c3ccccc3)[nH]...
91-08-7 <Mol: Cc1c(N=C=O)cccc1N=C=O>
89786-04-9 <Mol: CC1(Cn2ccnn2)C(C(=O)O)N2C(=O)CC2S1(=O)=O>
2439-35-2 <Mol: C=CC(=O)OCCN(C)C>
95-94-3 <Mol: Clc1cc(Cl)c(Cl)cc1Cl>
...
89930-60-9 <Mol: CCCn1cc2c3c(cccc31)C1C=C(C)CN(C)C1C2>
9002-92-0 <Mol: CCCCCCCCCCCCOCCOCCOCCOCCOCCOCCOCCO>
90597-22-1 <Mol: Nc1ccn(C2C=C(CO)C(O)C2O)c(=O)n1>
924-43-6 <Mol: CCC(C)ON=O>
97534-21-9 <Mol: O=C(Nc1ccccc1)c1c(O)nc(=S)[nH]c1O>
Name: structure, dtype: object
Tip
This pattern is the typical way to handle all operations while using scikit-chem. The available configuration options for all classes may be found in the class’s docstring, available in the documentation or using the builtin help
function.
Next, we will remove molecules that are likely to not work well with the circular descriptors that we will use. These are usually large or inorganic molecules.
To do this, we will use some Filters
, which implement the filter
method.
Tip
Filter
s drop compounds that fail a predicicate. The results of the predicate can be found by using transform
- that’s right, each Filter
is also a Transformer
! Labels with similar index can be passed in as a second argument, and will also be filtered and returned as a second return value.
In [10]:
of = skchem.filters.OrganicFilter()
ms, y = of.filter(ms, y)
OrganicFilter: 100% (4337 of 4337) |###########################################| Elapsed Time: 0:00:05 Time: 0:00:05
In [11]:
mf = skchem.filters.MassFilter(above=100, below=900)
ms, y = mf.filter(ms, y)
MassFilter: 100% (4337 of 4337) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
In [12]:
nf = skchem.filters.AtomNumberFilter(above=5, below=100, include_hydrogens=True)
ms, y = nf.filter(ms, y)
AtomNumberFilter: 100% (4068 of 4068) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00
We would like to calculate some features that require three dimensional
coordinates, so we will next calculate three dimensional conformers
using the Universal Force Field. Additionally, some compounds may be
unfeasible - these should be dropped from the dataset. In order to do
this, we will use the transform_filter
method:
In [13]:
uff = skchem.forcefields.UFF()
ms, y = uff.transform_filter(ms, y)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 109883-99-0
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 135768-83-1
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 13366-73-9
warnings.warn(msg)
UFF: 100% (4046 of 4046) |#####################################################| Elapsed Time: 0:01:47 Time: 0:01:47
In [14]:
len(ms)
Out[14]:
4043
As we can see, we get a warning that 3 molecules failed to embed, have
been dropped. If we didn’t care about the warnings, we could have set
the warn_on_fail
property to False
(or set it using a keyword
argument at initialization). Conversely, if we really cared about
failures, we could have set error_on_fail
to True, which would raise
an Error if any Mol
s failed to embed.
Tip
TransformFilter
s implement the transform_filter
method. This is a combination of transform
and filter
, which converts Mol
s into something else and drops instances that fail the predicate. The ChemAxonStandardizer
object is also a TransformFilter
, as it can drop Mol
s that fail to standardize.
scikit-chem adds a custom mol
accessor to pandas.Series
,
which provides a shorthand for calling methods on all Mol
s in the
collection. This is analogous to the str
accessor:
In [15]:
y_raw.str.get_dummies()
Out[15]:
mutagen | nonmutagen | |
---|---|---|
name | ||
1728-95-6 | 1 | 0 |
91-08-7 | 1 | 0 |
89786-04-9 | 0 | 1 |
2439-35-2 | 0 | 1 |
95-94-3 | 0 | 1 |
... | ... | ... |
89930-60-9 | 1 | 0 |
9002-92-0 | 0 | 1 |
90597-22-1 | 0 | 1 |
924-43-6 | 1 | 0 |
97534-21-9 | 0 | 1 |
4043 rows × 2 columns
We will use this function to binarize the labels:
In [16]:
y = y.str.get_dummies()['mutagen']
Amongst other options, it is provides access to chemical space plotting functionality. This will featurize the molecules using a passed featurizer (or a string shortcut), and a dimensionality reduction technique to reduce the feature space to two dimensions, which are then plotted. In this example, we use circular Morgan fingerprints, reduced by t-SNE to visualize structural diversity in the dataset.
In [17]:
ms.mol.visualize(fper='morgan',
dim_red='tsne', dim_red_kw={'method': 'exact'},
c=y,
cmap='copper')
The data appears to be reasonably separable in structural space, so we may suspect that Morgan fingerprints will be a good representation for modelling the data.
As previously noted, Morgan fingerprints would be a good fit for this
data. To calculate them, we will use the MorganFeaturizer
class,
which is a Transformer
.
In [18]:
mf = skchem.descriptors.MorganFeaturizer()
X, y = mf.transform(ms, y); X
MorganFeaturizer: 100% (4043 of 4043) |########################################| Elapsed Time: 0:00:01 Time: 0:00:01
Out[18]:
morgan_fp_idx | 0 | 1 | 2 | 3 | 4 | ... | 2043 | 2044 | 2045 | 2046 | 2047 |
---|---|---|---|---|---|---|---|---|---|---|---|
batch | |||||||||||
1728-95-6 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
91-08-7 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
89786-04-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
2439-35-2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
95-94-3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
89930-60-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
9002-92-0 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 |
90597-22-1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
924-43-6 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
97534-21-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4043 rows × 2048 columns
If this process appeared unnecessarily laborious (as it should!),
scikit-chem provides a Pipeline
class that will sequentially
apply objects passed to it. For this example, we could have simply
performed:
In [35]:
pipeline = skchem.pipeline.Pipeline([
skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
skchem.forcefields.UFF(),
skchem.filters.OrganicFilter(),
skchem.filters.MassFilter(above=100, below=1000),
skchem.filters.AtomNumberFilter(above=5, below=100),
skchem.descriptors.MorganFeaturizer()
])
X, y = pipeline.transform_filter(ms_raw, y_raw)
ChemAxonStandardizer: 100% (4337 of 4337) |####################################| Elapsed Time: 0:00:22 Time: 0:00:22
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 37364-66-2
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 109883-99-0
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 135768-83-1
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 57817-89-7
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 58071-32-2
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 13366-73-9
warnings.warn(msg)
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 89213-87-6
warnings.warn(msg)
UFF: 100% (4337 of 4337) |#####################################################| Elapsed Time: 0:01:49 Time: 0:01:49
OrganicFilter: 100% (4330 of 4330) |###########################################| Elapsed Time: 0:00:05 Time: 0:00:05
MassFilter: 100% (4330 of 4330) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
AtomNumberFilter: 100% (4070 of 4070) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00
MorganFeaturizer: 100% (4047 of 4047) |########################################| Elapsed Time: 0:00:00 Time: 0:00:00
In this section, we will try building some basic scikit-learn models on the data.
To decide on the best model to use, we should perform some model selection. This will require comparing the relative performance of a selection of candidate molecules each trained on the same train set, and evaluated on a validation set.
In cheminformatics, partitioning datasets usually requires some thought, as chemical datasets usually vastly overrepresent certain scaffolds, and underrepresent others. In order to get as unbiased an estimate of performance as possible, one can either downsample compounds in a region of high density, or artifically favor splits that pool in the same split molecules that are too close in chemical space.
scikit-chem provides this functionality in the SimThresholdSplit
class, which applies single link heirachical clustering to produce a
large number of clusters consisting of highly similar compounds. These
clusters are then randomly assigned to the desired splits, such that no
split contains compounds that are more similar to compounds in any other
split than the clustering threshold.
In [37]:
cv = skchem.cross_validation.SimThresholdSplit(fper=None, n_jobs=4).fit(X)
train, valid, test = cv.split((60, 20, 20))
X_train, X_valid, X_test = X[train], X[valid], X[test]
y_train, y_valid, y_test = y[train], y[valid], y[test]
In [21]:
import sklearn.ensemble
import sklearn.linear_model
import sklearn.naive_bayes
In [38]:
rf = sklearn.ensemble.RandomForestClassifier(n_estimators=100)
nb = sklearn.naive_bayes.BernoulliNB()
lr = sklearn.linear_model.LogisticRegression()
In [39]:
X_train.shape, y_train.shape
Out[39]:
((2428, 2048), (2428,))
In [42]:
rf_score = rf.fit(X_train, y_train).score(X_valid, y_valid)
nb_score = nb.fit(X_train, y_train).score(X_valid, y_valid)
lr_score = lr.fit(X_train, y_train).score(X_valid, y_valid)
print(rf_score, nb_score, lr_score)
0.843016069221 0.812113720643 0.796044499382
Random Forests appear to work best (although we should have chosen hyperparameters using Random or Grid search).
In [43]:
rf.fit(X_train.append(X_valid), y_train.append(y_valid)).score(X_test, y_test)
Out[43]:
0.83580246913580247
scikit-chem
is easy to install and configure. Detailed instructions are
listed below. The quickest way to get everything installed is by
using conda.
scikit-chem
is tested on Python 2.7 and 3.5. It depends on rdkit, most of
the core Scientific Python Stack, as well as several smaller pure Python
libraries.
The full list of dependencies is:
The package and these dependencies are available through two different Python package managers, conda and pip. It is recommended to use conda.
conda is a cross-platform, Python-agnostic package and environment manager developed by Continuum Analytics. It provides packages as prebuilt binary files, allowing for straightforward installation of Python packages, even those with complex C/C++ extensions. It is installed as part of the Anaconda Scientific Python distribution, or as the lightweight miniconda.
The package and all dependencies for scikit-chem
are available through the
defaults or richlewis conda channel. To install:
conda install -c richlewis scikit-chem
This will install scikit-chem
with all its dependencies from the
author’s anaconda repository as conda
packages.
Currently, scikit-chem cannot be configured in a config file. This feature is planned to be added in future releases. To request this feature as a priority, please mention it in the appropriate github issue
To use the data functionality, you will need to set up fuel. This involves configuring the .fuelrc. An example .fuelrc might be as follows:
data_path: ~/datasets
extra_downloaders:
- skchem.data.downloaders
extra_converters:
- skchem.data.converters
This adds the location for fuel datasets, and adds the scikit-chem
data
downloaders and converters to the fuel command line tools.
This tutorial is written as a series of Jupyter Notebooks. These may be downloaded from documentation section of the GitHub page..
To start using scikit-chem, the package to import is skchem
:
In [1]:
import skchem
The different functionalities are arranged in subpackages:
In [2]:
skchem.__all__
Out[2]:
['core',
'filters',
'data',
'descriptors',
'io',
'vis',
'cross_validation',
'standardizers',
'interact',
'pipeline']
These are all imported as soon as the base package is imported, so everything is ready to use right away:
In [3]:
skchem.core.Mol()
Out[3]:
<Mol name="None" formula="" at 0x11d01d538>
scikit-chem is first and formost a wrapper around rdkit to make
it more Pythonic, and more intuitive to a user familiar with other
libraries in the Scientific Python Stack. The package implements a core
Mol
class, physically representing a molecule. It is a direct
subclass of the rdkit.Mol
class:
In [1]:
import rdkit.Chem
issubclass(skchem.Mol, rdkit.Chem.Mol)
Out[1]:
True
As such, it has all the methods available that an rdkit.Mol
class
has, for example:
In [2]:
hasattr(skchem.Mol, 'GetAromaticAtoms')
Out[2]:
True
Constructors are provided as classmethods on the skchem.Mol
object,
in the same fashion as pandas objects are constructed. For example,
to make a pandas.DataFrame
from a dictionary, you call:
In [3]:
df = pd.DataFrame.from_dict({'a': [10, 20], 'b': [20, 40]}); df
Out[3]:
a | b | |
---|---|---|
0 | 10 | 20 |
1 | 20 | 40 |
Analogously, to make a skchem.Mol
from a smiles string, you call;
In [4]:
mol = skchem.Mol.from_smiles('CC(=O)Cl'); mol
Out[4]:
<Mol name="None" formula="C2H3ClO" at 0x11dc8f490>
The available methods are:
In [5]:
[method for method in skchem.Mol.__dict__ if method.startswith('from_')]
Out[5]:
['from_tplblock',
'from_molblock',
'from_molfile',
'from_binary',
'from_tplfile',
'from_mol2block',
'from_pdbfile',
'from_pdbblock',
'from_smiles',
'from_smarts',
'from_mol2file',
'from_inchi']
When a molecule fails to parse, a ValueError
is raised:
In [6]:
skchem.Mol.from_smiles('NOTSMILES')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-99e03ef822e7> in <module>()
----> 1 skchem.Mol.from_smiles('NOTSMILES')
/Users/rich/projects/scikit-chem/skchem/core/mol.py in constructor(_, in_arg, name, *args, **kwargs)
419 m = getattr(rdkit.Chem, 'MolFrom' + constructor_name)(in_arg, *args, **kwargs)
420 if m is None:
--> 421 raise ValueError('Failed to parse molecule, {}'.format(in_arg))
422 m = Mol.from_super(m)
423 m.name = name
ValueError: Failed to parse molecule, NOTSMILES
Atoms and bonds are accessible as a property:
In [7]:
mol.atoms
Out[7]:
<AtomView values="['C', 'C', 'O', 'Cl']" at 0x11dc9ac88>
In [8]:
mol.bonds
Out[8]:
<BondView values="['C-C', 'C=O', 'C-Cl']" at 0x11dc9abe0>
These are iterable:
In [9]:
[a for a in mol.atoms]
Out[9]:
[<Atom element="C" at 0x11dcfe8a0>,
<Atom element="C" at 0x11dcfe9e0>,
<Atom element="O" at 0x11dcfed00>,
<Atom element="Cl" at 0x11dcfedf0>]
subscriptable:
In [10]:
mol.atoms[3]
Out[10]:
<Atom element="Cl" at 0x11dcfef30>
sliceable:
In [11]:
mol.atoms[:3]
Out[11]:
[<Atom element="C" at 0x11dcfebc0>,
<Atom element="C" at 0x11de690d0>,
<Atom element="O" at 0x11de693f0>]
indexable:
In [19]:
mol.atoms[[1, 3]]
Out[19]:
[<Atom element="C" at 0x11de74760>, <Atom element="Cl" at 0x11de7fe40>]
and maskable:
In [18]:
mol.atoms[[True, False, True, False]]
Out[18]:
[<Atom element="C" at 0x11de74ad0>, <Atom element="O" at 0x11de74f30>]
Properties on the rdkit objects are accessible through the props
property:
In [11]:
mol.props['is_reactive'] = 'very!'
In [12]:
mol.atoms[1].props['kind'] = 'electrophilic'
mol.atoms[3].props['leaving group'] = 1
mol.bonds[2].props['bond strength'] = 'strong'
These are using the rdkit
property functionality internally:
In [13]:
mol.GetProp('is_reactive')
Out[13]:
'very!'
Note
RDKit properties can only store str
s, int
s and float
s. Any other type will be coerced to a string before storage.
The properties of atoms and bonds are accessible molecule wide:
In [14]:
mol.atoms.props
Out[14]:
<MolPropertyView values="{'leaving group': [nan, nan, nan, 1.0], 'kind': [None, 'electrophilic', None, None]}" at 0x11daf8390>
In [15]:
mol.bonds.props
Out[15]:
<MolPropertyView values="{'bond strength': [None, None, 'strong']}" at 0x11daf80f0>
These can be exported as pandas objects:
In [16]:
mol.atoms.props.to_frame()
Out[16]:
kind | leaving group | |
---|---|---|
atom_idx | ||
0 | None | NaN |
1 | electrophilic | NaN |
2 | None | NaN |
3 | None | 1.0 |
Molecules are exported and/or serialized in a very similar way in which
they are constructed, again with an inspiration from pandas
.
In [17]:
df.to_csv()
Out[17]:
',a,b\n0,10,20\n1,20,40\n'
In [18]:
mol.to_inchi_key()
Out[18]:
'WETWJCDKMRHUPV-UHFFFAOYSA-N'
The total available formats are:
In [19]:
[method for method in skchem.Mol.__dict__ if method.startswith('to_')]
Out[19]:
['to_inchi',
'to_json',
'to_smiles',
'to_smarts',
'to_inchi_key',
'to_binary',
'to_dict',
'to_molblock',
'to_tplfile',
'to_formula',
'to_molfile',
'to_pdbblock',
'to_tplblock']
Pandas objects are the main data structures used for collections of
molecules. scikit-chem provides convenience functions to load
objects into pandas.DataFrame
s from common file formats in
cheminformatics.
The scikit-chem functionality is modelled after the pandas
API.
To load an csv file using pandas
you would call:
In [1]:
df = pd.read_csv('https://archive.org/download/scikit-chem_example_files/iris.csv',
header=None); df
Out[1]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Analogously with scikit-chem:
In [2]:
smi = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')
Currently available:
In [3]:
[method for method in skchem.io.__dict__ if method.startswith('read_')]
Out[3]:
['read_sdf', 'read_smiles']
scikit-chem also adds convenience methods onto pandas.DataFrame
objects.
In [4]:
pd.DataFrame.from_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')
Out[4]:
structure | 1 | |
---|---|---|
0 | <Mol: CC> | ethane |
1 | <Mol: CCC> | propane |
2 | <Mol: c1ccccc1> | benzene |
3 | <Mol: CC(=O)[O-].[Na+]> | sodium acetate |
4 | <Mol: NC(CO)C(=O)O> | serine |
Note
Currently, only read_smiles
can read files over a network connection. This functionality is planned to be added in future for all file types.
Again, this is analogous to pandas
:
In [5]:
from io import StringIO
sio = StringIO()
df.to_csv(sio)
sio.seek(0)
print(sio.read())
,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
In [6]:
sio = StringIO()
smi.iloc[:2].to_sdf(sio) # don't write too many!
sio.seek(0)
print(sio.read())
0
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
> <1> (1)
ethane
$$$$
1
RDKit
3 2 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
M END
> <1> (2)
propane
$$$$
Operations on compounds are implemented as Transformer
s in
scikit-chem, which are analoguous to Transformer
objects in
scikit-learn. These objects define a 1:1 mapping between input and
output objects in a collection (i.e. the length of the collection
remains the same during a transform). These mappings can be very varied,
but the three main types currently implemented in scikit-chem
are
Standardizers
, Forcefields
and Featurizers
.
Chemical data curation is a difficult concept, and data may be formatted differently depending on the source, or even the habits of the curator.
For example, solvents or salts might be included the representation, which might be considered an unnecessary detail to a modeller, or even irrelevant to an experimentalist, if the compound is solvated is a standard solvent during the protocol.
Even the structure of molecules that would be considered the ‘same’, can often be drawn very differently. For example, tautomers are arguably the same molecule in different conditions, and mesomers might be considered different aspects of the same molecule.
Often, it is sensible to canonicalize these compounds in a process called Standardization.
In scikit-chem
, the
standardizers package provides
this functionality. Standardizer
objects transform Mol
objects
into other Mol
objects, which have their representation
canonicalized (or into None
if the protocol fails). The details of
the canonicalization may be configured at object initialization, or by
altering properties.
Tip
Currently, the only available Standardizer is a wrapper of the ChemAxon Standardizer. This requires the ChemAxon JChem software suite to be installed and licensed (free academic licenses are available from the website). We hope to implement an open source Standardizer in future.
As an example, we will standardize the sodium acetate:
In [3]:
mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]', name='sodium acetate'); mol.to_smiles()
Out[3]:
'CC(=O)[O-].[Na+]'
A Standardizer
object is initialized:
In [43]:
std = skchem.standardizers.ChemAxonStandardizer()
Calling transform on sodium acetate yields the conjugate ‘canonical’ acid, acetic acid.
In [44]:
mol_std = std.transform(mol); mol_std.to_smiles()
Out[44]:
'CC(=O)O'
The standardization of a collection of Mol
s can be achieved by
calling transform
on a pandas.Series
:
In [45]:
mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi',
name_column=1); mols
Out[45]:
name
ethane <Mol: CC>
propane <Mol: CCC>
benzene <Mol: c1ccccc1>
sodium acetate <Mol: CC(=O)[O-].[Na+]>
serine <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object
In [46]:
std.transform(mols)
ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:01 Time: 0:00:01
Out[46]:
name
ethane <Mol: CC>
propane <Mol: CCC>
benzene <Mol: c1ccccc1>
sodium acetate <Mol: CC(=O)O>
serine <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object
A loading bar is provided by default, although this can be disabled by lowering the verbosity:
In [47]:
std.verbose = 0
std.transform(mols)
Out[47]:
name
ethane <Mol: CC>
propane <Mol: CCC>
benzene <Mol: c1ccccc1>
sodium acetate <Mol: CC(=O)O>
serine <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object
Often the three dimensional structure of a compound is of relevance, but many chemical formats, such as SMILES do not encode this information (and often even in formats which serialize geometry only coordinates in two dimensions are provided).
To produce a reasonable three dimensional conformer, a compound must be roughly embedded in three dimensions according to local geometrical constraints, and forcefields used to optimize the geometry of a compound.
In scikit-chem
, the forcefields
package provides access to this functionality. Two forcefields, the
Universal Force Field
(UFF) and the Merck
Molecular Force Field (MMFF) are currently provided. We will use the
UFF:
In [23]:
uff = skchem.forcefields.UFF()
mol = uff.transform(mol_std)
In [25]:
mol.atoms
Out[25]:
<AtomView values="['C', 'C', 'O', 'O', 'H', 'H', 'H', 'H']" at 0x12102b6a0>
This uses the forcefield to generate a reasonable three dimensional
structure. In rdkit
(and thus scikit-chem
, conformers are
separate entities). The forcefield creates a new conformer on the
object:
In [27]:
mol.conformers[0].atom_positions
Out[27]:
[<Point3D coords="(1.22, -0.48, 0.10)" at 0x1214de3d8>,
<Point3D coords="(0.00, 0.10, -0.54)" at 0x1214de098>,
<Point3D coords="(0.06, 1.22, -1.11)" at 0x1214de168>,
<Point3D coords="(-1.20, -0.60, -0.53)" at 0x1214de100>,
<Point3D coords="(1.02, -0.64, 1.18)" at 0x1214de238>,
<Point3D coords="(1.47, -1.45, -0.37)" at 0x1214de1d0>,
<Point3D coords="(2.08, 0.21, -0.00)" at 0x1214de2a0>,
<Point3D coords="(-1.27, -1.51, -0.08)" at 0x1214de308>]
The molecule can be visualized by drawing it:
In [35]:
skchem.vis.draw(mol)
Out[35]:
<matplotlib.image.AxesImage at 0x1236c6978>
Chemical representation is not by itself very amenable to data analysis and mining techniques. Often, a fixed length vector representation is required. This is achieved by calculating features from the chemical representation.
In scikit-chem, this is provided by the descriptors
package. A
selection of features are available:
In [11]:
skchem.descriptors.__all__
Out[11]:
['PhysicochemicalFeaturizer',
'AtomFeaturizer',
'AtomPairFeaturizer',
'MorganFeaturizer',
'MACCSFeaturizer',
'TopologicalTorsionFeaturizer',
'RDKFeaturizer',
'ErGFeaturizer',
'ConnectivityInvariantsFeaturizer',
'FeatureInvariantsFeaturizer',
'ChemAxonNMRPredictor',
'ChemAxonFeaturizer',
'ChemAxonAtomFeaturizer',
'GraphDistanceTransformer',
'SpacialDistanceTransformer']
Circular fingerprints (of which Morgan fingerprints are an example) are often considered the most consistently well performing descriptor across a wide variety of compounds.
In [12]:
mf = skchem.descriptors.MorganFeaturizer()
mf.transform(mol)
Out[12]:
morgan_fp_idx
0 0
1 0
2 0
3 0
4 0
..
2043 0
2044 0
2045 0
2046 0
2047 0
Name: MorganFeaturizer, dtype: uint8
We can also call the standardizer on a series of Mol
s:
In [13]:
mf.transform(mols.structure)
MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[13]:
morgan_fp_idx | 0 | 1 | 2 | 3 | 4 | ... | 2043 | 2044 | 2045 | 2046 | 2047 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
5 rows × 2048 columns
Note
Note that Morgan fingerprints are 1D, and thus when we use a single Mol
as input, we get the features in a 1D pandas.Series
. When we use a collection of Mol
s, the features are returned in a pandas.DataFrame
, which is one higher dimension than a pandas.Series
, as a collection of Mol
s are a dimension higher than a Mol
by itself.
Some descriptors, such as the AtomFeaturizer
, will yield 2D features when used on a Mol
, and thus will yield the 3D pandas.Panel
when used on a collection of Mol
s.
Operations looking to remove compounds from a collection are implemented
as Filter
s in scikit-chem. These are implemented in the
skchem.filters
packages:
In [19]:
skchem.filters.__all__
Out[19]:
['ChiralFilter',
'SMARTSFilter',
'PAINSFilter',
'ElementFilter',
'OrganicFilter',
'AtomNumberFilter',
'MassFilter',
'Filter']
They are used very much like Transformer
s:
In [20]:
of = skchem.filters.OrganicFilter()
In [21]:
benzene = skchem.Mol.from_smiles('c1ccccc1', name='benzene')
ferrocene = skchem.Mol.from_smiles('[cH-]1cccc1.[cH-]1cccc1.[Fe+2]', name='ferrocene')
norbornane = skchem.Mol.from_smiles('C12CCC(C2)CC1', name='norbornane')
dicyclopentadiene = skchem.Mol.from_smiles('C1C=CC2C1C3CC2C=C3')
ms = [benzene, ferrocene, norbornane, dicyclopentadiene]
In [22]:
of.filter(ms)
OrganicFilter: 100% (4 of 4) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[22]:
benzene <Mol: c1ccccc1>
norbornane <Mol: C1CC2CCC1C2>
3 <Mol: C1=CC2C3C=CC(C3)C2C1>
Name: structure, dtype: object
Filter
s essentially use a predicate function to decide whether
to keep or remove instances. The result of this function can be returned
using transform
:
In [23]:
of.transform(ms)
OrganicFilter: 100% (4 of 4) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[23]:
benzene True
ferrocene False
norbornane True
3 True
dtype: bool
As Filter
s have a transform method, they are themselves
Transformer
s, that transform a molecule into the result of the
predicate!
In [24]:
issubclass(skchem.filters.Filter, skchem.base.Transformer)
Out[24]:
True
The predicate functions should return None
, False
or np.nan
for negative results, and anything else for positive results
You can create your own filter by passing a predicate function to the
Filter
class. For example, perhaps you only wanted compounds to keep
compounds that had a name:
In [25]:
is_named = skchem.filters.Filter(lambda m: m.name is not None)
We carelessly did not set dicyclopentadiene’s name previously, so we want this to get filtered out:
In [26]:
is_named.filter(ms)
Filter: 100% (4 of 4) |########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[26]:
benzene <Mol: c1ccccc1>
ferrocene <Mol: [Fe+2].c1cc[cH-]c1.c1cc[cH-]c1>
norbornane <Mol: C1CC2CCC1C2>
Name: structure, dtype: object
It worked!
A common functionality in cheminformatics is to convert a molecule into something else, and if the conversion fails, to just remove the compound. An example of this is standardization, where one might want to throw away compounds that fail to standardize, or geometry optimization where one might throw away molecules that fail to converge.
This functionality is similar to but crucially different from simply
``filtering``, as filtering returns the original compounds, rather
than the transformed compounds. Instead, there are special
Filter
s, called TransformFilter
s, that can perform this task
in a single method call. To give an example of the functionality, we
will use the UFF
class:
In [27]:
issubclass(skchem.forcefields.UFF, skchem.filters.base.TransformFilter)
Out[27]:
True
They are instanciated the same way as normal Transformers
and
Filter
s:
In [28]:
uff = skchem.forcefields.UFF()
An example molecule that fails is taken from the NCI DTP Diversity set III:
In [29]:
mol_that_fails = skchem.Mol.from_smiles('C[C@H](CCC(=O)O)[C@H]1CC[C@@]2(C)[C@@H]3C(=O)C[C@H]4C(C)(C)[C@@H](O)CC[C@]4(C)[C@H]3C(=O)C[C@]12C',
name='7524')
In [30]:
skchem.vis.draw(mol_that_fails)
Out[30]:
<matplotlib.image.AxesImage at 0x121561eb8>
In [31]:
ms.append(mol_that_fails)
In [32]:
res = uff.filter(ms); res
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 7524
warnings.warn(msg)
UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:01 Time: 0:00:01
Out[32]:
benzene <Mol: c1ccccc1>
ferrocene <Mol: [Fe+2].c1cc[cH-]c1.c1cc[cH-]c1>
norbornane <Mol: C1CC2CCC1C2>
3 <Mol: C1=CC2C3C=CC(C3)C2C1>
Name: structure, dtype: object
Note
filter returns the original molecules, which have not been optimized:
In [33]:
skchem.vis.draw(res.ix[3])
Out[33]:
<matplotlib.image.AxesImage at 0x12174c198>
In [34]:
res = uff.transform_filter(ms); res
/Users/rich/projects/scikit-chem/skchem/forcefields/base.py:54: UserWarning: Failed to Embed Molecule 7524
warnings.warn(msg)
UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:01 Time: 0:00:01
Out[34]:
benzene <Mol: [H]c1c([H])c([H])c([H])c([H])c1[H]>
ferrocene <Mol: [Fe+2].[H]c1c([H])c([H])[c-]([H])c1[H].[...
norbornane <Mol: [H]C1([H])C([H])([H])C2([H])C([H])([H])C...
3 <Mol: [H]C1=C([H])C2([H])C3([H])C([H])=C([H])C...
Name: structure, dtype: object
In [35]:
skchem.vis.draw(res.ix[3])
Out[35]:
<matplotlib.image.AxesImage at 0x121925390>
scikit-chem expands on the scikit-learn Pipeline
object to
support filtering. It is initialized using a list of Transformer
objects.
In [10]:
pipeline = skchem.pipeline.Pipeline([
skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
skchem.forcefields.UFF(),
skchem.filters.OrganicFilter(),
skchem.descriptors.MorganFeaturizer()])
The pipeline will apply each in turn to objects, using the the highest
priority function that each object implements, according to the order
transform_filter
> filter
> transform
.
For example, our pipeline can transform sodium acetate all the way to fingerprints:
In [11]:
mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]')
In [4]:
pipeline.transform_filter(mol)
Out[4]:
morgan_fp_idx
0 0
1 0
2 0
3 0
4 0
..
2043 0
2044 0
2045 0
2046 0
2047 0
Name: MorganFeaturizer, dtype: uint8
It also works on collections of molecules:
In [12]:
mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi', name_column=1); mols
Out[12]:
batch
ethane <Mol: CC>
propane <Mol: CCC>
benzene <Mol: c1ccccc1>
sodium acetate <Mol: CC(=O)[O-].[Na+]>
serine <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object
In [16]:
pipeline.transform_filter(mols)
ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:04 Time: 0:00:04
UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
OrganicFilter: 100% (5 of 5) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00
MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[16]:
morgan_fp_idx | 0 | 1 | 2 | 3 | 4 | ... | 2043 | 2044 | 2045 | 2046 | 2047 |
---|---|---|---|---|---|---|---|---|---|---|---|
batch | |||||||||||
ethane | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
propane | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
benzene | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
sodium acetate | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
serine | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 |
5 rows × 2048 columns
In [ ]:
scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.
Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.
For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:
In [31]:
skchem.data.BursiAmes.available_sets()
Out[31]:
('train', 'valid', 'test')
And many sources:
In [32]:
skchem.data.BursiAmes.available_sources()
Out[32]:
('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')
Note
Currently, the nature of the sources are not alway well documented, but as a guide, X are moleccular features, y are target variables, A are atom features, G are distances. When available, they will be detailed in the docstring of the dataset, accessible with help
.
For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).
We can load the data for requested sets and sources using the in memory API:
In [33]:
kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}
(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)
The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.
In [34]:
print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)
train shapes: (3007, 2048) (3007,)
valid shapes: (645, 2048) (645,)
test shapes: (645, 2048) (645,)
The raw data is loaded as numpy arrays:
In [35]:
X_train
Out[35]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
In [36]:
y_train
Out[36]:
array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)
Which should be ready to use as fuel for modelling!
The data is originally saved as pandas objects, and can be retrieved as
such using the read_frame
class method.
Features are available under the ‘feats’ namespace:
In [37]:
skchem.data.BursiAmes.read_frame('feats/X_morg')
Out[37]:
morgan_fp_idx | 0 | 1 | 2 | 3 | 4 | ... | 2043 | 2044 | 2045 | 2046 | 2047 |
---|---|---|---|---|---|---|---|---|---|---|---|
batch | |||||||||||
1728-95-6 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
74550-97-3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
16757-83-8 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
553-97-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
115-39-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
874-60-2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
92-66-0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
594-71-8 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
55792-21-7 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
84987-77-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4297 rows × 2048 columns
Target variables under ‘targets’:
In [39]:
skchem.data.BursiAmes.read_frame('targets/y')
Out[39]:
batch
1728-95-6 1
74550-97-3 1
16757-83-8 1
553-97-9 0
115-39-9 0
..
874-60-2 1
92-66-0 0
594-71-8 1
55792-21-7 0
84987-77-9 1
Name: is_mutagen, dtype: uint8
Set membership masks under ‘indices’:
In [40]:
skchem.data.BursiAmes.read_frame('indices/train')
Out[40]:
batch
1728-95-6 True
74550-97-3 True
16757-83-8 True
553-97-9 True
115-39-9 True
...
874-60-2 False
92-66-0 False
594-71-8 False
55792-21-7 False
84987-77-9 False
Name: split, dtype: bool
Finally, molecules are accessible via ‘structure’:
In [42]:
skchem.data.BursiAmes.read_frame('structure')
Out[42]:
batch
1728-95-6 <Mol: [H]c1c([H])c([H])c(-c2nc(-c3c([H])c([H])...
119-34-6 <Mol: [H]Oc1c([H])c([H])c(N([H])[H])c([H])c1[N...
371-40-4 <Mol: [H]c1c([H])c(N([H])[H])c([H])c([H])c1F>
2319-96-2 <Mol: [H]c1c([H])c([H])c2c([H])c3c(c([H])c(C([...
1822-51-1 <Mol: [H]c1nc([H])c([H])c(C([H])([H])Cl)c1[H]>
...
84-64-0 <Mol: [H]c1c([H])c([H])c(C(=O)OC2([H])C([H])([...
121808-62-6 <Mol: [H]OC(=O)C1([H])N(C(=O)C2([H])N([H])C(=O...
134-20-3 <Mol: [H]c1c([H])c([H])c(N([H])[H])c(C(=O)OC([...
6441-91-4 <Mol: [H]Oc1c([H])c(S(=O)(=O)O[H])c([H])c2c([H...
97534-21-9 <Mol: [H]Oc1nc(=S)n([H])c(O[H])c1C(=O)N([H])c1...
Name: structure, dtype: object
Note
The dataset building functionality is likely to undergo a large change in future so is not documented here. Please look at the example datasets to understand the format required to build the datasets directly.
The API documentation, autogenerated from the docstrings.
## skchem.core.atom
Defining atoms in scikit-chem.
skchem.core.atom.
Atom
[source]¶Bases: rdkit.Chem.rdchem.Atom
, skchem.core.base.ChemicalObject
Object representing an Atom in scikit-chem.
atomic_mass
¶float – the atomic mass of the atom in u.
atomic_number
¶int – the atomic number of the atom.
bonds
¶tuple<skchem.Bonds> – the bonds to this Atom.
cahn_ingold_prelog
¶The Cahn Ingold Prelog chirality indicator.
chiral_tag
¶int – the chiral tag.
covalent_radius
¶float – the covalent radius in angstroms.
degree
¶int – the degree of the atom.
depleted_degree
¶int – the degree of the atom in the h depleted molecular graph.
electron_affinity
¶float – the first electron affinity in eV.
explicit_valence
¶int – the explicit valence.
formal_charge
¶int – the formal charge.
full_degree
¶int – the full degree of the atom in the h full molecular graph.
hexcode
¶The hexcode to use as a color for the atom.
hybridization_state
¶str – the hybridization state.
implicit_valence
¶int – the implicit valence.
index
¶int – the index of the atom.
intrinsic_state
¶float – the intrinsic state of the atom.
ionisation_energy
¶float – the first ionisation energy in eV.
is_aromatic
¶bool – whether the atom is aromatic.
is_in_ring
¶bool – whether the atom is in a ring.
is_terminal
¶bool – whether the atom is terminal.
kier_hall_alpha_contrib
¶float – the covalent radius in angstroms.
kier_hall_electronegativity
¶float – the hall-keir electronegativity.
mcgowan_parameter
¶float – the mcgowan volume parameter
n_explicit_hs
¶int – the number of explicit hydrogens.
n_hs
¶int – the instanced, implicit and explicit number of hydrogens
n_implicit_hs
¶int – the number of implicit hydrogens.
n_instanced_hs
¶int – The number of instanced hydrogens.
n_lone_pairs
¶int – the number of lone pairs.
n_pi_electrons
¶int – the number of pi electrons.
n_total_hs
¶int – the total number of hydrogens (according to rdkit).
n_val_electrons
¶int – the number of valence electrons.
owner
¶skchem.Mol – the owning molecule.
Warning
This will seg fault if the atom is created manually.
pauling_electronegativity
¶float – the pauling electronegativity on Pauling scale.
polarisability
¶float – the atomic polarisability in 10^{-20} m^3.
principal_quantum_number
¶int – the principle quantum number.
props
¶PropertyView – rdkit properties of the atom.
sanderson_electronegativity
¶float – the sanderson electronegativity on Pauling scale.
symbol
¶str – the element symbol of the atom.
valence
¶int – the valence.
valence_degree
¶int – the valence degree.
$$ delta_i^v = Z_i^v - h_i $$
Where $ Z_i^v $ is the number of valence electrons and $ h_i $ is the number of hydrogens.
van_der_waals_radius
¶float – the Van der Waals radius in angstroms.
van_der_waals_volume
¶float –
the van der waals volume in angstroms^3.
$
rac{4}{3} pi r_v^3 $
skchem.core.atom.
AtomView
(owner)[source]¶Bases: skchem.core.base.ChemicalObjectView
adjacency_matrix
(bond_orders=False, force=True)[source]¶The vertex adjacency matrix.
Parameters: |
|
---|---|
Returns: | np.array[int] |
atomic_mass
¶np.array<float> – the atomic mass of the atoms in view
atomic_number
¶np.array<int> – the atomic number of the atoms in view
cahn_ingold_prelog
¶np.array<str> – the CIP string representation of atoms in view.
chiral_tag
¶np.array<str> – the chiral tag of the atoms in view.
covalent_radius
¶np.array<float> – the covalent radius of the atoms in the view.
degree
¶np.array<int> – the degree of the atoms in view, according to rdkit.
depleted_degree
¶np.array<int> – the degree of the atoms in the view in the h-depleted molecular graph.
distance_matrix
(bond_orders=False, force=True)[source]¶The vertex distance matrix.
Parameters: |
|
---|---|
Returns: | np.array[int] |
electron_affinity
¶np.array<float> – the electron affinity of the atoms in the view.
explicit_valence
¶np.array<int> – the explicit valence of the atoms in view..
formal_charge
¶np.array<int> – the formal charge on the atoms in view
full_degree
¶np.array<int> – the degree of the atoms in the view in the h-filled molecular graph.
hexcode
¶The hexcode to use as a color for the atoms in the view.
hybridization_state
¶np.array<str> – the hybridization state of the atoms in view.
One of ‘SP’, ‘SP2’, ‘SP3’, ‘SP3D’, ‘SP3D2’, ‘UNSPECIFIED’, ‘OTHER’
implicit_valence
¶np.array<int> – the explicit valence of the atoms in view.
index
¶pd.Index – an index for the atoms in the view.
intrinsic_state
¶np.ndarray<float> – the intrinsic state of the atoms in the view.
ionisation_energy
¶np.array<float> – the first ionisation energy of the atoms in the view.
is_aromatic
¶np.array<bool> – whether the atoms in the view are aromatic.
is_in_ring
¶np.array<bool> – whether the atoms in the view are in a ring.
is_terminal
¶np.array<bool> – whether the atoms in the view are terminal.
kier_hall_alpha_contrib
¶np.array<float> – the contribution to the kier hall alpha for each atom in the view.
kier_hall_electronegativity
¶np.array<float> – the hall kier electronegativity of the atoms in the view.
mcgowan_parameter
¶np.array<float> – the mcgowan parameter of the atoms in the iew.
n_explicit_hs
¶np.array<int> – the number of explicit hydrogens bonded to atoms in view, according to rdkit.
n_hs
¶np.array<int> – the number of hydrogens bonded to atoms in view.
n_implicit_hs
¶np.array<int> – the number of implicit hydrogens bonded to atoms in view, according to rdkit.
n_instanced_hs
¶np.array<int> – the number of instanced hydrogens bonded to atoms in view.
In this case, instanced means the number hs explicitly initialized as atoms.
n_lone_pairs
¶np.array<int> – the number of lone pairs on atoms in view.
n_pi_electrons
¶np.array<int> – the number of pi electrons on atoms in view.
n_total_hs
¶np.array<int> – the number of total hydrogens bonded to atoms in view, according to rdkit.
n_val_electrons
¶np.array<int> – the number of valence electrons bonded to atoms in view.
pauling_electronegativity
¶np.array<float> – the pauling electronegativity of the atoms in the view.
polarisability
¶np.array<float> – the atomic polarisability of the atoms in the view.
principal_quantum_number
¶np.array<float> – the principal quantum number of the atoms in the view.
sanderson_electronegativity
¶np.array<float> – the sanderson electronegativity of the atoms in the view.
symbol
¶np.array<str> – the symbols of the atoms in view
valence
¶np.array<int> – the valence of the atoms in view.
valence_degree
¶np.array<int> – the valence degree of the atoms in the view.
van_der_waals_radius
¶np.array<float> – the Van der Waals radius of the atoms in the view.
van_der_waals_volume
¶np.array<float> – the Van der Waals volume of the atoms in the view.
## skchem.core.base
Define base classes for scikit chem objects
skchem.core.base.
ChemicalObject
[source]¶Bases: object
A mixin for each chemical object in scikit-chem.
skchem.core.base.
ChemicalObjectIterator
(view)[source]¶Bases: object
Iterator for chemical object views.
next
()¶skchem.core.base.
ChemicalObjectView
(owner)[source]¶Bases: object
Abstract iterable view of chemical objects.
Concrete classes inheriting from it should implement __getitem__ and __len__.
props
¶Return a property view of the objects in the view.
skchem.core.base.
MolPropertyView
(obj_view)[source]¶Bases: skchem.core.base.View
Mol property wrapper.
This provides properties for the atom and bond views.
skchem.core.base.
PropertyView
(owner)[source]¶Bases: skchem.core.base.View
Property object wrapper.
This provides properties for rdkit objects.
## skchem.core.bond
Defining chemical bonds in scikit-chem.
skchem.core.bond.
Atom
[source]¶Bases: rdkit.Chem.rdchem.Atom
, skchem.core.base.ChemicalObject
Object representing an Atom in scikit-chem.
atomic_mass
¶float – the atomic mass of the atom in u.
atomic_number
¶int – the atomic number of the atom.
bonds
¶tuple<skchem.Bonds> – the bonds to this Atom.
cahn_ingold_prelog
¶The Cahn Ingold Prelog chirality indicator.
chiral_tag
¶int – the chiral tag.
covalent_radius
¶float – the covalent radius in angstroms.
degree
¶int – the degree of the atom.
depleted_degree
¶int – the degree of the atom in the h depleted molecular graph.
electron_affinity
¶float – the first electron affinity in eV.
explicit_valence
¶int – the explicit valence.
formal_charge
¶int – the formal charge.
full_degree
¶int – the full degree of the atom in the h full molecular graph.
hexcode
¶The hexcode to use as a color for the atom.
hybridization_state
¶str – the hybridization state.
implicit_valence
¶int – the implicit valence.
index
¶int – the index of the atom.
intrinsic_state
¶float – the intrinsic state of the atom.
ionisation_energy
¶float – the first ionisation energy in eV.
is_aromatic
¶bool – whether the atom is aromatic.
is_in_ring
¶bool – whether the atom is in a ring.
is_terminal
¶bool – whether the atom is terminal.
kier_hall_alpha_contrib
¶float – the covalent radius in angstroms.
kier_hall_electronegativity
¶float – the hall-keir electronegativity.
mcgowan_parameter
¶float – the mcgowan volume parameter
n_explicit_hs
¶int – the number of explicit hydrogens.
n_hs
¶int – the instanced, implicit and explicit number of hydrogens
n_implicit_hs
¶int – the number of implicit hydrogens.
n_instanced_hs
¶int – The number of instanced hydrogens.
n_lone_pairs
¶int – the number of lone pairs.
n_pi_electrons
¶int – the number of pi electrons.
n_total_hs
¶int – the total number of hydrogens (according to rdkit).
n_val_electrons
¶int – the number of valence electrons.
owner
¶skchem.Mol – the owning molecule.
Warning
This will seg fault if the atom is created manually.
pauling_electronegativity
¶float – the pauling electronegativity on Pauling scale.
polarisability
¶float – the atomic polarisability in 10^{-20} m^3.
principal_quantum_number
¶int – the principle quantum number.
props
¶PropertyView – rdkit properties of the atom.
sanderson_electronegativity
¶float – the sanderson electronegativity on Pauling scale.
symbol
¶str – the element symbol of the atom.
valence
¶int – the valence.
valence_degree
¶int – the valence degree.
$$ delta_i^v = Z_i^v - h_i $$
Where $ Z_i^v $ is the number of valence electrons and $ h_i $ is the number of hydrogens.
van_der_waals_radius
¶float – the Van der Waals radius in angstroms.
van_der_waals_volume
¶float –
the van der waals volume in angstroms^3.
$
rac{4}{3} pi r_v^3 $
## skchem.core.conformer
Defining conformers in scikit-chem.
skchem.core.conformer.
Conformer
[source]¶Bases: rdkit.Chem.rdchem.Conformer
, skchem.core.base.ChemicalObject
Class representing a Conformer in scikit-chem.
centre_of_mass
¶np.array – the centre of mass of the comformer.
centre_representation
(centre_of_mass=True)[source]¶Centre representation to the center of mass.
Parameters: | centre_of_mass (bool) – Whether to use the masses of atoms to calculate the centre of mass, or just use the mean position coordinate. |
---|---|
Returns: | Conformer |
geometric_centre
¶np.array – the geometric centre of the conformer.
id
¶The ID of the conformer.
is_3d
¶bool – whether the conformer is three dimensional.
owner
¶skchem.Mol – the owning molecule.
positions
¶np.ndarray – the atom positions in the conformer.
Note
This is a copy of the data, not the data itself. You cannot allocate to a slice of this.
skchem.core.conformer.
ConformerIterator
(view)[source]¶Bases: object
Iterator for chemical object views.
next
()¶skchem.core.conformer.
ConformerView
(owner)[source]¶Bases: skchem.core.base.ChemicalObjectView
append_3d
(n_conformers=1, **kwargs)[source]¶Append (a) 3D conformer(s), roughly embedded but not optimized.
Parameters: |
|
---|
id
¶is_3d
¶positions
¶## skchem.core.mol
Defining molecules in scikit-chem.
skchem.core.mol.
Mol
(*args, **kwargs)[source]¶Bases: rdkit.Chem.rdchem.Mol
, skchem.core.base.ChemicalObject
Class representing a Molecule in scikit-chem.
Mol objects inherit directly from rdkit Mol objects. Therefore, they contain atom and bond information, and may also include properties and atom bookmarks.
Example
Constructors are implemented as class methods with the from_ prefix.
>>> import skchem
>>> m = skchem.Mol.from_smiles('CC(=O)Cl'); m
<Mol name="None" formula="C2H3ClO" at ...>
This is an rdkit Mol:
>>> from rdkit.Chem import Mol as RDKMol
>>> isinstance(m, RDKMol)
True
A name can be given at initialization: >>> m = skchem.Mol.from_smiles(‘CC(=O)Cl’, name=’acetyl chloride’); m # doctest: +ELLIPSIS <Mol name=”acetyl chloride” formula=”C2H3ClO” at ...>
>>> m.name
'acetyl chloride'
Serializers are implemented as instance methods with the to_ prefix.
>>> m.to_smiles()
'CC(=O)Cl'
>>> m.to_inchi()
'InChI=1S/C2H3ClO/c1-2(3)4/h1H3'
>>> m.to_inchi_key()
'WETWJCDKMRHUPV-UHFFFAOYSA-N'
RDKit properties are accessible through the props property:
>>> m.SetProp('example_key', 'example_value') # set prop with rdkit directly
>>> m.props['example_key']
'example_value'
>>> m.SetIntProp('float_key', 42) # set int prop with rdkit directly
>>> m.props['float_key']
42
They can be set too:
>>> m.props['example_set'] = 'set_value'
>>> m.GetProp('example_set') # getting with rdkit directly
'set_value'
We can export the properties into a dict or a pandas series:
>>> m.props.to_series()
example_key example_value
example_set set_value
float_key 42
dtype: object
Atoms and bonds are provided in views:
>>> m.atoms
<AtomView values="['C', 'C', 'O', 'Cl']" at ...>
>>> m.bonds
<BondView values="['C-C', 'C=O', 'C-Cl']" at ...>
These are iterable: >>> [a.symbol for a in m.atoms] [‘C’, ‘C’, ‘O’, ‘Cl’]
The view provides shorthands for some attributes to get these:
>>> m.atoms.symbol
array(['C', 'C', 'O', 'Cl'], dtype=...)
Atom and bond props can also be set:
>>> m.atoms[0].props['atom_key'] = 'atom_value'
>>> m.atoms[0].props['atom_key']
'atom_value'
The properties for atoms on the whole molecule can be accessed like so:
>>> m.atoms.props
<MolPropertyView values="{'atom_key': ['atom_value', None, None, None]}" at ...>
The properties can be exported as a pandas dataframe >>> m.atoms.props.to_frame()
atom_key
atom_idx 0 atom_value 1 None 2 None 3 None
add_hs
(inplace=False, add_coords=True, explicit_only=False, only_on_atoms=False)[source]¶Add hydrogens to self.
Parameters: |
|
---|---|
Returns: | Mol with Hs added. |
Return type: | skchem.Mol |
atoms
¶List[skchem.Atom] – An iterable over the atoms of the molecule.
bonds
¶List[skchem.Bond] – An iterable over the bonds of the molecule.
conformers
¶List[Conformer] – conformers of the molecule.
from_binary
(binary)[source]¶Decode a molecule from a binary serialization.
Parameters: | binary – The bytes string to decode. |
---|---|
Returns: | The molecule encoded in the binary. |
Return type: | skchem.Mol |
from_inchi
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_mol2block
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_mol2file
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_molblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_molfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_pdbblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_pdbfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_smarts
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_smiles
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_tplblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_tplfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
mass
¶float – the mass of the molecule.
name
¶str – The name of the molecule.
Raises: | KeyError |
---|
props
¶PropertyView – A dictionary of the properties of the molecule.
remove_hs
(inplace=False, sanitize=True, update_explicit=False, implicit_only=False)[source]¶Remove hydrogens from self.
Parameters: |
|
---|---|
Returns: | Mol with Hs removed. |
Return type: | skchem.Mol |
to_binary
()[source]¶Serialize the molecule to binary encoding.
Returns: | the molecule in bytes. |
---|---|
Return type: | bytes |
Notes
Due to limitations in RDKit, not all data is serialized. Notably, properties are not, so e.g. compound names are not saved.
to_dict
(kind='chemdoodle', conformer_id=-1)[source]¶A dictionary representation of the molecule.
Parameters: | kind (str) – The type of representation to use. Only chemdoodle is currently supported. |
---|---|
Returns: | dictionary representation of the molecule. |
Return type: | dict |
to_inchi
(*args, **kwargs)¶The serializer to be bound.
to_inchi_key
()[source]¶The InChI key of the molecule.
Returns: | the InChI key. |
---|---|
Return type: | str |
Raises: | RuntimeError |
to_json
(kind='chemdoodle')[source]¶Serialize a molecule using JSON.
Parameters: | kind (str) – The type of serialization to use. Only chemdoodle is currently supported. |
---|---|
Returns: | the json string. |
Return type: | str |
to_molblock
(*args, **kwargs)¶The serializer to be bound.
to_molfile
(*args, **kwargs)¶The serializer to be bound.
to_pdbblock
(*args, **kwargs)¶The serializer to be bound.
to_smarts
(*args, **kwargs)¶The serializer to be bound.
to_smiles
(*args, **kwargs)¶The serializer to be bound.
to_tplblock
(*args, **kwargs)¶The serializer to be bound.
to_tplfile
(*args, **kwargs)¶The serializer to be bound.
## skchem.core
Module defining chemical types used in scikit-chem.
skchem.core.
Atom
[source]¶Bases: rdkit.Chem.rdchem.Atom
, skchem.core.base.ChemicalObject
Object representing an Atom in scikit-chem.
atomic_mass
¶float – the atomic mass of the atom in u.
atomic_number
¶int – the atomic number of the atom.
bonds
¶tuple<skchem.Bonds> – the bonds to this Atom.
cahn_ingold_prelog
¶The Cahn Ingold Prelog chirality indicator.
chiral_tag
¶int – the chiral tag.
covalent_radius
¶float – the covalent radius in angstroms.
degree
¶int – the degree of the atom.
depleted_degree
¶int – the degree of the atom in the h depleted molecular graph.
electron_affinity
¶float – the first electron affinity in eV.
explicit_valence
¶int – the explicit valence.
formal_charge
¶int – the formal charge.
full_degree
¶int – the full degree of the atom in the h full molecular graph.
hexcode
¶The hexcode to use as a color for the atom.
hybridization_state
¶str – the hybridization state.
implicit_valence
¶int – the implicit valence.
index
¶int – the index of the atom.
intrinsic_state
¶float – the intrinsic state of the atom.
ionisation_energy
¶float – the first ionisation energy in eV.
is_aromatic
¶bool – whether the atom is aromatic.
is_in_ring
¶bool – whether the atom is in a ring.
is_terminal
¶bool – whether the atom is terminal.
kier_hall_alpha_contrib
¶float – the covalent radius in angstroms.
kier_hall_electronegativity
¶float – the hall-keir electronegativity.
mcgowan_parameter
¶float – the mcgowan volume parameter
n_explicit_hs
¶int – the number of explicit hydrogens.
n_hs
¶int – the instanced, implicit and explicit number of hydrogens
n_implicit_hs
¶int – the number of implicit hydrogens.
n_instanced_hs
¶int – The number of instanced hydrogens.
n_lone_pairs
¶int – the number of lone pairs.
n_pi_electrons
¶int – the number of pi electrons.
n_total_hs
¶int – the total number of hydrogens (according to rdkit).
n_val_electrons
¶int – the number of valence electrons.
owner
¶skchem.Mol – the owning molecule.
Warning
This will seg fault if the atom is created manually.
pauling_electronegativity
¶float – the pauling electronegativity on Pauling scale.
polarisability
¶float – the atomic polarisability in 10^{-20} m^3.
principal_quantum_number
¶int – the principle quantum number.
props
¶PropertyView – rdkit properties of the atom.
sanderson_electronegativity
¶float – the sanderson electronegativity on Pauling scale.
symbol
¶str – the element symbol of the atom.
valence
¶int – the valence.
valence_degree
¶int – the valence degree.
$$ delta_i^v = Z_i^v - h_i $$
Where $ Z_i^v $ is the number of valence electrons and $ h_i $ is the number of hydrogens.
van_der_waals_radius
¶float – the Van der Waals radius in angstroms.
van_der_waals_volume
¶float –
the van der waals volume in angstroms^3.
$
rac{4}{3} pi r_v^3 $
skchem.core.
Bond
[source]¶Bases: rdkit.Chem.rdchem.Bond
, skchem.core.base.ChemicalObject
Class representing a chemical bond in scikit-chem.
atom_idxs
¶tuple[int] – list of atom indexes involved in the bond.
atoms
¶tuple[Atom] – list of atoms involved in the bond.
index
¶int – the index of the bond in the atom.
is_aromatic
¶bool – whether the bond is aromatic.
is_conjugated
¶bool – whether the bond is conjugated.
is_in_ring
¶bool – whether the bond is in a ring.
order
¶int – the order of the bond.
owner
¶skchem.Mol – the molecule this bond is a part of.
props
¶PropertyView – rdkit properties of the atom.
stereo_symbol
¶str – the stereo label of the bond (‘Z’, ‘E’, ‘ANY’, ‘NONE’)
skchem.core.
Conformer
[source]¶Bases: rdkit.Chem.rdchem.Conformer
, skchem.core.base.ChemicalObject
Class representing a Conformer in scikit-chem.
centre_of_mass
¶np.array – the centre of mass of the comformer.
centre_representation
(centre_of_mass=True)[source]¶Centre representation to the center of mass.
Parameters: | centre_of_mass (bool) – Whether to use the masses of atoms to calculate the centre of mass, or just use the mean position coordinate. |
---|---|
Returns: | Conformer |
geometric_centre
¶np.array – the geometric centre of the conformer.
id
¶The ID of the conformer.
is_3d
¶bool – whether the conformer is three dimensional.
owner
¶skchem.Mol – the owning molecule.
positions
¶np.ndarray – the atom positions in the conformer.
Note
This is a copy of the data, not the data itself. You cannot allocate to a slice of this.
skchem.core.
Mol
(*args, **kwargs)[source]¶Bases: rdkit.Chem.rdchem.Mol
, skchem.core.base.ChemicalObject
Class representing a Molecule in scikit-chem.
Mol objects inherit directly from rdkit Mol objects. Therefore, they contain atom and bond information, and may also include properties and atom bookmarks.
Example
Constructors are implemented as class methods with the from_ prefix.
>>> import skchem
>>> m = skchem.Mol.from_smiles('CC(=O)Cl'); m
<Mol name="None" formula="C2H3ClO" at ...>
This is an rdkit Mol:
>>> from rdkit.Chem import Mol as RDKMol
>>> isinstance(m, RDKMol)
True
A name can be given at initialization: >>> m = skchem.Mol.from_smiles(‘CC(=O)Cl’, name=’acetyl chloride’); m # doctest: +ELLIPSIS <Mol name=”acetyl chloride” formula=”C2H3ClO” at ...>
>>> m.name
'acetyl chloride'
Serializers are implemented as instance methods with the to_ prefix.
>>> m.to_smiles()
'CC(=O)Cl'
>>> m.to_inchi()
'InChI=1S/C2H3ClO/c1-2(3)4/h1H3'
>>> m.to_inchi_key()
'WETWJCDKMRHUPV-UHFFFAOYSA-N'
RDKit properties are accessible through the props property:
>>> m.SetProp('example_key', 'example_value') # set prop with rdkit directly
>>> m.props['example_key']
'example_value'
>>> m.SetIntProp('float_key', 42) # set int prop with rdkit directly
>>> m.props['float_key']
42
They can be set too:
>>> m.props['example_set'] = 'set_value'
>>> m.GetProp('example_set') # getting with rdkit directly
'set_value'
We can export the properties into a dict or a pandas series:
>>> m.props.to_series()
example_key example_value
example_set set_value
float_key 42
dtype: object
Atoms and bonds are provided in views:
>>> m.atoms
<AtomView values="['C', 'C', 'O', 'Cl']" at ...>
>>> m.bonds
<BondView values="['C-C', 'C=O', 'C-Cl']" at ...>
These are iterable: >>> [a.symbol for a in m.atoms] [‘C’, ‘C’, ‘O’, ‘Cl’]
The view provides shorthands for some attributes to get these:
>>> m.atoms.symbol
array(['C', 'C', 'O', 'Cl'], dtype=...)
Atom and bond props can also be set:
>>> m.atoms[0].props['atom_key'] = 'atom_value'
>>> m.atoms[0].props['atom_key']
'atom_value'
The properties for atoms on the whole molecule can be accessed like so:
>>> m.atoms.props
<MolPropertyView values="{'atom_key': ['atom_value', None, None, None]}" at ...>
The properties can be exported as a pandas dataframe >>> m.atoms.props.to_frame()
atom_key
atom_idx 0 atom_value 1 None 2 None 3 None
add_hs
(inplace=False, add_coords=True, explicit_only=False, only_on_atoms=False)[source]¶Add hydrogens to self.
Parameters: |
|
---|---|
Returns: | Mol with Hs added. |
Return type: | skchem.Mol |
atoms
¶List[skchem.Atom] – An iterable over the atoms of the molecule.
bonds
¶List[skchem.Bond] – An iterable over the bonds of the molecule.
conformers
¶List[Conformer] – conformers of the molecule.
from_binary
(binary)[source]¶Decode a molecule from a binary serialization.
Parameters: | binary – The bytes string to decode. |
---|---|
Returns: | The molecule encoded in the binary. |
Return type: | skchem.Mol |
from_inchi
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_mol2block
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_mol2file
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_molblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_molfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_pdbblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_pdbfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_smarts
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_smiles
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_tplblock
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
from_tplfile
(_, in_arg, name=None, *args, **kwargs)¶The constructor to be bound.
mass
¶float – the mass of the molecule.
name
¶str – The name of the molecule.
Raises: | KeyError |
---|
props
¶PropertyView – A dictionary of the properties of the molecule.
remove_hs
(inplace=False, sanitize=True, update_explicit=False, implicit_only=False)[source]¶Remove hydrogens from self.
Parameters: |
|
---|---|
Returns: | Mol with Hs removed. |
Return type: | skchem.Mol |
to_binary
()[source]¶Serialize the molecule to binary encoding.
Returns: | the molecule in bytes. |
---|---|
Return type: | bytes |
Notes
Due to limitations in RDKit, not all data is serialized. Notably, properties are not, so e.g. compound names are not saved.
to_dict
(kind='chemdoodle', conformer_id=-1)[source]¶A dictionary representation of the molecule.
Parameters: | kind (str) – The type of representation to use. Only chemdoodle is currently supported. |
---|---|
Returns: | dictionary representation of the molecule. |
Return type: | dict |
to_inchi
(*args, **kwargs)¶The serializer to be bound.
to_inchi_key
()[source]¶The InChI key of the molecule.
Returns: | the InChI key. |
---|---|
Return type: | str |
Raises: | RuntimeError |
to_json
(kind='chemdoodle')[source]¶Serialize a molecule using JSON.
Parameters: | kind (str) – The type of serialization to use. Only chemdoodle is currently supported. |
---|---|
Returns: | the json string. |
Return type: | str |
to_molblock
(*args, **kwargs)¶The serializer to be bound.
to_molfile
(*args, **kwargs)¶The serializer to be bound.
to_pdbblock
(*args, **kwargs)¶The serializer to be bound.
to_smarts
(*args, **kwargs)¶The serializer to be bound.
to_smiles
(*args, **kwargs)¶The serializer to be bound.
to_tplblock
(*args, **kwargs)¶The serializer to be bound.
to_tplfile
(*args, **kwargs)¶The serializer to be bound.
## skchem.cross_validation.similarity_threshold
Similarity threshold dataset partitioning functionality.
skchem.cross_validation.similarity_threshold.
SimThresholdSplit
(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]¶Bases: object
block_width
¶The width of the subsets of features.
Note
Only used in parallelized.
fit
(inp, pairs=None)[source]¶Fit the cross validator to the data. :param inp:
- pd.Series of Mol instances
- pd.DataFrame with Mol instances as a structure row.
- pd.DataFrame of fingerprints if fper is None
- pd.DataFrame of sim matrix if similarity_metric is None
- np.array of sim matrix if similarity_metric is None
k_fold
(n_folds)[source]¶Returns k-fold cross-validated folds with thresholded similarity.
Parameters: | n_folds (int) – The number of folds to provide. |
---|---|
Returns: | generator[ – The splits in series. |
Return type: | pd.Series, pd.Series |
n_instances_
¶The number of instances that were used to fit the object.
n_jobs
¶The number of processes to use to calculate the distance matrix.
-1 for all available.
split
(ratio)[source]¶Return splits of the data with thresholded similarity according to a specified ratio.
Parameters: | ratio (tuple[ints]) – the ratio to use. |
---|---|
Returns: | Generator of boolean split masks for the reqested splits. |
Return type: | generator[pd.Series] |
Example
>>> ms = skchem.data.Diversity.read_frame('structure')
>>> st = SimThresholdSplit(fper='morgan',
... similarity_metric='jaccard')
>>> st.fit(ms)
>>> train, valid, test = st.split(ratio=(70, 15, 15))
visualize_similarities
(subsample=5000, ax=None)[source]¶Plot a histogram of similarities, with the threshold plotted.
Parameters: |
|
---|---|
Returns: | matplotlib.axes |
visualize_space
(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]¶Plot chemical space using a transformer
Parameters: |
|
---|---|
Returns: | matplotlib.axes |
## skchem.cross_validation
Module implementing cross validation routines useful for chemical data.
skchem.cross_validation.
SimThresholdSplit
(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]¶Bases: object
block_width
¶The width of the subsets of features.
Note
Only used in parallelized.
fit
(inp, pairs=None)[source]¶Fit the cross validator to the data. :param inp:
- pd.Series of Mol instances
- pd.DataFrame with Mol instances as a structure row.
- pd.DataFrame of fingerprints if fper is None
- pd.DataFrame of sim matrix if similarity_metric is None
- np.array of sim matrix if similarity_metric is None
k_fold
(n_folds)[source]¶Returns k-fold cross-validated folds with thresholded similarity.
Parameters: | n_folds (int) – The number of folds to provide. |
---|---|
Returns: | generator[ – The splits in series. |
Return type: | pd.Series, pd.Series |
n_instances_
¶The number of instances that were used to fit the object.
n_jobs
¶The number of processes to use to calculate the distance matrix.
-1 for all available.
split
(ratio)[source]¶Return splits of the data with thresholded similarity according to a specified ratio.
Parameters: | ratio (tuple[ints]) – the ratio to use. |
---|---|
Returns: | Generator of boolean split masks for the reqested splits. |
Return type: | generator[pd.Series] |
Example
>>> ms = skchem.data.Diversity.read_frame('structure')
>>> st = SimThresholdSplit(fper='morgan',
... similarity_metric='jaccard')
>>> st.fit(ms)
>>> train, valid, test = st.split(ratio=(70, 15, 15))
visualize_similarities
(subsample=5000, ax=None)[source]¶Plot a histogram of similarities, with the threshold plotted.
Parameters: |
|
---|---|
Returns: | matplotlib.axes |
visualize_space
(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]¶Plot chemical space using a transformer
Parameters: |
|
---|---|
Returns: | matplotlib.axes |
# skchem.data.converters.base
Defines the base converter class.
skchem.data.converters.base.
Converter
(directory, output_directory, output_filename='default.h5')[source]¶Bases: object
Create a fuel dataset from molecules and targets.
run
(ms, y, output_path, splits=None, features=None, pytables_kws={'complib': 'bzip2', 'complevel': 9})[source]¶Args:
source_names
¶split_names
¶skchem.data.converters.base.
Feature
(fper, key, axis_names)¶Bases: tuple
axis_names
¶Alias for field number 2
fper
¶Alias for field number 0
key
¶Alias for field number 1
skchem.data.converters.base.
Split
(mask, name, converter)[source]¶Bases: object
contiguous
¶indices
¶ref
¶skchem.data.converters.base.
contiguous_order
(to_order, splits)[source]¶Determine a contiguous order from non-overlapping splits, and put data in that order.
Parameters: |
|
---|---|
Returns: | The data in contiguous order. |
Return type: | iterable<pd.Series, pd.DataFrame, pd.Panel> |
# skchem.data.coverters.example
Formatter for the example dataset.
skchem.data.converters.nmrshiftdb2.
NMRShiftDB2Converter
(directory, output_directory, output_filename='nmrshiftdb2.h5')[source]¶Bases: skchem.data.converters.base.Converter
combine_duplicates
(data)[source]¶Collect duplicate spectra into one dictionary. All shifts are collected into lists.
process_spectra
(data)[source]¶Turn the string representations found in sdf file into a dictionary.
## skchem.data.transformers.tox21
Module defining transformation techniques for tox21.
skchem.data.converters.tox21.
Tox21Converter
(directory, output_directory, output_filename='tox21.h5')[source]¶Bases: skchem.data.converters.base.Converter
Class to build tox21 dataset.
skchem.data.converters.
DiversityConverter
(directory, output_directory, output_filename='diversity.h5')[source]¶Bases: skchem.data.converters.base.Converter
Example Converter, using the NCI DTP Diversity Set III.
skchem.data.converters.
BursiAmesConverter
(directory, output_directory, output_filename='bursi_ames.h5')[source]¶skchem.data.converters.
MullerAmesConverter
(directory, output_directory, output_filename='muller_ames.h5')[source]¶skchem.data.converters.
PhysPropConverter
(directory, output_directory, output_filename='physprop.h5')[source]¶skchem.data.converters.
BradleyOpenMPConverter
(directory, output_directory, output_filename='bradley_open_mp.h5')[source]¶skchem.data.converters.
NMRShiftDB2Converter
(directory, output_directory, output_filename='nmrshiftdb2.h5')[source]¶Bases: skchem.data.converters.base.Converter
combine_duplicates
(data)[source]¶Collect duplicate spectra into one dictionary. All shifts are collected into lists.
process_spectra
(data)[source]¶Turn the string representations found in sdf file into a dictionary.
skchem.data.converters.
Tox21Converter
(directory, output_directory, output_filename='tox21.h5')[source]¶Bases: skchem.data.converters.base.Converter
Class to build tox21 dataset.
skchem.data.converters.
ChEMBLConverter
(directory, output_directory, output_filename='chembl.h5')[source]¶Bases: skchem.data.converters.base.Converter
Converter for the ChEMBL dataset.
skchem.data.datasets.base.
Dataset
(**kwargs)[source]¶Bases: fuel.datasets.hdf5.H5PYDataset
Abstract base class providing an interface to the skchem data format.
download
(output_directory=None, download_directory=None)[source]¶Download the dataset and convert it.
Parameters: |
|
---|---|
Returns: | The path of the downloaded and processed dataset. |
Return type: | str |
load_data
(sets=(), sources=())[source]¶Load a set of sources.
Parameters: |
|
---|
Example
(X_train, y_train), (X_test, y_test) = Dataset.load_data(sets=(‘train’, ‘test’), sources=(‘X’, ‘y’))
load_set
(set_name, sources=())[source]¶Load the sources for a single set.
Parameters: |
|
---|---|
Returns: |
|
read_frame
(key, *args, **kwargs)[source]¶Load a set of features from the dataset as a pandas object.
Parameters: | key (str) – The HDF5 key for required data. Typically, this will be one of
|
---|---|
Returns: |
|
## skchem.data.datasets
Module defining skchem datasets.
skchem.data.datasets.
Diversity
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
Example dataset, the NCI DTP Diversity Set III.
converter
¶alias of DiversityConverter
downloader
¶alias of DiversityDownloader
filename
= 'diversity.h5'¶skchem.data.datasets.
BursiAmes
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of BursiAmesConverter
downloader
¶alias of BursiAmesDownloader
filename
= 'bursi_ames.h5'¶skchem.data.datasets.
MullerAmes
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of MullerAmesConverter
downloader
¶alias of MullerAmesDownloader
filename
= 'muller_ames.h5'¶skchem.data.datasets.
PhysProp
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of PhysPropConverter
downloader
¶alias of PhysPropDownloader
filename
= 'physprop.h5'¶skchem.data.datasets.
BradleyOpenMP
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of BradleyOpenMPConverter
downloader
¶alias of BradleyOpenMPDownloader
filename
= 'bradley_open_mp.h5'¶skchem.data.datasets.
NMRShiftDB2
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of NMRShiftDB2Converter
downloader
¶alias of NMRShiftDB2Downloader
filename
= 'nmrshiftdb2.h5'¶skchem.data.downloaders.bradley_open_mp.
BradleyOpenMPDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['bradley_melting_point_dataset.xlsx']¶urls
= ['https://ndownloader.figshare.com/files/1503990']¶skchem.data.downloaders.bursi_ames.
BursiAmesDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['cas_4337.zip']¶urls
= ['http://cheminformatics.org/datasets/bursi/cas_4337.zip']¶# file title
Description
skchem.data.downloaders.diversity.
DiversityDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['structures.sdf']¶urls
= ['https://wiki.nci.nih.gov/download/attachments/160989212/Div3_2DStructures_Oct2014.sdf']¶skchem.data.downloaders.muller_ames.
MullerAmesDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['ci900161g_si_001.zip']¶urls
= ['https://ndownloader.figshare.com/files/4523278']¶skchem.data.downloaders.nmrshiftdb2.
NMRShiftDB2Downloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['nmrshiftdb2.sdf']¶urls
= ['https://sourceforge.net/p/nmrshiftdb2/code/HEAD/tree/trunk/snapshots/nmrshiftdb2withsignals.sd?format=raw']¶skchem.data.downloaders.physprop.
PhysPropDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['phys_sdf.zip', 'phys_txt.zip']¶urls
= ['http://esc.syrres.com/interkow/Download/phys_sdf.zip', 'http://esc.syrres.com/interkow/Download/phys_txt.zip']¶skchem.data.downloaders.tox21.
Tox21Downloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['train.sdf.zip', 'valid.sdf.zip', 'test.sdf.zip', 'test.txt']¶urls
= ['https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_data_allsdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_testsdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_scoresdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_scoretxt']¶skchem.data.downloaders.
DiversityDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['structures.sdf']¶urls
= ['https://wiki.nci.nih.gov/download/attachments/160989212/Div3_2DStructures_Oct2014.sdf']¶skchem.data.downloaders.
ChEMBLDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['chembl_raw.h5']¶urls
= []¶skchem.data.downloaders.
BursiAmesDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['cas_4337.zip']¶urls
= ['http://cheminformatics.org/datasets/bursi/cas_4337.zip']¶skchem.data.downloaders.
MullerAmesDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['ci900161g_si_001.zip']¶urls
= ['https://ndownloader.figshare.com/files/4523278']¶skchem.data.downloaders.
Tox21Downloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['train.sdf.zip', 'valid.sdf.zip', 'test.sdf.zip', 'test.txt']¶urls
= ['https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_data_allsdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_testsdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_scoresdf', 'https://tripod.nih.gov/tox21/challenge/download?id=tox21_10k_challenge_scoretxt']¶skchem.data.downloaders.
NMRShiftDB2Downloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['nmrshiftdb2.sdf']¶urls
= ['https://sourceforge.net/p/nmrshiftdb2/code/HEAD/tree/trunk/snapshots/nmrshiftdb2withsignals.sd?format=raw']¶skchem.data.downloaders.
PhysPropDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['phys_sdf.zip', 'phys_txt.zip']¶urls
= ['http://esc.syrres.com/interkow/Download/phys_sdf.zip', 'http://esc.syrres.com/interkow/Download/phys_txt.zip']¶skchem.data.downloaders.
BradleyOpenMPDownloader
[source]¶Bases: skchem.data.downloaders.base.Downloader
filenames
= ['bradley_melting_point_dataset.xlsx']¶urls
= ['https://ndownloader.figshare.com/files/1503990']¶skchem.data
Module for handling data. Data can be accessed using the resource function.
skchem.data.
Diversity
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
Example dataset, the NCI DTP Diversity Set III.
converter
¶alias of DiversityConverter
downloader
¶alias of DiversityDownloader
filename
= 'diversity.h5'¶skchem.data.
BursiAmes
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of BursiAmesConverter
downloader
¶alias of BursiAmesDownloader
filename
= 'bursi_ames.h5'¶skchem.data.
MullerAmes
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of MullerAmesConverter
downloader
¶alias of MullerAmesDownloader
filename
= 'muller_ames.h5'¶skchem.data.
PhysProp
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of PhysPropConverter
downloader
¶alias of PhysPropDownloader
filename
= 'physprop.h5'¶skchem.data.
BradleyOpenMP
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of BradleyOpenMPConverter
downloader
¶alias of BradleyOpenMPDownloader
filename
= 'bradley_open_mp.h5'¶skchem.data.
NMRShiftDB2
(**kwargs)[source]¶Bases: skchem.data.datasets.base.Dataset
converter
¶alias of NMRShiftDB2Converter
downloader
¶alias of NMRShiftDB2Downloader
filename
= 'nmrshiftdb2.h5'¶# skchem.filters
Chemical filters are defined.
skchem.filters.base.
BaseFilter
(agg='any', **kwargs)[source]¶Bases: skchem.base.BaseTransformer
The base Filter class.
agg
¶callable – The aggregate function to use. String aliases for ‘any’, ‘not any’, ‘all’, ‘not all’ are available.
columns
¶pd.Index – The column index to use.
skchem.filters.base.
Filter
(func=None, agg='any', n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.BaseFilter
, skchem.base.Transformer
Filter base class.
Examples
>>> import skchem
Initialize the filter with a function: >>> is_named = skchem.filters.Filter(lambda m: m.name is not None)
Filter results can be found with transform: >>> ethane = skchem.Mol.from_smiles(‘CC’, name=’ethane’) >>> is_named.transform(ethane) True
>>> anonymous = skchem.Mol.from_smiles('c1ccccc1')
>>> is_named.transform(anonymous)
False
Can take a series or dataframe: >>> mols = pd.Series({‘anonymous’: anonymous, ‘ethane’: ethane}) >>> is_named.transform(mols) anonymous False ethane True Name: Filter, dtype: bool
Using filter will drop out molecules that fail the test: >>> is_named.filter(mols) ethane <Mol: CC> dtype: object
Only failed are retained with the neg keyword argument: >>> is_named.filter(mols, neg=True) anonymous <Mol: c1ccccc1> dtype: object
skchem.filters.base.
TransformFilter
(agg='any', **kwargs)[source]¶Bases: skchem.filters.base.BaseFilter
Transform Filter object.
Implements transform_filter, which allows a transform, then a filter step returning the transformed values that are not False, None or np.nan.
# skchem.filters.simple
Simple filters for compounds.
skchem.filters.simple.
AtomNumberFilter
(above=3, below=60, include_hydrogens=False, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter whether the number of atoms in a Mol falls in a defined interval.
above <= n_atoms < below
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('CCCC', name='butane'),
... skchem.Mol.from_smiles('NC(C)C(=O)O', name='alanine'),
... skchem.Mol.from_smiles('C12C=CC(C=C2)C=C1', name='barrelene')
... ]
>>> af = skchem.filters.AtomNumberFilter(above=3, below=7)
>>> af.transform(data)
ethane False
butane True
alanine True
barrelene False
Name: num_atoms_in_range, dtype: bool
>>> af.filter(data)
butane <Mol: CCCC>
alanine <Mol: CC(N)C(=O)O>
Name: structure, dtype: object
>>> af = skchem.filters.AtomNumberFilter(above=5, below=15, include_hydrogens=True)
>>> af.transform(data)
ethane True
butane True
alanine True
barrelene False
Name: num_atoms_in_range, dtype: bool
columns
¶skchem.filters.simple.
ElementFilter
(elements=None, as_bits=False, agg='any', n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter by elements.
Examples
Basic usage on molecules:
>>> import skchem
>>> hal_f = skchem.filters.ElementFilter(['F', 'Cl', 'Br', 'I'])
Molecules with one of the atoms transform to True.
>>> m1 = skchem.Mol.from_smiles('ClC(Cl)Cl', name='chloroform')
>>> hal_f.transform(m1)
True
Molecules with none of the atoms transform to False.
>>> m2 = skchem.Mol.from_smiles('CC', name='ethane')
>>> hal_f.transform(m2)
False
Can see the atom breakdown by passing agg == False: >>> hal_f.transform(m1, agg=False) has_element F 0 Cl 3 Br 0 I 0 Name: ElementFilter, dtype: int64
Can transform series.
>>> ms = [m1, m2]
>>> hal_f.transform(ms)
chloroform True
ethane False
dtype: bool
>>> hal_f.transform(ms, agg=False)
has_element F Cl Br I
chloroform 0 3 0 0
ethane 0 0 0 0
Can also filter series:
>>> hal_f.filter(ms)
chloroform <Mol: ClC(Cl)Cl>
Name: structure, dtype: object
>>> hal_f.filter(ms, neg=True)
ethane <Mol: CC>
Name: structure, dtype: object
columns
¶elements
¶skchem.filters.simple.
MassFilter
(above=3, below=900, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter whether the molecular weight of a molecule is outside a range.
above <= mass < below
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('CCCC', name='butane'),
... skchem.Mol.from_smiles('NC(C)C(=O)O', name='alanine'),
... skchem.Mol.from_smiles('C12C=CC(C=C2)C=C1', name='barrelene')
... ]
>>> mf = skchem.filters.MassFilter(above=31, below=100)
>>> mf.transform(data)
ethane False
butane True
alanine True
barrelene False
Name: mass_in_range, dtype: bool
>>> mf.filter(data)
butane <Mol: CCCC>
alanine <Mol: CC(N)C(=O)O>
Name: structure, dtype: object
columns
¶skchem.filters.simple.
OrganicFilter
(n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.simple.ElementFilter
Whether a molecule is organic.
For the purpose of this function, an organic molecule is defined as having atoms with elements only in the set H, B, C, N, O, F, P, S, Cl, Br, I.
Examples
Basic usage as a function on molecules: >>> import skchem >>> of = skchem.filters.OrganicFilter() >>> benzene = skchem.Mol.from_smiles(‘c1ccccc1’, name=’benzene’)
>>> of.transform(benzene)
True
>>> ferrocene = skchem.Mol.from_smiles('[cH-]1cccc1.[cH-]1cccc1.[Fe+2]',
... name='ferrocene')
>>> of.transform(ferrocene)
False
More useful on collections:
>>> sa = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]', name='sodium acetate')
>>> norbornane = skchem.Mol.from_smiles('C12CCC(C2)CC1', name='norbornane')
>>> data = [benzene, ferrocene, norbornane, sa]
>>> of.transform(data)
benzene True
ferrocene False
norbornane True
sodium acetate False
dtype: bool
>>> of.filter(data)
benzene <Mol: c1ccccc1>
norbornane <Mol: C1CC2CCC1C2>
Name: structure, dtype: object
>>> of.filter(data, neg=True)
ferrocene <Mol: [Fe+2].c1cc[cH-]c1.c1cc[cH-]c1>
sodium acetate <Mol: CC(=O)[O-].[Na+]>
Name: structure, dtype: object
skchem.filters.simple.
mass
(mol, above=10, below=900)[source]¶Whether a the molecular weight of a molecule is lower than a threshold.
above <= mass < below
Parameters: |
|
---|---|
Returns: | Whether the mass of the molecule is lower than the threshold. |
Return type: | bool |
Examples
Basic usage as a function on molecules:
>>> import skchem
>>> m = skchem.Mol.from_smiles('c1ccccc1') # benzene has M_r = 78.
>>> skchem.filters.mass(m, above=70)
True
>>> skchem.filters.mass(m, above=80)
False
>>> skchem.filters.mass(m, below=80)
True
>>> skchem.filters.mass(m, below=70)
False
>>> skchem.filters.mass(m, above=70, below=80)
True
skchem.filters.simple.
n_atoms
(mol, above=2, below=75, include_hydrogens=False)[source]¶Whether the number of atoms in a molecule falls in a defined interval.
above <= n_atoms < below
Parameters: |
|
---|---|
Returns: | Whether the molecule has more atoms than the threshold. |
Return type: | bool |
Examples
Basic usage as a function on molecules:
>>> import skchem
>>> m = skchem.Mol.from_smiles('c1ccccc1') # benzene has 6 atoms.
Lower threshold:
>>> skchem.filters.n_atoms(m, above=3)
True
>>> skchem.filters.n_atoms(m, above=8)
False
Higher threshold:
>>> skchem.filters.n_atoms(m, below=8)
True
>>> skchem.filters.n_atoms(m, below=3)
False
Bounds work like Python slices - inclusive lower, exclusive upper:
>>> skchem.filters.n_atoms(m, above=6)
True
>>> skchem.filters.n_atoms(m, below=6)
False
Both can be used at once:
>>> skchem.filters.n_atoms(m, above=3, below=8)
True
Can include hydrogens:
>>> skchem.filters.n_atoms(m, above=3, below=8, include_hydrogens=True)
False
>>> skchem.filters.n_atoms(m, above=9, below=14, include_hydrogens=True)
True
# skchem.filters.smarts
Module defines SMARTS filters.
skchem.filters.smarts.
PAINSFilter
(n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.smarts.SMARTSFilter
Whether a molecule passes the Pan Assay INterference (PAINS) filters.
These are supplied with RDKit, and were originally proposed by Baell et al.
_pains
¶pd.Series – a series of smarts template molecules.
References
[The original paper](http://dx.doi.org/10.1021/jm901137j)
Examples
Basic usage as a function on molecules:
>>> import skchem
>>> benzene = skchem.Mol.from_smiles('c1ccccc1', name='benzene')
>>> pf = skchem.filters.PAINSFilter()
>>> pf.transform(benzene)
True
>>> catechol = skchem.Mol.from_smiles('Oc1c(O)cccc1', name='catechol')
>>> pf.transform(catechol)
False
>>> res = pf.transform(catechol, agg=False)
>>> res[res]
names
catechol_A(92) True
Name: PAINSFilter, dtype: bool
More useful in combination with pandas DataFrames:
>>> data = [benzene, catechol]
>>> pf.transform(data)
benzene True
catechol False
dtype: bool
>>> pf.filter(data)
benzene <Mol: c1ccccc1>
Name: structure, dtype: object
skchem.filters.smarts.
SMARTSFilter
(smarts, agg='any', merge_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter a molecule based on smarts.
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('c1ccccc1', name='benzene'),
... skchem.Mol.from_smiles('c1ccccc1-c2c(C=O)ccnc2', name='bg')
... ]
>>> f = skchem.filters.SMARTSFilter({'benzene': 'c1ccccc1',
... 'pyridine': 'c1ccccn1',
... 'acetyl': 'C=O'})
>>> f.transform(data, agg=False)
acetyl benzene pyridine
ethane False False False
benzene False True False
bg True True True
>>> f.transform(data)
ethane False
benzene True
bg True
dtype: bool
>>> f.filter(data)
benzene <Mol: c1ccccc1>
bg <Mol: O=Cc1ccncc1-c1ccccc1>
Name: structure, dtype: object
>>> f.agg = all
>>> f.filter(data)
bg <Mol: O=Cc1ccncc1-c1ccccc1>
Name: structure, dtype: object
columns
¶# skchem.filters.stereo
Stereo filters for scikit-chem.
skchem.filters.stereo.
ChiralFilter
(check_meso=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter chiral compounds.
Examples
>>> import skchem
>>> cf = skchem.filters.ChiralFilter()
>>> ms = [
... skchem.Mol.from_smiles('F[C@@H](F)[C@H](F)F', name='achiral'),
... skchem.Mol.from_smiles('F[C@@H](Br)[C@H](Br)F', name='chiral'),
... skchem.Mol.from_smiles('F[C@H](Br)[C@H](Br)F', name='meso'),
... skchem.Mol.from_smiles('FC(Br)C(Br)F', name='racemic')
... ]
>>> cf.transform(ms)
achiral False
chiral True
meso False
racemic False
Name: is_chiral, dtype: bool
columns
¶is_meso
(mol)[source]¶Determines whether the molecule is meso.
Meso compounds have chiral centres, but has a mirror plane allowing superposition.
Examples
>>> import skchem
>>> cf = skchem.filters.ChiralFilter()
>>> meso = skchem.Mol.from_smiles('F[C@H](Br)[C@H](Br)F')
>>> cf.is_meso(meso)
True
>>> non_meso = skchem.Mol.from_smiles('F[C@H](Br)[C@@H](Br)F')
>>> cf.is_meso(non_meso)
False
# skchem.filters
Molecule filters for scikit-chem.
skchem.filters.
ChiralFilter
(check_meso=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter chiral compounds.
Examples
>>> import skchem
>>> cf = skchem.filters.ChiralFilter()
>>> ms = [
... skchem.Mol.from_smiles('F[C@@H](F)[C@H](F)F', name='achiral'),
... skchem.Mol.from_smiles('F[C@@H](Br)[C@H](Br)F', name='chiral'),
... skchem.Mol.from_smiles('F[C@H](Br)[C@H](Br)F', name='meso'),
... skchem.Mol.from_smiles('FC(Br)C(Br)F', name='racemic')
... ]
>>> cf.transform(ms)
achiral False
chiral True
meso False
racemic False
Name: is_chiral, dtype: bool
columns
¶is_meso
(mol)[source]¶Determines whether the molecule is meso.
Meso compounds have chiral centres, but has a mirror plane allowing superposition.
Examples
>>> import skchem
>>> cf = skchem.filters.ChiralFilter()
>>> meso = skchem.Mol.from_smiles('F[C@H](Br)[C@H](Br)F')
>>> cf.is_meso(meso)
True
>>> non_meso = skchem.Mol.from_smiles('F[C@H](Br)[C@@H](Br)F')
>>> cf.is_meso(non_meso)
False
skchem.filters.
SMARTSFilter
(smarts, agg='any', merge_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter a molecule based on smarts.
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('c1ccccc1', name='benzene'),
... skchem.Mol.from_smiles('c1ccccc1-c2c(C=O)ccnc2', name='bg')
... ]
>>> f = skchem.filters.SMARTSFilter({'benzene': 'c1ccccc1',
... 'pyridine': 'c1ccccn1',
... 'acetyl': 'C=O'})
>>> f.transform(data, agg=False)
acetyl benzene pyridine
ethane False False False
benzene False True False
bg True True True
>>> f.transform(data)
ethane False
benzene True
bg True
dtype: bool
>>> f.filter(data)
benzene <Mol: c1ccccc1>
bg <Mol: O=Cc1ccncc1-c1ccccc1>
Name: structure, dtype: object
>>> f.agg = all
>>> f.filter(data)
bg <Mol: O=Cc1ccncc1-c1ccccc1>
Name: structure, dtype: object
columns
¶skchem.filters.
PAINSFilter
(n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.smarts.SMARTSFilter
Whether a molecule passes the Pan Assay INterference (PAINS) filters.
These are supplied with RDKit, and were originally proposed by Baell et al.
_pains
¶pd.Series – a series of smarts template molecules.
References
[The original paper](http://dx.doi.org/10.1021/jm901137j)
Examples
Basic usage as a function on molecules:
>>> import skchem
>>> benzene = skchem.Mol.from_smiles('c1ccccc1', name='benzene')
>>> pf = skchem.filters.PAINSFilter()
>>> pf.transform(benzene)
True
>>> catechol = skchem.Mol.from_smiles('Oc1c(O)cccc1', name='catechol')
>>> pf.transform(catechol)
False
>>> res = pf.transform(catechol, agg=False)
>>> res[res]
names
catechol_A(92) True
Name: PAINSFilter, dtype: bool
More useful in combination with pandas DataFrames:
>>> data = [benzene, catechol]
>>> pf.transform(data)
benzene True
catechol False
dtype: bool
>>> pf.filter(data)
benzene <Mol: c1ccccc1>
Name: structure, dtype: object
skchem.filters.
ElementFilter
(elements=None, as_bits=False, agg='any', n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter by elements.
Examples
Basic usage on molecules:
>>> import skchem
>>> hal_f = skchem.filters.ElementFilter(['F', 'Cl', 'Br', 'I'])
Molecules with one of the atoms transform to True.
>>> m1 = skchem.Mol.from_smiles('ClC(Cl)Cl', name='chloroform')
>>> hal_f.transform(m1)
True
Molecules with none of the atoms transform to False.
>>> m2 = skchem.Mol.from_smiles('CC', name='ethane')
>>> hal_f.transform(m2)
False
Can see the atom breakdown by passing agg == False: >>> hal_f.transform(m1, agg=False) has_element F 0 Cl 3 Br 0 I 0 Name: ElementFilter, dtype: int64
Can transform series.
>>> ms = [m1, m2]
>>> hal_f.transform(ms)
chloroform True
ethane False
dtype: bool
>>> hal_f.transform(ms, agg=False)
has_element F Cl Br I
chloroform 0 3 0 0
ethane 0 0 0 0
Can also filter series:
>>> hal_f.filter(ms)
chloroform <Mol: ClC(Cl)Cl>
Name: structure, dtype: object
>>> hal_f.filter(ms, neg=True)
ethane <Mol: CC>
Name: structure, dtype: object
columns
¶elements
¶skchem.filters.
OrganicFilter
(n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.simple.ElementFilter
Whether a molecule is organic.
For the purpose of this function, an organic molecule is defined as having atoms with elements only in the set H, B, C, N, O, F, P, S, Cl, Br, I.
Examples
Basic usage as a function on molecules: >>> import skchem >>> of = skchem.filters.OrganicFilter() >>> benzene = skchem.Mol.from_smiles(‘c1ccccc1’, name=’benzene’)
>>> of.transform(benzene)
True
>>> ferrocene = skchem.Mol.from_smiles('[cH-]1cccc1.[cH-]1cccc1.[Fe+2]',
... name='ferrocene')
>>> of.transform(ferrocene)
False
More useful on collections:
>>> sa = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]', name='sodium acetate')
>>> norbornane = skchem.Mol.from_smiles('C12CCC(C2)CC1', name='norbornane')
>>> data = [benzene, ferrocene, norbornane, sa]
>>> of.transform(data)
benzene True
ferrocene False
norbornane True
sodium acetate False
dtype: bool
>>> of.filter(data)
benzene <Mol: c1ccccc1>
norbornane <Mol: C1CC2CCC1C2>
Name: structure, dtype: object
>>> of.filter(data, neg=True)
ferrocene <Mol: [Fe+2].c1cc[cH-]c1.c1cc[cH-]c1>
sodium acetate <Mol: CC(=O)[O-].[Na+]>
Name: structure, dtype: object
skchem.filters.
AtomNumberFilter
(above=3, below=60, include_hydrogens=False, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter whether the number of atoms in a Mol falls in a defined interval.
above <= n_atoms < below
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('CCCC', name='butane'),
... skchem.Mol.from_smiles('NC(C)C(=O)O', name='alanine'),
... skchem.Mol.from_smiles('C12C=CC(C=C2)C=C1', name='barrelene')
... ]
>>> af = skchem.filters.AtomNumberFilter(above=3, below=7)
>>> af.transform(data)
ethane False
butane True
alanine True
barrelene False
Name: num_atoms_in_range, dtype: bool
>>> af.filter(data)
butane <Mol: CCCC>
alanine <Mol: CC(N)C(=O)O>
Name: structure, dtype: object
>>> af = skchem.filters.AtomNumberFilter(above=5, below=15, include_hydrogens=True)
>>> af.transform(data)
ethane True
butane True
alanine True
barrelene False
Name: num_atoms_in_range, dtype: bool
columns
¶skchem.filters.
MassFilter
(above=3, below=900, n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.Filter
Filter whether the molecular weight of a molecule is outside a range.
above <= mass < below
Examples
>>> import skchem
>>> data = [
... skchem.Mol.from_smiles('CC', name='ethane'),
... skchem.Mol.from_smiles('CCCC', name='butane'),
... skchem.Mol.from_smiles('NC(C)C(=O)O', name='alanine'),
... skchem.Mol.from_smiles('C12C=CC(C=C2)C=C1', name='barrelene')
... ]
>>> mf = skchem.filters.MassFilter(above=31, below=100)
>>> mf.transform(data)
ethane False
butane True
alanine True
barrelene False
Name: mass_in_range, dtype: bool
>>> mf.filter(data)
butane <Mol: CCCC>
alanine <Mol: CC(N)C(=O)O>
Name: structure, dtype: object
columns
¶skchem.filters.
Filter
(func=None, agg='any', n_jobs=1, verbose=True)[source]¶Bases: skchem.filters.base.BaseFilter
, skchem.base.Transformer
Filter base class.
Examples
>>> import skchem
Initialize the filter with a function: >>> is_named = skchem.filters.Filter(lambda m: m.name is not None)
Filter results can be found with transform: >>> ethane = skchem.Mol.from_smiles(‘CC’, name=’ethane’) >>> is_named.transform(ethane) True
>>> anonymous = skchem.Mol.from_smiles('c1ccccc1')
>>> is_named.transform(anonymous)
False
Can take a series or dataframe: >>> mols = pd.Series({‘anonymous’: anonymous, ‘ethane’: ethane}) >>> is_named.transform(mols) anonymous False ethane True Name: Filter, dtype: bool
Using filter will drop out molecules that fail the test: >>> is_named.filter(mols) ethane <Mol: CC> dtype: object
Only failed are retained with the neg keyword argument: >>> is_named.filter(mols, neg=True) anonymous <Mol: c1ccccc1> dtype: object
## skchem.forcefields.base
Module specifying base class for forcefields.
skchem.forcefields.base.
ForceField
(preembed=True, warn_on_fail=True, error_on_fail=False, add_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.base.Transformer
, skchem.filters.base.TransformFilter
Base forcefield class.
Filter drops those that fail to be optimized.
columns
¶## skchem.forcefields.mmff
Module specifying the Merck Molecular Force Field.
skchem.forcefields.mmff.
MMFF
(preembed=True, warn_on_fail=True, error_on_fail=False, add_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.forcefields.base.ForceField
Merck Molecular Force Field transformer.
## skchem.forcefields.uff
Module specifying the universal force field.
skchem.forcefields.uff.
UFF
(preembed=True, warn_on_fail=True, error_on_fail=False, add_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.forcefields.base.ForceField
Universal Force Field transformer.
## skchem.forcefields
Module specifying forcefields.
skchem.forcefields.
MMFF
(preembed=True, warn_on_fail=True, error_on_fail=False, add_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.forcefields.base.ForceField
Merck Molecular Force Field transformer.
skchem.forcefields.
UFF
(preembed=True, warn_on_fail=True, error_on_fail=False, add_hs=True, n_jobs=1, verbose=True)[source]¶Bases: skchem.forcefields.base.ForceField
Universal Force Field transformer.
# skchem.io.sdf
Defining input and output operations for sdf files.
skchem.io.sdf.
read_sdf
(sdf, error_bad_mol=False, warn_bad_mol=True, nmols=None, skipmols=None, skipfooter=None, read_props=True, mol_props=False, *args, **kwargs)[source]¶Read an sdf file into a pd.DataFrame.
The function wraps the RDKit ForwardSDMolSupplier object.
Parameters: |
|
---|---|
Returns: | The loaded data frame, with Mols supplied in the structure field. |
Return type: | pandas.DataFrame |
See also
rdkit.Chem.SDForwardMolSupplier skchem.read_smiles
skchem.io.sdf.
write_sdf
(data, sdf, write_cols=True, index_as_name=True, mol_props=False, *args, **kwargs)[source]¶Write an sdf file from a dataframe.
Parameters: |
|
---|
# skchem.io.smiles
Defining input and output operations for smiles files.
skchem.io.smiles.
read_smiles
(smiles_file, smiles_column=0, name_column=None, delimiter='\t', title_line=False, error_bad_mol=False, warn_bad_mol=True, drop_bad_mol=True, *args, **kwargs)[source]¶Read a smiles file into a pandas dataframe.
The class wraps the pandas read_csv function.
Returns: | The loaded data frame, with Mols supplied in the structure field. |
---|---|
Return type: | pandas.DataFrame |
See also
pandas.read_csv skchem.Mol.from_smiles skchem.io.sdf
skchem.io
Module defining input and output methods in scikit-chem.
skchem.io.
read_sdf
(sdf, error_bad_mol=False, warn_bad_mol=True, nmols=None, skipmols=None, skipfooter=None, read_props=True, mol_props=False, *args, **kwargs)[source]¶Read an sdf file into a pd.DataFrame.
The function wraps the RDKit ForwardSDMolSupplier object.
Parameters: |
|
---|---|
Returns: | The loaded data frame, with Mols supplied in the structure field. |
Return type: | pandas.DataFrame |
See also
rdkit.Chem.SDForwardMolSupplier skchem.read_smiles
skchem.io.
write_sdf
(data, sdf, write_cols=True, index_as_name=True, mol_props=False, *args, **kwargs)[source]¶Write an sdf file from a dataframe.
Parameters: |
|
---|
skchem.io.
read_smiles
(smiles_file, smiles_column=0, name_column=None, delimiter='\t', title_line=False, error_bad_mol=False, warn_bad_mol=True, drop_bad_mol=True, *args, **kwargs)[source]¶Read a smiles file into a pandas dataframe.
The class wraps the pandas read_csv function.
Returns: | The loaded data frame, with Mols supplied in the structure field. |
---|---|
Return type: | pandas.DataFrame |
See also
pandas.read_csv skchem.Mol.from_smiles skchem.io.sdf
skchem.io.
write_smiles
(data, smiles_path)[source]¶Write a dataframe to a smiles file.
Parameters: |
|
---|
skchem.io.
read_config
(conf)[source]¶Deserialize an object from a config dict.
Parameters: | conf (dict) – The config dict to deseriailize. |
---|---|
Returns: | object |
Note
config is different from params, in that it specifies the class. The params dict is a subdict in config.
skchem.io.
read_yaml
(conf)[source]¶Deserialize an object from a yaml file, filename or str.
Parameters: | yaml (str or filelike) – The yaml file to deserialize. |
---|---|
Returns: | object |
# skchem.pandas.structure_methods
Tools for adding a default attribute to pandas objects.
skchem.pandas_ext.structure_methods.
StructureAccessorMixin
[source]¶Bases: object
Mixin to bind chemical methods to objects.
mol
¶alias of StructureMethods
# skchem.pipeline.pipeline
Module implementing pipelines.
skchem.pipeline.pipeline.
Pipeline
(objects)[source]¶Bases: object
Pipeline object. Applies filters and transformers in sequence.
to_json
(target=None)[source]¶Serialize the object as JSON.
Parameters: |
|
---|
# skchem.pipeline
Package implementing pipelines.
skchem.pipeline.
Pipeline
(objects)[source]¶Bases: object
Pipeline object. Applies filters and transformers in sequence.
to_json
(target=None)[source]¶Serialize the object as JSON.
Parameters: |
|
---|
## skchem.standardizers.chemaxon
Module wrapping ChemAxon Standardizer. Must have standardizer installed and license activated.
skchem.standardizers.chemaxon.
ChemAxonStandardizer
(config_path=None, keep_failed=False, **kwargs)[source]¶Bases: skchem.base.CLIWrapper
, skchem.base.BatchTransformer
, skchem.base.Transformer
, skchem.filters.base.TransformFilter
ChemAxon Standardizer Wrapper.
Parameters: | config_path (str) – The path of the config_file. If None, use the default one. |
---|
Notes
ChemAxon Standardizer must be installed and accessible as standardize from the shell launching the program.
Warning
Must use a unique index (see #31).
Examples
>>> import skchem
>>> std = skchem.standardizers.ChemAxonStandardizer()
>>> m = skchem.Mol.from_smiles('CC.CCC')
>>> print(std.transform(m))
<Mol: CCC>
>>> data = [m, skchem.Mol.from_smiles('C=CO'), skchem.Mol.from_smiles('C[O-]')]
>>> std.transform(data)
0 <Mol: CCC>
1 <Mol: CC=O>
2 <Mol: CO>
Name: structure, dtype: object
>>> will_fail = mol = '''932-97-8
... RDKit 3D
...
... 9 9 0 0 0 0 0 0 0 0999 V2000
... -0.9646 0.0000 0.0032 C 0 0 0 0 0 0 0 0 0 0 0 0
... -0.2894 -1.2163 0.0020 C 0 0 0 0 0 0 0 0 0 0 0 0
... -0.2894 1.2163 0.0025 C 0 0 0 0 0 0 0 0 0 0 0 0
... -2.2146 0.0000 -0.0004 N 0 0 0 0 0 0 0 0 0 0 0 0
... 1.0710 -1.2610 0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
... 1.0710 1.2610 0.0007 C 0 0 0 0 0 0 0 0 0 0 0 0
... -3.3386 0.0000 -0.0037 N 0 0 0 0 0 0 0 0 0 0 0 0
... 1.8248 0.0000 -0.0005 C 0 0 0 0 0 0 0 0 0 0 0 0
... 3.0435 0.0000 -0.0026 O 0 0 0 0 0 0 0 0 0 0 0 0
... 1 2 1 0
... 1 3 1 0
... 1 4 2 3
... 2 5 2 0
... 3 6 2 0
... 4 7 2 0
... 5 8 1 0
... 8 9 2 0
... 6 8 1 0
... M CHG 2 4 1 7 -1
... M END
... '''
>>> will_fail = skchem.Mol.from_molblock(will_fail)
>>> std.transform(will_fail)
nan
>>> data = [will_fail] + data
>>> std.transform(data)
0 None
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
>>> std.transform_filter(data)
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
>>> std.keep_failed = True
>>> std.transform(data)
0 <Mol: [N-]=[N+]=C1C=CC(=O)C=C1>
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
DEFAULT_CONFIG
= '/home/docs/checkouts/readthedocs.org/user_builds/scikit-chem/checkouts/latest/skchem/standardizers/default_config.xml'¶columns
¶install_hint
= ' Install ChemAxon from https://www.chemaxon.com. It requires a license,\n which can be freely obtained for academics. '¶skchem.standardizers.
ChemAxonStandardizer
(config_path=None, keep_failed=False, **kwargs)[source]¶Bases: skchem.base.CLIWrapper
, skchem.base.BatchTransformer
, skchem.base.Transformer
, skchem.filters.base.TransformFilter
ChemAxon Standardizer Wrapper.
Parameters: | config_path (str) – The path of the config_file. If None, use the default one. |
---|
Notes
ChemAxon Standardizer must be installed and accessible as standardize from the shell launching the program.
Warning
Must use a unique index (see #31).
Examples
>>> import skchem
>>> std = skchem.standardizers.ChemAxonStandardizer()
>>> m = skchem.Mol.from_smiles('CC.CCC')
>>> print(std.transform(m))
<Mol: CCC>
>>> data = [m, skchem.Mol.from_smiles('C=CO'), skchem.Mol.from_smiles('C[O-]')]
>>> std.transform(data)
0 <Mol: CCC>
1 <Mol: CC=O>
2 <Mol: CO>
Name: structure, dtype: object
>>> will_fail = mol = '''932-97-8
... RDKit 3D
...
... 9 9 0 0 0 0 0 0 0 0999 V2000
... -0.9646 0.0000 0.0032 C 0 0 0 0 0 0 0 0 0 0 0 0
... -0.2894 -1.2163 0.0020 C 0 0 0 0 0 0 0 0 0 0 0 0
... -0.2894 1.2163 0.0025 C 0 0 0 0 0 0 0 0 0 0 0 0
... -2.2146 0.0000 -0.0004 N 0 0 0 0 0 0 0 0 0 0 0 0
... 1.0710 -1.2610 0.0002 C 0 0 0 0 0 0 0 0 0 0 0 0
... 1.0710 1.2610 0.0007 C 0 0 0 0 0 0 0 0 0 0 0 0
... -3.3386 0.0000 -0.0037 N 0 0 0 0 0 0 0 0 0 0 0 0
... 1.8248 0.0000 -0.0005 C 0 0 0 0 0 0 0 0 0 0 0 0
... 3.0435 0.0000 -0.0026 O 0 0 0 0 0 0 0 0 0 0 0 0
... 1 2 1 0
... 1 3 1 0
... 1 4 2 3
... 2 5 2 0
... 3 6 2 0
... 4 7 2 0
... 5 8 1 0
... 8 9 2 0
... 6 8 1 0
... M CHG 2 4 1 7 -1
... M END
... '''
>>> will_fail = skchem.Mol.from_molblock(will_fail)
>>> std.transform(will_fail)
nan
>>> data = [will_fail] + data
>>> std.transform(data)
0 None
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
>>> std.transform_filter(data)
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
>>> std.keep_failed = True
>>> std.transform(data)
0 <Mol: [N-]=[N+]=C1C=CC(=O)C=C1>
1 <Mol: CCC>
2 <Mol: CC=O>
3 <Mol: CO>
Name: structure, dtype: object
DEFAULT_CONFIG
= '/home/docs/checkouts/readthedocs.org/user_builds/scikit-chem/checkouts/latest/skchem/standardizers/default_config.xml'¶columns
¶install_hint
= ' Install ChemAxon from https://www.chemaxon.com. It requires a license,\n which can be freely obtained for academics. '¶## skchem.tests.test_cross_validation.test_similarity_threshold
Tests for similarity threshold dataset partitioning functionality.
Tests for data functions
Tests for sdf io functionality
skchem.test.test_io.test_sdf.
TestSDF
[source]¶Bases: object
Test class for sdf file parser
test_file_correct_structure
()[source]¶When opened with a file-like object, is the structure correct? Done by checking atom number (should be one, as rdkit ignores Hs by default
Tests for smiles io functionality
skchem.utils.helpers
Module providing helper functions for scikit-chem
# skchem.utils.io
IO helper functions for skchem.
skchem.utils.io.
json_dump
(obj, target=None)[source]¶Write object as json to file or stream, or return as string.
skchem.utils.io.
line_count
(filename)[source]¶Quickly count the number of lines in a file.
Adapted from http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python
Parameters: | filename (str) – The name of the file to count for. |
---|
skchem.utils.io.
sdf_count
(filename)[source]¶Efficiently count molecules in an sdf file.
Specifically, the function counts the number of times ‘$$$$’ occurs at the start of lines in the file.
Parameters: | filename (str) – The filename of the sdf file. |
---|---|
Returns: | the number of molecules in the file. |
Return type: | int |
# skchem.utils.progress
Module implementing progress bars.
skchem.utils.suppress
Class for suppressing C extensions output.
skchem.utils.suppress.
Suppressor
[source]¶Bases: object
A context manager for doing a “deep suppression” of stdout and stderr.
It will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printed to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).
null_fds
= [4, 5]¶skchem.utils
Module providing utility functions for scikit-chem
skchem.utils.
Suppressor
[source]¶Bases: object
A context manager for doing a “deep suppression” of stdout and stderr.
It will suppress all print, even if the print originates in a compiled C/Fortran sub-function.
This will not suppress raised exceptions, since exceptions are printed to stderr just before a script exits, and after the context manager has exited (at least, I think that is why it lets exceptions through).
null_fds
= [4, 5]¶skchem.utils.
NamedProgressBar
(name=None, **kwargs)[source]¶Bases: progressbar.bar.ProgressBar
skchem.utils.
json_dump
(obj, target=None)[source]¶Write object as json to file or stream, or return as string.
skchem.utils.
yaml_dump
(obj, target=None)[source]¶Write object as yaml to file or stream, or return as string.
skchem.utils.
line_count
(filename)[source]¶Quickly count the number of lines in a file.
Adapted from http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python
Parameters: | filename (str) – The name of the file to count for. |
---|
skchem.utils.
sdf_count
(filename)[source]¶Efficiently count molecules in an sdf file.
Specifically, the function counts the number of times ‘$$$$’ occurs at the start of lines in the file.
Parameters: | filename (str) – The filename of the sdf file. |
---|---|
Returns: | the number of molecules in the file. |
Return type: | int |
skchem.utils.
nanarray
(shape)[source]¶Produce an array of NaN in provided shape.
Parameters: | shape (tuple) – The shape of the nan array to produce. |
---|---|
Returns: | np.array |
## skchem.vis.atom
Module for atom contribution visualization.
skchem.vis.atom.
plot_weights
(mol, weights, quality=1, l=0.4, step=50, levels=20, contour_opacity=0.5, cmap='RdBu', ax=None, **kwargs)[source]¶Plot weights as a sum of gaussians across a structure image.
Parameters: |
|
---|---|
Returns: | The plot. |
Return type: | matplotlib.AxesSubplot |
## skchem.vis.mol
Module for drawing molecules.
skchem.vis.mol.
draw
(mol, quality=1, ax=None)[source]¶Draw a molecule on a matplotlib axis.
Parameters: |
|
---|---|
Returns: | A matplotlib AxesImage object with the molecule drawn. |
Return type: | plt.AxesImage |
skchem.vis.mol.
draw_3d
(m, conformer_id=-1, label_atoms=None)[source]¶Draw a molecule in three dimensions.
Parameters: |
|
---|---|
Returns: | plt.figure |
Note
This works great in the notebook with %matplotlib notebook.
## skchem.vis
Module for plotting images of molecules.
skchem.vis.
draw
(mol, quality=1, ax=None)[source]¶Draw a molecule on a matplotlib axis.
Parameters: |
|
---|---|
Returns: | A matplotlib AxesImage object with the molecule drawn. |
Return type: | plt.AxesImage |
skchem.vis.
draw_3d
(m, conformer_id=-1, label_atoms=None)[source]¶Draw a molecule in three dimensions.
Parameters: |
|
---|---|
Returns: | plt.figure |
Note
This works great in the notebook with %matplotlib notebook.
skchem.vis.
plot_weights
(mol, weights, quality=1, l=0.4, step=50, levels=20, contour_opacity=0.5, cmap='RdBu', ax=None, **kwargs)[source]¶Plot weights as a sum of gaussians across a structure image.
Parameters: |
|
---|---|
Returns: | The plot. |
Return type: | matplotlib.AxesSubplot |
# skchem.base
Base classes for scikit-chem objects.
skchem.base.
AtomTransformer
(max_atoms=100, **kwargs)[source]¶Bases: skchem.base.BaseTransformer
Transformer that will produce a Panel.
Concrete classes inheriting from this should implement _transform_atom, _transform_mol and minor_axis.
See also
Transformer
axes_names
¶tuple – The names of the axes.
minor_axis
¶pd.Index – Minor axis of transformed values.
skchem.base.
BaseTransformer
(n_jobs=1, verbose=True)[source]¶Bases: object
Transformer Base Class.
Specific Base Transformer classes inherit from this class and implement transform and axis_names.
axes_names
¶tuple – The names of the axes.
n_jobs
¶to_json
(target=None)[source]¶Serialize the object as JSON.
Parameters: |
|
---|
skchem.base.
BatchTransformer
(n_jobs=1, verbose=True)[source]¶Bases: skchem.base.BaseTransformer
Mixin for which transforms on multiple molecules save overhead.
Implement _transform_series with the transformation rather than _transform_mol. Must occur before Transformer or AtomTransformer in method resolution order.
See also
Transformer, AtomTransformer.
skchem.base.
CLIWrapper
(error_on_fail=False, warn_on_fail=True, **kwargs)[source]¶Bases: skchem.base.External
, skchem.base.BaseTransformer
CLI wrapper.
Concrete classes inheriting from this must implement _cli_args, monitor_progress, _parse_outfile, _parse_errors.
n_jobs
¶skchem.base.
External
(**kwargs)[source]¶Bases: object
Mixin for wrappers of external CLI tools.
Concrete classes must implement validate_install.
install_hint
¶str – an explanation of how to install external tool.
install_hint
= ''validated
¶bool – whether the external tool is installed and active.
skchem.base.
Featurizer
[source]¶Bases: object
Base class for m -> data transforms, such as Fingerprinting etc.
Concrete subclasses should implement name, returning a string uniquely identifying the featurizer.
skchem.base.
Transformer
(n_jobs=1, verbose=True)[source]¶Bases: skchem.base.BaseTransformer
Molecular based Transformer Base class.
Concrete Transformers inherit from this class and must implement _transform_mol and _columns.
See also
AtomTransformer.
axes_names
¶tuple – The names of the axes.
columns
¶pd.Index – The column index to use.
skchem.metrics.
bedroc_score
(y_true, y_pred, decreasing=True, alpha=20.0)[source]¶BEDROC metric implemented according to Truchon and Bayley.
The Boltzmann Enhanced Descrimination of the Receiver Operator Characteristic (BEDROC) score is a modification of the Receiver Operator Characteristic (ROC) score that allows for a factor of early recognition.
References
The original paper by Truchon et al. is located at 10.1021/ci600426e.
Parameters: |
|
---|---|
Returns: | Value in interval [0, 1] indicating degree to which the predictive technique employed detects (early) the positive class. |
Return type: | float |
A cheminformatics library to integrate with the Scientific Python Stack
Development occurs on GitHub. We gladly accept pull requests !
To start developing features for the package, you will need the core runtime dependencies, shown in installing, in addition to the below:
Tests may be run locally through py.test. This can be invoked using either
py.test
or python setup.py test
in the project root. Command line
extensions are not tested by default - these can be tested also, by using the
appropriate flag, such as python setup.py test --with-chemaxon
.
Test coverage is assessed using coverage
. This is run locally as part of
the pytest command. It is set up to run as part of the CI, and can be viewed
on Scrutinzer. Test coverage has suffered as features were rapidly developed
in response to needs for the author’s PhD, and will be improved once the PhD
is submitted!
scikit-chem conforms to pep8. PyLint is used to assess code quality locally,
and can be run using pylint skchem
from the root of the project.
Scrutinzer is also set up to run as part of the CI. As with test coverage,
code quality has slipped due to time demands, and will be fixed once the PhD is
submitted!
This documentation is built using Sphinx, and Bootstrap using the Bootswatch
Flatly theme. The documentation is hosted on Github Pages. To build the
html documentation locally, run make html
. To serve it, run
make livehtml
.
Warning
scikit-chem is currently in pre-alpha. The basic API may change between releases as we develop and optimise the library. Please read the what’s new page when updating to stay on top of changes.