Pipelining

scikit-chem expands on the scikit-learn Pipeline object to support filtering. It is initialized using a list of Transformer objects.

In [10]:
pipeline = skchem.pipeline.Pipeline([
        skchem.standardizers.ChemAxonStandardizer(keep_failed=True),
        skchem.forcefields.UFF(),
        skchem.filters.OrganicFilter(),
        skchem.descriptors.MorganFeaturizer()])

The pipeline will apply each in turn to objects, using the the highest priority function that each object implements, according to the order transform_filter > filter > transform.

For example, our pipeline can transform sodium acetate all the way to fingerprints:

In [11]:
mol = skchem.Mol.from_smiles('CC(=O)[O-].[Na+]')
In [4]:
pipeline.transform_filter(mol)
Out[4]:
morgan_fp_idx
0       0
1       0
2       0
3       0
4       0
       ..
2043    0
2044    0
2045    0
2046    0
2047    0
Name: MorganFeaturizer, dtype: uint8

It also works on collections of molecules:

In [12]:
mols = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi', name_column=1); mols
Out[12]:
batch
ethane                          <Mol: CC>
propane                        <Mol: CCC>
benzene                   <Mol: c1ccccc1>
sodium acetate    <Mol: CC(=O)[O-].[Na+]>
serine                <Mol: NC(CO)C(=O)O>
Name: structure, dtype: object
In [16]:
pipeline.transform_filter(mols)
ChemAxonStandardizer: 100% (5 of 5) |##########################################| Elapsed Time: 0:00:04 Time: 0:00:04
UFF: 100% (5 of 5) |###########################################################| Elapsed Time: 0:00:00 Time: 0:00:00
OrganicFilter: 100% (5 of 5) |#################################################| Elapsed Time: 0:00:00 Time: 0:00:00
MorganFeaturizer: 100% (5 of 5) |##############################################| Elapsed Time: 0:00:00 Time: 0:00:00
Out[16]:
morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 2047
batch
ethane 0 0 0 0 0 ... 0 0 0 0 0
propane 0 0 0 0 0 ... 0 0 0 0 0
benzene 0 0 0 0 0 ... 0 0 0 0 0
sodium acetate 0 0 0 0 0 ... 0 0 0 0 0
serine 0 0 0 0 0 ... 0 0 1 0 0

5 rows × 2048 columns

In [ ]: