Molecules in scikit-chem

scikit-chem is first and formost a wrapper around rdkit to make it more Pythonic, and more intuitive to a user familiar with other libraries in the Scientific Python Stack. The package implements a core Mol class, physically representing a molecule. It is a direct subclass of the rdkit.Mol class:

In [1]:
import rdkit.Chem
issubclass(skchem.Mol, rdkit.Chem.Mol)
Out[1]:
True

As such, it has all the methods available that an rdkit.Mol class has, for example:

In [2]:
hasattr(skchem.Mol, 'GetAromaticAtoms')
Out[2]:
True

Initializing new molecules

Constructors are provided as classmethods on the skchem.Mol object, in the same fashion as pandas objects are constructed. For example, to make a pandas.DataFrame from a dictionary, you call:

In [3]:
df = pd.DataFrame.from_dict({'a': [10, 20], 'b': [20, 40]}); df
Out[3]:
a b
0 10 20
1 20 40

Analogously, to make a skchem.Mol from a smiles string, you call;

In [4]:
mol = skchem.Mol.from_smiles('CC(=O)Cl'); mol
Out[4]:
<Mol name="None" formula="C2H3ClO" at 0x11dc8f490>

The available methods are:

In [5]:
[method for method in skchem.Mol.__dict__ if method.startswith('from_')]
Out[5]:
['from_tplblock',
 'from_molblock',
 'from_molfile',
 'from_binary',
 'from_tplfile',
 'from_mol2block',
 'from_pdbfile',
 'from_pdbblock',
 'from_smiles',
 'from_smarts',
 'from_mol2file',
 'from_inchi']

When a molecule fails to parse, a ValueError is raised:

In [6]:
skchem.Mol.from_smiles('NOTSMILES')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-99e03ef822e7> in <module>()
----> 1 skchem.Mol.from_smiles('NOTSMILES')

/Users/rich/projects/scikit-chem/skchem/core/mol.py in constructor(_, in_arg, name, *args, **kwargs)
    419         m = getattr(rdkit.Chem, 'MolFrom' + constructor_name)(in_arg, *args, **kwargs)
    420         if m is None:
--> 421             raise ValueError('Failed to parse molecule, {}'.format(in_arg))
    422         m = Mol.from_super(m)
    423         m.name = name

ValueError: Failed to parse molecule, NOTSMILES

Molecule accessors

Atoms and bonds are accessible as a property:

In [7]:
mol.atoms
Out[7]:
<AtomView values="['C', 'C', 'O', 'Cl']" at 0x11dc9ac88>
In [8]:
mol.bonds
Out[8]:
<BondView values="['C-C', 'C=O', 'C-Cl']" at 0x11dc9abe0>

These are iterable:

In [9]:
[a for a in mol.atoms]
Out[9]:
[<Atom element="C" at 0x11dcfe8a0>,
 <Atom element="C" at 0x11dcfe9e0>,
 <Atom element="O" at 0x11dcfed00>,
 <Atom element="Cl" at 0x11dcfedf0>]

subscriptable:

In [10]:
mol.atoms[3]
Out[10]:
<Atom element="Cl" at 0x11dcfef30>

sliceable:

In [11]:
mol.atoms[:3]
Out[11]:
[<Atom element="C" at 0x11dcfebc0>,
 <Atom element="C" at 0x11de690d0>,
 <Atom element="O" at 0x11de693f0>]

indexable:

In [19]:
mol.atoms[[1, 3]]
Out[19]:
[<Atom element="C" at 0x11de74760>, <Atom element="Cl" at 0x11de7fe40>]

and maskable:

In [18]:
mol.atoms[[True, False, True, False]]
Out[18]:
[<Atom element="C" at 0x11de74ad0>, <Atom element="O" at 0x11de74f30>]

Properties on the rdkit objects are accessible through the props property:

In [11]:
mol.props['is_reactive'] = 'very!'
In [12]:
mol.atoms[1].props['kind'] = 'electrophilic'
mol.atoms[3].props['leaving group'] = 1
mol.bonds[2].props['bond strength'] = 'strong'

These are using the rdkit property functionality internally:

In [13]:
mol.GetProp('is_reactive')
Out[13]:
'very!'

Note

RDKit properties can only store str s, int s and float s. Any other type will be coerced to a string before storage.

The properties of atoms and bonds are accessible molecule wide:

In [14]:
mol.atoms.props
Out[14]:
<MolPropertyView values="{'leaving group': [nan, nan, nan, 1.0], 'kind': [None, 'electrophilic', None, None]}" at 0x11daf8390>
In [15]:
mol.bonds.props
Out[15]:
<MolPropertyView values="{'bond strength': [None, None, 'strong']}" at 0x11daf80f0>

These can be exported as pandas objects:

In [16]:
mol.atoms.props.to_frame()
Out[16]:
kind leaving group
atom_idx
0 None NaN
1 electrophilic NaN
2 None NaN
3 None 1.0

Export and Serialization

Molecules are exported and/or serialized in a very similar way in which they are constructed, again with an inspiration from pandas.

In [17]:
df.to_csv()
Out[17]:
',a,b\n0,10,20\n1,20,40\n'
In [18]:
mol.to_inchi_key()
Out[18]:
'WETWJCDKMRHUPV-UHFFFAOYSA-N'

The total available formats are:

In [19]:
[method for method in skchem.Mol.__dict__ if method.startswith('to_')]
Out[19]:
['to_inchi',
 'to_json',
 'to_smiles',
 'to_smarts',
 'to_inchi_key',
 'to_binary',
 'to_dict',
 'to_molblock',
 'to_tplfile',
 'to_formula',
 'to_molfile',
 'to_pdbblock',
 'to_tplblock']