Data¶

scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.

In memory datasets¶

Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.

For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:

In [31]:

skchem.data.BursiAmes.available_sets()

Out[31]:

('train', 'valid', 'test')

And many sources:

In [32]:

skchem.data.BursiAmes.available_sources()

Out[32]:

('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')

Note

Currently, the nature of the sources are not alway well documented, but as a guide, X are moleccular features, y are target variables, A are atom features, G are distances. When available, they will be detailed in the docstring of the dataset, accessible with help.

For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).

We can load the data for requested sets and sources using the in memory API:

In [33]:

kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}

(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)

The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.

In [34]:

print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)

train shapes: (3007, 2048) (3007,)
valid shapes: (645, 2048) (645,)
test shapes: (645, 2048) (645,)

The raw data is loaded as numpy arrays:

In [35]:

X_train

Out[35]:

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [36]:

y_train

Out[36]:

array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)

Which should be ready to use as fuel for modelling!

Data as pandas objects¶

The data is originally saved as pandas objects, and can be retrieved as such using the read_frame class method.

Features are available under the ‘feats’ namespace:

In [37]:

skchem.data.BursiAmes.read_frame('feats/X_morg')

Out[37]:

morgan_fp_idx	0	1	2	3	4	...	2043	2044	2045	2046	2047
batch
1728-95-6	0	0	0	0	0	...	0	0	0	0	0
74550-97-3	0	0	0	0	0	...	0	0	0	0	0
16757-83-8	0	0	0	0	0	...	0	0	0	0	0
553-97-9	0	0	0	0	0	...	0	0	0	0	0
115-39-9	0	0	0	0	0	...	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
874-60-2	0	0	0	0	0	...	0	0	0	0	0
92-66-0	0	0	0	0	0	...	0	0	0	0	0
594-71-8	0	0	0	0	0	...	0	0	0	0	0
55792-21-7	0	0	0	0	0	...	0	0	0	0	0
84987-77-9	0	0	0	0	0	...	0	0	0	0	0

4297 rows × 2048 columns

Target variables under ‘targets’:

In [39]:

skchem.data.BursiAmes.read_frame('targets/y')

Out[39]:

batch
1728-95-6     1
74550-97-3    1
16757-83-8    1
553-97-9      0
115-39-9      0
             ..
874-60-2      1
92-66-0       0
594-71-8      1
55792-21-7    0
84987-77-9    1
Name: is_mutagen, dtype: uint8

Set membership masks under ‘indices’:

In [40]:

skchem.data.BursiAmes.read_frame('indices/train')

Out[40]:

batch
1728-95-6      True
74550-97-3     True
16757-83-8     True
553-97-9       True
115-39-9       True
              ...
874-60-2      False
92-66-0       False
594-71-8      False
55792-21-7    False
84987-77-9    False
Name: split, dtype: bool

Finally, molecules are accessible via ‘structure’:

In [42]:

skchem.data.BursiAmes.read_frame('structure')

Out[42]:

batch
1728-95-6      <Mol: [H]c1c([H])c([H])c(-c2nc(-c3c([H])c([H])...
119-34-6       <Mol: [H]Oc1c([H])c([H])c(N([H])[H])c([H])c1[N...
371-40-4           <Mol: [H]c1c([H])c(N([H])[H])c([H])c([H])c1F>
2319-96-2      <Mol: [H]c1c([H])c([H])c2c([H])c3c(c([H])c(C([...
1822-51-1         <Mol: [H]c1nc([H])c([H])c(C([H])([H])Cl)c1[H]>
                                     ...
84-64-0        <Mol: [H]c1c([H])c([H])c(C(=O)OC2([H])C([H])([...
121808-62-6    <Mol: [H]OC(=O)C1([H])N(C(=O)C2([H])N([H])C(=O...
134-20-3       <Mol: [H]c1c([H])c([H])c(N([H])[H])c(C(=O)OC([...
6441-91-4      <Mol: [H]Oc1c([H])c(S(=O)(=O)O[H])c([H])c2c([H...
97534-21-9     <Mol: [H]Oc1nc(=S)n([H])c(O[H])c1C(=O)N([H])c1...
Name: structure, dtype: object

Note

The dataset building functionality is likely to undergo a large change in future so is not documented here. Please look at the example datasets to understand the format required to build the datasets directly.