Data

scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.

In memory datasets

Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.

For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:

In [31]:
skchem.data.BursiAmes.available_sets()
Out[31]:
('train', 'valid', 'test')

And many sources:

In [32]:
skchem.data.BursiAmes.available_sources()
Out[32]:
('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')

Note

Currently, the nature of the sources are not alway well documented, but as a guide, X are moleccular features, y are target variables, A are atom features, G are distances. When available, they will be detailed in the docstring of the dataset, accessible with help.

For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).

We can load the data for requested sets and sources using the in memory API:

In [33]:
kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}

(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)

The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.

In [34]:
print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)
train shapes: (3007, 2048) (3007,)
valid shapes: (645, 2048) (645,)
test shapes: (645, 2048) (645,)

The raw data is loaded as numpy arrays:

In [35]:
X_train
Out[35]:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])
In [36]:
y_train
Out[36]:
array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)

Which should be ready to use as fuel for modelling!

Data as pandas objects

The data is originally saved as pandas objects, and can be retrieved as such using the read_frame class method.

Features are available under the ‘feats’ namespace:

In [37]:
skchem.data.BursiAmes.read_frame('feats/X_morg')
Out[37]:
morgan_fp_idx 0 1 2 3 4 ... 2043 2044 2045 2046 2047
batch
1728-95-6 0 0 0 0 0 ... 0 0 0 0 0
74550-97-3 0 0 0 0 0 ... 0 0 0 0 0
16757-83-8 0 0 0 0 0 ... 0 0 0 0 0
553-97-9 0 0 0 0 0 ... 0 0 0 0 0
115-39-9 0 0 0 0 0 ... 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
874-60-2 0 0 0 0 0 ... 0 0 0 0 0
92-66-0 0 0 0 0 0 ... 0 0 0 0 0
594-71-8 0 0 0 0 0 ... 0 0 0 0 0
55792-21-7 0 0 0 0 0 ... 0 0 0 0 0
84987-77-9 0 0 0 0 0 ... 0 0 0 0 0

4297 rows × 2048 columns

Target variables under ‘targets’:

In [39]:
skchem.data.BursiAmes.read_frame('targets/y')
Out[39]:
batch
1728-95-6     1
74550-97-3    1
16757-83-8    1
553-97-9      0
115-39-9      0
             ..
874-60-2      1
92-66-0       0
594-71-8      1
55792-21-7    0
84987-77-9    1
Name: is_mutagen, dtype: uint8

Set membership masks under ‘indices’:

In [40]:
skchem.data.BursiAmes.read_frame('indices/train')
Out[40]:
batch
1728-95-6      True
74550-97-3     True
16757-83-8     True
553-97-9       True
115-39-9       True
              ...
874-60-2      False
92-66-0       False
594-71-8      False
55792-21-7    False
84987-77-9    False
Name: split, dtype: bool

Finally, molecules are accessible via ‘structure’:

In [41]:
skchem.data.BursiAmes.read_frame('structure')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-41-5d342c123258> in <module>()
----> 1 skchem.data.BursiAmes.read_frame('structure')

/Users/rich/projects/scikit-chem/skchem/data/datasets/base.py in read_frame(cls, key, *args, **kwargs)
     95         with warnings.catch_warnings():
     96             warnings.simplefilter('ignore')
---> 97             data = pd.read_hdf(find_in_data_path(cls.filename), key, *args, **kwargs)
     98         if isinstance(data, pd.Panel):
     99             data = data.transpose(2, 1, 0)

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
    328                                  'multiple datasets.')
    329             key = keys[0]
--> 330         return store.select(key, auto_close=auto_close, **kwargs)
    331     except:
    332         # if there is an error, close the store

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
    678                            chunksize=chunksize, auto_close=auto_close)
    679
--> 680         return it.get_result()
    681
    682     def select_as_coordinates(

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
   1362
   1363         # directly return the result
-> 1364         results = self.func(self.start, self.stop, where)
   1365         self.close()
   1366         return results

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
    671             return s.read(start=_start, stop=_stop,
    672                           where=_where,
--> 673                           columns=columns, **kwargs)
    674
    675         # create the iterator

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read(self, **kwargs)
   2637         self.validate_read(kwargs)
   2638         index = self.read_index('index')
-> 2639         values = self.read_array('values')
   2640         return Series(values, index=index, name=self.name)
   2641

/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read_array(self, key)
   2325         import tables
   2326         node = getattr(self.group, key)
-> 2327         data = node[:]
   2328         attrs = node._v_attrs
   2329

/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in __getitem__(self, key)
    675             start, stop, step = self._process_range(
    676                 key.start, key.stop, key.step)
--> 677             return self.read(start, stop, step)
    678         # Try with a boolean or point selection
    679         elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):

/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in read(self, start, stop, step)
    815         atom = self.atom
    816         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 817             outlistarr = [atom.fromarray(arr) for arr in listarr]
    818         else:
    819             # Convert the list to the right flavor

/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in <listcomp>(.0)
    815         atom = self.atom
    816         if not hasattr(atom, 'size'):  # it is a pseudo-atom
--> 817             outlistarr = [atom.fromarray(arr) for arr in listarr]
    818         else:
    819             # Convert the list to the right flavor

/Users/rich/anaconda/lib/python3.5/site-packages/tables/atom.py in fromarray(self, array)
   1179         if array.size == 0:
   1180             return None
-> 1181         return pickle.loads(array.tostring())

AttributeError: Can't get attribute 'AtomView' on <module 'skchem.core.base' from '/Users/rich/projects/scikit-chem/skchem/core/base.py'>

Note

The dataset building functionality is likely to undergo a large change in future so is not documented here. Please look at the example datasets to understand the format required to build the datasets directly.