scikit-chem provides a simple interface to chemical datasets, and a framework for constructing these datasets. The data module uses fuel to make complex out of memory iterative functionality straightforward (see the fuel documentation). It also offers an abstraction to allow easy loading of smaller datasets, that can fit in memory.
Datasets consist of sets and sources. Simply put, sets are collections of molecules in the dataset, and sources are types of data relating to these molecules.
For demonstration purposes, we will use the Bursi Ames dataset. This has 3 sets:
In [31]:
skchem.data.BursiAmes.available_sets()
Out[31]:
('train', 'valid', 'test')
And many sources:
In [32]:
skchem.data.BursiAmes.available_sources()
Out[32]:
('G', 'A', 'y', 'A_cx', 'G_d', 'X_morg', 'X_cx', 'X_pc')
Note
Currently, the nature of the sources are not alway well documented, but as a guide, X are moleccular features, y are target variables, A are atom features, G are distances. When available, they will be detailed in the docstring of the dataset, accessible with help
.
For this example, we will load the X_morg and the y sources for all the sets. These are circular fingerprints, and the target labels (in this case, whether the molecule was a mutagen).
We can load the data for requested sets and sources using the in memory API:
In [33]:
kws = {'sets': ('train', 'valid', 'test'), 'sources':('X_morg', 'y')}
(X_train, y_train), (X_valid, y_valid), (X_test, y_test) = skchem.data.BursiAmes.load_data(**kws)
The requested data is loaded as nested tuples, sorted first by set, and then by source, which can easily be unpacked as above.
In [34]:
print('train shapes:', X_train.shape, y_train.shape)
print('valid shapes:', X_valid.shape, y_valid.shape)
print('test shapes:', X_test.shape, y_test.shape)
train shapes: (3007, 2048) (3007,)
valid shapes: (645, 2048) (645,)
test shapes: (645, 2048) (645,)
The raw data is loaded as numpy arrays:
In [35]:
X_train
Out[35]:
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]])
In [36]:
y_train
Out[36]:
array([1, 1, 1, ..., 0, 1, 1], dtype=uint8)
Which should be ready to use as fuel for modelling!
The data is originally saved as pandas objects, and can be retrieved as
such using the read_frame
class method.
Features are available under the ‘feats’ namespace:
In [37]:
skchem.data.BursiAmes.read_frame('feats/X_morg')
Out[37]:
morgan_fp_idx | 0 | 1 | 2 | 3 | 4 | ... | 2043 | 2044 | 2045 | 2046 | 2047 |
---|---|---|---|---|---|---|---|---|---|---|---|
batch | |||||||||||
1728-95-6 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
74550-97-3 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
16757-83-8 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
553-97-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
115-39-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
874-60-2 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
92-66-0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
594-71-8 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
55792-21-7 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
84987-77-9 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 |
4297 rows × 2048 columns
Target variables under ‘targets’:
In [39]:
skchem.data.BursiAmes.read_frame('targets/y')
Out[39]:
batch
1728-95-6 1
74550-97-3 1
16757-83-8 1
553-97-9 0
115-39-9 0
..
874-60-2 1
92-66-0 0
594-71-8 1
55792-21-7 0
84987-77-9 1
Name: is_mutagen, dtype: uint8
Set membership masks under ‘indices’:
In [40]:
skchem.data.BursiAmes.read_frame('indices/train')
Out[40]:
batch
1728-95-6 True
74550-97-3 True
16757-83-8 True
553-97-9 True
115-39-9 True
...
874-60-2 False
92-66-0 False
594-71-8 False
55792-21-7 False
84987-77-9 False
Name: split, dtype: bool
Finally, molecules are accessible via ‘structure’:
In [41]:
skchem.data.BursiAmes.read_frame('structure')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-41-5d342c123258> in <module>()
----> 1 skchem.data.BursiAmes.read_frame('structure')
/Users/rich/projects/scikit-chem/skchem/data/datasets/base.py in read_frame(cls, key, *args, **kwargs)
95 with warnings.catch_warnings():
96 warnings.simplefilter('ignore')
---> 97 data = pd.read_hdf(find_in_data_path(cls.filename), key, *args, **kwargs)
98 if isinstance(data, pd.Panel):
99 data = data.transpose(2, 1, 0)
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read_hdf(path_or_buf, key, **kwargs)
328 'multiple datasets.')
329 key = keys[0]
--> 330 return store.select(key, auto_close=auto_close, **kwargs)
331 except:
332 # if there is an error, close the store
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in select(self, key, where, start, stop, columns, iterator, chunksize, auto_close, **kwargs)
678 chunksize=chunksize, auto_close=auto_close)
679
--> 680 return it.get_result()
681
682 def select_as_coordinates(
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in get_result(self, coordinates)
1362
1363 # directly return the result
-> 1364 results = self.func(self.start, self.stop, where)
1365 self.close()
1366 return results
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in func(_start, _stop, _where)
671 return s.read(start=_start, stop=_stop,
672 where=_where,
--> 673 columns=columns, **kwargs)
674
675 # create the iterator
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read(self, **kwargs)
2637 self.validate_read(kwargs)
2638 index = self.read_index('index')
-> 2639 values = self.read_array('values')
2640 return Series(values, index=index, name=self.name)
2641
/Users/rich/anaconda/lib/python3.5/site-packages/pandas/io/pytables.py in read_array(self, key)
2325 import tables
2326 node = getattr(self.group, key)
-> 2327 data = node[:]
2328 attrs = node._v_attrs
2329
/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in __getitem__(self, key)
675 start, stop, step = self._process_range(
676 key.start, key.stop, key.step)
--> 677 return self.read(start, stop, step)
678 # Try with a boolean or point selection
679 elif type(key) in (list, tuple) or isinstance(key, numpy.ndarray):
/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in read(self, start, stop, step)
815 atom = self.atom
816 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 817 outlistarr = [atom.fromarray(arr) for arr in listarr]
818 else:
819 # Convert the list to the right flavor
/Users/rich/anaconda/lib/python3.5/site-packages/tables/vlarray.py in <listcomp>(.0)
815 atom = self.atom
816 if not hasattr(atom, 'size'): # it is a pseudo-atom
--> 817 outlistarr = [atom.fromarray(arr) for arr in listarr]
818 else:
819 # Convert the list to the right flavor
/Users/rich/anaconda/lib/python3.5/site-packages/tables/atom.py in fromarray(self, array)
1179 if array.size == 0:
1180 return None
-> 1181 return pickle.loads(array.tostring())
AttributeError: Can't get attribute 'AtomView' on <module 'skchem.core.base' from '/Users/rich/projects/scikit-chem/skchem/core/base.py'>
Note
The dataset building functionality is likely to undergo a large change in future so is not documented here. Please look at the example datasets to understand the format required to build the datasets directly.