Pandas objects are the main data structures used for collections of
molecules. scikit-chem provides convenience functions to load
objects into pandas.DataFrame
s from common file formats in
cheminformatics.
The scikit-chem functionality is modelled after the pandas
API.
To load an csv file using pandas
you would call:
In [1]:
df = pd.read_csv('https://archive.org/download/scikit-chem_example_files/iris.csv',
header=None); df
Out[1]:
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
Analogously with scikit-chem:
In [2]:
smi = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')
Currently available:
In [3]:
[method for method in skchem.io.__dict__ if method.startswith('read_')]
Out[3]:
['read_sdf', 'read_smiles']
scikit-chem also adds convenience methods onto pandas.DataFrame
objects.
In [4]:
pd.DataFrame.from_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')
Out[4]:
structure | 1 | |
---|---|---|
0 | <Mol: CC> | ethane |
1 | <Mol: CCC> | propane |
2 | <Mol: c1ccccc1> | benzene |
3 | <Mol: CC(=O)[O-].[Na+]> | sodium acetate |
4 | <Mol: NC(CO)C(=O)O> | serine |
Note
Currently, only read_smiles
can read files over a network connection. This functionality is planned to be added in future for all file types.
Again, this is analogous to pandas
:
In [5]:
from io import StringIO
sio = StringIO()
df.to_csv(sio)
sio.seek(0)
print(sio.read())
,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
In [6]:
sio = StringIO()
smi.iloc[:2].to_sdf(sio) # don't write too many!
sio.seek(0)
print(sio.read())
0
RDKit
2 1 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
M END
> <1> (1)
ethane
$$$$
1
RDKit
3 2 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0
2 3 1 0
M END
> <1> (2)
propane
$$$$