Input/Output

Pandas objects are the main data structures used for collections of molecules. scikit-chem provides convenience functions to load objects into pandas.DataFrames from common file formats in cheminformatics.

Reading files

The scikit-chem functionality is modelled after the pandas API. To load an csv file using pandas you would call:

In [1]:
df = pd.read_csv('https://archive.org/download/scikit-chem_example_files/iris.csv',
                 header=None); df
Out[1]:
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Analogously with scikit-chem:

In [2]:
smi = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')

Currently available:

In [3]:
[method for method in skchem.io.__dict__ if method.startswith('read_')]
Out[3]:
['read_sdf', 'read_smiles']

scikit-chem also adds convenience methods onto pandas.DataFrame objects.

In [4]:
pd.DataFrame.from_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')
Out[4]:
structure 1
0 <Mol: CC> ethane
1 <Mol: CCC> propane
2 <Mol: c1ccccc1> benzene
3 <Mol: CC(=O)[O-].[Na+]> sodium acetate
4 <Mol: NC(CO)C(=O)O> serine

Note

Currently, only read_smiles can read files over a network connection. This functionality is planned to be added in future for all file types.

Writing files

Again, this is analogous to pandas:

In [5]:
from io import StringIO
sio = StringIO()
df.to_csv(sio)
sio.seek(0)
print(sio.read())
,0,1,2,3,4
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa

In [6]:
sio = StringIO()
smi.iloc[:2].to_sdf(sio) # don't write too many!
sio.seek(0)
print(sio.read())
0
     RDKit

  2  1  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
M  END
>  <1>  (1)
ethane

$$$$
1
     RDKit

  3  2  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0
  2  3  1  0
M  END
>  <1>  (2)
propane

$$$$