{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Input/Output"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Pandas objects are the main data structures used for collections of molecules. **scikit-chem** provides convenience functions to load objects into `pandas.DataFrame`s from common file formats in cheminformatics. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The **scikit-chem** functionality is modelled after the `pandas` API. To load an csv file using `pandas` you would call:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
\n",
" \n",
" \n",
" | \n",
" 0 | \n",
" 1 | \n",
" 2 | \n",
" 3 | \n",
" 4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 5.1 | \n",
" 3.5 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 1 | \n",
" 4.9 | \n",
" 3.0 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 2 | \n",
" 4.7 | \n",
" 3.2 | \n",
" 1.3 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 3 | \n",
" 4.6 | \n",
" 3.1 | \n",
" 1.5 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
" 4 | \n",
" 5.0 | \n",
" 3.6 | \n",
" 1.4 | \n",
" 0.2 | \n",
" Iris-setosa | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" 0 1 2 3 4\n",
"0 5.1 3.5 1.4 0.2 Iris-setosa\n",
"1 4.9 3.0 1.4 0.2 Iris-setosa\n",
"2 4.7 3.2 1.3 0.2 Iris-setosa\n",
"3 4.6 3.1 1.5 0.2 Iris-setosa\n",
"4 5.0 3.6 1.4 0.2 Iris-setosa"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.read_csv('https://archive.org/download/scikit-chem_example_files/iris.csv', \n",
" header=None); df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analogously with **scikit-chem**:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"smi = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Currently available:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"['read_sdf', 'read_smiles']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"[method for method in skchem.io.__dict__ if method.startswith('read_')]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**scikit-chem** also adds convenience methods onto `pandas.DataFrame` objects."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" | \n",
" structure | \n",
" 1 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" <Mol: CC> | \n",
" ethane | \n",
"
\n",
" \n",
" 1 | \n",
" <Mol: CCC> | \n",
" propane | \n",
"
\n",
" \n",
" 2 | \n",
" <Mol: c1ccccc1> | \n",
" benzene | \n",
"
\n",
" \n",
" 3 | \n",
" <Mol: CC(=O)[O-].[Na+]> | \n",
" sodium acetate | \n",
"
\n",
" \n",
" 4 | \n",
" <Mol: NC(CO)C(=O)O> | \n",
" serine | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" structure 1\n",
"0 ethane\n",
"1 propane\n",
"2 benzene\n",
"3 sodium acetate\n",
"4 serine"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame.from_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')"
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. note:: \n",
"\n",
" Currently, only ``read_smiles`` can read files over a network connection. This functionality is planned to be added in future for all file types."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing files"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Again, this is analogous to `pandas`:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
",0,1,2,3,4\n",
"0,5.1,3.5,1.4,0.2,Iris-setosa\n",
"1,4.9,3.0,1.4,0.2,Iris-setosa\n",
"2,4.7,3.2,1.3,0.2,Iris-setosa\n",
"3,4.6,3.1,1.5,0.2,Iris-setosa\n",
"4,5.0,3.6,1.4,0.2,Iris-setosa\n",
"\n"
]
}
],
"source": [
"from io import StringIO\n",
"sio = StringIO()\n",
"df.to_csv(sio)\n",
"sio.seek(0)\n",
"print(sio.read())"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
" RDKit \n",
"\n",
" 2 1 0 0 0 0 0 0 0 0999 V2000\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 1 0\n",
"M END\n",
"> <1> (1) \n",
"ethane\n",
"\n",
"$$$$\n",
"1\n",
" RDKit \n",
"\n",
" 3 2 0 0 0 0 0 0 0 0999 V2000\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n",
" 1 2 1 0\n",
" 2 3 1 0\n",
"M END\n",
"> <1> (2) \n",
"propane\n",
"\n",
"$$$$\n",
"\n"
]
}
],
"source": [
"sio = StringIO()\n",
"smi.iloc[:2].to_sdf(sio) # don't write too many!\n",
"sio.seek(0)\n",
"print(sio.read())"
]
},
{
"cell_type": "raw",
"metadata": {
"raw_mimetype": "text/restructuredtext"
},
"source": [
".. todo::\n",
" \n",
" Document reading and writing to local files by filenames, to file-like objects, and from remote objects by URI"
]
}
],
"metadata": {
"celltoolbar": "Raw Cell Format",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
},
"widgets": {
"state": {},
"version": "1.1.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}