{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Input/Output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas objects are the main data structures used for collections of molecules. **scikit-chem** provides convenience functions to load objects into `pandas.DataFrame`s from common file formats in cheminformatics. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The **scikit-chem** functionality is modelled after the `pandas` API. To load an csv file using `pandas` you would call:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01234
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
\n", "
" ], "text/plain": [ " 0 1 2 3 4\n", "0 5.1 3.5 1.4 0.2 Iris-setosa\n", "1 4.9 3.0 1.4 0.2 Iris-setosa\n", "2 4.7 3.2 1.3 0.2 Iris-setosa\n", "3 4.6 3.1 1.5 0.2 Iris-setosa\n", "4 5.0 3.6 1.4 0.2 Iris-setosa" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('https://archive.org/download/scikit-chem_example_files/iris.csv', \n", " header=None); df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analogously with **scikit-chem**:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "smi = skchem.read_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Currently available:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['read_sdf', 'read_smiles']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[method for method in skchem.io.__dict__ if method.startswith('read_')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**scikit-chem** also adds convenience methods onto `pandas.DataFrame` objects." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
structure1
0<Mol: CC>ethane
1<Mol: CCC>propane
2<Mol: c1ccccc1>benzene
3<Mol: CC(=O)[O-].[Na+]>sodium acetate
4<Mol: NC(CO)C(=O)O>serine
\n", "
" ], "text/plain": [ " structure 1\n", "0 ethane\n", "1 propane\n", "2 benzene\n", "3 sodium acetate\n", "4 serine" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame.from_smiles('https://archive.org/download/scikit-chem_example_files/example.smi')" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. note:: \n", "\n", " Currently, only ``read_smiles`` can read files over a network connection. This functionality is planned to be added in future for all file types." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, this is analogous to `pandas`:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ",0,1,2,3,4\n", "0,5.1,3.5,1.4,0.2,Iris-setosa\n", "1,4.9,3.0,1.4,0.2,Iris-setosa\n", "2,4.7,3.2,1.3,0.2,Iris-setosa\n", "3,4.6,3.1,1.5,0.2,Iris-setosa\n", "4,5.0,3.6,1.4,0.2,Iris-setosa\n", "\n" ] } ], "source": [ "from io import StringIO\n", "sio = StringIO()\n", "df.to_csv(sio)\n", "sio.seek(0)\n", "print(sio.read())" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", " RDKit \n", "\n", " 2 1 0 0 0 0 0 0 0 0999 V2000\n", " 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n", " 1 2 1 0\n", "M END\n", "> <1> (1) \n", "ethane\n", "\n", "$$$$\n", "1\n", " RDKit \n", "\n", " 3 2 0 0 0 0 0 0 0 0999 V2000\n", " 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n", " 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0\n", " 1 2 1 0\n", " 2 3 1 0\n", "M END\n", "> <1> (2) \n", "propane\n", "\n", "$$$$\n", "\n" ] } ], "source": [ "sio = StringIO()\n", "smi.iloc[:2].to_sdf(sio) # don't write too many!\n", "sio.seek(0)\n", "print(sio.read())" ] }, { "cell_type": "raw", "metadata": { "raw_mimetype": "text/restructuredtext" }, "source": [ ".. todo::\n", " \n", " Document reading and writing to local files by filenames, to file-like objects, and from remote objects by URI" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" }, "widgets": { "state": {}, "version": "1.1.2" } }, "nbformat": 4, "nbformat_minor": 0 }