skchem.data.converters package

Submodules

skchem.data.converters.base module

# skchem.data.converters.base

Defines the base converter class.

class skchem.data.converters.base.Converter(directory, output_directory, output_filename='default.h5')[source]

Bases: object

Create a fuel dataset from molecules and targets.

classmethod convert(**kwargs)[source]
create_file(path)[source]
classmethod fill_subparser(subparser)[source]
run(ms, y, output_path, splits=None, features=None, pytables_kws={'complib': 'bzip2', 'complevel': 9})[source]
Args:
ms (pd.Series):
The molecules of the dataset.
ys (pd.Series or pd.DataFrame):
The target labels of the dataset.
output_path (str):
The path to which the dataset should be saved.
features (list[Feature]):
The features to calculate. Defaults are used if None.
splits (iterable<(name, split)>):
An iterable of name, split tuples. Splits are provided as boolean arrays of the whole data.
save_features(ms)[source]

Save all features for the dataset.

save_frame(data, name, prefix='targets')[source]

Save the a frame to the data file.

save_molecules(mols)[source]

Save the molecules to the data file.

save_splits()[source]

Save the splits to the data file.

save_targets(y)[source]
source_names
split_names
class skchem.data.converters.base.Feature(fper, key, axis_names)

Bases: tuple

axis_names

Alias for field number 2

fper

Alias for field number 0

key

Alias for field number 1

class skchem.data.converters.base.Split(mask, name, converter)[source]

Bases: object

contiguous
indices
ref
save()[source]
to_dict()[source]
skchem.data.converters.base.contiguous_order(to_order, splits)[source]

Determine a contiguous order from non-overlapping splits, and put data in that order.

Parameters:
  • to_order (iterable<pd.Series, pd.DataFrame, pd.Panel>) – The pandas objects to put in contiguous order.
  • splits (iterable<pd.Series>) – The non-overlapping splits, as boolean masks.
Returns:

The data in contiguous order.

Return type:

iterable<pd.Series, pd.DataFrame, pd.Panel>

skchem.data.converters.base.default_features()[source]
skchem.data.converters.base.default_pipeline()[source]

Return a default pipeline to be used for general datasets.

skchem.data.converters.bradley_open_mp module

class skchem.data.converters.bradley_open_mp.BradleyOpenMPConverter(directory, output_directory, output_filename='bradley_open_mp.h5')[source]

Bases: skchem.data.converters.base.Converter

static filter_bad(data)[source]
static fix_mp(data)[source]
static parse_data(path)[source]

skchem.data.converters.bursi_ames module

class skchem.data.converters.bursi_ames.BursiAmesConverter(directory, output_directory, output_filename='bursi_ames.h5')[source]

Bases: skchem.data.converters.base.Converter

skchem.data.converters.diversity_set module

# skchem.data.coverters.example

Formatter for the example dataset.

class skchem.data.converters.diversity_set.DiversityConverter(directory, output_directory, output_filename='diversity.h5')[source]

Bases: skchem.data.converters.base.Converter

Example Converter, using the NCI DTP Diversity Set III.

parse_file(path)[source]
synthetic_targets(index)[source]

skchem.data.converters.muller_ames module

class skchem.data.converters.muller_ames.MullerAmesConverter(directory, output_directory, output_filename='muller_ames.h5')[source]

Bases: skchem.data.converters.base.Converter

create_split_dict(splits, name)[source]
drop_indices(splits, indices)[source]
parse_splits(f_path)[source]
patch_data(data, patches)[source]

Patch smiles in a DataFrame with rewritten ones that specify diazo groups in rdkit friendly way.

skchem.data.converters.nmrshiftdb2 module

class skchem.data.converters.nmrshiftdb2.NMRShiftDB2Converter(directory, output_directory, output_filename='nmrshiftdb2.h5')[source]

Bases: skchem.data.converters.base.Converter

static combine_duplicates(data)[source]

Collect duplicate spectra into one dictionary. All shifts are collected into lists.

static extract_duplicates(data, kind='13c')[source]

Get all 13c duplicates.

static get_spectra(data)[source]

Retrieves spectra from raw data.

static log_dists(data)[source]
log_duplicates(data)[source]
static parse_data(filepath)[source]

Reads the raw datafile.

static process_spectra(data)[source]

Turn the string representations found in sdf file into a dictionary.

static squash_duplicates(data)[source]

Take the mean of all the duplicates. This is where we could do a bit more checking.

static to_frame(data)[source]

Convert a series of dictionaries to a dataframe.

skchem.data.converters.physprop module

class skchem.data.converters.physprop.PhysPropConverter(directory, output_directory, output_filename='physprop.h5')[source]

Bases: skchem.data.converters.base.Converter

drop_inconsistencies(data)[source]
extract(directory)[source]
static fix_temp(s, mean_range=5)[source]
process_bp(data)[source]
process_logP(data)[source]
process_logS(data)[source]
process_mp(data)[source]
process_sdf(path)[source]
process_targets(data)[source]
process_txt(path)[source]

skchem.data.converters.tox21 module

## skchem.data.transformers.tox21

Module defining transformation techniques for tox21.

class skchem.data.converters.tox21.Tox21Converter(directory, output_directory, output_filename='tox21.h5')[source]

Bases: skchem.data.converters.base.Converter

Class to build tox21 dataset.

extract(directory)[source]
static fix_assay_name(s)[source]
static fix_id(s)[source]
static patch_test(test)[source]
read_test(test, test_data)[source]
read_train(train)[source]
read_valid(valid)[source]

Module contents

class skchem.data.converters.DiversityConverter(directory, output_directory, output_filename='diversity.h5')[source]

Bases: skchem.data.converters.base.Converter

Example Converter, using the NCI DTP Diversity Set III.

parse_file(path)[source]
synthetic_targets(index)[source]
class skchem.data.converters.BursiAmesConverter(directory, output_directory, output_filename='bursi_ames.h5')[source]

Bases: skchem.data.converters.base.Converter

class skchem.data.converters.MullerAmesConverter(directory, output_directory, output_filename='muller_ames.h5')[source]

Bases: skchem.data.converters.base.Converter

create_split_dict(splits, name)[source]
drop_indices(splits, indices)[source]
parse_splits(f_path)[source]
patch_data(data, patches)[source]

Patch smiles in a DataFrame with rewritten ones that specify diazo groups in rdkit friendly way.

class skchem.data.converters.PhysPropConverter(directory, output_directory, output_filename='physprop.h5')[source]

Bases: skchem.data.converters.base.Converter

drop_inconsistencies(data)[source]
extract(directory)[source]
static fix_temp(s, mean_range=5)[source]
process_bp(data)[source]
process_logP(data)[source]
process_logS(data)[source]
process_mp(data)[source]
process_sdf(path)[source]
process_targets(data)[source]
process_txt(path)[source]
class skchem.data.converters.BradleyOpenMPConverter(directory, output_directory, output_filename='bradley_open_mp.h5')[source]

Bases: skchem.data.converters.base.Converter

static filter_bad(data)[source]
static fix_mp(data)[source]
static parse_data(path)[source]
class skchem.data.converters.NMRShiftDB2Converter(directory, output_directory, output_filename='nmrshiftdb2.h5')[source]

Bases: skchem.data.converters.base.Converter

static combine_duplicates(data)[source]

Collect duplicate spectra into one dictionary. All shifts are collected into lists.

static extract_duplicates(data, kind='13c')[source]

Get all 13c duplicates.

static get_spectra(data)[source]

Retrieves spectra from raw data.

static log_dists(data)[source]
log_duplicates(data)[source]
static parse_data(filepath)[source]

Reads the raw datafile.

static process_spectra(data)[source]

Turn the string representations found in sdf file into a dictionary.

static squash_duplicates(data)[source]

Take the mean of all the duplicates. This is where we could do a bit more checking.

static to_frame(data)[source]

Convert a series of dictionaries to a dataframe.

class skchem.data.converters.Tox21Converter(directory, output_directory, output_filename='tox21.h5')[source]

Bases: skchem.data.converters.base.Converter

Class to build tox21 dataset.

extract(directory)[source]
static fix_assay_name(s)[source]
static fix_id(s)[source]
static patch_test(test)[source]
read_test(test, test_data)[source]
read_train(train)[source]
read_valid(valid)[source]
class skchem.data.converters.ChEMBLConverter(directory, output_directory, output_filename='chembl.h5')[source]

Bases: skchem.data.converters.base.Converter

Converter for the ChEMBL dataset.

parse_infile(filename)[source]