skchem.cross_validation package

Submodules

skchem.cross_validation.similarity_threshold module

## skchem.cross_validation.similarity_threshold

Similarity threshold dataset partitioning functionality.

class skchem.cross_validation.similarity_threshold.SimThresholdSplit(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]

Bases: object

block_width

The width of the subsets of features.

Note

Only used in parallelized.

fit(inp, pairs=None)[source]

Fit the cross validator to the data. :param inp:

  • pd.Series of Mol instances
  • pd.DataFrame with Mol instances as a structure row.
  • pd.DataFrame of fingerprints if fper is None
  • pd.DataFrame of sim matrix if similarity_metric is None
  • np.array of sim matrix if similarity_metric is None
k_fold(n_folds)[source]

Returns k-fold cross-validated folds with thresholded similarity.

Parameters:n_folds (int) – The number of folds to provide.
Returns:generator[ – The splits in series.
Return type:pd.Series, pd.Series
n_instances_

The number of instances that were used to fit the object.

n_jobs

The number of processes to use to calculate the distance matrix.

-1 for all available.

split(ratio)[source]

Return splits of the data with thresholded similarity according to a specified ratio.

Parameters:ratio (tuple[ints]) – the ratio to use.
Returns:Generator of boolean split masks for the reqested splits.
Return type:generator[pd.Series]

Example

>>> ms = skchem.data.Diversity.read_frame('structure') 
>>> st = SimThresholdSplit(fper='morgan', 
...                        similarity_metric='jaccard')
>>> st.fit(ms) 
>>> train, valid, test = st.split(ratio=(70, 15, 15)) 
visualize_similarities(subsample=5000, ax=None)[source]

Plot a histogram of similarities, with the threshold plotted.

Parameters:
  • subsample (int) – For a large dataset, subsample the number of compounds to consider.
  • ax (matplotlib.axis) – Axis to make the plot on.
Returns:

matplotlib.axes

visualize_space(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]

Plot chemical space using a transformer

Parameters:
  • dim_reducer (str or sklearn object) – Technique to use to reduce fingerprint space.
  • dim_red_kw (dict) – Keyword args to pass to dim_reducer.
  • subsample (int) – for a large dataset, subsample the number of compounds to consider.
  • ax (matplotlib.axis) – Axis to make the plot on.
  • plt_kw (dict) – Keyword args to pass to the plot.
Returns:

matplotlib.axes

skchem.cross_validation.similarity_threshold.returns_pairs(func)[source]

Wraps a function that returns a ((i, j), sim) list to return a dataframe.

Module contents

## skchem.cross_validation

Module implementing cross validation routines useful for chemical data.

class skchem.cross_validation.SimThresholdSplit(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]

Bases: object

block_width

The width of the subsets of features.

Note

Only used in parallelized.

fit(inp, pairs=None)[source]

Fit the cross validator to the data. :param inp:

  • pd.Series of Mol instances
  • pd.DataFrame with Mol instances as a structure row.
  • pd.DataFrame of fingerprints if fper is None
  • pd.DataFrame of sim matrix if similarity_metric is None
  • np.array of sim matrix if similarity_metric is None
k_fold(n_folds)[source]

Returns k-fold cross-validated folds with thresholded similarity.

Parameters:n_folds (int) – The number of folds to provide.
Returns:generator[ – The splits in series.
Return type:pd.Series, pd.Series
n_instances_

The number of instances that were used to fit the object.

n_jobs

The number of processes to use to calculate the distance matrix.

-1 for all available.

split(ratio)[source]

Return splits of the data with thresholded similarity according to a specified ratio.

Parameters:ratio (tuple[ints]) – the ratio to use.
Returns:Generator of boolean split masks for the reqested splits.
Return type:generator[pd.Series]

Example

>>> ms = skchem.data.Diversity.read_frame('structure') 
>>> st = SimThresholdSplit(fper='morgan', 
...                        similarity_metric='jaccard')
>>> st.fit(ms) 
>>> train, valid, test = st.split(ratio=(70, 15, 15)) 
visualize_similarities(subsample=5000, ax=None)[source]

Plot a histogram of similarities, with the threshold plotted.

Parameters:
  • subsample (int) – For a large dataset, subsample the number of compounds to consider.
  • ax (matplotlib.axis) – Axis to make the plot on.
Returns:

matplotlib.axes

visualize_space(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]

Plot chemical space using a transformer

Parameters:
  • dim_reducer (str or sklearn object) – Technique to use to reduce fingerprint space.
  • dim_red_kw (dict) – Keyword args to pass to dim_reducer.
  • subsample (int) – for a large dataset, subsample the number of compounds to consider.
  • ax (matplotlib.axis) – Axis to make the plot on.
  • plt_kw (dict) – Keyword args to pass to the plot.
Returns:

matplotlib.axes