## skchem.cross_validation.similarity_threshold
Similarity threshold dataset partitioning functionality.
skchem.cross_validation.similarity_threshold.SimThresholdSplit(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]¶Bases: object
block_width¶The width of the subsets of features.
Note
Only used in parallelized.
fit(inp, pairs=None)[source]¶Fit the cross validator to the data. :param inp:
- pd.Series of Mol instances
- pd.DataFrame with Mol instances as a structure row.
- pd.DataFrame of fingerprints if fper is None
- pd.DataFrame of sim matrix if similarity_metric is None
- np.array of sim matrix if similarity_metric is None
k_fold(n_folds)[source]¶Returns k-fold cross-validated folds with thresholded similarity.
| Parameters: | n_folds (int) – The number of folds to provide. |
|---|---|
| Returns: | generator[ – The splits in series. |
| Return type: | pd.Series, pd.Series |
n_instances_¶The number of instances that were used to fit the object.
n_jobs¶The number of processes to use to calculate the distance matrix.
-1 for all available.
split(ratio)[source]¶Return splits of the data with thresholded similarity according to a specified ratio.
| Parameters: | ratio (tuple[ints]) – the ratio to use. |
|---|---|
| Returns: | Generator of boolean split masks for the reqested splits. |
| Return type: | generator[pd.Series] |
Example
>>> ms = skchem.data.Diversity.read_frame('structure')
>>> st = SimThresholdSplit(fper='morgan',
... similarity_metric='jaccard')
>>> st.fit(ms)
>>> train, valid, test = st.split(ratio=(70, 15, 15))
visualize_similarities(subsample=5000, ax=None)[source]¶Plot a histogram of similarities, with the threshold plotted.
| Parameters: |
|
|---|---|
| Returns: | matplotlib.axes |
visualize_space(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]¶Plot chemical space using a transformer
| Parameters: |
|
|---|---|
| Returns: | matplotlib.axes |
## skchem.cross_validation
Module implementing cross validation routines useful for chemical data.
skchem.cross_validation.SimThresholdSplit(min_threshold=0.45, largest_cluster_fraction=0.1, fper='morgan', similarity_metric='jaccard', memory_optimized=True, n_jobs=1, block_width=1000, verbose=False)[source]¶Bases: object
block_width¶The width of the subsets of features.
Note
Only used in parallelized.
fit(inp, pairs=None)[source]¶Fit the cross validator to the data. :param inp:
- pd.Series of Mol instances
- pd.DataFrame with Mol instances as a structure row.
- pd.DataFrame of fingerprints if fper is None
- pd.DataFrame of sim matrix if similarity_metric is None
- np.array of sim matrix if similarity_metric is None
k_fold(n_folds)[source]¶Returns k-fold cross-validated folds with thresholded similarity.
| Parameters: | n_folds (int) – The number of folds to provide. |
|---|---|
| Returns: | generator[ – The splits in series. |
| Return type: | pd.Series, pd.Series |
n_instances_¶The number of instances that were used to fit the object.
n_jobs¶The number of processes to use to calculate the distance matrix.
-1 for all available.
split(ratio)[source]¶Return splits of the data with thresholded similarity according to a specified ratio.
| Parameters: | ratio (tuple[ints]) – the ratio to use. |
|---|---|
| Returns: | Generator of boolean split masks for the reqested splits. |
| Return type: | generator[pd.Series] |
Example
>>> ms = skchem.data.Diversity.read_frame('structure')
>>> st = SimThresholdSplit(fper='morgan',
... similarity_metric='jaccard')
>>> st.fit(ms)
>>> train, valid, test = st.split(ratio=(70, 15, 15))
visualize_similarities(subsample=5000, ax=None)[source]¶Plot a histogram of similarities, with the threshold plotted.
| Parameters: |
|
|---|---|
| Returns: | matplotlib.axes |
visualize_space(dim_reducer='tsne', dim_red_kw=None, subsample=5000, ax=None, plt_kw=None)[source]¶Plot chemical space using a transformer
| Parameters: |
|
|---|---|
| Returns: | matplotlib.axes |