tdc.utils#
tdc.utils.label module#
Utilities functions for transform labels
- tdc.utils.label.NegSample(df, column_names, frac, two_types)[source]#
Negative Sampling for Binary Interaction Dataset
- Parameters:
df (pandas.DataFrame) – input dataset dataframe
column_names (list) – column names in the order of [id1, x1, id2, x2]
frac (float) – the ratio of negative samples compared to positive samples
two_types (bool) – whether or not if the two entity types are different (e.g. drug-target) or single entity type (e.g. drug-drug)
- Returns:
a new dataframe with negative samples (Y = 0)
- Return type:
pandas.DataFrame
- tdc.utils.label.binarize(y, threshold, order='ascending')[source]#
binarization of a label list given a pre-specified threshold
- Parameters:
- Returns:
an array of transformed labels
- Return type:
np.array
- Raises:
AttributeError – select the correct order “ascending/descending”
- tdc.utils.label.convert_back_log(y)[source]#
conversion from log-scale helper
- Parameters:
y (list) – a list of labels in log-scale
- Returns:
an array of nM->p labels
- Return type:
np.array
- tdc.utils.label.convert_to_log(y)[source]#
log conversion helper
- Parameters:
y (list) – a list of labels
- Returns:
an array of log-transformed labels
- Return type:
np.array
- tdc.utils.label.label_dist(y, name=None)[source]#
plot the distribution of label
- Parameters:
y (list) – a list of labels
name (None, optional) – dataset name
- tdc.utils.label.label_transform(y, binary, threshold, convert_to_log, verbose=True, order='descending')[source]#
label transformation helper function
- Parameters:
y (list) – a list of labels
binary (bool) – whether or not to conduct binarization
threshold (float) – the threshold for binarization
convert_to_log (bool) – convert to log-scale for continuous values such as Kd and etc
verbose (bool, optional) – whether or not to print intermediate processing statements
order (str, optional) – if descending, then label is 1 for value less than threshold and vice versus, defaults to ‘descending’
- Returns:
an array of transformed labels
- Return type:
np.array
- Raises:
ValueError – specify the correct order from ‘descending’/’ascending’
tdc.utils.label_name_list module#
list of dataset names
tdc.utils.load module#
wrapper for download various dataset
- tdc.utils.load.atom_to_one_hot(atom, allowed_atom_list)[source]#
a helper to convert atom to one-hot encoding
- tdc.utils.load.bi_distribution_dataset_load(name, path, dataset_names, return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#
a wrapper to download, process and load protein-ligand conditional generation task datasets. assume the downloaded file is already processed
- tdc.utils.load.bm_download_wrapper(name, path)[source]#
wrapper for downloading a benchmark group given the name and path
- tdc.utils.load.bm_group_load(name, path)[source]#
a wrapper to download, process and load benchmark group
- tdc.utils.load.dataverse_download(url, path, name, types, id=None)[source]#
dataverse download helper with progress bar
- tdc.utils.load.distribution_dataset_load(name, path, dataset_names, column_name)[source]#
a wrapper to download, process and load molecule distribution learning task datasets. assume the downloaded file is already processed
- Parameters:
- Returns:
the input list of molecules representation
- Return type:
pandas.Series
- tdc.utils.load.download_wrapper(name, path, dataset_names)[source]#
wrapper for downloading a dataset given the name and path, for csv,pkl,tsv files
- tdc.utils.load.extract_atom_from_mol(rdmol, remove_Hs)[source]#
a helper to extract molecule atom information
- Parameters:
rdmol (rdkit.rdmol) – rdkit molecule
remove_Hs (bool) – whether to remove H atoms from ligands or not
- Returns:
atom types atom_type (numpy.array): atom coordinates
- Return type:
coord (numpy.array)
- tdc.utils.load.extract_atom_from_protein(data_frame, data_frame_het, remove_Hs, keep_het)[source]#
a helper to extract protein atom information
- Parameters:
- Returns:
atom types atom_type (numpy.array): atom coordinates
- Return type:
coord (numpy.array)
- tdc.utils.load.general_load(name, path, sep)[source]#
a wrapper to download, process and load any pandas dataframe files
- tdc.utils.load.generation_dataset_load(name, path, dataset_names)[source]#
a wrapper to download, process and load generation task datasets. assume the downloaded file is already processed
- tdc.utils.load.generation_paired_dataset_load(name, path, dataset_names, input_name, output_name)[source]#
a wrapper to download, process and load generation-paired task datasets
- Parameters:
- Returns:
two series (entity 1 representation, label)
- Return type:
pandas.Series
- tdc.utils.load.interaction_dataset_load(name, path, target, dataset_names, aux_column)[source]#
a wrapper to download, process and load two-instance prediction task datasets
- Parameters:
- Returns:
three series (entity 1 representation, entity 2 representation, entity id 1, entity id 2, label)
- Return type:
pandas.Series
- tdc.utils.load.multi_dataset_load(name, path, dataset_names)[source]#
a wrapper to download, process and load multiple(>2)-instance prediction task datasets. assume the downloaded file is already processed
- tdc.utils.load.oracle_download_wrapper(name, path, oracle_names)[source]#
wrapper for downloading an oracle model checkpoint given the name and path
- tdc.utils.load.oracle_load(name, path='./oracle', oracle_names=['drd2', 'gsk3b', 'jnk3', 'fpscores', 'cyp3a4_veith', 'drd2_current', 'gsk3b_current', 'jnk3_current', 'qed', 'logp', 'sa', 'rediscovery', 'similarity', 'median', 'isomers', 'mpo', 'hop', 'celecoxib_rediscovery', 'troglitazone_rediscovery', 'thiothixene_rediscovery', 'aripiprazole_similarity', 'albuterol_similarity', 'mestranol_similarity', 'isomers_c7h8n2o2', 'isomers_c9h10n2o2pf2cl', 'isomers_c11h24', 'osimertinib_mpo', 'fexofenadine_mpo', 'ranolazine_mpo', 'perindopril_mpo', 'amlodipine_mpo', 'sitagliptin_mpo', 'zaleplon_mpo', 'sitagliptin_mpo_prev', 'zaleplon_mpo_prev', 'median1', 'median2', 'valsartan_smarts', 'deco_hop', 'scaffold_hop', 'novelty', 'diversity', 'uniqueness', 'validity', 'fcd_distance', 'kl_divergence', 'askcos', 'ibm_rxn', 'isomer_meta', 'rediscovery_meta', 'similarity_meta', 'median_meta', 'docking_score', 'molecule_one_synthesis', 'pyscreener', 'rmsd', 'kabsch_rmsd', 'smina', '1iep_docking', '2rgp_docking', '3eml_docking', '3ny8_docking', '4rlu_docking', '4unn_docking', '5mo4_docking', '7l11_docking', 'drd3_docking', '3pbl_docking', '1iep_docking_normalize', '2rgp_docking_normalize', '3eml_docking_normalize', '3ny8_docking_normalize', '4rlu_docking_normalize', '4unn_docking_normalize', '5mo4_docking_normalize', '7l11_docking_normalize', 'drd3_docking_normalize', '3pbl_docking_normalize', '1iep_docking_vina', '2rgp_docking_vina', '3eml_docking_vina', '3ny8_docking_vina', '4rlu_docking_vina', '4unn_docking_vina', '5mo4_docking_vina', '7l11_docking_vina', 'drd3_docking_vina', '3pbl_docking_vina'])[source]#
a wrapper to download, process and load oracles.
- tdc.utils.load.pd_load(name, path)[source]#
load a pandas dataframe from local file.
- Parameters:
- Returns:
loaded dataset in dataframe
- Return type:
pandas.DataFrame
- Raises:
ValueError – the file format is not supported. currently only support tab/csv/pkl/zip
- tdc.utils.load.process_crossdock(path, name='crossdock', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#
a processor to process crossdock dataset
- Parameters:
name (str) – the name of the dataset
path (str) – the path to save the data file
print_stats (bool) – whether to print the basic statistics of the dataset
return_pocket (bool) –
whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,
the threshold is used as a radius for a sphere around the ligand center to consider protein pocket
remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein
- Returns:
a dict of protein features ligand (dict): a dict of ligand features
- Return type:
protein (dict)
- tdc.utils.load.process_dude(path, name='dude', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#
a processor to process DUD-E dataset
- Parameters:
name (str) – the name of the dataset
path (str) – the path to save the data file
print_stats (bool) – whether to print the basic statistics of the dataset
return_pocket (bool) –
whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,
the threshold is used as a radius for a sphere around the ligand center to consider protein pocket
remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein
- Returns:
a dict of protein features ligand (dict): a dict of ligand features
- Return type:
protein (dict)
- tdc.utils.load.process_pdbbind(path, name='pdbbind', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#
a processor to process pdbbind dataset
- Parameters:
name (str) – the name of the dataset
path (str) – the path to save the data file
print_stats (bool) – whether to print the basic statistics of the dataset
return_pocket (bool) –
whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,
the threshold is used as a radius for a sphere around the ligand center to consider protein pocket
remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein
- Returns:
a dict of protein features ligand (dict): a dict of ligand features
- Return type:
protein (dict)
- tdc.utils.load.process_scpdb(path, name='scPDB', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#
a processor to process scpdb dataset
- Parameters:
name (str) – the name of the dataset
path (str) – the path to save the data file
print_stats (bool) – whether to print the basic statistics of the dataset
return_pocket (bool) –
whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,
the threshold is used as a radius for a sphere around the ligand center to consider protein pocket
remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein
- Returns:
a dict of protein features ligand (dict): a dict of ligand features
- Return type:
protein (dict)
- tdc.utils.load.property_dataset_load(name, path, target, dataset_names)[source]#
a wrapper to download, process and load single-instance prediction task datasets
- Parameters:
- Returns:
three series (entity representation, label, entity id)
- Return type:
pandas.Series
- tdc.utils.load.receptor_download_wrapper(name, path)[source]#
wrapper for downloading an receptor pdb file given the name and path
- tdc.utils.load.receptor_load(name, path='./oracle')[source]#
a wrapper to download, process and load pdb file.
- tdc.utils.load.three_dim_dataset_load(name, path, dataset_names)[source]#
a wrapper to download, process and load 3d molecule task datasets
tdc.utils.misc module#
miscellaneous utilities functions
- tdc.utils.misc.fuzzy_search(name, dataset_names)[source]#
fuzzy matching between the real dataset name and the input name
- Parameters:
- Returns:
the real dataset name
- Return type:
s
- Raises:
ValueError – the wrong task name, no name is matched
- tdc.utils.misc.get_closet_match(predefined_tokens, test_token, threshold=0.8)[source]#
Get the closest match by Levenshtein Distance.
- Parameters:
- Returns:
- the exact token with highest matching prob
float: probability
- Return type:
- Raises:
ValueError – no name is matched
- tdc.utils.misc.install(package)[source]#
install pip package
- Parameters:
package (str) – package name
tdc.utils.query module#
Utilities functions for query
- tdc.utils.query.request(identifier, namespace='cid', domain='compound', operation=None, output='JSON', searchtype=None)[source]#
copied from https://github.com/mcs07/PubChemPy/blob/e3c4f4a9b6120433e5cc3383464c7a79e9b2b86e/pubchempy.py#L238 Construct API request from parameters and return the response. Full specification at http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html
tdc.utils.retrieve module#
Utilities functions for dataset/metadata retrieval
- tdc.utils.retrieve.get_label_map(name, path='./data', target=None, file_format='csv', output_format='dict', task='DDI', name_column='Map')[source]#
to retrieve the biomedical meaning of labels
- Parameters:
name (str) – the name of the dataset
path (str, optional) – the dataset path, where the data is located
target (None, optional) – the label name
file_format (str, optional) – format of the file
output_format (str, optional) – return a dictionary or a dataframe or the raw array of mapped labels
task (str, optional) – the name of the task
name_column (str, optional) – the name of the column that stores the label name
- Returns:
when output_format is dict/df/array
- Return type:
dict/pd.DataFrame/np.array
- Raises:
ValueError – output_format not supported.
- tdc.utils.retrieve.get_reaction_type(name, path='./data', output_format='array')[source]#
to retrieve the type of reactions for reaction dataset
- Parameters:
- Returns:
when output_format is df/array
- Return type:
pd.DataFrame/np.array
- Raises:
ValueError – the output format is not supported
- tdc.utils.retrieve.retrieve_all_benchmarks()[source]#
to get all available benchmark groups
- Returns:
a list of benchmark group names
- Return type:
- tdc.utils.retrieve.retrieve_benchmark_names(name)[source]#
to get all available benchmarks given a query benchmark group
tdc.utils.split module#
Utilities functions for splitting dataset
- tdc.utils.split.create_combination_generation_split(dict1, dict2, seed, frac)[source]#
create random split
- tdc.utils.split.create_combination_split(df, seed, frac)[source]#
Function for splitting drug combination dataset such that no combinations are shared across the split
- tdc.utils.split.create_fold_setting_cold(df, fold_seed, frac, entities)[source]#
create cold-split where given one or multiple columns, it first splits based on entities in the columns and then maps all associated data points to the partition
- Parameters:
- Returns:
a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe
- Return type:
- tdc.utils.split.create_fold_time(df, frac, date_column)[source]#
create splits based on time
- Parameters:
- Returns:
a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe
- Return type:
- tdc.utils.split.create_group_split(train_val, seed, holdout_frac, group_column)[source]#
split within each stratification defined by the group column for training/validation split
- Parameters:
- Returns:
a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe
- Return type:
- tdc.utils.split.create_scaffold_split(df, seed, frac, entity)[source]#
create scaffold split. it first generates molecular scaffold for each molecule and then split based on scaffolds reference: https://github.com/chemprop/chemprop/blob/master/chemprop/data/scaffold.py
- Parameters:
- Returns:
a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe
- Return type: