tdc.utils#

tdc.utils.label module#

Utilities functions for transform labels

tdc.utils.label.NegSample(df, column_names, frac, two_types)[source]#

Negative Sampling for Binary Interaction Dataset

Parameters:
  • df (pandas.DataFrame) – input dataset dataframe

  • column_names (list) – column names in the order of [id1, x1, id2, x2]

  • frac (float) – the ratio of negative samples compared to positive samples

  • two_types (bool) – whether or not if the two entity types are different (e.g. drug-target) or single entity type (e.g. drug-drug)

Returns:

a new dataframe with negative samples (Y = 0)

Return type:

pandas.DataFrame

tdc.utils.label.binarize(y, threshold, order='ascending')[source]#

binarization of a label list given a pre-specified threshold

Parameters:
  • y (list) – a list of labels

  • threshold (float) – the threshold for turning label to 1 or 0

  • order (str, optional) – if order is ascending then for label that is above threshold becomes 1, and below becomes 0, vice versus

Returns:

an array of transformed labels

Return type:

np.array

Raises:

AttributeError – select the correct order “ascending/descending”

tdc.utils.label.convert_back_log(y)[source]#

conversion from log-scale helper

Parameters:

y (list) – a list of labels in log-scale

Returns:

an array of nM->p labels

Return type:

np.array

tdc.utils.label.convert_to_log(y)[source]#

log conversion helper

Parameters:

y (list) – a list of labels

Returns:

an array of log-transformed labels

Return type:

np.array

tdc.utils.label.convert_y_unit(y, from_, to_)[source]#

label unit conversion helper function

Parameters:
  • y (list) – a list of labels

  • from (str) – source units, ‘nM’/’p’

  • to (str) – target units, ‘p’/’nM’

Returns:

a numpy array of transformed labels

Return type:

np.array

tdc.utils.label.label_dist(y, name=None)[source]#

plot the distribution of label

Parameters:
  • y (list) – a list of labels

  • name (None, optional) – dataset name

tdc.utils.label.label_transform(y, binary, threshold, convert_to_log, verbose=True, order='descending')[source]#

label transformation helper function

Parameters:
  • y (list) – a list of labels

  • binary (bool) – whether or not to conduct binarization

  • threshold (float) – the threshold for binarization

  • convert_to_log (bool) – convert to log-scale for continuous values such as Kd and etc

  • verbose (bool, optional) – whether or not to print intermediate processing statements

  • order (str, optional) – if descending, then label is 1 for value less than threshold and vice versus, defaults to ‘descending’

Returns:

an array of transformed labels

Return type:

np.array

Raises:

ValueError – specify the correct order from ‘descending’/’ascending’

tdc.utils.label_name_list module#

list of dataset names

tdc.utils.load module#

wrapper for download various dataset

tdc.utils.load.atom_to_one_hot(atom, allowed_atom_list)[source]#

a helper to convert atom to one-hot encoding

Parameters:
  • atom (str) – the atom to convert

  • allowed_atom_list (list(str)) – atom types allowed to include

Returns:

atom one-hot encoding vector

Return type:

new_atom (numpy.array)

tdc.utils.load.bi_distribution_dataset_load(name, path, dataset_names, return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#

a wrapper to download, process and load protein-ligand conditional generation task datasets. assume the downloaded file is already processed

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

the input list of molecules representation

Return type:

pandas.Series

tdc.utils.load.bm_download_wrapper(name, path)[source]#

wrapper for downloading a benchmark group given the name and path

Parameters:
  • name (str) – the rough benckmark group query name

  • path (str) – the path to save the benchmark group

  • dataset_names (list) – the list of available benchmark group names

Returns:

the exact benchmark group query name

Return type:

str

tdc.utils.load.bm_group_load(name, path)[source]#

a wrapper to download, process and load benchmark group

Parameters:
  • name (str) – the rough benchmark group name

  • path (str) – the benchmark group path to save/retrieve

Returns:

exact benchmark group name

Return type:

str

tdc.utils.load.dataverse_download(url, path, name, types, id=None)[source]#

dataverse download helper with progress bar

Parameters:
  • url (str) – the url of the dataset

  • path (str) – the path to save the dataset

  • name (str) – the dataset name

  • types (dict) – a dictionary mapping from the dataset name to the file format

tdc.utils.load.distribution_dataset_load(name, path, dataset_names, column_name)[source]#

a wrapper to download, process and load molecule distribution learning task datasets. assume the downloaded file is already processed

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

  • column_name (str) – the column specifying where molecule locates

Returns:

the input list of molecules representation

Return type:

pandas.Series

tdc.utils.load.download_wrapper(name, path, dataset_names)[source]#

wrapper for downloading a dataset given the name and path, for csv,pkl,tsv files

Parameters:
  • name (str) – the rough dataset query name

  • path (str) – the path to save the dataset

  • dataset_names (list) – the list of available dataset names to search the query dataset

Returns:

the exact dataset query name

Return type:

str

tdc.utils.load.extract_atom_from_mol(rdmol, remove_Hs)[source]#

a helper to extract molecule atom information

Parameters:
  • rdmol (rdkit.rdmol) – rdkit molecule

  • remove_Hs (bool) – whether to remove H atoms from ligands or not

Returns:

atom types atom_type (numpy.array): atom coordinates

Return type:

coord (numpy.array)

tdc.utils.load.extract_atom_from_protein(data_frame, data_frame_het, remove_Hs, keep_het)[source]#

a helper to extract protein atom information

Parameters:
  • data_frame (pandas.dataframe) – protein atom

  • data_frame_het (pandas.dataframe) – protein het atom

  • remove_Hs (bool) – whether to remove H atoms from proteins or not

  • keep_het (bool) – whether to keep het atoms (e.g. cofactors) in protein

Returns:

atom types atom_type (numpy.array): atom coordinates

Return type:

coord (numpy.array)

tdc.utils.load.general_load(name, path, sep)[source]#

a wrapper to download, process and load any pandas dataframe files

Parameters:
  • name (str) – the dataset name

  • path (str) – the data save path

Returns:

data frame

Return type:

pandas.DataFrame

tdc.utils.load.generation_dataset_load(name, path, dataset_names)[source]#

a wrapper to download, process and load generation task datasets. assume the downloaded file is already processed

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

the data series

Return type:

pandas.Series

tdc.utils.load.generation_paired_dataset_load(name, path, dataset_names, input_name, output_name)[source]#

a wrapper to download, process and load generation-paired task datasets

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

two series (entity 1 representation, label)

Return type:

pandas.Series

tdc.utils.load.interaction_dataset_load(name, path, target, dataset_names, aux_column)[source]#

a wrapper to download, process and load two-instance prediction task datasets

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

three series (entity 1 representation, entity 2 representation, entity id 1, entity id 2, label)

Return type:

pandas.Series

tdc.utils.load.multi_dataset_load(name, path, dataset_names)[source]#

a wrapper to download, process and load multiple(>2)-instance prediction task datasets. assume the downloaded file is already processed

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

the raw dataframe

Return type:

pandas.DataFrame

tdc.utils.load.oracle_download_wrapper(name, path, oracle_names)[source]#

wrapper for downloading an oracle model checkpoint given the name and path

Parameters:
  • name (str) – the rough oracle query name

  • path (str) – the path to save the oracle

  • dataset_names (list) – the list of available exact oracle names

Returns:

the exact oracle query name

Return type:

str

tdc.utils.load.oracle_load(name, path='./oracle', oracle_names=['drd2', 'gsk3b', 'jnk3', 'fpscores', 'cyp3a4_veith', 'drd2_current', 'gsk3b_current', 'jnk3_current', 'qed', 'logp', 'sa', 'rediscovery', 'similarity', 'median', 'isomers', 'mpo', 'hop', 'celecoxib_rediscovery', 'troglitazone_rediscovery', 'thiothixene_rediscovery', 'aripiprazole_similarity', 'albuterol_similarity', 'mestranol_similarity', 'isomers_c7h8n2o2', 'isomers_c9h10n2o2pf2cl', 'isomers_c11h24', 'osimertinib_mpo', 'fexofenadine_mpo', 'ranolazine_mpo', 'perindopril_mpo', 'amlodipine_mpo', 'sitagliptin_mpo', 'zaleplon_mpo', 'sitagliptin_mpo_prev', 'zaleplon_mpo_prev', 'median1', 'median2', 'valsartan_smarts', 'deco_hop', 'scaffold_hop', 'novelty', 'diversity', 'uniqueness', 'validity', 'fcd_distance', 'kl_divergence', 'askcos', 'ibm_rxn', 'isomer_meta', 'rediscovery_meta', 'similarity_meta', 'median_meta', 'docking_score', 'molecule_one_synthesis', 'pyscreener', 'rmsd', 'kabsch_rmsd', 'smina', '1iep_docking', '2rgp_docking', '3eml_docking', '3ny8_docking', '4rlu_docking', '4unn_docking', '5mo4_docking', '7l11_docking', 'drd3_docking', '3pbl_docking', '1iep_docking_normalize', '2rgp_docking_normalize', '3eml_docking_normalize', '3ny8_docking_normalize', '4rlu_docking_normalize', '4unn_docking_normalize', '5mo4_docking_normalize', '7l11_docking_normalize', 'drd3_docking_normalize', '3pbl_docking_normalize', '1iep_docking_vina', '2rgp_docking_vina', '3eml_docking_vina', '3ny8_docking_vina', '4rlu_docking_vina', '4unn_docking_vina', '5mo4_docking_vina', '7l11_docking_vina', 'drd3_docking_vina', '3pbl_docking_vina'])[source]#

a wrapper to download, process and load oracles.

Parameters:
  • name (str) – the rough oracle name

  • path (str) – the oracle path to save/retrieve, defaults to ‘./oracle’

  • dataset_names (list) – a list of availabel exact oracle names

Returns:

exact oracle name

Return type:

str

tdc.utils.load.pd_load(name, path)[source]#

load a pandas dataframe from local file.

Parameters:
  • name (str) – dataset name

  • path (str) – the path where the dataset is saved

Returns:

loaded dataset in dataframe

Return type:

pandas.DataFrame

Raises:

ValueError – the file format is not supported. currently only support tab/csv/pkl/zip

tdc.utils.load.process_crossdock(path, name='crossdock', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#

a processor to process crossdock dataset

Parameters:
  • name (str) – the name of the dataset

  • path (str) – the path to save the data file

  • print_stats (bool) – whether to print the basic statistics of the dataset

  • return_pocket (bool) –

    whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,

    the threshold is used as a radius for a sphere around the ligand center to consider protein pocket

    remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein

Returns:

a dict of protein features ligand (dict): a dict of ligand features

Return type:

protein (dict)

tdc.utils.load.process_dude(path, name='dude', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#

a processor to process DUD-E dataset

Parameters:
  • name (str) – the name of the dataset

  • path (str) – the path to save the data file

  • print_stats (bool) – whether to print the basic statistics of the dataset

  • return_pocket (bool) –

    whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,

    the threshold is used as a radius for a sphere around the ligand center to consider protein pocket

    remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein

Returns:

a dict of protein features ligand (dict): a dict of ligand features

Return type:

protein (dict)

tdc.utils.load.process_pdbbind(path, name='pdbbind', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#

a processor to process pdbbind dataset

Parameters:
  • name (str) – the name of the dataset

  • path (str) – the path to save the data file

  • print_stats (bool) – whether to print the basic statistics of the dataset

  • return_pocket (bool) –

    whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,

    the threshold is used as a radius for a sphere around the ligand center to consider protein pocket

    remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein

Returns:

a dict of protein features ligand (dict): a dict of ligand features

Return type:

protein (dict)

tdc.utils.load.process_scpdb(path, name='scPDB', return_pocket=False, threshold=15, remove_protein_Hs=True, remove_ligand_Hs=True, keep_het=False)[source]#

a processor to process scpdb dataset

Parameters:
  • name (str) – the name of the dataset

  • path (str) – the path to save the data file

  • print_stats (bool) – whether to print the basic statistics of the dataset

  • return_pocket (bool) –

    whether to return only protein pocket or full protein threshold (int): only enabled when return_pocket is to True, if pockets are not provided in the raw data,

    the threshold is used as a radius for a sphere around the ligand center to consider protein pocket

    remove_protein_Hs (bool): whether to remove H atoms from proteins or not remove_ligand_Hs (bool): whether to remove H atoms from ligands or not keep_het (bool): whether to keep het atoms (e.g. cofactors) in protein

Returns:

a dict of protein features ligand (dict): a dict of ligand features

Return type:

protein (dict)

tdc.utils.load.property_dataset_load(name, path, target, dataset_names)[source]#

a wrapper to download, process and load single-instance prediction task datasets

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

three series (entity representation, label, entity id)

Return type:

pandas.Series

tdc.utils.load.receptor_download_wrapper(name, path)[source]#

wrapper for downloading an receptor pdb file given the name and path

Parameters:
  • name (str) – the exact pdbid

  • path (str) – the path to save the oracle

Returns:

the exact pdbid

Return type:

str

tdc.utils.load.receptor_load(name, path='./oracle')[source]#

a wrapper to download, process and load pdb file.

Parameters:
  • name (str) – the rough pdbid name

  • path (str) – the oracle path to save/retrieve, defaults to ‘./oracle’

Returns:

exact pdbid name

Return type:

str

tdc.utils.load.three_dim_dataset_load(name, path, dataset_names)[source]#

a wrapper to download, process and load 3d molecule task datasets

Parameters:
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

Returns:

the dataframe holds 3d information str: the path of the dataset str: the name of the dataset

Return type:

pandas.DataFrame

tdc.utils.load.zip_data_download_wrapper(name, path, dataset_names)[source]#

wrapper for downloading a dataset given the name and path - zip file, automatically unzipping

Parameters:
  • name (str) – the rough dataset query name

  • path (str) – the path to save the dataset

  • dataset_names (list) – the list of available dataset names to search the query dataset

Returns:

the exact dataset query name

Return type:

str

tdc.utils.misc module#

miscellaneous utilities functions

fuzzy matching between the real dataset name and the input name

Parameters:
  • name (str) – input dataset name given by users

  • dataset_names (str) – the exact dataset name in TDC

Returns:

the real dataset name

Return type:

s

Raises:

ValueError – the wrong task name, no name is matched

tdc.utils.misc.get_closet_match(predefined_tokens, test_token, threshold=0.8)[source]#

Get the closest match by Levenshtein Distance.

Parameters:
  • predefined_tokens (list) – Predefined string tokens.

  • test_token (str) – User input that needs matching to existing tokens.

  • threshold (float, optional) – The lowest match score to raise errors, defaults to 0.8

Returns:

the exact token with highest matching prob

float: probability

Return type:

str

Raises:

ValueError – no name is matched

tdc.utils.misc.install(package)[source]#

install pip package

Parameters:

package (str) – package name

tdc.utils.misc.load_dict(path)[source]#

load an object from a path

Parameters:

path (str) – the path where the pickle file locates

Returns:

loaded pickle file

Return type:

object

tdc.utils.misc.print_sys(s)[source]#

system print

Parameters:

s (str) – the string to print

tdc.utils.misc.save_dict(path, obj)[source]#

save an object to a pickle file

Parameters:
  • path (str) – the path to save the pickle file

  • obj (object) – any file

tdc.utils.misc.to_submission_format(results)[source]#

convert the results to submission-ready format in leaderboard

Parameters:

results (dict) – a dictionary of metrics across five runs

Returns:

a dictionary of metrics and values with mean and std

Return type:

dict

tdc.utils.query module#

Utilities functions for query

tdc.utils.query.cid2smiles(cid)[source]#

SMILES string from PubChem CID

Parameters:

cid (str) – PubChem CID

Returns:

SMILES string

Return type:

str

tdc.utils.query.request(identifier, namespace='cid', domain='compound', operation=None, output='JSON', searchtype=None)[source]#

copied from https://github.com/mcs07/PubChemPy/blob/e3c4f4a9b6120433e5cc3383464c7a79e9b2b86e/pubchempy.py#L238 Construct API request from parameters and return the response. Full specification at http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html

tdc.utils.query.uniprot2seq(ProteinID)[source]#

Get protein sequence from Uniprot ID

Parameters:

ProteinID (str) – the uniprot ID

Returns:

amino acid sequence

Return type:

str

tdc.utils.retrieve module#

Utilities functions for dataset/metadata retrieval

tdc.utils.retrieve.get_label_map(name, path='./data', target=None, file_format='csv', output_format='dict', task='DDI', name_column='Map')[source]#

to retrieve the biomedical meaning of labels

Parameters:
  • name (str) – the name of the dataset

  • path (str, optional) – the dataset path, where the data is located

  • target (None, optional) – the label name

  • file_format (str, optional) – format of the file

  • output_format (str, optional) – return a dictionary or a dataframe or the raw array of mapped labels

  • task (str, optional) – the name of the task

  • name_column (str, optional) – the name of the column that stores the label name

Returns:

when output_format is dict/df/array

Return type:

dict/pd.DataFrame/np.array

Raises:

ValueError – output_format not supported.

tdc.utils.retrieve.get_reaction_type(name, path='./data', output_format='array')[source]#

to retrieve the type of reactions for reaction dataset

Parameters:
  • name (str) – dataset name

  • path (str, optional) – dataset path

  • output_format (str, optional) – output format in dataframe or in raw array format

Returns:

when output_format is df/array

Return type:

pd.DataFrame/np.array

Raises:

ValueError – the output format is not supported

tdc.utils.retrieve.retrieve_all_benchmarks()[source]#

to get all available benchmark groups

Returns:

a list of benchmark group names

Return type:

list

tdc.utils.retrieve.retrieve_benchmark_names(name)[source]#

to get all available benchmarks given a query benchmark group

Parameters:

name (str) – the name of the benchmark group

Returns:

a list of benchmarks

Return type:

list

tdc.utils.retrieve.retrieve_dataset_names(name)[source]#

to get all available dataset names given a task

Parameters:

name (str) – the name of query task

Returns:

a list of available datasets

Return type:

list

tdc.utils.retrieve.retrieve_label_name_list(name)[source]#

get the set of available labels for query dataset

Parameters:

name (str) – rough dataset name

Returns:

a list of available labels

Return type:

list

tdc.utils.split module#

Utilities functions for splitting dataset

tdc.utils.split.create_combination_generation_split(dict1, dict2, seed, frac)[source]#

create random split

Parameters:
  • dict – data dict

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_combination_split(df, seed, frac)[source]#

Function for splitting drug combination dataset such that no combinations are shared across the split

Parameters:
  • df (pd.Dataframe) – dataset to split

  • seed (int) – random seed

  • frac (list) – split fraction as a list

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_fold(df, fold_seed, frac)[source]#

create random split

Parameters:
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_fold_setting_cold(df, fold_seed, frac, entities)[source]#

create cold-split where given one or multiple columns, it first splits based on entities in the columns and then maps all associated data points to the partition

Parameters:
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

  • entities (Union[str, List[str]]) – either a single “cold” entity or a list of “cold” entities on which the split is done

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_fold_time(df, frac, date_column)[source]#

create splits based on time

Parameters:
  • df (pd.DataFrame) – the dataset dataframe

  • frac (list) – list of train/valid/test fractions

  • date_column (str) – the name of the column that contains the time info

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_group_split(train_val, seed, holdout_frac, group_column)[source]#

split within each stratification defined by the group column for training/validation split

Parameters:
  • train_val (pd.DataFrame) – the train+valid dataframe to split on

  • seed (int) – the random seed

  • holdout_frac (float) – the fraction of validation

  • group_column (str) – the name of the group column

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict

tdc.utils.split.create_scaffold_split(df, seed, frac, entity)[source]#

create scaffold split. it first generates molecular scaffold for each molecule and then split based on scaffolds reference: https://github.com/chemprop/chemprop/blob/master/chemprop/data/scaffold.py

Parameters:
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

  • entity (str) – the column name for where molecule stores

Returns:

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type:

dict