tdc.utils

tdc.utils.label module

Utilities functions for transform labels

tdc.utils.label.NegSample(df, column_names, frac, two_types)[source]

Negative Sampling for Binary Interaction Dataset

Parameters
  • df (pandas.DataFrame) – input dataset dataframe

  • column_names (list) – column names in the order of [id1, x1, id2, x2]

  • frac (float) – the ratio of negative samples compared to positive samples

  • two_types (bool) – whether or not if the two entity types are different (e.g. drug-target) or single entity type (e.g. drug-drug)

Returns

a new dataframe with negative samples (Y = 0)

Return type

pandas.DataFrame

tdc.utils.label.binarize(y, threshold, order='ascending')[source]

binarization of a label list given a pre-specified threshold

Parameters
  • y (list) – a list of labels

  • threshold (float) – the threshold for turning label to 1 or 0

  • order (str, optional) – if order is ascending then for label that is above threshold becomes 1, and below becomes 0, vice versus

Returns

an array of transformed labels

Return type

np.array

Raises

AttributeError – select the correct order “ascending/descending”

tdc.utils.label.convert_back_log(y)[source]

conversion from log-scale helper

Parameters

y (list) – a list of labels in log-scale

Returns

an array of nM->p labels

Return type

np.array

tdc.utils.label.convert_to_log(y)[source]

log conversion helper

Parameters

y (list) – a list of labels

Returns

an array of log-transformed labels

Return type

np.array

tdc.utils.label.convert_y_unit(y, from_, to_)[source]

label unit conversion helper function

Parameters
  • y (list) – a list of labels

  • from (str) – source units, ‘nM’/’p’

  • to (str) – target units, ‘p’/’nM’

Returns

a numpy array of transformed labels

Return type

np.array

tdc.utils.label.label_dist(y, name=None)[source]

plot the distribution of label

Parameters
  • y (list) – a list of labels

  • name (None, optional) – dataset name

tdc.utils.label.label_transform(y, binary, threshold, convert_to_log, verbose=True, order='descending')[source]

label transformation helper function

Parameters
  • y (list) – a list of labels

  • binary (bool) – whether or not to conduct binarization

  • threshold (float) – the threshold for binarization

  • convert_to_log (bool) – convert to log-scale for continuous values such as Kd and etc

  • verbose (bool, optional) – whether or not to print intermediate processing statements

  • order (str, optional) – if descending, then label is 1 for value less than threshold and vice versus, defaults to ‘descending’

Returns

an array of transformed labels

Return type

np.array

Raises

ValueError – specify the correct order from ‘descending’/’ascending’

tdc.utils.label_name_list module

list of dataset names

tdc.utils.load module

wrapper for download various dataset

tdc.utils.load.bm_download_wrapper(name, path)[source]

wrapper for downloading a benchmark group given the name and path

Parameters
  • name (str) – the rough benckmark group query name

  • path (str) – the path to save the benchmark group

  • dataset_names (list) – the list of available benchmark group names

Returns

the exact benchmark group query name

Return type

str

tdc.utils.load.bm_group_load(name, path)[source]

a wrapper to download, process and load benchmark group

Parameters
  • name (str) – the rough benchmark group name

  • path (str) – the benchmark group path to save/retrieve

Returns

exact benchmark group name

Return type

str

tdc.utils.load.dataverse_download(url, path, name, types)[source]

dataverse download helper with progress bar

Parameters
  • url (str) – the url of the dataset

  • path (str) – the path to save the dataset

  • name (str) – the dataset name

  • types (dict) – a dictionary mapping from the dataset name to the file format

tdc.utils.load.distribution_dataset_load(name, path, dataset_names, column_name)[source]

a wrapper to download, process and load molecule distribution learning task datasets. assume the downloaded file is already processed

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

  • column_name (str) – the column specifying where molecule locates

Returns

the input list of molecules representation

Return type

pandas.Series

tdc.utils.load.download_wrapper(name, path, dataset_names)[source]

wrapper for downloading a dataset given the name and path, for csv,pkl,tsv files

Parameters
  • name (str) – the rough dataset query name

  • path (str) – the path to save the dataset

  • dataset_names (list) – the list of available dataset names to search the query dataset

Returns

the exact dataset query name

Return type

str

tdc.utils.load.generation_dataset_load(name, path, dataset_names)[source]

a wrapper to download, process and load generation task datasets. assume the downloaded file is already processed

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

Returns

the data series

Return type

pandas.Series

tdc.utils.load.generation_paired_dataset_load(name, path, dataset_names, input_name, output_name)[source]

a wrapper to download, process and load generation-paired task datasets

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns

two series (entity 1 representation, label)

Return type

pandas.Series

tdc.utils.load.interaction_dataset_load(name, path, target, dataset_names, aux_column)[source]

a wrapper to download, process and load two-instance prediction task datasets

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns

three series (entity 1 representation, entity 2 representation, entity id 1, entity id 2, label)

Return type

pandas.Series

tdc.utils.load.multi_dataset_load(name, path, dataset_names)[source]

a wrapper to download, process and load multiple(>2)-instance prediction task datasets. assume the downloaded file is already processed

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns

the raw dataframe

Return type

pandas.DataFrame

tdc.utils.load.oracle_download_wrapper(name, path, oracle_names)[source]

wrapper for downloading an oracle model checkpoint given the name and path

Parameters
  • name (str) – the rough oracle query name

  • path (str) – the path to save the oracle

  • dataset_names (list) – the list of available exact oracle names

Returns

the exact oracle query name

Return type

str

tdc.utils.load.oracle_load(name, path='./oracle', oracle_names=['drd2', 'gsk3b', 'jnk3', 'fpscores', 'cyp3a4_veith', 'qed', 'logp', 'sa', 'rediscovery', 'similarity', 'median', 'isomers', 'mpo', 'hop', 'celecoxib_rediscovery', 'troglitazone_rediscovery', 'thiothixene_rediscovery', 'aripiprazole_similarity', 'albuterol_similarity', 'mestranol_similarity', 'isomers_c7h8n2o2', 'isomers_c9h10n2o2pf2cl', 'osimertinib_mpo', 'fexofenadine_mpo', 'ranolazine_mpo', 'perindopril_mpo', 'amlodipine_mpo', 'sitagliptin_mpo', 'zaleplon_mpo', 'median1', 'median2', 'valsartan_smarts', 'deco_hop', 'scaffold_hop', 'novelty', 'diversity', 'uniqueness', 'validity', 'fcd_distance', 'kl_divergence', 'askcos', 'ibm_rxn', 'isomer_meta', 'rediscovery_meta', 'similarity_meta', 'median_meta', 'docking_score', 'molecule_one_synthesis', '1iep_docking', '2rgp_docking', '3eml_docking', '3ny8_docking', '4rlu_docking', '4unn_docking', '5mo4_docking', '7l11_docking', 'drd3_docking', '3pbl_docking'])[source]

a wrapper to download, process and load oracles.

Parameters
  • name (str) – the rough oracle name

  • path (str) – the oracle path to save/retrieve, defaults to ‘./oracle’

  • dataset_names (list) – a list of availabel exact oracle names

Returns

exact oracle name

Return type

str

tdc.utils.load.pd_load(name, path)[source]

load a pandas dataframe from local file.

Parameters
  • name (str) – dataset name

  • path (str) – the path where the dataset is saved

Returns

loaded dataset in dataframe

Return type

pandas.DataFrame

Raises

ValueError – the file format is not supported. currently only support tab/csv/pkl/zip

tdc.utils.load.property_dataset_load(name, path, target, dataset_names)[source]

a wrapper to download, process and load single-instance prediction task datasets

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • target (str) – for multi-label dataset, retrieve the label of interest

  • dataset_names (list) – a list of availabel exact dataset names

Returns

three series (entity representation, label, entity id)

Return type

pandas.Series

tdc.utils.load.receptor_download_wrapper(name, path)[source]

wrapper for downloading an receptor pdb file given the name and path

Parameters
  • name (str) – the exact pdbid

  • path (str) – the path to save the oracle

Returns

the exact pdbid

Return type

str

tdc.utils.load.receptor_load(name, path='./oracle')[source]

a wrapper to download, process and load pdb file.

Parameters
  • name (str) – the rough pdbid name

  • path (str) – the oracle path to save/retrieve, defaults to ‘./oracle’

Returns

exact pdbid name

Return type

str

tdc.utils.load.three_dim_dataset_load(name, path, dataset_names)[source]

a wrapper to download, process and load 3d molecule task datasets

Parameters
  • name (str) – the rough dataset name

  • path (str) – the dataset path to save/retrieve

  • dataset_names (list) – a list of availabel exact dataset names

Returns

the dataframe holds 3d information str: the path of the dataset str: the name of the dataset

Return type

pandas.DataFrame

tdc.utils.load.zip_data_download_wrapper(name, path, dataset_names)[source]

wrapper for downloading a dataset given the name and path - zip file, automatically unzipping

Parameters
  • name (str) – the rough dataset query name

  • path (str) – the path to save the dataset

  • dataset_names (list) – the list of available dataset names to search the query dataset

Returns

the exact dataset query name

Return type

str

tdc.utils.misc module

miscellaneous utilities functions

fuzzy matching between the real dataset name and the input name

Parameters
  • name (str) – input dataset name given by users

  • dataset_names (str) – the exact dataset name in TDC

Returns

the real dataset name

Return type

s

Raises

ValueError – the wrong task name, no name is matched

tdc.utils.misc.get_closet_match(predefined_tokens, test_token, threshold=0.8)[source]

Get the closest match by Levenshtein Distance.

Parameters
  • predefined_tokens (list) – Predefined string tokens.

  • test_token (str) – User input that needs matching to existing tokens.

  • threshold (float, optional) – The lowest match score to raise errors, defaults to 0.8

Returns

the exact token with highest matching prob

float: probability

Return type

str

Raises

ValueError – no name is matched

tdc.utils.misc.install(package)[source]

install pip package

Parameters

package (str) – package name

tdc.utils.misc.load_dict(path)[source]

load an object from a path

Parameters

path (str) – the path where the pickle file locates

Returns

loaded pickle file

Return type

object

tdc.utils.misc.print_sys(s)[source]

system print

Parameters

s (str) – the string to print

tdc.utils.misc.save_dict(path, obj)[source]

save an object to a pickle file

Parameters
  • path (str) – the path to save the pickle file

  • obj (object) – any file

tdc.utils.misc.to_submission_format(results)[source]

convert the results to submission-ready format in leaderboard

Parameters

results (dict) – a dictionary of metrics across five runs

Returns

a dictionary of metrics and values with mean and std

Return type

dict

tdc.utils.query module

Utilities functions for query

tdc.utils.query.cid2smiles(cid)[source]

SMILES string from PubChem CID

Parameters

cid (str) – PubChem CID

Returns

SMILES string

Return type

str

tdc.utils.query.request(identifier, namespace='cid', domain='compound', operation=None, output='JSON', searchtype=None)[source]

copied from https://github.com/mcs07/PubChemPy/blob/e3c4f4a9b6120433e5cc3383464c7a79e9b2b86e/pubchempy.py#L238 Construct API request from parameters and return the response. Full specification at http://pubchem.ncbi.nlm.nih.gov/pug_rest/PUG_REST.html

tdc.utils.query.uniprot2seq(ProteinID)[source]

Get protein sequence from Uniprot ID

Parameters

ProteinID (str) – the uniprot ID

Returns

amino acid sequence

Return type

str

tdc.utils.retrieve module

Utilities functions for dataset/metadata retrieval

tdc.utils.retrieve.get_label_map(name, path='./data', target=None, file_format='csv', output_format='dict', task='DDI', name_column='Map')[source]

to retrieve the biomedical meaning of labels

Parameters
  • name (str) – the name of the dataset

  • path (str, optional) – the dataset path, where the data is located

  • target (None, optional) – the label name

  • file_format (str, optional) – format of the file

  • output_format (str, optional) – return a dictionary or a dataframe or the raw array of mapped labels

  • task (str, optional) – the name of the task

  • name_column (str, optional) – the name of the column that stores the label name

Returns

when output_format is dict/df/array

Return type

dict/pd.DataFrame/np.array

Raises

ValueError – output_format not supported.

tdc.utils.retrieve.get_reaction_type(name, path='./data', output_format='array')[source]

to retrieve the type of reactions for reaction dataset

Parameters
  • name (str) – dataset name

  • path (str, optional) – dataset path

  • output_format (str, optional) – output format in dataframe or in raw array format

Returns

when output_format is df/array

Return type

pd.DataFrame/np.array

Raises

ValueError – the output format is not supported

tdc.utils.retrieve.retrieve_all_benchmarks()[source]

to get all available benchmark groups

Returns

a list of benchmark group names

Return type

list

tdc.utils.retrieve.retrieve_benchmark_names(name)[source]

to get all available benchmarks given a query benchmark group

Parameters

name (str) – the name of the benchmark group

Returns

a list of benchmarks

Return type

list

tdc.utils.retrieve.retrieve_dataset_names(name)[source]

to get all available dataset names given a task

Parameters

name (str) – the name of query task

Returns

a list of available datasets

Return type

list

tdc.utils.retrieve.retrieve_label_name_list(name)[source]

get the set of available labels for query dataset

Parameters

name (str) – rough dataset name

Returns

a list of available labels

Return type

list

tdc.utils.split module

Utilities functions for splitting dataset

tdc.utils.split.create_combination_split(df, seed, frac)[source]

Function for splitting drug combination dataset such that no combinations are shared across the split

Parameters
  • df (pd.Dataframe) – dataset to split

  • seed (int) – random seed

  • frac (list) – split fraction as a list

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict

tdc.utils.split.create_fold(df, fold_seed, frac)[source]

create random split

Parameters
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict

tdc.utils.split.create_fold_setting_cold(df, fold_seed, frac, entities)[source]

create cold-split where given one or multiple columns, it first splits based on entities in the columns and then maps all associated data points to the partition

Parameters
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

  • entities (Union[str, List[str]]) – either a single “cold” entity or a list of “cold” entities on which the split is done

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict

tdc.utils.split.create_fold_time(df, frac, date_column)[source]

create splits based on time

Parameters
  • df (pd.DataFrame) – the dataset dataframe

  • frac (list) – list of train/valid/test fractions

  • date_column (str) – the name of the column that contains the time info

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict

tdc.utils.split.create_group_split(train_val, seed, holdout_frac, group_column)[source]

split within each stratification defined by the group column for training/validation split

Parameters
  • train_val (pd.DataFrame) – the train+valid dataframe to split on

  • seed (int) – the random seed

  • holdout_frac (float) – the fraction of validation

  • group_column (str) – the name of the group column

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict

tdc.utils.split.create_scaffold_split(df, seed, frac, entity)[source]

create scaffold split. it first generates molecular scaffold for each molecule and then split based on scaffolds reference: https://github.com/chemprop/chemprop/blob/master/chemprop/data/scaffold.py

Parameters
  • df (pd.DataFrame) – dataset dataframe

  • fold_seed (int) – the random seed

  • frac (list) – a list of train/valid/test fractions

  • entity (str) – the column name for where molecule stores

Returns

a dictionary of splitted dataframes, where keys are train/valid/test and values correspond to each dataframe

Return type

dict