tdc.base_dataset#

This file contains a base data loader object that specific one can inherit from.

class tdc.base_dataset.DataLoader[source]#

Bases: object

base data loader class that contains functions shared by almost all data loader classes.

balanced(oversample=False, seed=42)[source]#

balance the label neg-pos ratio

Parameters:

oversample (bool, optional) – whether or not to oversample minority or subsample majority to match ratio
seed (int, optional) – random seed

Returns:

the updated dataframe with balanced dataset

Return type:

pd.DataFrame

Raises:

AttributeError – alert to binarize the data first as continuous values cannot do balancing

binarize(threshold=None, order='descending')[source]#

binarize the labels

Parameters:

threshold (float, optional) – the threshold to binarize the label.
order (str, optional) – the order of binarization, if ascending, flip 1 to larger values and vice versus for descending

Returns:

data loader class with updated label

Return type:

DataLoader

Raises:

AttributeError – no threshold specified for binarization

convert_from_log(form='standard')[source]#

convert labels from log-scale

Parameters:: form (str, optional) – standard log-transformation or binding nM <-> p transformation.

convert_to_log(form='standard')[source]#

convert labels to log-scale

Parameters:: form (str, optional) – standard log-transformation or binding nM <-> p transformation.

get_data(format='df')[source]#

get_label_meaning(output_format='dict')[source]#

get the biomedical meaning of label

Parameters:: output_format (str, optional) – dict/df/array for label
Returns:: when output_format is dict/df/array
Return type:: dict/pd.DataFrame/np.array

get_split(method='random', seed=42, frac=[0.7, 0.1, 0.2])[source]#

split function, overwritten by single_pred/multi_pred/generation for more specific splits

Parameters:

Returns:

a dictionary of train/valid/test dataframes

Return type:

dict

Raises:

AttributeError – split method not supported