tdc.base_dataset#

This file contains a base data loader object that specific one can inherit from.

class tdc.base_dataset.DataLoader[source]#

Bases: object

base data loader class that contains functions shared by almost all data loader classes.

balanced(oversample=False, seed=42)[source]#

balance the label neg-pos ratio

Parameters:
  • oversample (bool, optional) – whether or not to oversample minority or subsample majority to match ratio

  • seed (int, optional) – random seed

Returns:

the updated dataframe with balanced dataset

Return type:

pd.DataFrame

Raises:

AttributeError – alert to binarize the data first as continuous values cannot do balancing

binarize(threshold=None, order='descending')[source]#

binarize the labels

Parameters:
  • threshold (float, optional) – the threshold to binarize the label.

  • order (str, optional) – the order of binarization, if ascending, flip 1 to larger values and vice versus for descending

Returns:

data loader class with updated label

Return type:

DataLoader

Raises:

AttributeError – no threshold specified for binarization

convert_from_log(form='standard')[source]#

convert labels from log-scale

Parameters:

form (str, optional) – standard log-transformation or binding nM <-> p transformation.

convert_to_log(form='standard')[source]#

convert labels to log-scale

Parameters:

form (str, optional) – standard log-transformation or binding nM <-> p transformation.

get_data(format='df')[source]#
Parameters:

format (str, optional) – the dataset format

Returns:

when format is df/dict/DeepPurpose

Return type:

pd.DataFrame/dict/np.array

Raises:

AttributeError – format not supported

get_label_meaning(output_format='dict')[source]#

get the biomedical meaning of label

Parameters:

output_format (str, optional) – dict/df/array for label

Returns:

when output_format is dict/df/array

Return type:

dict/pd.DataFrame/np.array

get_split(method='random', seed=42, frac=[0.7, 0.1, 0.2])[source]#

split function, overwritten by single_pred/multi_pred/generation for more specific splits

Parameters:
  • method – splitting schemes

  • seed – random seed

  • frac – train/val/test split fractions

Returns:

a dictionary of train/valid/test dataframes

Return type:

dict

Raises:

AttributeError – split method not supported

label_distribution()[source]#

visualize distribution of labels

print_stats()[source]#

print statistics