tdc.base_dataset

This file contains a base data loader object that specific one can inherit from.

class tdc.base_dataset.DataLoader[source]

Bases: object

base data loader class that contains functions shared by almost all data loader classes.

balanced(oversample=False, seed=42)[source]

balance the label neg-pos ratio

Parameters
  • oversample (bool, optional) – whether or not to oversample minority or subsample majority to match ratio

  • seed (int, optional) – random seed

Returns

the updated dataframe with balanced dataset

Return type

pd.DataFrame

Raises

AttributeError – alert to binarize the data first as continuous values cannot do balancing

binarize(threshold=None, order='descending')[source]

binarize the labels

Parameters
  • threshold (float, optional) – the threshold to binarize the label.

  • order (str, optional) – the order of binarization, if ascending, flip 1 to larger values and vice versus for descending

Returns

data loader class with updated label

Return type

DataLoader

Raises

AttributeError – no threshold specified for binarization

convert_from_log(form='standard')[source]

convert labels from log-scale

Parameters

form (str, optional) – standard log-transformation or binding nM <-> p transformation.

convert_to_log(form='standard')[source]

convert labels to log-scale

Parameters

form (str, optional) – standard log-transformation or binding nM <-> p transformation.

get_data(format='df')[source]
Parameters

format (str, optional) – the dataset format

Returns

when format is df/dict/DeepPurpose

Return type

pd.DataFrame/dict/np.array

Raises

AttributeError – format not supported

get_label_meaning(output_format='dict')[source]

get the biomedical meaning of label

Parameters

output_format (str, optional) – dict/df/array for label

Returns

when output_format is dict/df/array

Return type

dict/pd.DataFrame/np.array

get_split(method='random', seed=42, frac=[0.7, 0.1, 0.2])[source]

split function, overwritten by single_pred/multi_pred/generation for more specific splits

Parameters
  • method – splitting schemes

  • seed – random seed

  • frac – train/val/test split fractions

Returns

a dictionary of train/valid/test dataframes

Return type

dict

Raises

AttributeError – split method not supported

label_distribution()[source]

visualize distribution of labels

print_stats()[source]

print statistics