tdc.chem_utils#

tdc.chem_utils.featurize module#

tdc.chem_utils.featurize.molconvert submodule#

class tdc.chem_utils.featurize.molconvert.MolConvert(src='SMILES', dst='Graph2D', radius=2, nBits=1024)[source]#

Bases: object

MolConvert: convert the molecule from src formet to dst format.

Example

convert = MolConvert(src = ‘SMILES’, dst = ‘Graph2D’) g = convert(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # g: graph with edge, node features g = convert([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,

‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])

# g: a list of graphs with edge, node features if src is 2D, dst can be only 2D output if src is 3D, dst can be both 2D and 3D outputs src: 2D - [SMILES, SELFIES]

3D - [SDF file, XYZ file]

dst: 2D - [2D Graph (+ PyG, DGL format), Canonical SMILES, SELFIES, Fingerprints]

3D - [3D graphs (adj matrix entry is (distance, bond type)), Coulumb Matrix]

static eligible_format(src=None)[source]#

given a src format, output all the available format of the src format Example MoleculeLink.eligible_format(‘SMILES’) ## [‘Graph’, ‘SMARTS’, …]

class tdc.chem_utils.featurize.molconvert.MoleculeFingerprint(fp='ECFP4')[source]#

Bases: object

Example: MolFP = MoleculeFingerprint(fp = ‘ECFP6’) out = MolFp(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # np.array([1, 0, 1, …..]) out = MolFp([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,

‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])

# np.array([[1, 0, 1, …..],

[0, 0, 1, …..]])

Supporting FPs: Basic_Descriptors(atoms, chirality, ….), ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, PubChem

tdc.chem_utils.featurize.molconvert.atom2onehot(atom)[source]#

convert atom to one-hot feature vector

Parameters:

'C'

Returns:

[1, 0, 0, 0, 0, ..]

tdc.chem_utils.featurize.molconvert.atomstring2atomfeature(atom_string_list)[source]#
tdc.chem_utils.featurize.molconvert.bondtype2idx(bond_type)[source]#
tdc.chem_utils.featurize.molconvert.canonicalize(smiles)[source]#
tdc.chem_utils.featurize.molconvert.distance3d(coordinate_1, coordinate_2)[source]#
tdc.chem_utils.featurize.molconvert.get_atom_features(atom)[source]#
tdc.chem_utils.featurize.molconvert.get_mol(smiles)[source]#
tdc.chem_utils.featurize.molconvert.mol2file2smiles(molfile)[source]#

convert mol2file into SMILES string

Parameters:

mol2file – str, a file.

Returns:

str, SMILES strings

Return type:

smiles

tdc.chem_utils.featurize.molconvert.mol2smiles(mol)[source]#
tdc.chem_utils.featurize.molconvert.mol_conformer2graph3d(mol_conformer_lst)[source]#

convert list of (molecule, conformer) into a list of 3D graph.

Parameters:

mol_conformer_lst – list of tuple (molecule, conformer)

Returns:

a list of 3D graph.

each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)

Return type:

graph3d_lst

tdc.chem_utils.featurize.molconvert.molfile2PyG(molfile)[source]#
tdc.chem_utils.featurize.molconvert.molfile2smiles(molfile)[source]#

convert molfile into SMILES string

Parameters:

molfile – str, a file.

Returns:

str, SMILES strings

Return type:

smiles

tdc.chem_utils.featurize.molconvert.onek_encoding_unk(x, allowable_set)[source]#
tdc.chem_utils.featurize.molconvert.raw3D2pyg(raw3d_feature)[source]#

convert raw3d feature to pyg (torch-geometric) feature

Parameters:

raw3d_feature – (atom_string_list, positions, y) - atom_string_list: list, each element is an atom, length is N - positions: np.array, shape: (N,3) - y: float

Returns:

data = Data(x=x, pos=pos, y=y)

tdc.chem_utils.featurize.molconvert.sdffile2coulomb(sdf)[source]#

convert sdffile into a list of coulomb feature.

Parameters:

sdffile – str, file

Returns:

np.array

Return type:

coulomb feature

tdc.chem_utils.featurize.molconvert.sdffile2graph3d_lst(sdffile)[source]#

convert SDF file into a list of 3D graph.

Parameters:

sdffile – SDF file

Returns:

a list of 3D graph.

each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)

Return type:

graph3d_lst

tdc.chem_utils.featurize.molconvert.sdffile2mol_conformer(sdffile)[source]#

convert sdffile into a list of molecule conformers.

Parameters:

sdffile – str, file

Returns:

a list of molecule conformers.

Return type:

smiles_lst

tdc.chem_utils.featurize.molconvert.sdffile2selfies_lst(sdf)[source]#

convert sdffile into a list of SELFIES strings.

Parameters:

sdffile – str, file

Returns:

a list of SELFIES strings.

Return type:

selfies_lst

tdc.chem_utils.featurize.molconvert.sdffile2smiles_lst(sdffile)[source]#

convert SDF file into a list of SMILES string.

Parameters:

sdffile – str, file

Returns:

a list of SMILES strings.

Return type:

smiles_lst

tdc.chem_utils.featurize.molconvert.selfies2smiles(selfies)[source]#

Convert selfies into smiles.

Parameters:

selfies – str, a SELFIES string.

Returns:

str, a SMILES string

Return type:

smiles

tdc.chem_utils.featurize.molconvert.smiles2DGL(smiles)[source]#

convert SMILES string into dgl.DGLGraph

Parameters:
  • smiles

  • str

  • string (a SMILES) –

Returns:

dgl.DGLGraph()

Return type:

g

tdc.chem_utils.featurize.molconvert.smiles2ECFP2(smiles)[source]#

Convert smiles into ECFP2 Morgan Fingerprint.

Parameters:

smiles – str

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2ECFP4(smiles)[source]#

Convert smiles into ECFP4 Morgan Fingerprint.

Parameters:

smiles – str

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2ECFP6(smiles)[source]#

Convert smiles into ECFP6 Morgan Fingerprint.

Parameters:

smiles – str, a SMILES string

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

refer: https://github.com/rdkit/benchmarking_platform/blob/master/scoring/fingerprint_lib.py

tdc.chem_utils.featurize.molconvert.smiles2PyG(smiles)[source]#

convert SMILES string into torch_geometric.data.Data

Parameters:
  • smiles

  • str

  • string (a SMILES) –

Returns:

data, torch_geometric.data.Data

tdc.chem_utils.featurize.molconvert.smiles2daylight(s)[source]#

Convert smiles into 2048-dim Daylight feature.

Parameters:

smiles – str

Returns:

numpy.array

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2graph2D(smiles)[source]#

convert SMILES string into two-dimensional molecular graph feature

Parameters:
  • smiles

  • str

  • string (a SMILES) –

Returns:

dict, map from index to atom’s symbol, e.g., {0:’C’, 1:’N’, …} adj_matrix: np.array

Return type:

idx2atom

tdc.chem_utils.featurize.molconvert.smiles2maccs(s)[source]#

Convert smiles into maccs feature.

Parameters:

smiles – str

Returns:

numpy.array

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2mol(smiles)[source]#

Convert SMILES string into rdkit.Chem.rdchem.Mol.

Parameters:

smiles – str, a SMILES string.

Returns:

rdkit.Chem.rdchem.Mol

Return type:

mol

tdc.chem_utils.featurize.molconvert.smiles2morgan(s, radius=2, nBits=1024)[source]#

Convert smiles into Morgan Fingerprint.

Parameters:
  • smiles – str

  • radius – int (default: 2)

  • nBits – int (default: 1024)

Returns:

numpy.array

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2rdkit2d(s)[source]#

Convert smiles into 200-dim Normalized RDKit 2D vector.

Parameters:

smiles – str

Returns:

numpy.array

Return type:

fp

tdc.chem_utils.featurize.molconvert.smiles2selfies(smiles)[source]#

Convert smiles into selfies.

Parameters:

smiles – str, a SMILES string

Returns:

str, a SELFIES string.

Return type:

selfies

tdc.chem_utils.featurize.molconvert.smiles_lst2coulomb(smiles_lst)[source]#

convert a list of SMILES strings into coulomb format.

Parameters:

smiles_lst – a list of SELFIES strings.

Returns:

np.array

Return type:

features

tdc.chem_utils.featurize.molconvert.upper_atom(atomsymbol)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2coulomb(xyzfile)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2graph3d(xyzfile)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2selfies(xyzfile)[source]#

convert xyzfile into SELFIES string.

Parameters:

xyzfile – str, file

Returns:

str, a SELFIES string.

Return type:

selfies

tdc.chem_utils.featurize.molconvert.xyzfile2smiles(xyzfile)[source]#

convert xyzfile into smiles string.

Parameters:

xyzfile – str, file

Returns:

str, a SMILES string

Return type:

smiles

tdc.chem_utils.oracle module#

tdc.chem_utils.oracle.filter submodule#

class tdc.chem_utils.oracle.filter.MolFilter(filters='all', property_filters_flag=True, HBA=[0, 10], HBD=[0, 5], LogP=[-5, 5], MW=[0, 500], Rot=[0, 10], TPSA=[0, 200])[source]#

Bases: object

Molecule Filter: filter Molecule based on user-specified condition

Parameters:
  • filters

  • property_filters_flag – bool,

  • HBA – [lower_bound, upper_bound]

  • HBD – [lower_bound, upper_bound]

  • LogP – [lower_bound, upper_bound]

  • MW – [lower_bound, upper_bound], Molecule weight

  • Rot – [lower_bound, upper_bound]

  • TPSA – [lower_bound, upper_bound]

Returns:

list of SMILES strings that pass the filter.

tdc.chem_utils.oracle.oracle submodule#

class tdc.chem_utils.oracle.oracle.AbsoluteScoreModifier(target_value: float)[source]#

Bases: ScoreModifier

Score modifier that has a maximum at a given target value, and decreases linearly with increasing distance from the target value.

class tdc.chem_utils.oracle.oracle.AtomCounter(element)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.ChainedModifier(modifiers: List[ScoreModifier])[source]#

Bases: ScoreModifier

Calls several modifiers one after the other, for instance:

score = modifier3(modifier2(modifier1(raw_score)))

class tdc.chem_utils.oracle.oracle.ClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#

Bases: ScoreModifier

Clips a score between specified low and high scores, and does a linear interpolation in between.

This class works as follows: First the input is mapped onto a linear interpolation between both specified points. Then the generated values are clipped between low and high scores.

class tdc.chem_utils.oracle.oracle.GaussianModifier(mu: float, sigma: float)[source]#

Bases: ScoreModifier

Score modifier that reproduces a Gaussian bell shape.

class tdc.chem_utils.oracle.oracle.Isomer_scoring(target_smiles, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.Isomer_scoring_prev(target_smiles, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.LinearModifier(slope=1.0)[source]#

Bases: ScoreModifier

Score modifier that multiplies the score by a scalar (default: 1, i.e. do nothing).

class tdc.chem_utils.oracle.oracle.MPO_meta(means)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.MinMaxGaussianModifier(mu: float, sigma: float, minimize=False)[source]#

Bases: ScoreModifier

Score modifier that reproduces a half Gaussian bell shape. For minimize==True, the function is 1.0 for x <= mu and decreases to zero for x > mu. For minimize==False, the function is 1.0 for x >= mu and decreases to zero for x < mu.

class tdc.chem_utils.oracle.oracle.PyScreener_meta(receptor_pdb_file, box_center, box_size, software_class='vina', ncpu=4, **kwargs)[source]#

Bases: object

Evaluate docking score

Args:

Return:

tdc.chem_utils.oracle.oracle.SA(s)[source]#

Evaluate SA score of a SMILES string

Parameters:

smiles – str

Returns:

float

Return type:

SAscore

class tdc.chem_utils.oracle.oracle.SMARTS_scoring(target_smarts, inverse)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.ScoreModifier[source]#

Bases: object

Interface for score modifiers.

class tdc.chem_utils.oracle.oracle.Score_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Evaluate Vina score (force field) for a conformer binding to a receptor

class tdc.chem_utils.oracle.oracle.SmoothClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#

Bases: ScoreModifier

Smooth variant of ClippedScoreModifier.

Implemented as a logistic function that has the same steepness as ClippedScoreModifier in the center of the logistic function.

class tdc.chem_utils.oracle.oracle.SquaredModifier(target_value: float, coefficient=1.0)[source]#

Bases: ScoreModifier

Score modifier that has a maximum at a given target value, and decreases quadratically with increasing distance from the target value.

class tdc.chem_utils.oracle.oracle.ThresholdedLinearModifier(threshold: float)[source]#

Bases: ScoreModifier

Returns a value of min(input, threshold)/threshold.

class tdc.chem_utils.oracle.oracle.Vina_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Perform docking search from a conformer.

class tdc.chem_utils.oracle.oracle.Vina_smiles(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Perform docking search from a conformer.

tdc.chem_utils.oracle.oracle.amlodipine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.askcos(smiles, host_ip, output='plausibility', save_json=False, file_name='tree_builder_result.json', num_trials=5, max_depth=9, max_branching=25, expansion_time=60, max_ppg=100, template_count=1000, max_cum_prob=0.999, chemical_property_logic='none', max_chemprop_c=0, max_chemprop_n=0, max_chemprop_o=0, max_chemprop_h=0, chemical_popularity_logic='none', min_chempop_reactants=5, min_chempop_products=5, filter_threshold=0.1, return_first='true')[source]#

The ASKCOS retrosynthetic analysis oracle function. Please refer https://github.com/connorcoley/ASKCOS to run the ASKCOS with docker on a server to receive requests.

tdc.chem_utils.oracle.oracle.calculateScore(m)[source]#
tdc.chem_utils.oracle.oracle.canonicalize(smiles: str, include_stereocenters=True)[source]#

Canonicalize the SMILES strings with RDKit.

The algorithm is detailed under https://pubs.acs.org/doi/full/10.1021/acs.jcim.5b00543

Parameters:
  • smiles – SMILES string to canonicalize

  • include_stereocenters – whether to keep the stereochemical information in the canonical SMILES string

Returns:

Canonicalized SMILES string, None if the molecule is invalid.

tdc.chem_utils.oracle.oracle.cyp3a4_veith(smiles)[source]#
tdc.chem_utils.oracle.oracle.deco_hop(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.drd2(smile)[source]#

Evaluate DRD2 score of a SMILES string

Parameters:

smiles – str

Returns:

float

Return type:

drd_score

tdc.chem_utils.oracle.oracle.fexofenadine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.fingerprints_from_mol(mol)[source]#
tdc.chem_utils.oracle.oracle.get_PHCO_fingerprint(mol)[source]#
tdc.chem_utils.oracle.oracle.gsk3b(smiles)[source]#

Evaluate GSK3B score of a SMILES string

Parameters:

smiles – str

Returns:

float, between 0 and 1.

Return type:

gsk3_score

tdc.chem_utils.oracle.oracle.ibm_rxn(smiles, api_key, output='confidence', sleep_time=30)[source]#

This function is modified from Dr. Jan Jensen’s code

tdc.chem_utils.oracle.oracle.isomer_meta(target_smiles, means='geometric')[source]#
tdc.chem_utils.oracle.oracle.isomer_meta_prev(target_smiles, means='geometric')[source]#
class tdc.chem_utils.oracle.oracle.jnk3[source]#

Bases: object

Evaluate JSK3 score of a SMILES string

Parameters:

smiles – str

Returns:

float , between 0 and 1.

Return type:

jnk3_score

tdc.chem_utils.oracle.oracle.load_cyp3a4_veith()[source]#
tdc.chem_utils.oracle.oracle.load_drd2_model()[source]#
tdc.chem_utils.oracle.oracle.load_gsk3b_model()[source]#
tdc.chem_utils.oracle.oracle.load_pickled_model(name: str)[source]#

Loading a pretrained model serialized with pickle. Usually for sklearn models.

Parameters:

name – Name of the model to load.

Returns:

The model.

class tdc.chem_utils.oracle.oracle.median_meta(target_smiles_1, target_smiles_2, fp1='ECFP6', fp2='ECFP6', modifier_func1=None, modifier_func2=None, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.molecule_one_retro(api_token)[source]#

Bases: object

tdc.chem_utils.oracle.oracle.numBridgeheadsAndSpiro(mol, ri=None)[source]#
tdc.chem_utils.oracle.oracle.osimertinib_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.parse_molecular_formula(formula)[source]#

Parse a molecular formulat to get the element types and counts.

Parameters:

formula – molecular formula, f.i. “C8H3F3Br”

Returns:

A list of tuples containing element types and number of occurrences.

tdc.chem_utils.oracle.oracle.penalized_logp(s)[source]#

Evaluate LogP score of a SMILES string

Parameters:

smiles – str

Returns:

float, between - infinity and + infinity

Return type:

logp_score

tdc.chem_utils.oracle.oracle.perindopril_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.qed(smiles)[source]#

Evaluate QED score of a SMILES string

Parameters:

smiles – str

Returns:

float, between 0 and 1.

Return type:

qed_score

tdc.chem_utils.oracle.oracle.ranolazine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.readFragmentScores(name='fpscores')[source]#
class tdc.chem_utils.oracle.oracle.rediscovery_meta(target_smiles, fp='ECFP4')[source]#

Bases: object

tdc.chem_utils.oracle.oracle.scaffold_hop(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.similarity(smiles_a, smiles_b)[source]#

Evaluate Tanimoto similarity between 2 SMILES strings

Parameters:
  • smiles_a – str, SMILES string

  • smiles_b – str, SMILES string

Returns:

float, between 0 and 1.

Return type:

similarity score

class tdc.chem_utils.oracle.oracle.similarity_meta(target_smiles, fp='FCFP4', modifier_func=None)[source]#

Bases: object

tdc.chem_utils.oracle.oracle.sitagliptin_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.sitagliptin_mpo_prev(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.smiles2formula(smiles)[source]#
tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_AP(smiles)[source]#

Convert smiles into Atom Pair Fingerprint.

Parameters:

smiles – str, SMILES string.

Returns:

rdkit.DataStructs.cDataStructs.IntSparseIntVect

Return type:

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP4(smiles)[source]#

Convert smiles into ECFP4 Morgan Fingerprint.

Parameters:

smiles – str, SMILES string.

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP6(smiles)[source]#

Convert smiles into ECFP6 Fingerprint.

Parameters:

smiles – str, SMILES string.

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_FCFP4(smiles)[source]#

Convert smiles into FCFP4 Morgan Fingerprint.

Parameters:

smiles – str, SMILES string.

Returns:

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type:

fp

tdc.chem_utils.oracle.oracle.smiles_to_rdkit_mol(smiles)[source]#

Convert smiles into rdkit’s mol (molecule) format.

Parameters:

smiles – str, SMILES string.

Returns:

rdkit.Chem.rdchem.Mol

Return type:

mol

tdc.chem_utils.oracle.oracle.smina(ligand, protein, score_only=False, raw_input=False)[source]#

Sima is a docking algorithm that docks a ligand to a protein pocket.

Koes, D.R., Baumgartner, M.P. and Camacho, C.J., 2013. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8), pp.1893-1904.

Parameters:
  • ligand (array) – (N_1,3) matrix, where N_1 is ligand size.

  • protein (array) – (N_2,3) matrix, where N_2 is protein size.

  • score_only (boolean) – whether to only return docking score.

  • raw_input (boolean) – whether to input raw ML input or sdf file input

Returns:

docking_info – docking result

Return type:

str or float

tdc.chem_utils.oracle.oracle.tree_analysis(current)[source]#

Analyze the result of tree builder Calculate: 1. Number of steps 2. Pi plausibility 3. If find a path In case of celery error, all values are -1

Returns:

num_path = number of paths found status: Same as implemented in ASKCOS one num_step: number of steps p_score: Pi plausibility synthesizability: binary code price: price for synthesize query compound

tdc.chem_utils.oracle.oracle.valsartan_smarts(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.zaleplon_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.zaleplon_mpo_prev(test_smiles)[source]#

tdc.chem_utils.evaluator module#

tdc.chem_utils.evaluator.calculate_internal_pairwise_similarities(smiles_list)[source]#

Computes the pairwise similarities of the provided list of smiles against itself.

Parameters:

smiles_list – list of str

Returns:

Symmetric matrix of pairwise similarities. Diagonal is set to zero.

tdc.chem_utils.evaluator.calculate_pc_descriptors(smiles, pc_descriptors)[source]#

Calculate Physical Chemical descriptors of a list of molecules.

Parameters:
  • list_of_smiles – list of SMILES strings

  • pc_descriptors – list of strings, names of descriptors to calculate

Returns:

list of float

Return type:

descriptros

tdc.chem_utils.evaluator.canonicalize(smiles)[source]#

Convert SMILES into canonical form.

Parameters:

smiles – str, SMILES string

Returns:

str, canonical SMILES string.

Return type:

smiles

tdc.chem_utils.evaluator.continuous_kldiv(X_baseline: array, X_sampled: array) float[source]#

calculate KL divergence for two numpy arrays, conitnuous version.

Parameters:
  • X_baseline – numpy array

  • X_sampled – numpy array

Returns:

float

Return type:

KL divergence

tdc.chem_utils.evaluator.discrete_kldiv(X_baseline: array, X_sampled: array) float[source]#

calculate KL divergence for two numpy arrays, discrete version.

Parameters:
  • X_baseline – numpy array

  • X_sampled – numpy array

Returns:

float

Return type:

KL divergence

tdc.chem_utils.evaluator.diversity(list_of_smiles)[source]#
Evaluate the internal diversity of a set of molecules. The internbal diversity is defined as the average pairwise

Tanimoto distance between the Morgan fingerprints.

Parameters:

list_of_smiles – list of SMILES strings

Returns:

float

Return type:

div

tdc.chem_utils.evaluator.fcd_distance(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set.

Parameters:
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns:

float

Return type:

fcd_distance

tdc.chem_utils.evaluator.fcd_distance_tf(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set using tensorflow.

Parameters:
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns:

float

Return type:

fcd_distance

tdc.chem_utils.evaluator.fcd_distance_torch(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set using PyTorch.

Parameters:
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns:

float

Return type:

fcd_distance

tdc.chem_utils.evaluator.get_fingerprints(mols, radius=2, length=4096)[source]#

Converts molecules to ECFP bitvectors.

Parameters:
  • mols – RDKit molecules

  • radius – ECFP fingerprint radius

  • length – number of bits

Returns: a list of fingerprints

tdc.chem_utils.evaluator.get_mols(smiles_list)[source]#

Convert SMILES strings to RDKit RDMol objects.

Parameters:

list_of_smiles – list of SMILES strings

Returns:

list of RDKit RDMol objects

Return type:

mols

tdc.chem_utils.evaluator.kl_divergence(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate the KL divergence of set of generated smiles using list of training smiles as reference. KL divergence is defined as the averaged KL divergence of a set of physical chemical descriptors between a set of generated molecules and a set of training molecules.

Parameters:
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns:

float

Return type:

KL divergence

tdc.chem_utils.evaluator.novelty(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate the novelty of set of generated smiles using list of training smiles as reference. Novelty is defined as the fraction of generated molecules that doesn’t appear in the training set.

Parameters:
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns:

float

Return type:

novelty

tdc.chem_utils.evaluator.single_molecule_validity(smiles)[source]#

Evaluate the chemical validity of a single molecule in terms of SMILES string

Parameters:

smiles – str, SMILES string.

Returns:

if the SMILES string is a valid molecule

Return type:

Boolean

tdc.chem_utils.evaluator.unique_lst_of_smiles(list_of_smiles)[source]#
tdc.chem_utils.evaluator.uniqueness(list_of_smiles)[source]#

Evaluate the uniqueness of a list of SMILES string, i.e., the fraction of unique molecules among a given list.

Parameters:

list_of_smiles – list (of SMILES string)

Returns:

float

Return type:

uniqueness

tdc.chem_utils.evaluator.validity(list_of_smiles)[source]#