tdc.chem_utils#

tdc.chem_utils.featurize module#

tdc.chem_utils.featurize.molconvert submodule#

class tdc.chem_utils.featurize.molconvert.MolConvert(src='SMILES', dst='Graph2D', radius=2, nBits=1024)[source]#

Bases: object

MolConvert: convert the molecule from src formet to dst format.

Example

convert = MolConvert(src = ‘SMILES’, dst = ‘Graph2D’) g = convert(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # g: graph with edge, node features g = convert([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,

‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])

# g: a list of graphs with edge, node features if src is 2D, dst can be only 2D output if src is 3D, dst can be both 2D and 3D outputs src: 2D - [SMILES, SELFIES]

3D - [SDF file, XYZ file]

dst: 2D - [2D Graph (+ PyG, DGL format), Canonical SMILES, SELFIES, Fingerprints]

3D - [3D graphs (adj matrix entry is (distance, bond type)), Coulumb Matrix]

static eligible_format(src=None)[source]#

given a src format, output all the available format of the src format Example MoleculeLink.eligible_format(‘SMILES’) ## [‘Graph’, ‘SMARTS’, …]

class tdc.chem_utils.featurize.molconvert.MoleculeFingerprint(fp='ECFP4')[source]#

Bases: object

Example: MolFP = MoleculeFingerprint(fp = ‘ECFP6’) out = MolFp(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # np.array([1, 0, 1, …..]) out = MolFp([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,

‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])

# np.array([[1, 0, 1, …..],

[0, 0, 1, …..]])

Supporting FPs: Basic_Descriptors(atoms, chirality, ….), ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, PubChem

tdc.chem_utils.featurize.molconvert.atom2onehot(atom)[source]#

convert atom to one-hot feature vector

Parameters

'C'

Returns

[1, 0, 0, 0, 0, ..]

tdc.chem_utils.featurize.molconvert.atomstring2atomfeature(atom_string_list)[source]#
tdc.chem_utils.featurize.molconvert.bondtype2idx(bond_type)[source]#
tdc.chem_utils.featurize.molconvert.canonicalize(smiles)[source]#
tdc.chem_utils.featurize.molconvert.distance3d(coordinate_1, coordinate_2)[source]#
tdc.chem_utils.featurize.molconvert.get_atom_features(atom)[source]#
tdc.chem_utils.featurize.molconvert.get_mol(smiles)[source]#
tdc.chem_utils.featurize.molconvert.mol2file2smiles(molfile)[source]#

convert mol2file into SMILES string

Parameters

mol2file – str, a file.

Returns

str, SMILES strings

Return type

smiles

tdc.chem_utils.featurize.molconvert.mol2smiles(mol)[source]#
tdc.chem_utils.featurize.molconvert.mol_conformer2graph3d(mol_conformer_lst)[source]#

convert list of (molecule, conformer) into a list of 3D graph.

Parameters

mol_conformer_lst – list of tuple (molecule, conformer)

Returns

a list of 3D graph.

each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)

Return type

graph3d_lst

tdc.chem_utils.featurize.molconvert.molfile2PyG(molfile)[source]#
tdc.chem_utils.featurize.molconvert.molfile2smiles(molfile)[source]#

convert molfile into SMILES string

Parameters

molfile – str, a file.

Returns

str, SMILES strings

Return type

smiles

tdc.chem_utils.featurize.molconvert.onek_encoding_unk(x, allowable_set)[source]#
tdc.chem_utils.featurize.molconvert.raw3D2pyg(raw3d_feature)[source]#

convert raw3d feature to pyg (torch-geometric) feature

Parameters

raw3d_feature – (atom_string_list, positions, y) - atom_string_list: list, each element is an atom, length is N - positions: np.array, shape: (N,3) - y: float

Returns

data = Data(x=x, pos=pos, y=y)

tdc.chem_utils.featurize.molconvert.sdffile2coulomb(sdf)[source]#

convert sdffile into a list of coulomb feature.

Parameters

sdffile – str, file

Returns

np.array

Return type

coulomb feature

tdc.chem_utils.featurize.molconvert.sdffile2graph3d_lst(sdffile)[source]#

convert SDF file into a list of 3D graph.

Parameters

sdffile – SDF file

Returns

a list of 3D graph.

each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)

Return type

graph3d_lst

tdc.chem_utils.featurize.molconvert.sdffile2mol_conformer(sdffile)[source]#

convert sdffile into a list of molecule conformers.

Parameters

sdffile – str, file

Returns

a list of molecule conformers.

Return type

smiles_lst

tdc.chem_utils.featurize.molconvert.sdffile2selfies_lst(sdf)[source]#

convert sdffile into a list of SELFIES strings.

Parameters

sdffile – str, file

Returns

a list of SELFIES strings.

Return type

selfies_lst

tdc.chem_utils.featurize.molconvert.sdffile2smiles_lst(sdffile)[source]#

convert SDF file into a list of SMILES string.

Parameters

sdffile – str, file

Returns

a list of SMILES strings.

Return type

smiles_lst

tdc.chem_utils.featurize.molconvert.selfies2smiles(selfies)[source]#

Convert selfies into smiles.

Parameters

selfies – str, a SELFIES string.

Returns

str, a SMILES string

Return type

smiles

tdc.chem_utils.featurize.molconvert.smiles2DGL(smiles)[source]#

convert SMILES string into dgl.DGLGraph

Parameters
  • smiles

  • str

  • string (a SMILES) –

Returns

dgl.DGLGraph()

Return type

g

tdc.chem_utils.featurize.molconvert.smiles2ECFP2(smiles)[source]#

Convert smiles into ECFP2 Morgan Fingerprint.

Parameters

smiles – str

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2ECFP4(smiles)[source]#

Convert smiles into ECFP4 Morgan Fingerprint.

Parameters

smiles – str

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2ECFP6(smiles)[source]#

Convert smiles into ECFP6 Morgan Fingerprint.

Parameters

smiles – str, a SMILES string

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

refer: https://github.com/rdkit/benchmarking_platform/blob/master/scoring/fingerprint_lib.py

tdc.chem_utils.featurize.molconvert.smiles2PyG(smiles)[source]#

convert SMILES string into torch_geometric.data.Data

Parameters
  • smiles

  • str

  • string (a SMILES) –

Returns

data, torch_geometric.data.Data

tdc.chem_utils.featurize.molconvert.smiles2daylight(s)[source]#

Convert smiles into 2048-dim Daylight feature.

Parameters

smiles – str

Returns

numpy.array

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2graph2D(smiles)[source]#

convert SMILES string into two-dimensional molecular graph feature

Parameters
  • smiles

  • str

  • string (a SMILES) –

Returns

dict, map from index to atom’s symbol, e.g., {0:’C’, 1:’N’, …} adj_matrix: np.array

Return type

idx2atom

tdc.chem_utils.featurize.molconvert.smiles2maccs(s)[source]#

Convert smiles into maccs feature.

Parameters

smiles – str

Returns

numpy.array

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2mol(smiles)[source]#

Convert SMILES string into rdkit.Chem.rdchem.Mol.

Parameters

smiles – str, a SMILES string.

Returns

rdkit.Chem.rdchem.Mol

Return type

mol

tdc.chem_utils.featurize.molconvert.smiles2morgan(s, radius=2, nBits=1024)[source]#

Convert smiles into Morgan Fingerprint.

Parameters
  • smiles – str

  • radius – int (default: 2)

  • nBits – int (default: 1024)

Returns

numpy.array

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2rdkit2d(s)[source]#

Convert smiles into 200-dim Normalized RDKit 2D vector.

Parameters

smiles – str

Returns

numpy.array

Return type

fp

tdc.chem_utils.featurize.molconvert.smiles2selfies(smiles)[source]#

Convert smiles into selfies.

Parameters

smiles – str, a SMILES string

Returns

str, a SELFIES string.

Return type

selfies

tdc.chem_utils.featurize.molconvert.smiles_lst2coulomb(smiles_lst)[source]#

convert a list of SMILES strings into coulomb format.

Parameters

smiles_lst – a list of SELFIES strings.

Returns

np.array

Return type

features

tdc.chem_utils.featurize.molconvert.upper_atom(atomsymbol)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2coulomb(xyzfile)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2graph3d(xyzfile)[source]#
tdc.chem_utils.featurize.molconvert.xyzfile2selfies(xyzfile)[source]#

convert xyzfile into SELFIES string.

Parameters

xyzfile – str, file

Returns

str, a SELFIES string.

Return type

selfies

tdc.chem_utils.featurize.molconvert.xyzfile2smiles(xyzfile)[source]#

convert xyzfile into smiles string.

Parameters

xyzfile – str, file

Returns

str, a SMILES string

Return type

smiles

tdc.chem_utils.oracle module#

tdc.chem_utils.oracle.filter submodule#

class tdc.chem_utils.oracle.filter.MolFilter(filters='all', property_filters_flag=True, HBA=[0, 10], HBD=[0, 5], LogP=[-5, 5], MW=[0, 500], Rot=[0, 10], TPSA=[0, 200])[source]#

Bases: object

Molecule Filter: filter Molecule based on user-specified condition

Parameters
  • filters

  • property_filters_flag – bool,

  • HBA – [lower_bound, upper_bound]

  • HBD – [lower_bound, upper_bound]

  • LogP – [lower_bound, upper_bound]

  • MW – [lower_bound, upper_bound], Molecule weight

  • Rot – [lower_bound, upper_bound]

  • TPSA – [lower_bound, upper_bound]

Returns

list of SMILES strings that pass the filter.

tdc.chem_utils.oracle.oracle submodule#

class tdc.chem_utils.oracle.oracle.AbsoluteScoreModifier(target_value: float)[source]#

Bases: ScoreModifier

Score modifier that has a maximum at a given target value, and decreases linearly with increasing distance from the target value.

class tdc.chem_utils.oracle.oracle.AtomCounter(element)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.ChainedModifier(modifiers: List[ScoreModifier])[source]#

Bases: ScoreModifier

Calls several modifiers one after the other, for instance:

score = modifier3(modifier2(modifier1(raw_score)))

class tdc.chem_utils.oracle.oracle.ClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#

Bases: ScoreModifier

Clips a score between specified low and high scores, and does a linear interpolation in between.

This class works as follows: First the input is mapped onto a linear interpolation between both specified points. Then the generated values are clipped between low and high scores.

class tdc.chem_utils.oracle.oracle.GaussianModifier(mu: float, sigma: float)[source]#

Bases: ScoreModifier

Score modifier that reproduces a Gaussian bell shape.

class tdc.chem_utils.oracle.oracle.Isomer_scoring(target_smiles, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.Isomer_scoring_prev(target_smiles, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.LinearModifier(slope=1.0)[source]#

Bases: ScoreModifier

Score modifier that multiplies the score by a scalar (default: 1, i.e. do nothing).

class tdc.chem_utils.oracle.oracle.MPO_meta(means)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.MinMaxGaussianModifier(mu: float, sigma: float, minimize=False)[source]#

Bases: ScoreModifier

Score modifier that reproduces a half Gaussian bell shape. For minimize==True, the function is 1.0 for x <= mu and decreases to zero for x > mu. For minimize==False, the function is 1.0 for x >= mu and decreases to zero for x < mu.

class tdc.chem_utils.oracle.oracle.PyScreener_meta(receptor_pdb_file, box_center, box_size, software_class='vina', ncpu=4, **kwargs)[source]#

Bases: object

Evaluate docking score

Args:

Return:

tdc.chem_utils.oracle.oracle.SA(s)[source]#

Evaluate SA score of a SMILES string

Parameters

smiles – str

Returns

float

Return type

SAscore

class tdc.chem_utils.oracle.oracle.SMARTS_scoring(target_smarts, inverse)[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.ScoreModifier[source]#

Bases: object

Interface for score modifiers.

class tdc.chem_utils.oracle.oracle.Score_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Evaluate Vina score (force field) for a conformer binding to a receptor

class tdc.chem_utils.oracle.oracle.SmoothClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#

Bases: ScoreModifier

Smooth variant of ClippedScoreModifier.

Implemented as a logistic function that has the same steepness as ClippedScoreModifier in the center of the logistic function.

class tdc.chem_utils.oracle.oracle.SquaredModifier(target_value: float, coefficient=1.0)[source]#

Bases: ScoreModifier

Score modifier that has a maximum at a given target value, and decreases quadratically with increasing distance from the target value.

class tdc.chem_utils.oracle.oracle.ThresholdedLinearModifier(threshold: float)[source]#

Bases: ScoreModifier

Returns a value of min(input, threshold)/threshold.

class tdc.chem_utils.oracle.oracle.Vina_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Perform docking search from a conformer.

class tdc.chem_utils.oracle.oracle.Vina_smiles(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#

Bases: object

Perform docking search from a conformer.

tdc.chem_utils.oracle.oracle.amlodipine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.askcos(smiles, host_ip, output='plausibility', save_json=False, file_name='tree_builder_result.json', num_trials=5, max_depth=9, max_branching=25, expansion_time=60, max_ppg=100, template_count=1000, max_cum_prob=0.999, chemical_property_logic='none', max_chemprop_c=0, max_chemprop_n=0, max_chemprop_o=0, max_chemprop_h=0, chemical_popularity_logic='none', min_chempop_reactants=5, min_chempop_products=5, filter_threshold=0.1, return_first='true')[source]#

The ASKCOS retrosynthetic analysis oracle function. Please refer https://github.com/connorcoley/ASKCOS to run the ASKCOS with docker on a server to receive requests.

tdc.chem_utils.oracle.oracle.calculateScore(m)[source]#
tdc.chem_utils.oracle.oracle.canonicalize(smiles: str, include_stereocenters=True)[source]#

Canonicalize the SMILES strings with RDKit.

The algorithm is detailed under https://pubs.acs.org/doi/full/10.1021/acs.jcim.5b00543

Parameters
  • smiles – SMILES string to canonicalize

  • include_stereocenters – whether to keep the stereochemical information in the canonical SMILES string

Returns

Canonicalized SMILES string, None if the molecule is invalid.

tdc.chem_utils.oracle.oracle.centroid(X)[source]#

Centroid is the mean position of all the points in all of the coordinate directions, from a vectorset X. https://en.wikipedia.org/wiki/Centroid C = sum(X)/len(X) :param X: (N,D) matrix, where N is points and D is dimension. :type X: array

Returns

C – centroid

Return type

float

tdc.chem_utils.oracle.oracle.cyp3a4_veith(smiles)[source]#
tdc.chem_utils.oracle.oracle.deco_hop(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.drd2(smile)[source]#

Evaluate DRD2 score of a SMILES string

Parameters

smiles – str

Returns

float

Return type

drd_score

tdc.chem_utils.oracle.oracle.fexofenadine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.fingerprints_from_mol(mol)[source]#
tdc.chem_utils.oracle.oracle.get_PHCO_fingerprint(mol)[source]#
tdc.chem_utils.oracle.oracle.gsk3b(smiles)[source]#

Evaluate GSK3B score of a SMILES string

Parameters

smiles – str

Returns

float, between 0 and 1.

Return type

gsk3_score

tdc.chem_utils.oracle.oracle.ibm_rxn(smiles, api_key, output='confidence', sleep_time=30)[source]#

This function is modified from Dr. Jan Jensen’s code

tdc.chem_utils.oracle.oracle.isomer_meta(target_smiles, means='geometric')[source]#
tdc.chem_utils.oracle.oracle.isomer_meta_prev(target_smiles, means='geometric')[source]#
class tdc.chem_utils.oracle.oracle.jnk3[source]#

Bases: object

Evaluate JSK3 score of a SMILES string

Parameters

smiles – str

Returns

float , between 0 and 1.

Return type

jnk3_score

tdc.chem_utils.oracle.oracle.kabsch(P, Q)[source]#

Using the Kabsch algorithm with two sets of paired point P and Q, centered around the centroid. Each vector set is represented as an NxD matrix, where D is the the dimension of the space. The algorithm works in three steps: - a centroid translation of P and Q (assumed done before this function

call)

  • the computation of a covariance matrix C

  • computation of the optimal rotation matrix U

For more info see http://en.wikipedia.org/wiki/Kabsch_algorithm :param P: (N,D) matrix, where N is points and D is dimension. :type P: array :param Q: (N,D) matrix, where N is points and D is dimension. :type Q: array

Returns

U – Rotation matrix (D,D)

Return type

matrix

tdc.chem_utils.oracle.oracle.kabsch_rmsd(P, Q, W=None, translate=False)[source]#

Rotate matrix P unto Q using Kabsch algorithm and calculate the RMSD. An optional vector of weights W may be provided. :param P: (N,D) matrix, where N is points and D is dimension. :type P: array :param Q: (N,D) matrix, where N is points and D is dimension. :type Q: array :param W:

  1. vector, where N is points.

Parameters

translate (bool) – Use centroids to translate vector P and Q unto each other.

Returns

rmsd – root-mean squared deviation

Return type

float

tdc.chem_utils.oracle.oracle.kabsch_rotate(P, Q)[source]#

Rotate matrix P unto matrix Q using Kabsch algorithm. :param P: (N,D) matrix, where N is points and D is dimension. :type P: array :param Q: (N,D) matrix, where N is points and D is dimension. :type Q: array

Returns

P – (N,D) matrix, where N is points and D is dimension, rotated

Return type

array

tdc.chem_utils.oracle.oracle.kabsch_weighted(P, Q, W=None)[source]#

Using the Kabsch algorithm with two sets of paired point P and Q. Each vector set is represented as an NxD matrix, where D is the dimension of the space. An optional vector of weights W may be provided. Note that this algorithm does not require that P and Q have already been overlayed by a centroid translation. The function returns the rotation matrix U, translation vector V, and RMS deviation between Q and P’, where P’ is:

P’ = P * U + V

For more info see http://en.wikipedia.org/wiki/Kabsch_algorithm :param P: (N,D) matrix, where N is points and D is dimension. :type P: array :param Q: (N,D) matrix, where N is points and D is dimension. :type Q: array :param W:

  1. vector, where N is points.

Returns

  • U (matrix) – Rotation matrix (D,D)

  • V (vector) – Translation vector (D)

  • RMSD (float) – Root mean squared deviation between P and Q

tdc.chem_utils.oracle.oracle.kabsch_weighted_rmsd(P, Q, W=None)[source]#

Calculate the RMSD between P and Q with optional weighhts W :param P: (N,D) matrix, where N is points and D is dimension. :type P: array :param Q: (N,D) matrix, where N is points and D is dimension. :type Q: array :param W:

  1. vector, where N is points

Returns

RMSD

Return type

float

tdc.chem_utils.oracle.oracle.load_cyp3a4_veith()[source]#
tdc.chem_utils.oracle.oracle.load_drd2_model()[source]#
tdc.chem_utils.oracle.oracle.load_gsk3b_model()[source]#
tdc.chem_utils.oracle.oracle.load_pickled_model(name: str)[source]#

Loading a pretrained model serialized with pickle. Usually for sklearn models.

Parameters

name – Name of the model to load.

Returns

The model.

class tdc.chem_utils.oracle.oracle.median_meta(target_smiles_1, target_smiles_2, fp1='ECFP6', fp2='ECFP6', modifier_func1=None, modifier_func2=None, means='geometric')[source]#

Bases: object

class tdc.chem_utils.oracle.oracle.molecule_one_retro(api_token)[source]#

Bases: object

tdc.chem_utils.oracle.oracle.numBridgeheadsAndSpiro(mol, ri=None)[source]#
tdc.chem_utils.oracle.oracle.osimertinib_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.parse_molecular_formula(formula)[source]#

Parse a molecular formulat to get the element types and counts.

Parameters

formula – molecular formula, f.i. “C8H3F3Br”

Returns

A list of tuples containing element types and number of occurrences.

tdc.chem_utils.oracle.oracle.penalized_logp(s)[source]#

Evaluate LogP score of a SMILES string

Parameters

smiles – str

Returns

float, between - infinity and + infinity

Return type

logp_score

tdc.chem_utils.oracle.oracle.perindopril_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.qed(smiles)[source]#

Evaluate QED score of a SMILES string

Parameters

smiles – str

Returns

float, between 0 and 1.

Return type

qed_score

tdc.chem_utils.oracle.oracle.ranolazine_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.readFragmentScores(name='fpscores')[source]#
class tdc.chem_utils.oracle.oracle.rediscovery_meta(target_smiles, fp='ECFP4')[source]#

Bases: object

tdc.chem_utils.oracle.oracle.rmsd(V, W)[source]#

Calculate Root-mean-square deviation from two sets of vectors V and W. :param V: (N,D) matrix, where N is points and D is dimension. :type V: array :param W: (N,D) matrix, where N is points and D is dimension. :type W: array

Returns

rmsd – Root-mean-square deviation between the two vectors

Return type

float

tdc.chem_utils.oracle.oracle.scaffold_hop(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.similarity(smiles_a, smiles_b)[source]#

Evaluate Tanimoto similarity between 2 SMILES strings

Parameters
  • smiles_a – str, SMILES string

  • smiles_b – str, SMILES string

Returns

float, between 0 and 1.

Return type

similarity score

class tdc.chem_utils.oracle.oracle.similarity_meta(target_smiles, fp='FCFP4', modifier_func=None)[source]#

Bases: object

tdc.chem_utils.oracle.oracle.sitagliptin_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.sitagliptin_mpo_prev(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.smiles2formula(smiles)[source]#
tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_AP(smiles)[source]#

Convert smiles into Atom Pair Fingerprint.

Parameters

smiles – str, SMILES string.

Returns

rdkit.DataStructs.cDataStructs.IntSparseIntVect

Return type

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP4(smiles)[source]#

Convert smiles into ECFP4 Morgan Fingerprint.

Parameters

smiles – str, SMILES string.

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP6(smiles)[source]#

Convert smiles into ECFP6 Fingerprint.

Parameters

smiles – str, SMILES string.

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_FCFP4(smiles)[source]#

Convert smiles into FCFP4 Morgan Fingerprint.

Parameters

smiles – str, SMILES string.

Returns

rdkit.DataStructs.cDataStructs.UIntSparseIntVect

Return type

fp

tdc.chem_utils.oracle.oracle.smiles_to_rdkit_mol(smiles)[source]#

Convert smiles into rdkit’s mol (molecule) format.

Parameters

smiles – str, SMILES string.

Returns

rdkit.Chem.rdchem.Mol

Return type

mol

tdc.chem_utils.oracle.oracle.smina(ligand, protein, score_only=False, raw_input=False)[source]#

Sima is a docking algorithm that docks a ligand to a protein pocket.

Koes, D.R., Baumgartner, M.P. and Camacho, C.J., 2013. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8), pp.1893-1904.

Parameters
  • ligand (array) – (N_1,3) matrix, where N_1 is ligand size.

  • protein (array) – (N_2,3) matrix, where N_2 is protein size.

  • score_only (boolean) – whether to only return docking score.

  • raw_input (boolean) – whether to input raw ML input or sdf file input

Returns

docking_info – docking result

Return type

str or float

tdc.chem_utils.oracle.oracle.tree_analysis(current)[source]#

Analyze the result of tree builder Calculate: 1. Number of steps 2. Pi plausibility 3. If find a path In case of celery error, all values are -1

Returns

num_path = number of paths found status: Same as implemented in ASKCOS one num_step: number of steps p_score: Pi plausibility synthesizability: binary code price: price for synthesize query compound

tdc.chem_utils.oracle.oracle.valsartan_smarts(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.zaleplon_mpo(test_smiles)[source]#
tdc.chem_utils.oracle.oracle.zaleplon_mpo_prev(test_smiles)[source]#

tdc.chem_utils.evaluator module#

tdc.chem_utils.evaluator.calculate_internal_pairwise_similarities(smiles_list)[source]#

Computes the pairwise similarities of the provided list of smiles against itself.

Parameters

smiles_list – list of str

Returns

Symmetric matrix of pairwise similarities. Diagonal is set to zero.

tdc.chem_utils.evaluator.calculate_pc_descriptors(smiles, pc_descriptors)[source]#

Calculate Physical Chemical descriptors of a list of molecules.

Parameters
  • list_of_smiles – list of SMILES strings

  • pc_descriptors – list of strings, names of descriptors to calculate

Returns

list of float

Return type

descriptros

tdc.chem_utils.evaluator.canonicalize(smiles)[source]#

Convert SMILES into canonical form.

Parameters

smiles – str, SMILES string

Returns

str, canonical SMILES string.

Return type

smiles

tdc.chem_utils.evaluator.continuous_kldiv(X_baseline: array, X_sampled: array) float[source]#

calculate KL divergence for two numpy arrays, conitnuous version.

Parameters
  • X_baseline – numpy array

  • X_sampled – numpy array

Returns

float

Return type

KL divergence

tdc.chem_utils.evaluator.discrete_kldiv(X_baseline: array, X_sampled: array) float[source]#

calculate KL divergence for two numpy arrays, discrete version.

Parameters
  • X_baseline – numpy array

  • X_sampled – numpy array

Returns

float

Return type

KL divergence

tdc.chem_utils.evaluator.diversity(list_of_smiles)[source]#
Evaluate the internal diversity of a set of molecules. The internbal diversity is defined as the average pairwise

Tanimoto distance between the Morgan fingerprints.

Parameters

list_of_smiles – list of SMILES strings

Returns

float

Return type

div

tdc.chem_utils.evaluator.fcd_distance(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set.

Parameters
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns

float

Return type

fcd_distance

tdc.chem_utils.evaluator.fcd_distance_tf(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set using tensorflow.

Parameters
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns

float

Return type

fcd_distance

tdc.chem_utils.evaluator.fcd_distance_torch(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate FCD distance between generated smiles set and training smiles set using PyTorch.

Parameters
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns

float

Return type

fcd_distance

tdc.chem_utils.evaluator.get_fingerprints(mols, radius=2, length=4096)[source]#

Converts molecules to ECFP bitvectors.

Parameters
  • mols – RDKit molecules

  • radius – ECFP fingerprint radius

  • length – number of bits

Returns: a list of fingerprints

tdc.chem_utils.evaluator.get_mols(smiles_list)[source]#

Convert SMILES strings to RDKit RDMol objects.

Parameters

list_of_smiles – list of SMILES strings

Returns

list of RDKit RDMol objects

Return type

mols

tdc.chem_utils.evaluator.kl_divergence(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate the KL divergence of set of generated smiles using list of training smiles as reference. KL divergence is defined as the averaged KL divergence of a set of physical chemical descriptors between a set of generated molecules and a set of training molecules.

Parameters
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns

float

Return type

KL divergence

tdc.chem_utils.evaluator.novelty(generated_smiles_lst, training_smiles_lst)[source]#

Evaluate the novelty of set of generated smiles using list of training smiles as reference. Novelty is defined as the fraction of generated molecules that doesn’t appear in the training set.

Parameters
  • generated_smiles_lst – list (of SMILES string), which are generated.

  • training_smiles_lst – list (of SMILES string), which are used for training.

Returns

float

Return type

novelty

tdc.chem_utils.evaluator.single_molecule_validity(smiles)[source]#

Evaluate the chemical validity of a single molecule in terms of SMILES string

Parameters

smiles – str, SMILES string.

Returns

if the SMILES string is a valid molecule

Return type

Boolean

tdc.chem_utils.evaluator.unique_lst_of_smiles(list_of_smiles)[source]#
tdc.chem_utils.evaluator.uniqueness(list_of_smiles)[source]#

Evaluate the uniqueness of a list of SMILES string, i.e., the fraction of unique molecules among a given list.

Parameters

list_of_smiles – list (of SMILES string)

Returns

float

Return type

uniqueness

tdc.chem_utils.evaluator.validity(list_of_smiles)[source]#