tdc.chem_utils#
tdc.chem_utils.featurize module#
tdc.chem_utils.featurize.molconvert submodule#
- class tdc.chem_utils.featurize.molconvert.MolConvert(src='SMILES', dst='Graph2D', radius=2, nBits=1024)[source]#
Bases:
object
MolConvert: convert the molecule from src formet to dst format.
Example
convert = MolConvert(src = ‘SMILES’, dst = ‘Graph2D’) g = convert(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # g: graph with edge, node features g = convert([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,
‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])
# g: a list of graphs with edge, node features if src is 2D, dst can be only 2D output if src is 3D, dst can be both 2D and 3D outputs src: 2D - [SMILES, SELFIES]
3D - [SDF file, XYZ file]
- dst: 2D - [2D Graph (+ PyG, DGL format), Canonical SMILES, SELFIES, Fingerprints]
3D - [3D graphs (adj matrix entry is (distance, bond type)), Coulumb Matrix]
- class tdc.chem_utils.featurize.molconvert.MoleculeFingerprint(fp='ECFP4')[source]#
Bases:
object
Example: MolFP = MoleculeFingerprint(fp = ‘ECFP6’) out = MolFp(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # np.array([1, 0, 1, …..]) out = MolFp([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,
‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])
- # np.array([[1, 0, 1, …..],
[0, 0, 1, …..]])
Supporting FPs: Basic_Descriptors(atoms, chirality, ….), ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, PubChem
- tdc.chem_utils.featurize.molconvert.atom2onehot(atom)[source]#
convert atom to one-hot feature vector
- Parameters
'C' –
- Returns
[1, 0, 0, 0, 0, ..]
- tdc.chem_utils.featurize.molconvert.mol2file2smiles(molfile)[source]#
convert mol2file into SMILES string
- Parameters
mol2file – str, a file.
- Returns
str, SMILES strings
- Return type
smiles
- tdc.chem_utils.featurize.molconvert.mol_conformer2graph3d(mol_conformer_lst)[source]#
convert list of (molecule, conformer) into a list of 3D graph.
- Parameters
mol_conformer_lst – list of tuple (molecule, conformer)
- Returns
- a list of 3D graph.
each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)
- Return type
graph3d_lst
- tdc.chem_utils.featurize.molconvert.molfile2smiles(molfile)[source]#
convert molfile into SMILES string
- Parameters
molfile – str, a file.
- Returns
str, SMILES strings
- Return type
smiles
- tdc.chem_utils.featurize.molconvert.raw3D2pyg(raw3d_feature)[source]#
convert raw3d feature to pyg (torch-geometric) feature
- Parameters
raw3d_feature – (atom_string_list, positions, y) - atom_string_list: list, each element is an atom, length is N - positions: np.array, shape: (N,3) - y: float
- Returns
data = Data(x=x, pos=pos, y=y)
- tdc.chem_utils.featurize.molconvert.sdffile2coulomb(sdf)[source]#
convert sdffile into a list of coulomb feature.
- Parameters
sdffile – str, file
- Returns
np.array
- Return type
coulomb feature
- tdc.chem_utils.featurize.molconvert.sdffile2graph3d_lst(sdffile)[source]#
convert SDF file into a list of 3D graph.
- Parameters
sdffile – SDF file
- Returns
- a list of 3D graph.
each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)
- Return type
graph3d_lst
- tdc.chem_utils.featurize.molconvert.sdffile2mol_conformer(sdffile)[source]#
convert sdffile into a list of molecule conformers.
- Parameters
sdffile – str, file
- Returns
a list of molecule conformers.
- Return type
smiles_lst
- tdc.chem_utils.featurize.molconvert.sdffile2selfies_lst(sdf)[source]#
convert sdffile into a list of SELFIES strings.
- Parameters
sdffile – str, file
- Returns
a list of SELFIES strings.
- Return type
selfies_lst
- tdc.chem_utils.featurize.molconvert.sdffile2smiles_lst(sdffile)[source]#
convert SDF file into a list of SMILES string.
- Parameters
sdffile – str, file
- Returns
a list of SMILES strings.
- Return type
smiles_lst
- tdc.chem_utils.featurize.molconvert.selfies2smiles(selfies)[source]#
Convert selfies into smiles.
- Parameters
selfies – str, a SELFIES string.
- Returns
str, a SMILES string
- Return type
smiles
- tdc.chem_utils.featurize.molconvert.smiles2DGL(smiles)[source]#
convert SMILES string into dgl.DGLGraph
- Parameters
smiles –
str –
string (a SMILES) –
- Returns
dgl.DGLGraph()
- Return type
g
- tdc.chem_utils.featurize.molconvert.smiles2ECFP2(smiles)[source]#
Convert smiles into ECFP2 Morgan Fingerprint.
- Parameters
smiles – str
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2ECFP4(smiles)[source]#
Convert smiles into ECFP4 Morgan Fingerprint.
- Parameters
smiles – str
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2ECFP6(smiles)[source]#
Convert smiles into ECFP6 Morgan Fingerprint.
- Parameters
smiles – str, a SMILES string
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
refer: https://github.com/rdkit/benchmarking_platform/blob/master/scoring/fingerprint_lib.py
- tdc.chem_utils.featurize.molconvert.smiles2PyG(smiles)[source]#
convert SMILES string into torch_geometric.data.Data
- Parameters
smiles –
str –
string (a SMILES) –
- Returns
data, torch_geometric.data.Data
- tdc.chem_utils.featurize.molconvert.smiles2daylight(s)[source]#
Convert smiles into 2048-dim Daylight feature.
- Parameters
smiles – str
- Returns
numpy.array
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2graph2D(smiles)[source]#
convert SMILES string into two-dimensional molecular graph feature
- Parameters
smiles –
str –
string (a SMILES) –
- Returns
dict, map from index to atom’s symbol, e.g., {0:’C’, 1:’N’, …} adj_matrix: np.array
- Return type
idx2atom
- tdc.chem_utils.featurize.molconvert.smiles2maccs(s)[source]#
Convert smiles into maccs feature.
- Parameters
smiles – str
- Returns
numpy.array
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2mol(smiles)[source]#
Convert SMILES string into rdkit.Chem.rdchem.Mol.
- Parameters
smiles – str, a SMILES string.
- Returns
rdkit.Chem.rdchem.Mol
- Return type
mol
- tdc.chem_utils.featurize.molconvert.smiles2morgan(s, radius=2, nBits=1024)[source]#
Convert smiles into Morgan Fingerprint.
- Parameters
smiles – str
radius – int (default: 2)
nBits – int (default: 1024)
- Returns
numpy.array
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2rdkit2d(s)[source]#
Convert smiles into 200-dim Normalized RDKit 2D vector.
- Parameters
smiles – str
- Returns
numpy.array
- Return type
fp
- tdc.chem_utils.featurize.molconvert.smiles2selfies(smiles)[source]#
Convert smiles into selfies.
- Parameters
smiles – str, a SMILES string
- Returns
str, a SELFIES string.
- Return type
selfies
- tdc.chem_utils.featurize.molconvert.smiles_lst2coulomb(smiles_lst)[source]#
convert a list of SMILES strings into coulomb format.
- Parameters
smiles_lst – a list of SELFIES strings.
- Returns
np.array
- Return type
features
tdc.chem_utils.oracle module#
tdc.chem_utils.oracle.filter submodule#
- class tdc.chem_utils.oracle.filter.MolFilter(filters='all', property_filters_flag=True, HBA=[0, 10], HBD=[0, 5], LogP=[-5, 5], MW=[0, 500], Rot=[0, 10], TPSA=[0, 200])[source]#
Bases:
object
Molecule Filter: filter Molecule based on user-specified condition
- Parameters
filters –
property_filters_flag – bool,
HBA – [lower_bound, upper_bound]
HBD – [lower_bound, upper_bound]
LogP – [lower_bound, upper_bound]
MW – [lower_bound, upper_bound], Molecule weight
Rot – [lower_bound, upper_bound]
TPSA – [lower_bound, upper_bound]
- Returns
list of SMILES strings that pass the filter.
tdc.chem_utils.oracle.oracle submodule#
- class tdc.chem_utils.oracle.oracle.AbsoluteScoreModifier(target_value: float)[source]#
Bases:
ScoreModifier
Score modifier that has a maximum at a given target value, and decreases linearly with increasing distance from the target value.
- class tdc.chem_utils.oracle.oracle.ChainedModifier(modifiers: List[ScoreModifier])[source]#
Bases:
ScoreModifier
- Calls several modifiers one after the other, for instance:
score = modifier3(modifier2(modifier1(raw_score)))
- class tdc.chem_utils.oracle.oracle.ClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#
Bases:
ScoreModifier
Clips a score between specified low and high scores, and does a linear interpolation in between.
This class works as follows: First the input is mapped onto a linear interpolation between both specified points. Then the generated values are clipped between low and high scores.
- class tdc.chem_utils.oracle.oracle.GaussianModifier(mu: float, sigma: float)[source]#
Bases:
ScoreModifier
Score modifier that reproduces a Gaussian bell shape.
- class tdc.chem_utils.oracle.oracle.Isomer_scoring(target_smiles, means='geometric')[source]#
Bases:
object
- class tdc.chem_utils.oracle.oracle.Isomer_scoring_prev(target_smiles, means='geometric')[source]#
Bases:
object
- class tdc.chem_utils.oracle.oracle.LinearModifier(slope=1.0)[source]#
Bases:
ScoreModifier
Score modifier that multiplies the score by a scalar (default: 1, i.e. do nothing).
- class tdc.chem_utils.oracle.oracle.MinMaxGaussianModifier(mu: float, sigma: float, minimize=False)[source]#
Bases:
ScoreModifier
Score modifier that reproduces a half Gaussian bell shape. For minimize==True, the function is 1.0 for x <= mu and decreases to zero for x > mu. For minimize==False, the function is 1.0 for x >= mu and decreases to zero for x < mu.
- class tdc.chem_utils.oracle.oracle.PyScreener_meta(receptor_pdb_file, box_center, box_size, software_class='vina', ncpu=4, **kwargs)[source]#
Bases:
object
Evaluate docking score
Args:
Return:
- tdc.chem_utils.oracle.oracle.SA(s)[source]#
Evaluate SA score of a SMILES string
- Parameters
smiles – str
- Returns
float
- Return type
SAscore
- class tdc.chem_utils.oracle.oracle.ScoreModifier[source]#
Bases:
object
Interface for score modifiers.
- class tdc.chem_utils.oracle.oracle.Score_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Evaluate Vina score (force field) for a conformer binding to a receptor
- class tdc.chem_utils.oracle.oracle.SmoothClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#
Bases:
ScoreModifier
Smooth variant of ClippedScoreModifier.
Implemented as a logistic function that has the same steepness as ClippedScoreModifier in the center of the logistic function.
- class tdc.chem_utils.oracle.oracle.SquaredModifier(target_value: float, coefficient=1.0)[source]#
Bases:
ScoreModifier
Score modifier that has a maximum at a given target value, and decreases quadratically with increasing distance from the target value.
- class tdc.chem_utils.oracle.oracle.ThresholdedLinearModifier(threshold: float)[source]#
Bases:
ScoreModifier
Returns a value of min(input, threshold)/threshold.
- class tdc.chem_utils.oracle.oracle.Vina_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Perform docking search from a conformer.
- class tdc.chem_utils.oracle.oracle.Vina_smiles(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Perform docking search from a conformer.
- tdc.chem_utils.oracle.oracle.askcos(smiles, host_ip, output='plausibility', save_json=False, file_name='tree_builder_result.json', num_trials=5, max_depth=9, max_branching=25, expansion_time=60, max_ppg=100, template_count=1000, max_cum_prob=0.999, chemical_property_logic='none', max_chemprop_c=0, max_chemprop_n=0, max_chemprop_o=0, max_chemprop_h=0, chemical_popularity_logic='none', min_chempop_reactants=5, min_chempop_products=5, filter_threshold=0.1, return_first='true')[source]#
The ASKCOS retrosynthetic analysis oracle function. Please refer https://github.com/connorcoley/ASKCOS to run the ASKCOS with docker on a server to receive requests.
- tdc.chem_utils.oracle.oracle.canonicalize(smiles: str, include_stereocenters=True)[source]#
Canonicalize the SMILES strings with RDKit.
The algorithm is detailed under https://pubs.acs.org/doi/full/10.1021/acs.jcim.5b00543
- Parameters
smiles – SMILES string to canonicalize
include_stereocenters – whether to keep the stereochemical information in the canonical SMILES string
- Returns
Canonicalized SMILES string, None if the molecule is invalid.
- tdc.chem_utils.oracle.oracle.drd2(smile)[source]#
Evaluate DRD2 score of a SMILES string
- Parameters
smiles – str
- Returns
float
- Return type
drd_score
- tdc.chem_utils.oracle.oracle.gsk3b(smiles)[source]#
Evaluate GSK3B score of a SMILES string
- Parameters
smiles – str
- Returns
float, between 0 and 1.
- Return type
gsk3_score
- tdc.chem_utils.oracle.oracle.ibm_rxn(smiles, api_key, output='confidence', sleep_time=30)[source]#
This function is modified from Dr. Jan Jensen’s code
- class tdc.chem_utils.oracle.oracle.jnk3[source]#
Bases:
object
Evaluate JSK3 score of a SMILES string
- Parameters
smiles – str
- Returns
float , between 0 and 1.
- Return type
jnk3_score
- tdc.chem_utils.oracle.oracle.load_pickled_model(name: str)[source]#
Loading a pretrained model serialized with pickle. Usually for sklearn models.
- Parameters
name – Name of the model to load.
- Returns
The model.
- class tdc.chem_utils.oracle.oracle.median_meta(target_smiles_1, target_smiles_2, fp1='ECFP6', fp2='ECFP6', modifier_func1=None, modifier_func2=None, means='geometric')[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.parse_molecular_formula(formula)[source]#
Parse a molecular formulat to get the element types and counts.
- Parameters
formula – molecular formula, f.i. “C8H3F3Br”
- Returns
A list of tuples containing element types and number of occurrences.
- tdc.chem_utils.oracle.oracle.penalized_logp(s)[source]#
Evaluate LogP score of a SMILES string
- Parameters
smiles – str
- Returns
float, between - infinity and + infinity
- Return type
logp_score
- tdc.chem_utils.oracle.oracle.qed(smiles)[source]#
Evaluate QED score of a SMILES string
- Parameters
smiles – str
- Returns
float, between 0 and 1.
- Return type
qed_score
- class tdc.chem_utils.oracle.oracle.rediscovery_meta(target_smiles, fp='ECFP4')[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.similarity(smiles_a, smiles_b)[source]#
Evaluate Tanimoto similarity between 2 SMILES strings
- Parameters
smiles_a – str, SMILES string
smiles_b – str, SMILES string
- Returns
float, between 0 and 1.
- Return type
similarity score
- class tdc.chem_utils.oracle.oracle.similarity_meta(target_smiles, fp='FCFP4', modifier_func=None)[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_AP(smiles)[source]#
Convert smiles into Atom Pair Fingerprint.
- Parameters
smiles – str, SMILES string.
- Returns
rdkit.DataStructs.cDataStructs.IntSparseIntVect
- Return type
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP4(smiles)[source]#
Convert smiles into ECFP4 Morgan Fingerprint.
- Parameters
smiles – str, SMILES string.
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP6(smiles)[source]#
Convert smiles into ECFP6 Fingerprint.
- Parameters
smiles – str, SMILES string.
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_FCFP4(smiles)[source]#
Convert smiles into FCFP4 Morgan Fingerprint.
- Parameters
smiles – str, SMILES string.
- Returns
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type
fp
- tdc.chem_utils.oracle.oracle.smiles_to_rdkit_mol(smiles)[source]#
Convert smiles into rdkit’s mol (molecule) format.
- Parameters
smiles – str, SMILES string.
- Returns
rdkit.Chem.rdchem.Mol
- Return type
mol
- tdc.chem_utils.oracle.oracle.smina(ligand, protein, score_only=False, raw_input=False)[source]#
Sima is a docking algorithm that docks a ligand to a protein pocket.
Koes, D.R., Baumgartner, M.P. and Camacho, C.J., 2013. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8), pp.1893-1904.
- Parameters
ligand (array) – (N_1,3) matrix, where N_1 is ligand size.
protein (array) – (N_2,3) matrix, where N_2 is protein size.
score_only (boolean) – whether to only return docking score.
raw_input (boolean) – whether to input raw ML input or sdf file input
- Returns
docking_info – docking result
- Return type
- tdc.chem_utils.oracle.oracle.tree_analysis(current)[source]#
Analyze the result of tree builder Calculate: 1. Number of steps 2. Pi plausibility 3. If find a path In case of celery error, all values are -1
- Returns
num_path = number of paths found status: Same as implemented in ASKCOS one num_step: number of steps p_score: Pi plausibility synthesizability: binary code price: price for synthesize query compound
tdc.chem_utils.evaluator module#
- tdc.chem_utils.evaluator.calculate_internal_pairwise_similarities(smiles_list)[source]#
Computes the pairwise similarities of the provided list of smiles against itself.
- Parameters
smiles_list – list of str
- Returns
Symmetric matrix of pairwise similarities. Diagonal is set to zero.
- tdc.chem_utils.evaluator.calculate_pc_descriptors(smiles, pc_descriptors)[source]#
Calculate Physical Chemical descriptors of a list of molecules.
- Parameters
list_of_smiles – list of SMILES strings
pc_descriptors – list of strings, names of descriptors to calculate
- Returns
list of float
- Return type
descriptros
- tdc.chem_utils.evaluator.canonicalize(smiles)[source]#
Convert SMILES into canonical form.
- Parameters
smiles – str, SMILES string
- Returns
str, canonical SMILES string.
- Return type
smiles
- tdc.chem_utils.evaluator.continuous_kldiv(X_baseline: array, X_sampled: array) float [source]#
calculate KL divergence for two numpy arrays, conitnuous version.
- Parameters
X_baseline – numpy array
X_sampled – numpy array
- Returns
float
- Return type
KL divergence
- tdc.chem_utils.evaluator.discrete_kldiv(X_baseline: array, X_sampled: array) float [source]#
calculate KL divergence for two numpy arrays, discrete version.
- Parameters
X_baseline – numpy array
X_sampled – numpy array
- Returns
float
- Return type
KL divergence
- tdc.chem_utils.evaluator.diversity(list_of_smiles)[source]#
- Evaluate the internal diversity of a set of molecules. The internbal diversity is defined as the average pairwise
Tanimoto distance between the Morgan fingerprints.
- Parameters
list_of_smiles – list of SMILES strings
- Returns
float
- Return type
div
- tdc.chem_utils.evaluator.fcd_distance(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set.
- Parameters
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns
float
- Return type
fcd_distance
- tdc.chem_utils.evaluator.fcd_distance_tf(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set using tensorflow.
- Parameters
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns
float
- Return type
fcd_distance
- tdc.chem_utils.evaluator.fcd_distance_torch(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set using PyTorch.
- Parameters
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns
float
- Return type
fcd_distance
- tdc.chem_utils.evaluator.get_fingerprints(mols, radius=2, length=4096)[source]#
Converts molecules to ECFP bitvectors.
- Parameters
mols – RDKit molecules
radius – ECFP fingerprint radius
length – number of bits
Returns: a list of fingerprints
- tdc.chem_utils.evaluator.get_mols(smiles_list)[source]#
Convert SMILES strings to RDKit RDMol objects.
- Parameters
list_of_smiles – list of SMILES strings
- Returns
list of RDKit RDMol objects
- Return type
mols
- tdc.chem_utils.evaluator.kl_divergence(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate the KL divergence of set of generated smiles using list of training smiles as reference. KL divergence is defined as the averaged KL divergence of a set of physical chemical descriptors between a set of generated molecules and a set of training molecules.
- Parameters
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns
float
- Return type
KL divergence
- tdc.chem_utils.evaluator.novelty(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate the novelty of set of generated smiles using list of training smiles as reference. Novelty is defined as the fraction of generated molecules that doesn’t appear in the training set.
- Parameters
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns
float
- Return type
novelty
- tdc.chem_utils.evaluator.single_molecule_validity(smiles)[source]#
Evaluate the chemical validity of a single molecule in terms of SMILES string
- Parameters
smiles – str, SMILES string.
- Returns
if the SMILES string is a valid molecule
- Return type
Boolean