tdc.chem_utils#
tdc.chem_utils.featurize module#
tdc.chem_utils.featurize.molconvert submodule#
- class tdc.chem_utils.featurize.molconvert.MolConvert(src='SMILES', dst='Graph2D', radius=2, nBits=1024)[source]#
Bases:
object
MolConvert: convert the molecule from src formet to dst format.
Example
convert = MolConvert(src = ‘SMILES’, dst = ‘Graph2D’) g = convert(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # g: graph with edge, node features g = convert([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,
‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])
# g: a list of graphs with edge, node features if src is 2D, dst can be only 2D output if src is 3D, dst can be both 2D and 3D outputs src: 2D - [SMILES, SELFIES]
3D - [SDF file, XYZ file]
- dst: 2D - [2D Graph (+ PyG, DGL format), Canonical SMILES, SELFIES, Fingerprints]
3D - [3D graphs (adj matrix entry is (distance, bond type)), Coulumb Matrix]
- class tdc.chem_utils.featurize.molconvert.MoleculeFingerprint(fp='ECFP4')[source]#
Bases:
object
Example: MolFP = MoleculeFingerprint(fp = ‘ECFP6’) out = MolFp(‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’) # np.array([1, 0, 1, …..]) out = MolFp([‘Clc1ccccc1C2C(=C(/N/C(=C2/C(=O)OCC)COCCN)C)C(=O)OC’,
‘CCCOc1cc2ncnc(Nc3ccc4ncsc4c3)c2cc1S(=O)(=O)C(C)(C)C’])
- # np.array([[1, 0, 1, …..],
[0, 0, 1, …..]])
Supporting FPs: Basic_Descriptors(atoms, chirality, ….), ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, PubChem
- tdc.chem_utils.featurize.molconvert.atom2onehot(atom)[source]#
convert atom to one-hot feature vector
- Parameters:
'C' –
- Returns:
[1, 0, 0, 0, 0, ..]
- tdc.chem_utils.featurize.molconvert.mol2file2smiles(molfile)[source]#
convert mol2file into SMILES string
- Parameters:
mol2file – str, a file.
- Returns:
str, SMILES strings
- Return type:
smiles
- tdc.chem_utils.featurize.molconvert.mol_conformer2graph3d(mol_conformer_lst)[source]#
convert list of (molecule, conformer) into a list of 3D graph.
- Parameters:
mol_conformer_lst – list of tuple (molecule, conformer)
- Returns:
- a list of 3D graph.
each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)
- Return type:
graph3d_lst
- tdc.chem_utils.featurize.molconvert.molfile2smiles(molfile)[source]#
convert molfile into SMILES string
- Parameters:
molfile – str, a file.
- Returns:
str, SMILES strings
- Return type:
smiles
- tdc.chem_utils.featurize.molconvert.raw3D2pyg(raw3d_feature)[source]#
convert raw3d feature to pyg (torch-geometric) feature
- Parameters:
raw3d_feature – (atom_string_list, positions, y) - atom_string_list: list, each element is an atom, length is N - positions: np.array, shape: (N,3) - y: float
- Returns:
data = Data(x=x, pos=pos, y=y)
- tdc.chem_utils.featurize.molconvert.sdffile2coulomb(sdf)[source]#
convert sdffile into a list of coulomb feature.
- Parameters:
sdffile – str, file
- Returns:
np.array
- Return type:
coulomb feature
- tdc.chem_utils.featurize.molconvert.sdffile2graph3d_lst(sdffile)[source]#
convert SDF file into a list of 3D graph.
- Parameters:
sdffile – SDF file
- Returns:
- a list of 3D graph.
each graph has (i) idx2atom (dict); (ii) distance_adj_matrix (np.array); (iii) bondtype_adj_matrix (np.array)
- Return type:
graph3d_lst
- tdc.chem_utils.featurize.molconvert.sdffile2mol_conformer(sdffile)[source]#
convert sdffile into a list of molecule conformers.
- Parameters:
sdffile – str, file
- Returns:
a list of molecule conformers.
- Return type:
smiles_lst
- tdc.chem_utils.featurize.molconvert.sdffile2selfies_lst(sdf)[source]#
convert sdffile into a list of SELFIES strings.
- Parameters:
sdffile – str, file
- Returns:
a list of SELFIES strings.
- Return type:
selfies_lst
- tdc.chem_utils.featurize.molconvert.sdffile2smiles_lst(sdffile)[source]#
convert SDF file into a list of SMILES string.
- Parameters:
sdffile – str, file
- Returns:
a list of SMILES strings.
- Return type:
smiles_lst
- tdc.chem_utils.featurize.molconvert.selfies2smiles(selfies)[source]#
Convert selfies into smiles.
- Parameters:
selfies – str, a SELFIES string.
- Returns:
str, a SMILES string
- Return type:
smiles
- tdc.chem_utils.featurize.molconvert.smiles2DGL(smiles)[source]#
convert SMILES string into dgl.DGLGraph
- Parameters:
smiles –
str –
string (a SMILES) –
- Returns:
dgl.DGLGraph()
- Return type:
g
- tdc.chem_utils.featurize.molconvert.smiles2ECFP2(smiles)[source]#
Convert smiles into ECFP2 Morgan Fingerprint.
- Parameters:
smiles – str
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2ECFP4(smiles)[source]#
Convert smiles into ECFP4 Morgan Fingerprint.
- Parameters:
smiles – str
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2ECFP6(smiles)[source]#
Convert smiles into ECFP6 Morgan Fingerprint.
- Parameters:
smiles – str, a SMILES string
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
refer: https://github.com/rdkit/benchmarking_platform/blob/master/scoring/fingerprint_lib.py
- tdc.chem_utils.featurize.molconvert.smiles2PyG(smiles)[source]#
convert SMILES string into torch_geometric.data.Data
- Parameters:
smiles –
str –
string (a SMILES) –
- Returns:
data, torch_geometric.data.Data
- tdc.chem_utils.featurize.molconvert.smiles2daylight(s)[source]#
Convert smiles into 2048-dim Daylight feature.
- Parameters:
smiles – str
- Returns:
numpy.array
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2graph2D(smiles)[source]#
convert SMILES string into two-dimensional molecular graph feature
- Parameters:
smiles –
str –
string (a SMILES) –
- Returns:
dict, map from index to atom’s symbol, e.g., {0:’C’, 1:’N’, …} adj_matrix: np.array
- Return type:
idx2atom
- tdc.chem_utils.featurize.molconvert.smiles2maccs(s)[source]#
Convert smiles into maccs feature.
- Parameters:
smiles – str
- Returns:
numpy.array
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2mol(smiles)[source]#
Convert SMILES string into rdkit.Chem.rdchem.Mol.
- Parameters:
smiles – str, a SMILES string.
- Returns:
rdkit.Chem.rdchem.Mol
- Return type:
mol
- tdc.chem_utils.featurize.molconvert.smiles2morgan(s, radius=2, nBits=1024)[source]#
Convert smiles into Morgan Fingerprint.
- Parameters:
smiles – str
radius – int (default: 2)
nBits – int (default: 1024)
- Returns:
numpy.array
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2rdkit2d(s)[source]#
Convert smiles into 200-dim Normalized RDKit 2D vector.
- Parameters:
smiles – str
- Returns:
numpy.array
- Return type:
fp
- tdc.chem_utils.featurize.molconvert.smiles2selfies(smiles)[source]#
Convert smiles into selfies.
- Parameters:
smiles – str, a SMILES string
- Returns:
str, a SELFIES string.
- Return type:
selfies
- tdc.chem_utils.featurize.molconvert.smiles_lst2coulomb(smiles_lst)[source]#
convert a list of SMILES strings into coulomb format.
- Parameters:
smiles_lst – a list of SELFIES strings.
- Returns:
np.array
- Return type:
features
tdc.chem_utils.oracle module#
tdc.chem_utils.oracle.filter submodule#
- class tdc.chem_utils.oracle.filter.MolFilter(filters='all', property_filters_flag=True, HBA=[0, 10], HBD=[0, 5], LogP=[-5, 5], MW=[0, 500], Rot=[0, 10], TPSA=[0, 200])[source]#
Bases:
object
Molecule Filter: filter Molecule based on user-specified condition
- Parameters:
filters –
property_filters_flag – bool,
HBA – [lower_bound, upper_bound]
HBD – [lower_bound, upper_bound]
LogP – [lower_bound, upper_bound]
MW – [lower_bound, upper_bound], Molecule weight
Rot – [lower_bound, upper_bound]
TPSA – [lower_bound, upper_bound]
- Returns:
list of SMILES strings that pass the filter.
tdc.chem_utils.oracle.oracle submodule#
- class tdc.chem_utils.oracle.oracle.AbsoluteScoreModifier(target_value: float)[source]#
Bases:
ScoreModifier
Score modifier that has a maximum at a given target value, and decreases linearly with increasing distance from the target value.
- class tdc.chem_utils.oracle.oracle.ChainedModifier(modifiers: List[ScoreModifier])[source]#
Bases:
ScoreModifier
- Calls several modifiers one after the other, for instance:
score = modifier3(modifier2(modifier1(raw_score)))
- class tdc.chem_utils.oracle.oracle.ClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#
Bases:
ScoreModifier
Clips a score between specified low and high scores, and does a linear interpolation in between.
This class works as follows: First the input is mapped onto a linear interpolation between both specified points. Then the generated values are clipped between low and high scores.
- class tdc.chem_utils.oracle.oracle.GaussianModifier(mu: float, sigma: float)[source]#
Bases:
ScoreModifier
Score modifier that reproduces a Gaussian bell shape.
- class tdc.chem_utils.oracle.oracle.Isomer_scoring(target_smiles, means='geometric')[source]#
Bases:
object
- class tdc.chem_utils.oracle.oracle.Isomer_scoring_prev(target_smiles, means='geometric')[source]#
Bases:
object
- class tdc.chem_utils.oracle.oracle.LinearModifier(slope=1.0)[source]#
Bases:
ScoreModifier
Score modifier that multiplies the score by a scalar (default: 1, i.e. do nothing).
- class tdc.chem_utils.oracle.oracle.MinMaxGaussianModifier(mu: float, sigma: float, minimize=False)[source]#
Bases:
ScoreModifier
Score modifier that reproduces a half Gaussian bell shape. For minimize==True, the function is 1.0 for x <= mu and decreases to zero for x > mu. For minimize==False, the function is 1.0 for x >= mu and decreases to zero for x < mu.
- class tdc.chem_utils.oracle.oracle.PyScreener_meta(receptor_pdb_file, box_center, box_size, software_class='vina', ncpu=4, **kwargs)[source]#
Bases:
object
Evaluate docking score
Args:
Return:
- tdc.chem_utils.oracle.oracle.SA(s)[source]#
Evaluate SA score of a SMILES string
- Parameters:
smiles – str
- Returns:
float
- Return type:
SAscore
- class tdc.chem_utils.oracle.oracle.ScoreModifier[source]#
Bases:
object
Interface for score modifiers.
- class tdc.chem_utils.oracle.oracle.Score_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Evaluate Vina score (force field) for a conformer binding to a receptor
- class tdc.chem_utils.oracle.oracle.SmoothClippedScoreModifier(upper_x: float, lower_x=0.0, high_score=1.0, low_score=0.0)[source]#
Bases:
ScoreModifier
Smooth variant of ClippedScoreModifier.
Implemented as a logistic function that has the same steepness as ClippedScoreModifier in the center of the logistic function.
- class tdc.chem_utils.oracle.oracle.SquaredModifier(target_value: float, coefficient=1.0)[source]#
Bases:
ScoreModifier
Score modifier that has a maximum at a given target value, and decreases quadratically with increasing distance from the target value.
- class tdc.chem_utils.oracle.oracle.ThresholdedLinearModifier(threshold: float)[source]#
Bases:
ScoreModifier
Returns a value of min(input, threshold)/threshold.
- class tdc.chem_utils.oracle.oracle.Vina_3d(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Perform docking search from a conformer.
- class tdc.chem_utils.oracle.oracle.Vina_smiles(receptor_pdbqt_file, center, box_size, scorefunction='vina')[source]#
Bases:
object
Perform docking search from a conformer.
- tdc.chem_utils.oracle.oracle.askcos(smiles, host_ip, output='plausibility', save_json=False, file_name='tree_builder_result.json', num_trials=5, max_depth=9, max_branching=25, expansion_time=60, max_ppg=100, template_count=1000, max_cum_prob=0.999, chemical_property_logic='none', max_chemprop_c=0, max_chemprop_n=0, max_chemprop_o=0, max_chemprop_h=0, chemical_popularity_logic='none', min_chempop_reactants=5, min_chempop_products=5, filter_threshold=0.1, return_first='true')[source]#
The ASKCOS retrosynthetic analysis oracle function. Please refer https://github.com/connorcoley/ASKCOS to run the ASKCOS with docker on a server to receive requests.
- tdc.chem_utils.oracle.oracle.canonicalize(smiles: str, include_stereocenters=True)[source]#
Canonicalize the SMILES strings with RDKit.
The algorithm is detailed under https://pubs.acs.org/doi/full/10.1021/acs.jcim.5b00543
- Parameters:
smiles – SMILES string to canonicalize
include_stereocenters – whether to keep the stereochemical information in the canonical SMILES string
- Returns:
Canonicalized SMILES string, None if the molecule is invalid.
- tdc.chem_utils.oracle.oracle.drd2(smile)[source]#
Evaluate DRD2 score of a SMILES string
- Parameters:
smiles – str
- Returns:
float
- Return type:
drd_score
- tdc.chem_utils.oracle.oracle.gsk3b(smiles)[source]#
Evaluate GSK3B score of a SMILES string
- Parameters:
smiles – str
- Returns:
float, between 0 and 1.
- Return type:
gsk3_score
- tdc.chem_utils.oracle.oracle.ibm_rxn(smiles, api_key, output='confidence', sleep_time=30)[source]#
This function is modified from Dr. Jan Jensen’s code
- class tdc.chem_utils.oracle.oracle.jnk3[source]#
Bases:
object
Evaluate JSK3 score of a SMILES string
- Parameters:
smiles – str
- Returns:
float , between 0 and 1.
- Return type:
jnk3_score
- tdc.chem_utils.oracle.oracle.load_pickled_model(name: str)[source]#
Loading a pretrained model serialized with pickle. Usually for sklearn models.
- Parameters:
name – Name of the model to load.
- Returns:
The model.
- class tdc.chem_utils.oracle.oracle.median_meta(target_smiles_1, target_smiles_2, fp1='ECFP6', fp2='ECFP6', modifier_func1=None, modifier_func2=None, means='geometric')[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.parse_molecular_formula(formula)[source]#
Parse a molecular formulat to get the element types and counts.
- Parameters:
formula – molecular formula, f.i. “C8H3F3Br”
- Returns:
A list of tuples containing element types and number of occurrences.
- tdc.chem_utils.oracle.oracle.penalized_logp(s)[source]#
Evaluate LogP score of a SMILES string
- Parameters:
smiles – str
- Returns:
float, between - infinity and + infinity
- Return type:
logp_score
- tdc.chem_utils.oracle.oracle.qed(smiles)[source]#
Evaluate QED score of a SMILES string
- Parameters:
smiles – str
- Returns:
float, between 0 and 1.
- Return type:
qed_score
- class tdc.chem_utils.oracle.oracle.rediscovery_meta(target_smiles, fp='ECFP4')[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.similarity(smiles_a, smiles_b)[source]#
Evaluate Tanimoto similarity between 2 SMILES strings
- Parameters:
smiles_a – str, SMILES string
smiles_b – str, SMILES string
- Returns:
float, between 0 and 1.
- Return type:
similarity score
- class tdc.chem_utils.oracle.oracle.similarity_meta(target_smiles, fp='FCFP4', modifier_func=None)[source]#
Bases:
object
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_AP(smiles)[source]#
Convert smiles into Atom Pair Fingerprint.
- Parameters:
smiles – str, SMILES string.
- Returns:
rdkit.DataStructs.cDataStructs.IntSparseIntVect
- Return type:
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP4(smiles)[source]#
Convert smiles into ECFP4 Morgan Fingerprint.
- Parameters:
smiles – str, SMILES string.
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_ECFP6(smiles)[source]#
Convert smiles into ECFP6 Fingerprint.
- Parameters:
smiles – str, SMILES string.
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
- tdc.chem_utils.oracle.oracle.smiles_2_fingerprint_FCFP4(smiles)[source]#
Convert smiles into FCFP4 Morgan Fingerprint.
- Parameters:
smiles – str, SMILES string.
- Returns:
rdkit.DataStructs.cDataStructs.UIntSparseIntVect
- Return type:
fp
- tdc.chem_utils.oracle.oracle.smiles_to_rdkit_mol(smiles)[source]#
Convert smiles into rdkit’s mol (molecule) format.
- Parameters:
smiles – str, SMILES string.
- Returns:
rdkit.Chem.rdchem.Mol
- Return type:
mol
- tdc.chem_utils.oracle.oracle.smina(ligand, protein, score_only=False, raw_input=False)[source]#
Sima is a docking algorithm that docks a ligand to a protein pocket.
Koes, D.R., Baumgartner, M.P. and Camacho, C.J., 2013. Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. Journal of chemical information and modeling, 53(8), pp.1893-1904.
- Parameters:
ligand (array) – (N_1,3) matrix, where N_1 is ligand size.
protein (array) – (N_2,3) matrix, where N_2 is protein size.
score_only (boolean) – whether to only return docking score.
raw_input (boolean) – whether to input raw ML input or sdf file input
- Returns:
docking_info – docking result
- Return type:
- tdc.chem_utils.oracle.oracle.tree_analysis(current)[source]#
Analyze the result of tree builder Calculate: 1. Number of steps 2. Pi plausibility 3. If find a path In case of celery error, all values are -1
- Returns:
num_path = number of paths found status: Same as implemented in ASKCOS one num_step: number of steps p_score: Pi plausibility synthesizability: binary code price: price for synthesize query compound
tdc.chem_utils.evaluator module#
- tdc.chem_utils.evaluator.calculate_internal_pairwise_similarities(smiles_list)[source]#
Computes the pairwise similarities of the provided list of smiles against itself.
- Parameters:
smiles_list – list of str
- Returns:
Symmetric matrix of pairwise similarities. Diagonal is set to zero.
- tdc.chem_utils.evaluator.calculate_pc_descriptors(smiles, pc_descriptors)[source]#
Calculate Physical Chemical descriptors of a list of molecules.
- Parameters:
list_of_smiles – list of SMILES strings
pc_descriptors – list of strings, names of descriptors to calculate
- Returns:
list of float
- Return type:
descriptros
- tdc.chem_utils.evaluator.canonicalize(smiles)[source]#
Convert SMILES into canonical form.
- Parameters:
smiles – str, SMILES string
- Returns:
str, canonical SMILES string.
- Return type:
smiles
- tdc.chem_utils.evaluator.continuous_kldiv(X_baseline: array, X_sampled: array) float [source]#
calculate KL divergence for two numpy arrays, conitnuous version.
- Parameters:
X_baseline – numpy array
X_sampled – numpy array
- Returns:
float
- Return type:
KL divergence
- tdc.chem_utils.evaluator.discrete_kldiv(X_baseline: array, X_sampled: array) float [source]#
calculate KL divergence for two numpy arrays, discrete version.
- Parameters:
X_baseline – numpy array
X_sampled – numpy array
- Returns:
float
- Return type:
KL divergence
- tdc.chem_utils.evaluator.diversity(list_of_smiles)[source]#
- Evaluate the internal diversity of a set of molecules. The internbal diversity is defined as the average pairwise
Tanimoto distance between the Morgan fingerprints.
- Parameters:
list_of_smiles – list of SMILES strings
- Returns:
float
- Return type:
div
- tdc.chem_utils.evaluator.fcd_distance(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set.
- Parameters:
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns:
float
- Return type:
fcd_distance
- tdc.chem_utils.evaluator.fcd_distance_tf(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set using tensorflow.
- Parameters:
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns:
float
- Return type:
fcd_distance
- tdc.chem_utils.evaluator.fcd_distance_torch(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate FCD distance between generated smiles set and training smiles set using PyTorch.
- Parameters:
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns:
float
- Return type:
fcd_distance
- tdc.chem_utils.evaluator.get_fingerprints(mols, radius=2, length=4096)[source]#
Converts molecules to ECFP bitvectors.
- Parameters:
mols – RDKit molecules
radius – ECFP fingerprint radius
length – number of bits
Returns: a list of fingerprints
- tdc.chem_utils.evaluator.get_mols(smiles_list)[source]#
Convert SMILES strings to RDKit RDMol objects.
- Parameters:
list_of_smiles – list of SMILES strings
- Returns:
list of RDKit RDMol objects
- Return type:
mols
- tdc.chem_utils.evaluator.kl_divergence(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate the KL divergence of set of generated smiles using list of training smiles as reference. KL divergence is defined as the averaged KL divergence of a set of physical chemical descriptors between a set of generated molecules and a set of training molecules.
- Parameters:
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns:
float
- Return type:
KL divergence
- tdc.chem_utils.evaluator.novelty(generated_smiles_lst, training_smiles_lst)[source]#
Evaluate the novelty of set of generated smiles using list of training smiles as reference. Novelty is defined as the fraction of generated molecules that doesn’t appear in the training set.
- Parameters:
generated_smiles_lst – list (of SMILES string), which are generated.
training_smiles_lst – list (of SMILES string), which are used for training.
- Returns:
float
- Return type:
novelty
- tdc.chem_utils.evaluator.single_molecule_validity(smiles)[source]#
Evaluate the chemical validity of a single molecule in terms of SMILES string
- Parameters:
smiles – str, SMILES string.
- Returns:
if the SMILES string is a valid molecule
- Return type:
Boolean