Data Science Asked by Chiffa on June 17, 2021
I’m trying to build a model that would predict the caco-2 coefficient of a molecule given its smiles string representation.
My solution is based on this example.
Since I need to predict a real value, I use a RandomForestRegressor
.
With some molecules added to the code manually, everything works (although the predictions themselves are wildly wrong):
from rdkit import Chem, DataStructs #all the nice chemical stuff, ConvertToNumpyArray
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestRegressor #our regressor
from sklearn.model_selection import train_test_split
import numpy as np
# generate molecules
m1 = Chem.MolFromSmiles('Cc1ccc(NNC(=O)c2ccc(CN3C(=O)CCC3=O)cc2)cc1Cl')
m2 = Chem.MolFromSmiles('Nc1ccc(C(=O)N2CCN(c3cc[nH+]cc3)CC2)cc1[N+](=O)[O-]')
m3 = Chem.MolFromSmiles('CN(Cc1[nH+]ccn1C)C(=O)CCc1ccsc1')
m4 = Chem.MolFromSmiles('COc1ccc([N+](=O)[O-])cc1C(=O)NCCC[NH+]1CCCC1')
m5 = Chem.MolFromSmiles('C[NH+]1CCN(S(=O)(=O)c2ccc(NC(=O)Cc3ccc([N+](=O)[O-])cc3)cc2)CC1')
m6 = Chem.MolFromSmiles('CCc1ccc(S(=O)(=O)Nc2ccc(NC(C)=O)cc2)cc1')
m7 = Chem.MolFromSmiles('O=C(COC(=O)c1ccc(S(=O)(=O)N2CCCCC2)cc1)c1ccc(F)cc1')
m8 = Chem.MolFromSmiles('COC(=O)c1ccc(S(=O)(=O)NCc2csc3ccc(Cl)cc23)n1C')
m9 = Chem.MolFromSmiles('CCC(C)N1C(=O)C(=CNc2ccccc2C(=O)[O-])C(=O)NC1=S')
m10 = Chem.MolFromSmiles('Cn1c(CNC(=O)C(=O)Nc2cccc(Cl)c2Cl)nc2ccccc21')
mols = [m1, m2, m3, m4, m5 ,m6, m7, m8, m9, m10]
# generate fingeprints: Morgan fingerprint with radius 2
fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]
# convert the RDKit explicit vectors into numpy arrays
np_fps = []
for fp in fps:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps.append(arr)
# get a random forest regressor with 100 trees
rndf_rgsr = RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1, warm_start=False)
#train the random forest
#ys are the caco-2 coefficients we wish to predict
ys_fit = [379.724, 101.644, 3154.167, 97.437, 21.152, 569.981, 150.55, 690.843, 78.866, 984.371]
rndf_rgsr.fit(np_fps, ys_fit)
#use the random forest to predict a new molecule
m_new = Chem.MolFromSmiles('Cc1n[nH]c(Cc2ccc(-n3cnnc3)cc2)n1') #actual caco2 is 410.037
fp = np.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m_new, 2), fp)
print(rndf_rgsr.predict((fp,)))
But when I try to work with a lot of molecules imported from a file, which contains a lot of lines that look like Cc1ccc(NNC(=O)c2ccc(CN3C(=O)CCC3=O)cc2)cc1Cl,379.724
, using the following code:
from rdkit import Chem, DataStructs
from rdkit.Chem import AllChem
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor #our regressors
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from pandas import DataFrame, read_csv
#import our data from file
df = pd.read_csv('test_db.csv', delimiter=',' ) #a pandas DataFrame
#get the values of variables and targets
X = df["smiles"].values
y = df["Caco2"].values
#split our data set into two parts
x_train, x_eval, y_train, y_eval = train_test_split(X, y, test_size = 0.2, random_state = 42)
#convert our smiles string into actual molecular graphs
mols_ready_train = [Chem.MolFromSmiles(x_train[i]) for i in range(len(x_train))]
mols_ready_eval = [Chem.MolFromSmiles(x_eval[i]) for i in range(len(x_eval))]
# generate fingeprints: Morgan fingerprint with radius 2
fing_prints_train = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols_ready_train]
fing_prints_eval = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols_ready_eval]
# convert the RDKit explicit vectors into numpy arrays
np_fps_train = []
for fp in fing_prints_train:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps_train.append(arr)
np_fps_eval = []
for fp in fing_prints_eval:
arr = np.zeros((1,))
DataStructs.ConvertToNumpyArray(fp, arr)
np_fps_eval.append(arr)
# get a random forest regressor with 100 trees
rndf_rgsr = RandomForestRegressor(n_estimators=1000, random_state=42, n_jobs=-1, warm_start=False)
#train our random forest regressor
rndf_rgsr.fit(np_fps_train, y_train)
# use the random forest to predict a new molecule
m_new = Chem.MolFromSmiles('Cc1n[nH]c(Cc2ccc(-n3cnnc3)cc2)n1')
fp = numpy.zeros((1,))
DataStructs.ConvertToNumpyArray(AllChem.GetMorganFingerprintAsBitVect(m_new, 2), fp)
print(rndf_rgsr.predict((fp,)))
it crashes with the following error:
File “/home/me/predictor.py”, line 55, in
rndf_rgsr.fit(np_fps_train, y_train) File “/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py”,
line 248, in fit
y = check_array(y, accept_sparse=’csc’, ensure_2d=False, dtype=None) File
“/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”,
line 407, in check_array
_assert_all_finite(array) File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”,
line 58, in _assert_all_finite
” or a value too large for %r.” % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).
I’ve checked that no vectors I use contain nan
s or inf
s. The fingerprints used here are 2048 bits long, but I doubt they’re the source of the problem.
Something is going wrong with validation, but I can’t really see what.
Could you provide any hints?
Ok, I've been wrong in my assumptions.
A simple:
for i in range(len(X)):
if np.isnan(y[i]):
print ("Here it is: ", i,X[i],y[i])
has revealed about 200 "bad" lines in my dataset. Cleaning those up solved that particular problem.
Answered by Chiffa on June 17, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP