Data Science Asked on August 4, 2021
I am trying to port this little piece of R code to python:
rf <- randomForest(features, proximity = T, oob.prox = T, ntree = 2000)
dists <- as.dist(1 - rf$proximity)
with parameters
oob.prox
: Should proximity be calculated only on “out-of-bag” data?
proximity
: if proximity=TRUE
when randomForest
is called, a matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).
I am currently trying using sklearn.ensemble.RandomTreesEmbedding
for this task, however there is no functionality for the proximity matrix. I found the following developer comment though:
We don’t implement proximity matrix in Scikit-Learn (yet).
However, this could be done by relying on the apply function provided
in our implementation of decision trees. That is, for all pairs of
samples in your dataset, iterate over the decision trees in the forest
(through forest.estimators_) and count the number of times they fall
in the same leaf, i.e., the number of times apply give the same node
id for both samples in the pair.
And so I tried, utilizing numpy’s pdist()
function along with my custom distance (or in this case, proximity) measure. I still have several problems:
as.dist(1- rf$proximity)
: I think I need to normalize my count matrix, then subtract it from 1 and then afterwards compute the euclidean distances between its rows!?My code as of now looks like this:
# grow a random forest from points
rf = ensemble.RandomTreesEmbedding(n_estimators=200,
random_state=0,
max_depth=5
)
rfdata = rf.fit_transform(xdata);
# define an affinity measure function to use with numpy's pdist
def treeprox(u, v):
leafcount = 0
# needs reshaping for single samples
u = u.reshape(1,-1)
v = v.reshape(1,-1)
a = rf.apply(u)
b = rf.apply(v)
# count number of times they fall in the same leaf
# (use of np forces element-wise)
c = np.sum(np.array(a)==np.array(b))
return c
distm = pdist(xdata, proxfun)
distm = squareform(distm)
There must be a better way I guess, since this functionality is readily provided by the R package randomForest
.
Any suggestions?
tia
I have written some code for this. It can be found here. In answer to your specific questions:
Answered by Keith on August 4, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP