TransWikia.com

Output value of a gradient boosting decision tree node that has just a single example in it

Data Science Asked on September 5, 2021

The general gradient boosting algorithm for tree-based classifiers is as follows:

Input: training set ${displaystyle {(x_{i},y_{i})}_{i=1}^{n},}{displaystyle {(x_{i},y_{i})}_{i=1}^{n},}$a differentiable loss function ${displaystyle L(y,F(x)),}{displaystyle L(y,F(x)),}$ number of iterations M.

Algorithm:

  1. Initialize model with a constant value:
    ${displaystyle F_{0}(x)={underset {gamma }{arg min }}sum _{i=1}^{n}L(y_{i},gamma ).}F_0(x) = underset{gamma}{argmin} sum_{i=1}^n L(y_i, gamma)$
  2. For m = 1 to M:
  • (a) Compute so-called pseudo-residuals:
    ${displaystyle r_{im}=-left[{frac {partial L(y_{i},F(x_{i}))}{partial F(x_{i})}}right]_{F(x)=F_{m-1}(x)}quad {mbox{for }}i=1,ldots ,n.}r_{im} = -left[frac{partial L(y_i, F(x_i))}{partial F(x_i)}right]_{F(x)=F_{m-1}(x)} quad mbox{for } i=1,ldots,n.$
  • (b) Fit a base learner (or weak learner, e.g. tree) ${displaystyle h_{m}(x)}{displaystyle h_{m}(x)}$ to pseudo-residuals, i.e. train it using the training set ${displaystyle {(x_{i},r_{im})}_{i=1}^{n}}{(x_i, r_{im})}_{i=1}^n.$
  • (c) Compute multipliers and solve for the output values of each leaf. $gamma$ is the output value of a leaf here: ${displaystyle F_{m}(x)=F_{m-1}(x)+sum _{j=1}^{J_{m}}gamma _{jm}mathbf {1} _{R_{jm}}(x),quad gamma _{jm}={underset {gamma }{operatorname {arg,min} }}sum _{x_{i}in R_{jm}}L(y_{i},F_{m-1}(x_{i})+gamma ).}$
  1. Output ${displaystyle F_{M}(x).}F_M(x).$

Now the loss function that they input is logo-loss: $sum_i −(y_ilog(p_i)+(1−y_i)log(1−p_i))$ where $i$ is counter over all the samples in training set. The output of a tree leaf in the algorithm is assumed to be log-odds of samples belonging to the positive class.

In reality, how the algorithm is executed is – The $gamma _{jm}$ is computed by expanding the Loss function in Taylor’s series of degree 2 and then differentiating that w.r.t $gamma$

I will write the formulation for the case when there is just one sample ($x1$) in a terminal node.

$$L(y_1,F_{m-1}(x1)+gamma)=L(y_1,F_{m-1}(x1)) + frac{partial L(y_1,F_{m-1}(x1))}{partial F_{m-1}}gamma + frac{1}{2}frac{partial^2 L(y_1,F_{m-1}(x1))}{partial F_{m-1}^2}gamma^2$$

Now, this formulation is differentiated w.r.t $gamma$ and set equal to 0 to get the optimal value of gamma

$$gamma = -frac{frac{partial L(y_1,F_{m-1}(x1))}{partial F_{m-1}}}{frac{partial^2 L(y_1,F_{m-1}(x1))}{partial F_{m-1}^2}}$$

This computes to $$Residual_1/P_0(1-P_0) tag 1$$ where $P_0$ is the most recent predicted probability.

The example below is taken from this youtube video – linked to exact time 16:43

For an example case where $F_0(X1) = 0.69$ (these are log-odds)

(converting the log odds to probability) $P_0 = 0.67; Residual_1 = y_1-P_0 = 0.33$

Now they compute $gamma$ (the output value) for the leaf containing $x1$ by plugging these values into the equation (1) and get the $gamma$ to be equal to $$frac{0.33}{(0.67)(0.33)} = 1.49$$

Questions:

  1. How can we approximate the loss function $L$ for each gamma. We need to know radius of convergence around that $gamma$ I think and then the $gamma$ thus computed at the end should have lied in that radius of convergence?
  2. The log-loss (binary cross-entropy) is monotonic in predicted probability value. It increases with $P$ if $y = 1$ and decreases if $y = 0$ and thus it is also monotonic in log-odds since log-odds and probability are monotonic w.r.t each other.In that light why is $gamma$ not equal to $as-high-as-numerical-stability-permists$ when $y=1$ and $0$ when $y=0$?

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP