strings as features in decision tree/random forest

Question

I am doing some problems on an application of decision tree/random forest. I am trying to fit a problem which has numbers as well as strings (such as country name) as features. Now the library, scikit-learn takes only numbers as parameters, but I want to inject the strings as well as they carry a significant amount of knowledge.

How do I handle such a scenario?

I can convert a string to numbers by some mechanism such as hashing in Python. But I would like to know the best practice on how strings are handled in decision tree problems.

rapaio · Accepted Answer

In most of the well-established machine learning systems, categorical variables are handled naturally. For example in R you would use factors, in WEKA you would use nominal variables. This is not the case in scikit-learn. The decision trees implemented in scikit-learn uses only numerical features and these features are interpreted always as continuous numeric variables.

Thus, simply replacing the strings with a hash code should be avoided, because being considered as a continuous numerical feature any coding you will use will induce an order which simply does not exist in your data.

One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'.

Finally, the answer to your question lies in coding the categorical feature into multiple binary features. For example, you might code ['red','green','blue'] with 3 columns, one for each category, having 1 when the category match and 0 otherwise. This is called one-hot-encoding, binary encoding, one-of-k-encoding or whatever. You can check documentation here for encoding categorical features and feature extraction - hashing and dicts. Obviously one-hot-encoding will expand your space requirements and sometimes it hurts the performance as well.

Kyle. · Answer

You need to encode your strings as numeric features that sci-kit can use for the ML algorithms. This functionality is handled in the preprocessing module (e.g., see sklearn.preprocessing.LabelEncoder for an example).

Answered by Kyle. on February 17, 2021

ozn · Answer

You can use dummy variables in such scenarios. With panda's panda.get_dummies you can create dummy variables for strings you want to put in Decision Tree or Random Forest.

Example:

import pandas as pd
d = {'one' : pd.Series([1., 2., 3.,4.], index=['a', 'b', 'c','d']),'two' :pd.Series(['Paul', 'John', 'Micheal','George'], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)

df_with_dummies= pd.get_dummies(df,columns=["two"],drop_first=False)
df_with_dummies

denson · Answer

You should usually one-hot encode categorical variables for scikit-learn models, including random forest. Random forest will often work ok without one-hot encoding but usually performs better if you do one-hot encode. One-hot encoding and "dummying" variables mean the same thing in this context. Scikit-learn has sklearn.preprocessing.OneHotEncoder and Pandas has pandas.get_dummies to accomplish this.

However, there are alternatives. The article "Beyond One-Hot" at KDnuggets does a great job of explaining why you need to encode categorical variables and alternatives to one-hot encoding.

There are alternative implementations of random forest that do not require one-hot encoding such as R or H2O. The implementation in R is computationally expensive and will not work if your features have many categories. H2O will work with large numbers of categories. Continuum has made H2O available in Anaconda Python.

There is an ongoing effort to make scikit-learn handle categorical features directly.

This article has an explanation of the algorithm used in H2O. It references the academic paper A Streaming Parallel Decision Tree Algorithm and a longer version of the same paper.

Arash Jamshidi · Answer

Turn them to numbers, for example for each unique country assingn a unique number (like 1,2,3 and ...)

also you Don't need to use One-Hot Encoding (aka dummy variables) when working with random forest, because trees don't work like other algorithm (such as linear/logistic regression) and they don't work by distant (they work with finding good split for your features) so NO NEED for One-Hot Encoding

Pete · Answer

2018 Update!

You can create an embedding (dense vector) space for your categorical variables.  Many of you are familiar with word2vec and fastext, which embed words in a meaningful dense vector space.  Same idea here-- your categorical variables will map to a vector with some meaning.

From the Guo/Berkhahn paper:

Entity embedding not only reduces memory usage and speeds up neural
  networks compared with one-hot encoding, but more importantly by
  mapping similar values close to each other in the embedding space it
  reveals the intrinsic properties of the categorical variables. We
  applied it successfully in a recent Kaggle competition and were able
  to reach the third position with relative simple features.

The authors found that representing categorical variables this way improved the effectiveness of all machine learning algorithms tested, including random forest.

The best example might be Pinterest's application of the technique to group related Pins:

The folks at fastai have implemented categorical embeddings and created a very nice blog post with companion demo notebook.

Additional Details and Explanation

A neural net is used to create the embeddings i.e. assign a vector to each categorical value. Once you have the vectors, you may use them in any model which accepts numerical values. Each component of vector becomes an input variable. For example, if you used 3-D vectors to embed your categorical list of colors, you might get something like: red=(0, 1.5, -2.3), blue=(1, 1, 0) etc. You would use three input variables in your random forest corresponding to the three components. For red things, c1=0, c2=1.5, and c3=-2.3. For blue things, c1=1, c2=1, and c3=0.

You don't actually need to use a neural network to create embeddings (although I don't recommend shying away from the technique).  You're free to create your own embeddings by hand or other means, when possible.  Some examples:

Map colors to RGB vectors.
Map locations to lat/long vectors.
In a U.S. political model, map cities to some vector components representing left/right alignment, tax burden, etc.

strings as features in decision tree/random forest

6 Answers

Add your own answers!

Ask a Question