Data Science Asked on July 16, 2021
A colleague of mine is having an interesting situation, he has quite a large set of possibilities for a defined categorical feature (+/- 300 different values)
The usual data science approach would be to perform a One-Hot Encoding.
However, wouldn’t it be a bit extreme to perform some One-Hot Encoding with a dictionary quite large (+/- 300 values)? Is there any best practice on when to choose Embedding vectors or One-Hot Encoding?
Additional, information: how would you handle the previous case if new values can be added to the dictionary. Re-training seems the only solution, however with One-Hot Encoding, the data dimension will simultaniously increase which may lead to additional troubles, embedding vectors, on the opposite side, can keep the same dimension even if new values appears.
How would you handle such a case ? Embedding vectors clearly seem more appropriate to me, however I would like to validate my opinion and check if there is another solution that could be more apporiate.
One-Hot Encoding is a general method that can vectorize any categorical features. It is simple and fast to create and update the vectorization, just add a new entry in the vector with a one for each new category. However, that speed and simplicity also leads to the "curse of dimensionality" by creating a new dimension for each category.
Embedding is a method that requires large amounts, both in the total amount of data and repeated occurrences of individual exemplars, and long training time. The result is a dense vector with a fixed, arbitrary number of dimensions.
They also differ at the prediction stage: a One-Hot Encoding tells you nothing of the semantics of the items; each vectorization is an orthogonal representation in another dimension. Embeddings will group commonly co-occurring items together in the representation space.
If you have enough training data, enough training time, and the ability to apply the more complex training algorithm (e.g., word2vec or GloVe), go with Embeddings. Otherwise, fall back to One-Hot Encoding.
Correct answer by Brian Spiering on July 16, 2021
It seems that Embedding vector is the best solution here.
However, you may consider a variant of the one-hot encoding called 'one-hot hashing trick". In this variant, when the number of unique words is too large to be assigned a unique index in a dictionary, one may hash words of into vector of fixed size.
One advantage in your use case is that you may perform online encoding. If you have not encountered every vocabulary words yet, you may still assign a hash. Then later, new words may be added to the vocabulary. One pitfall though is "hash collisions". Indeed there is a probability that two different words end up with the same hash.
I found an example of this hashing trick in the excellent "Deep learning with Python" from François Chollet.
import numpy as np
samples = ['The cat sat on the mat.', 'The dog ate my homework.']
dimensionality = 1000
max_length = 10
results = np.zeros((len(samples), max_length, dimensionality))
for i, sample in enumerate(samples):
for j, word in list(enumerate(sample.split()))[:max_length]:
# the word is given a hash with a random integer
index = abs(hash(word)) % dimensionality
results[i, j, index] = 1.0
The resulting array:
results.shape
(2, 10, 1000)
You can observe that common words between the two sentences are given the same index: The
in position 0 of the two sentences (positions 0 and 6 of dimension 2).
np.where(results > 0)
(array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype=int64),
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4], dtype=int64),
array([ 23, 58, 467, 442, 83, 77, 23, 798, 618, 301, 942], dtype=int64))
Answered by michaelg on July 16, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP