to generate consistent encoding for words in Keras using tf.keras.preprocessing.text.one_hot

Question

I am using keras(tensorflow) to convert text into encodings using tensorflow.keras.preprocessing.text.one_hot
I have used it for training dataset as below
from tensorflow.keras.preprocessing.text import one_hot

corpus = ['nice app']
onehot_repr = [one_hot(words, 10000) for words in corpus]

print(onehot_repr)
# [5779, 2969]

It's ok upto this point.
But when I use the one_hot for my testing set it generates different encoding.
I have created a Flask API to test, So how can use same encoding for both train and test set
Result from API is :
[[5129, 4965]] for same text ['nice app']

Brian Spiering · Accepted Answer

Keras' one_hot function has many limitations. The biggest issue is that the function does not actually do one hot encoding, it does the hashing trick.
One possible fix is to use keras' hashing_trick function. It allows the hashing function to specified. If you pick a stable hashing function like md5, then the values will be consistent across runs.
Here is an example:
from tensorflow.keras.preprocessing.text import hashing_trick

corpus = ['nice app']
text_hashed = [hashing_trick(text=words, n=10_000, hash_function='md5') for words in corpus]
assert text_hashed == [[9146, 6067]]

to generate consistent encoding for words in Keras using tf.keras.preprocessing.text.one_hot

One Answer

Add your own answers!

Ask a Question