TransWikia.com

to generate consistent encoding for words in Keras using tf.keras.preprocessing.text.one_hot

Data Science Asked by Sociopath on March 19, 2021

I am using keras(tensorflow) to convert text into encodings using tensorflow.keras.preprocessing.text.one_hot

I have used it for training dataset as below

from tensorflow.keras.preprocessing.text import one_hot

corpus = ['nice app']
onehot_repr = [one_hot(words, 10000) for words in corpus]

print(onehot_repr)
# [5779, 2969]

It’s ok upto this point.

But when I use the one_hot for my testing set it generates different encoding.

I have created a Flask API to test, So how can use same encoding for both train and test set

Result from API is :

[[5129, 4965]] for same text ['nice app']

One Answer

Keras' one_hot function has many limitations. The biggest issue is that the function does not actually do one hot encoding, it does the hashing trick.

One possible fix is to use keras' hashing_trick function. It allows the hashing function to specified. If you pick a stable hashing function like md5, then the values will be consistent across runs.

Here is an example:

from tensorflow.keras.preprocessing.text import hashing_trick

corpus = ['nice app']
text_hashed = [hashing_trick(text=words, n=10_000, hash_function='md5') for words in corpus]
assert text_hashed == [[9146, 6067]]

Correct answer by Brian Spiering on March 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP