Data Science Asked by dokondr on May 19, 2021
Looking for some guidelines to choose dimension of Keras word embedding layer. For example in a simplified movie review classification code:
# NN layer params
MAX_LEN = 100 # Max length of a review text
VOCAB_SIZE = 10000 # Number of words in vocabulary
EMBEDDING_DIMS = 50 # Embedding dimension - number of components in word embedding vector
text_model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_length=MAX_LEN,input_dim=VOCAB_SIZE,
output_dim=EMBEDDING_DIMS),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
text_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Embedding vector has 50 components in this example. Trained on 17500 and tested on 5625 reviews this model reports:
precision recall f1-score support
0 0.87 0.86 0.87 2802
1 0.87 0.88 0.87 2823
accuracy 0.87 5625
macro avg 0.87 0.87 0.87 5625
weighted avg 0.87 0.87 0.87 5625
With 10 and even 2 dimensions I get similar values in classification report!
Then what guiding principle really works when choosing word embedding vector dimension? When select 50, 10,100, 200, … etc. dimensions?
There is no "Right" answer to this question , but you should take in mind the following guidelines:
Embedding layer is a compression of the input, when the layer is smaller , you compress more and lose more data. When the layer is bigger you compress less and potentially overfit your input dataset to this layer making it useless.
The larger vocabulary you have you want better representation of it - make the layer larger.
If you have very sparse documents relatively to the vocabulary, then you want to "get rid" of unnecessary and noisy words - you should compress more - make the embedding smaller.
Answered by Tolik on May 19, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP