TransWikia.com

How to choose dimension of Keras embedding layer?

Data Science Asked by dokondr on May 19, 2021

Looking for some guidelines to choose dimension of Keras word embedding layer. For example in a simplified movie review classification code:

# NN layer params

MAX_LEN = 100        # Max length of a review text
VOCAB_SIZE = 10000   # Number of words in vocabulary
EMBEDDING_DIMS = 50  # Embedding dimension - number of components in word embedding vector

text_model = tf.keras.Sequential([
    tf.keras.layers.Embedding(input_length=MAX_LEN,input_dim=VOCAB_SIZE,
                              output_dim=EMBEDDING_DIMS),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
text_model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

Embedding vector has 50 components in this example. Trained on 17500 and tested on 5625 reviews this model reports:

              precision    recall  f1-score   support

           0       0.87      0.86      0.87      2802
           1       0.87      0.88      0.87      2823

    accuracy                           0.87      5625
   macro avg       0.87      0.87      0.87      5625
weighted avg       0.87      0.87      0.87      5625

With 10 and even 2 dimensions I get similar values in classification report!

Then what guiding principle really works when choosing word embedding vector dimension? When select 50, 10,100, 200, … etc. dimensions?

One Answer

There is no "Right" answer to this question , but you should take in mind the following guidelines:

  1. Embedding layer is a compression of the input, when the layer is smaller , you compress more and lose more data. When the layer is bigger you compress less and potentially overfit your input dataset to this layer making it useless.

  2. The larger vocabulary you have you want better representation of it - make the layer larger.

  3. If you have very sparse documents relatively to the vocabulary, then you want to "get rid" of unnecessary and noisy words - you should compress more - make the embedding smaller.

Answered by Tolik on May 19, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP