TransWikia.com

Why does English ELMo model give embeddings for non-English words?

Data Science Asked by Gokul NC on September 4, 2021

Here’s the code from my notebook:

%tensorflow_version 1.x
import tensorflow as tf
import tensorflow_hub as hub

elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True)
tf.logging.set_verbosity(tf.logging.ERROR)

def elmo_vectors(x):
    embeddings = elmo(x, signature="default", as_dict=True)["elmo"]
    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        sess.run(tf.tables_initializer())
        return sess.run(embeddings)

Output for non-English language: (Hindi in this example)

words = ['गोकुल']
v = elmo_vectors(words)
print(v.shape) # (1,1,1024)
print(v[0][0])
# Output: [ 0.3731584   0.5700774  -0.48072845 ... -0.1241736   0.5961436 -0.6986947 ]

The documentation of the pre-trained ELMo on Tensorflow Hub shows that it was trained only on the English language.
That is, the dataset from 1 billion word benchmark is based on monolingual English data. (Source)

So, how/why am I getting embeddings for non-English vocabulary words from ELMo using the TF Hub model?

One Answer

While ELMo was trained on English data, it does not know whether the data you give it as input is English or not.

The input of ELMo is received at character-level. It may happen that the 1B Word data had hindi characters intermixed, case in which your characters would be encoded as they are or, most probably, your characters are encoded as unknown characters (just like the unknown token <unk> for word-level NLP but for characters).

ELMo is just a bunch of mathematical operations, so it takes whatever it receives and computes its operations on it, first, taking the character embedding with the characters you pass to it, then with a char-CNN followed by two highway layers and finally a bidirectional LSTM.

Answered by noe on September 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP