Can word embedding be used for text classification on a mix of English and non-English text?

Question

I'm doing text classification on text messages generated by consumers and just realized even though most of the replies provided by consumers are in English, some are in French. I've used Keras word embedding, conv1D and maxpooling to learn the structure in the text and didn't use any other text preprocessing techniques such as stop words removal etc.

In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages...Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?

vipin bansal · Accepted Answer

In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages...Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?

If I think logically you are correct. Word embedding is merely a
  collection of Tokens, which derived its features on the basis of
  nearby words in a sentence. So if you have sufficient raw data(mix of
  both), I think its good to go, though results will explain you more
  :).

However its good to see how such models will behave in case we have mix of LeftToRight(LTR) and RTL languages.

bonez001 · Answer

Depends on data distribution. Some consumers might speak in mixed ways and they might be very few who do this and your Word Embedding might not learn anything from those data. It might be helpful to have a feature that tells or approximately tells that a conversation has mixed, Pure English or Pure French language.

Answered by bonez001 on April 4, 2021

Can word embedding be used for text classification on a mix of English and non-English text?

2 Answers

Add your own answers!

Ask a Question