TransWikia.com

Can word embedding be used for text classification on a mix of English and non-English text?

Data Science Asked by George Liu on April 4, 2021

I’m doing text classification on text messages generated by consumers and just realized even though most of the replies provided by consumers are in English, some are in French. I’ve used Keras word embedding, conv1D and maxpooling to learn the structure in the text and didn’t use any other text preprocessing techniques such as stop words removal etc.

In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages…Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?

2 Answers

In this case, I think it should be fine to use word embedding on both languages since word embedding learns the meaning of individual words regardless of languages...Is this reasonable? Or maybe I do need to separate the languages and build different models for each language?

If I think logically you are correct. Word embedding is merely a collection of Tokens, which derived its features on the basis of nearby words in a sentence. So if you have sufficient raw data(mix of both), I think its good to go, though results will explain you more :).

However its good to see how such models will behave in case we have mix of LeftToRight(LTR) and RTL languages.

Correct answer by vipin bansal on April 4, 2021

Depends on data distribution. Some consumers might speak in mixed ways and they might be very few who do this and your Word Embedding might not learn anything from those data. It might be helpful to have a feature that tells or approximately tells that a conversation has mixed, Pure English or Pure French language.

Answered by bonez001 on April 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP