Data Science Asked by Gleb Shigin on August 5, 2021
I have the names of the companies (in Russian). The name can contain abbreviations, words with capital letters, words with lowercase letters, and mixed words. The model is trained according to the principle: At the input, the name is given in upper case, at the output – in the "correct" version. For example (adapted):
"SIGMA" LIMITED LIABILITY COMPANY -> "SIGMA" Limited liability company
"SIGMA" LLT -> "SIGMA" LLT
PJSC "GAZPROM"-> PJSC "Gazprom"
STATE BUDGETARY EDUCATIONAL INSTITUTION OF THE CITY OF MOSCOW "LYCEUM NO. 1568" -> State budgetary Educational institution of the city of Moscow "Lyceum No. 1568"
And then I have to predict sentences that I don’t know answers for. The size of my dataset is about 5 million rows or 600 MB.
I tried to make a character-based seq2seq model with attention and bidirectional GRU (based on a pytorch guides), but with all the hyperparameters I tried, it seems to be underfitted. It generates the beginning of phrases quite well, but at the end it breaks down.
Now I think that it’s better to work with word tokens. But I do not know if there are methods for classifying words from a single text in a context. (Because i need to know, where is the word in sentence, what words are next it predict in’t case)
I want to tokenize a sentence, and assign a property to each word, what it is: uppercase, lowercase, or with a capital first letter. Also I need to do something with mixed words like "МосГосПаравоз" (or like "McDonald’s" which not only starts with a capital letter, but also contains it inside.).
Perhaps I need a completely different approach. I’ll be happy to accept your help
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP