Data Science Asked on May 1, 2021
Chinese text uses a character set containing tens of thousands of characters. Words in Chinese are most commonly made up of 1, 2 or 3 characters. There are no spaces or other markers between words in Chinese text since native speakers can easily segment words from text. In order to assist language learning tools, it’s helpful to have an automatic way to segment words within blocks of text.
Some approaches use dictionary based greedy algorithm methods but are prone to failing due to the common problems that most greedy algorithms present.
I want to try using neural networks, but my question is, How would one go about encoding the characters to the input neurons of the network?
I am not talking about OCR. The characters are already known and encoded in unicode, but how would I present the characters to the inputs of the network?
One method I can imagine is to have the network look at the text in portions of say 100 characters at a time, and have one neuron for each character. But how would I represent the character as a number to the network? Using the unicode integer number value for the character doesn’t seem like a good idea.
So, the main question is about how to represent the Chinese characters in your Chinese word segmentation task.
Since effectively these characters are non-ordinal categorical variables, we would represent these as one hot encodings (https://medium.com/@michaeldelsole/what-is-one-hot-encoding-and-how-to-do-it-f0ae272f1179) of dimensionality n = number of unique Chinese characters in your dataset.
Sometimes n can be very large. So in this instance, we typically have an Embedding layer, which can reduce data sparsity and collapse the number of dimensions. (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding#:~:text=Arguments%20%20%20%20input_dim%20%20%20,the%20embed%20...%20%202%20more%20rows%20)
Answered by shepan6 on May 1, 2021
I think there are three questions hidden here:
Let's address them one by one:
The typical way to encode discrete elements like in this case is to use a closed dictionary. In this case the elements to be encoded are characters, so we should:
Take our training data and extract all possible characters present there (potentially Chinese characters, roman numerals, and other foreign alphabet letters that were present in the data, like Latin or Cyrillic letters)
Create a list with the $N$ most frequent ones. In order to dimension $N$ appropriately, we should take into account that current neural network architectures won't be able to handle more than 50K elements.
Given that this problem can be formulated similarly to a language modeling task, I would say that the most appropriate architectures would be LSTM/GRU, 1D Convolutional networks, and the Transformer (or one of its variants). As we may benefit if our architecture can handle infinite context, then I would say that the choice is between LSTMs and TransformerXL. I think that word segmentation should not require very heavy processing, so I would go for an LSTM, which is very light at inference time.
As the characters are discrete, the first layer of the network would be an Embedding layer to encode the discrete characters as continuous vectors. The size of these vectors is a hyperparameter whose value we should decide.
The output of the network could be a 1 at the character positions that start a word and 0 at the other positions.
Given that we proposed to use a neural architecture that can handle infinite context, then we should support it also in the input data preparation, i.e. by using Truncated Back-Propagation Through Time (TBPTT) if we chose LSTMs, in the style of normal language models, where we prepare the minibatches so that we can take the last hidden state of a batch and use it for the initialization of the next one.
Answered by noe on May 1, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP