TransWikia.com

What is the difference between fasttext and DANs in document classification?

Data Science Asked by user1043144 on February 28, 2021

I came across two interesting papers that describe promising approaches for document classification using word embedding.

1. The fasttext algorithm

Described in the paper Bag of Tricks for Efficient Text Classification here.

(With further explanation here).

2. DANs

Described in the paper Deep Unordered Composition Rivals Syntactic Methods for Text Classification here.

Question:

What is the difference between both approaches?

Are they essentially the same as they both seem to average word embedding and pass it through an MLP or am I missing something crucial?

One Answer

The first most important difference consists in the fact that when using fasttext you are training a language model, i.e. your own embedding vecotrs, while DAN is an architecture (not a language model) that require either a random initializion of embedding layers (which are then trained along with the other layers) or to use pre-trained embeddgings like GloVe (or even fasttext vectors!).

DAN is something that has become popular in some sense (even though I never saw this paper before now). Aveaging embedding vectors of single words before feeding them to a dense layer is a common practice if you need to perform some task at a paragraph or document level.

Just to add some peculiarities of fasttext embeddings, their are trained not for single words, but for n-grams. So during the preprocesing of the corpus from which the embedding are learnt, words are splitted into several chunks of character. For example:

'matter' would become [ma, mat, att, tte, ter, er]

and a unique embedding is then learnt for each chunk like 'ma' or 'mat'. The training follow the same logic of word2vec vectors, which means that from every chunk the model tries to predict context chunks. The advantage of learning embeddings for each chunk relies on the ability of these vectors to learn specific morphological features that classic token-level embeddings usually miss.

If it might help, for a good survey on word embeddings I suggest to take a look at this.

Answered by Edoardo Guerriero on February 28, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP