TransWikia.com

How do I visualize data for a natural language processing project?

Data Science Asked on June 30, 2021

I am using a question-and-answer dataset. My neural network takes a question and an article content, and outputs where an answer starts (as an integer). To visualize my data, how should I process it and what plot(s) should I use?

I’m considering:

Word/N-gram frequency histogram for the questions. Another one for the answers.

Plots mapping word/n-gram frequency to output features

Plots mapping word/n-gram frequencies to Shannon entropy values.

On that note, maybe using a smaller machine learning model – such as a decision tree – qnd graphing the resulting probabilities.

What is the best plot for a project like mine?

One Answer

I'm not an expert but let me try to think with you. What's your vocabulary size?

I think certainly starting with a small machine learning model is a good idea, but I think that a decision tree would quickly suffer with even a medium-sized vocabulary. You would need a huge tree to do anything. So I think I would start with pretrained word embeddings, and use a small neural net to predict the starting point. This helps, because words that are close in meaning have similar vectors, and the decision tree wouldn't be able to use that kind of information.

Your suggestions for histograms don't seem bad, but you would have a histogram that is as wide as your vocabulary, which seems like it defeats the purpose of visualizing it... If you went with word embeddings, how about using a technique like UMAP to plot the questions and articles in 2D?

Answered by Paul on June 30, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP