Deep learning detect reference boundary in text (or number of references in text)

Data Science Asked on June 14, 2021

I have several documents that either contain or don’t an X number of references. I would like to build a model that can detect the number of references if any in a text.

I’ve been thinking for training to generate a bunch of random text and generate a variety of styles of references for different articles. Generating this dataset is fairly straightforward.

I am not sure how to craft the data for CNN. Word2vec does not seem like a good idea since punctuation is part of what makes references different than regular text. I could just do tf-idf vectors but then not sure what to represent as my Y. Should I put the boundary (index position, start and end) of where the reference is? What loss function do I use for a vector Y variable? Most guide show how to do numeric, binary and multiclass. Any advise or resources are much appreciated.

convolutional neural network information extraction text mining

Add your own answers!

Ask a Question

Get help from others!

Recent Answers

Jon Church on Why fry rice before boiling?
Peter Machado on Why fry rice before boiling?
Joshua Engel on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?
haakon.io on Why fry rice before boiling?