Information Extraction/Semantic Search for long, unstructured documents

Question

I am stuck with a particular task of information extraction. I have a few hundred, long (5-35 pages) pdf, doc and docx project documents from which I seek to extract specific information and store them in a structured database.

The ultimate goal is to extract and store information in a way that we can query those and any new incoming documents for fast and reliable information. For instance, I want to query a combination of entities from the knowledge base and then return the n-most relevant paragraphs/sentences from the documents. Since some entities like “World Bank” are extracted dozens of times for some documents, I need a way to query the entity in context. Otherwise I just end up with a database that contains the names of specific entities without any way to map them back.

NER usually seems like a good solution for this, however, the documents all have very unique structures which also change from document to document. For instance, a lot of relevant information is stored in tables, but also in long paragraphs.

As far as I understand, NER uses the surrounding words to identify entities, hence loading in the whole documents as raw text and manually tagging terms in the tables will probably not serve as good training data.

For now I built a function that extracts the raw text from pdf, doc and docx documents and extracted entities with spaCys NER model, however, I need to define my own entities and the domain is to scientifc for spaCy to deliever good results.  I also obtained a prodigy license for annotation.

Also, it seems I have to find a way to distinguish between "relevant" text and table entries and "junk" - aka footnotes, annexes, titles, subtitles, etc.

Any input is highly appreciated!

Thanks!

Adelson Araújo · Answer

If you think filtering certain POS-tag words or "bad" topics from topic modelling doesn't suffice to distinguish between relevant text and junk, you can implement your custom filter. Try regex rules and use your prodigy licence to annotate junk to fit a classifier based on tf-idf or other text feature extractor.

But don't forget, there isn't a preprocessing framework that fits all needs.

Information Extraction/Semantic Search for long, unstructured documents

One Answer

Add your own answers!

Ask a Question