What is a good way to extract blocks of texts from documents in various formats?

Data Science Asked by n.mathfreak on June 24, 2021

I have lots of documentation in pdf, ppt, xls files, which contain tables with text, pictures, headlines etc. The goal is to extract blocks of texts where the information is continuous, so to say.

The lazy method is to copy-paste or convert the documents to txt. I suppose I can also write a Python script that semi-automates some parts. I also found online tools that detect tables, but they are better suited for tables with numbers and values.

What would be a good, faster way to do it?

data python text

Add your own answers!

Ask a Question

Get help from others!

Recent Questions

How can I transform graph image into a tikzpicture LaTeX code?
How Do I Get The Ifruit App Off Of Gta 5 / Grand Theft Auto 5
Iv’e designed a space elevator using a series of lasers. do you know anybody i could submit the designs too that could manufacture the concept and put it to use
Need help finding a book. Female OP protagonist, magic
Why is the WWF pending games (“Your turn”) area replaced w/ a column of “Bonus & Reward”gift boxes?

Recent Answers

Joshua Engel on Why fry rice before boiling?
Peter Machado on Why fry rice before boiling?
haakon.io on Why fry rice before boiling?
Jon Church on Why fry rice before boiling?
Lex on Does Google Analytics track 404 page responses as valid page views?