Extracting text into columns

Question

I have a set of customer reports, each is in ms word file. they are all in a similar pattern, for example they start with Name: --, Age: --, Date: --, etc...

is there a way to extract particular strings from each file to form a data set.

In orange, I was able to compile the word documents into corpus which I can display as one column (each report is in one cell). Does orange have a way to extract strings into columns (for example if between "age:" and "gender")?

K3---rnc · Answer

Maybe you could use Orange3-Text add-on, widget Preprocess Text, Tokenization > Regexp. The source code indicates it's a Python regex, so you might be able to use a regular expression pattern such as:

(?ix)        # ignore case, ignore comments and whitespace in this RE
(?<=age:s)  # preceded by 'age: '
.+           # characters you wish to match
(?=gender:)  # followed by 'gender:'

Extracting text into columns

One Answer

Add your own answers!

Ask a Question