Text vectorizer that capture feature offset in the text?

Question

I'm using sklearn Tfifdfvectorizer to extract feature from text towards text classification.
I believe the information I need tends to be in the beginning of the document, so I would like to somehow capture the offset of each feature per document (either of the first appearance, or the mean offset over of all appearances).
Is there some vectorizer that can do that? or some other method of extracting this information efficiently?

Thank you!

Brian Spiering · Answer

One approach is to create another matrix that stores this information. Scikit-learn stores text features in a document-by-token matrix. The cells of this matrix would be the token index in a document. This matrix then could be used as features during modeling.

It would require writing a custom vectorizer which would be similar to scikit-learn's CountVector implementation.

Text vectorizer that capture feature offset in the text?

One Answer

Add your own answers!

Ask a Question