Data Science Asked by siya m on August 19, 2020
I have a dataset which has the following columns : Document id, page numbers and labels.
documentid pagenumbers label
document1 1 23 26 45 48 76 fiction
document2 22 34 56 67 mystery
document3 61 78 82 99 science
document 4 12 32 mystery
Explaining my table below:
Row 1 corresponds to data for document 1: document 1 starts from page 1 to 23,restarts from 26 to 45. Document 1 continues again from 48 to 76 and ends. This document 1 represents stories belonging to class fiction.
Similarly row 2 has data corresponding to document 2: document 2 starts from page 22 to 34 and then restarts from 56 to 67. This document 2 represents stories belonging to class mystery and so on. Aim is to develop a classifier that can classify a document to be of a particular category (fiction,mystery,science) based on the page numbers.
I am looking for advice on what kind of classifiers could be used for classification a page number series. This isn’t time series and hence i am a bit confused whether i need to use complex algorithms like RNN, LSTM.Are there easier models that can use series of data such as page numbers as features?
One thing that I am also considering doing is to introduce padding to the page numbers so that all the page numbers are of equal length as I was considering sequence classifiers. Is this required?
Are there any graph traversals based ML algorithms that could be used? Networkx provides features like page rank, centrality.It would be interesting to explore such graph features. Any inputs would be helpful too
Looking for tips on any sequence classifier libraries or even networkx features that might be useful for my problem.
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP