What are the different ways to feature engineer webpage data for input into a webpage classification model?

Data Science Asked by mkerrig on June 20, 2021

Looking for resources on the different ways that one can manipulate webpage data to input as features into a neural net.

I’m aware of a service called diffbot that claims to use a CV based method to "look" at the website while still being contextually aware of the data content embedded in the page.

What I’m struggling with is figuring out how to embed the webpage content so that a model can consume it.

Is there some standard way of identifying text/image elements on a page and their relative locations to one another?

If there’s any particular modeling tricks that have been proven to be successful in this field that would also be appreciated.

Context:

I have a set of about 12 million sites that I plan on scraping and it would be economically ideal to know what data I need for training the model so I can optimize what resources need to be downloaded, stored, and processed.

The goal of the model is to classify websites by what kinds of services the site offers.

computer vision feature engineering feature extraction nlp web scraping

Add your own answers!

Ask a Question

Get help from others!