Data Science Asked by HelloGoodbye on March 23, 2021
I’m going to scrape the HTML code from a large number of ULRs and store them on my computer, for machine learning purposes (basically, I’m going to use Python and PyTorch to train a neural network on this data). What is the best way to store the HTML code for all the web pages?
I want to be able to see which URLs I have already scraped, so that I don’t have to scrape them again, and for each piece of HTML code (1 piece of HTML code = all extracted HTML code from one URL), I may also want to see which URL it came from (but maybe this turns out to be an unnecessary requirement). I also want to be able to see the timestamp when these pages were created so I can read them in chronological order (I will be able to extract timestamp when I download the web pages), and possibly other metadata. I imagine that the total file size of the HTML code can reach many GB, if not TB, and speed (both for reading and scraping) is a high priority.
In an ideal world, I would be able to just use the URLs as file names, have one file for each piece of HTML code, and store all files in one folder. But I’m not sure this is such a good idea, or even possible, for several reasons, for example:
/
‘). It may be possible to hash each URL and let that be the name of the file, but then I don’t know what URL the HTML code came from by looking at the file name.Simply put you can create an "index data base" in the following format:
ID | URL | Timestamp | link_to_file | other_metadata
You could even store the actual file, instead of just the link with most databases.
However the simplest way might be the best here. This "index db" would be created automatically in the scraping process and serve as navigation and check to avoid double scraping, etc.
Answered by Fnguyen on March 23, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP