Data Science Asked by nainometer on January 12, 2021
I have a data chunk (~30k) in which I have htmls pages and pngs saved in a folder for websites. These folders are titled based on some randomly generated hashes. My supervisor wants me to crunch through this data chunk and extract some attributes out of each HTML page and store it in a DB for future use. Attributes to be extracted comprises of page titles and copyright section from the HTML.
As per my understanding this data is unstructured because there is no relation per say in the folder data for now. Moreover, there is a somewhat inherent structure which is of HTML but essentially each page mutually disjoint with the rest which qualifies for unstructured. Please correct me if I am wrong here.
Manager wishes to have the data stored in an ELK stack. By storing, he is quite unclear at this point in time but so far he wants to have the whole HTML file, title and copyright for each single HTML file extracted and stored. Here comes my first concern which I need help with.
I haven’t worked with ELK stack and I thought it would be a good learning opportunity. While going through online tutorials I have learned that it is essentially for logs parsing from different applications servers and storing and visualizing them in a presentable and searchable manner.
So far the end objective is to crunch through this data and store the attributes and when required search through the attributes and use them as per future need. For example, if there is a specific copy right text that is coming up very frequently, then get that copyright text and use it for classifying certain pattern which takes to my third and last question.
The terms "structured data" or "unstructured data" are not defined in such a way that a given dataset is always either one or the other. There are gray areas and I think this is one example. Since you cannot rely on the structure in your data, I would categorize this as unstructured.
To understand if it's a good idea to store the whole HTML in the DB (and same question for the PNGs), you need to weigh the pro's and con's. Pro storing everything in the DB is the simplicity: You don't have separate places where data is stored, so if you take a snapshot from the DB at some point in time and restore it, you restore the entire state as it was at that time. You do not need to worry about your disk storage separately, to restore that to a given state. Against storing everything in the DB is the amount of data. Can the DB handle it, or does performance suffer too much? Think about retrieving the data, searching/querying it, storing data, making back-ups. This will depend on your choice of database.
The same question for ELK and MySQL: What are the pros and cons of each? MySQL is simpler to install, that's always good. MySQL gives you a relational datamodel (tables can be related using foreign keys). Is that an advantage? MySQL gives you transactions. Is that helpful? ELK mainly gives you scalability, meaning that probably it would allow everything to be stored in the DB and still meet your performance needs.
If you can't store everything in MySQL (HTML and PNGs), then before choosing to store part of the data somewhere else, my first option would be to change DB technology to something that can store everything, rather than to start storing things in different places. So in that case ELK might be a good option, but store the PNGs there, too.
Answered by Paul on January 12, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP