Data Science Asked by blue-dino on March 29, 2021
I have a large dataset with 9m JSON objects at ~300 bytes each. They are posts from a link aggregator: basically links (a URL, title and author id) and comments (text and author ID) + metadata.
They could very well be relational records in a table, except for the fact that they have one array field with IDs pointing to child records.
What implementation looks more solid?
I want to maximize performance in joins, so I can massage the data and explore it until I find interesting analyses, at which point I think it will be better to transform the data into a form specific to each analysis.
For data load, Postgre outperforms MongoDB. MongoDB is almost always faster when returning query counts. PostgreSQL is almost always faster for queries using indexes.
Check out this website and this one too for more info. They have very detailed explanations.
Answered by untitledprogrammer on March 29, 2021
You may benefit more from the schemaless design of Mongodb. This means its very easy to modify data structures on the fly.
There is no such thing as a join in Mongodb. So how one thinks about data and how to use it needs to be modified to account for document based and schemaless db environments.
Maybe speed becomes less important as perspective and priorities change.
I hope that helps.
-Todd
Answered by Todd Canedy on March 29, 2021
For the numbers you mention, I think all alternatives should work (read: you'll be able to finish your analysis in reasonable time). I recommend on a design that can lead to significantly faster results.
As answered before, in general postgresql is faster than mongo, some times more than 4 times faster.
See for example this.
You said that you are interested in improving performance in joins. I assume that you are interested in calculating similarities among the entities (e.g., post, author) so you'll mainly join the table with it self (e.g., by post or author) and aggregate.
Add to that the fact that after the initial loading your database will be read only, what make the problem very suitable to index usage. You won't pay for index update since you won't have any and I guess you have the extra storage for the index.
I would have use postgres and store the data in two tables:
create table posts( post_id integer, url varchar(255), author_id integer ) ;
-- Load data and then create the indices. -- That will lead to a faster load and better indices alter table posts add constraint posts_pk primary key(post_id); create index post_author on posts(author_id);
create table comments( comment_id integer, post_id integer, author_id integer, comment varchar(255) ) ; alter table comments add constraint comments_pk primary key(comment_id); create index comment_author on comments(author_id); create index comment_post on comments(post_id);
Then you can compute author similarity based on comments in queries like select m. author_id as m_author_id, a. author_id as a_author_id, count(distinct m.post_id) as posts from comments as m join comments as a using (post_id) group by m.author_id , a. author_id
In case you are interested in tokenzing the words in the comment for nlp, add another table for that but remember that it will increase the volume of your data significantly.Usually it is better not to represent the entire tokenization in the database.
Answered by DaL on March 29, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP