Which is faster: PostgreSQL vs MongoDB on large JSON datasets?

Question

I have a large dataset with 9m JSON objects at ~300 bytes each. They are posts from a link aggregator: basically links (a URL, title and author id) and comments (text and author ID) + metadata.

They could very well be relational records in a table, except for the fact that they have one array field with IDs pointing to child records.

What implementation looks more solid?

JSON objects on a PostgreSQL database (just one large table with one column, namely the JSON object)
JSON objects on a MongoDB
Explode the JSON objects into columns and use arrays on PostgreSQL

I want to maximize performance in joins, so I can massage the data and explore it until I find interesting analyses, at which point I think it will be better to transform the data into a form specific to each analysis.

untitledprogrammer · Answer

For data load, Postgre outperforms MongoDB.
MongoDB is almost always faster when returning query counts.
PostgreSQL is almost always faster for queries using indexes.

Check out this website  and this one too for more info. They have very detailed explanations.

Todd Canedy · Answer

You may benefit more from the schemaless design of Mongodb. This means its very easy to modify data structures on the fly.

There is no such thing as a join in Mongodb. So how one thinks about data and how to use it needs to be modified to account for document based and schemaless db environments.

Maybe speed becomes less important as perspective and priorities change.

I hope that helps.

-Todd

DaL · Answer

For the numbers you mention, I think all alternatives should work (read: you'll be able to finish your analysis in reasonable time).
I recommend on a design that can lead to significantly faster results.
As answered before, in general postgresql is faster than mongo, some times more than 4 times faster.
See for example this.
You said that you are interested in improving performance in joins.
I assume that you are interested in calculating similarities among the entities (e.g., post, author) so you'll mainly join the table with it self (e.g., by post or author) and aggregate.
Add to that the fact that after the initial loading your database will be read only, what make the problem very suitable to index usage. You won't pay for index update since you won't have any and I guess you have the extra storage for the index.
I would have use postgres and store the data in two tables:
create table posts(
post_id integer,
url varchar(255),
author_id integer
)
;
-- Load data and then create the indices.
-- That will lead to a faster load and better indices
alter table posts add constraint posts_pk primary key(post_id);
create  index post_author on posts(author_id);
create table comments(
comment_id integer,
post_id integer,
author_id integer,
comment varchar(255)
)
;
alter table comments add constraint comments_pk primary key(comment_id);
create  index comment_author on comments(author_id);
create  index comment_post on comments(post_id);
Then you can compute author similarity based on comments in queries like
select
m. author_id as m_author_id,
a. author_id as a_author_id,
count(distinct m.post_id) as posts
from
comments as m
join
comments as a
using
(post_id)
group by
m.author_id
, a. author_id
In case you are interested in tokenzing the words in the comment for nlp, add another table for that but remember that it will increase the volume of your data significantly.Usually it is better not to represent the entire tokenization in the database.

Which is faster: PostgreSQL vs MongoDB on large JSON datasets?

3 Answers

Add your own answers!

Ask a Question