Is pandas now faster than data.table?

Question

Here is the GitHub link to the most recent data.table benchmark.
The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?

Tobias Krabel · Answer

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).

We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.

EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

Data retrieval with a select-like operation
Data filtering with a conditional select operation
Data sort operations
Data aggregation operations

The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table

Results in a nutshell

data.tableseems to be faster when selecting columns (pandason average takes 50% more time)
pandas is faster at filtering rows (roughly 50% on average)
data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)
adding a new column appears faster with pandas
aggregating results are completely mixed

Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.

jangorecki · Answer

Has anyone done any benchmarks?

Yes, the 2014's benchmark in question has turned into foundation for db-benchmark project. Initial step was to reproduce 2014's benchmark on recent version of software, then to make it a continuous benchmark, so it runs routinely and automatically upgrades software before each run. Over time many things have been added. Below is high level diff of the 2014's benchmark comparing to db-benchmark project.
New:

continuous benchmark: runs routinely, upgrades software, re-run benchmarking script
more software solutions: spark, python datatable, cuda dataframes, julia dataframes, clickhouse, dask
more data cases

two new smaller data sizes: 0.5GB (1e7 rows) and 5GB (1e8 rows)
two more cardinality factors: unbalanced, heavily unbalanced
sortedness
data having NA

advanced groupby questions

median, sd
range v1-v2: max(v1)-min(v2)
top 2 rows: order(.); head(.,2)
regression: cor(v1, v2)^2
count and grouping by 6 columns

benchmark task: join

Changes (see groupby2014 task for 2014 fully compliant benchmark script):

using categorical/factor instead of character
cardinality of v2 and v3 measures increased
function calls are NA-aware
aggregated columns are named
order of groups is irrelevant whenever possible
extra call to dim()/.shape is included in timings to force lazy evaluation
machine that runs benchmark has 128GB mem (not 244GB mem)
no 100GB (2e9 rows) data size

We are planning to add even more software solutions and benchmark tasks in future.
Feedback is very welcome, feel invited to our issue tracker at https://github.com/h2oai/db-benchmark/issues.

Is pandas now faster than data.table?

According the our results pandas is not faster than data.table.
I am pasting medium size data 5GB (1e8 rows) groupby benchmark plot taken from the report at h2oai.github.io/db-benchmark as of 20210312. Consult the h2oai.github.io/db-benchmark#explore-more-data-cases for other data sizes (1e7, 1e9), data cases (cardinality, NAs, sorted), questions groups (advanced), or tasks (join).
For up-to-date timings please visit https://h2oai.github.io/db-benchmark.

DonQuixote · Answer

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

Chenying Gao · Answer

Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

Is pandas now faster than data.table?

4 Answers

Add your own answers!

Ask a Question