Data Science Asked on April 4, 2021
Here is the GitHub link to the most recent data.table
benchmark.
The data.table
benchmarks has not been updated since 2014. I heard somewhere that Pandas
is now faster than data.table
. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas
can beat data.table
?
A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).
We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.
EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:
Setup
We compared pandas
and data.table
on 12 different simulated data sets on the following operations (so far), which we called scenarios.
The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table
Results in a nutshell
data.table
seems to be faster when selecting columns (pandas
on average takes 50% more time)pandas
is faster at filtering rows (roughly 50% on average)data.table
seems to be considerably faster at sorting (pandas
was sometimes 100 times slower)pandas
Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.
Answered by Tobias Krabel on April 4, 2021
Has anyone done any benchmarks?
Yes, the 2014's benchmark in question has turned into foundation for db-benchmark project. Initial step was to reproduce 2014's benchmark on recent version of software, then to make it a continuous benchmark, so it runs routinely and automatically upgrades software before each run. Over time many things have been added. Below is high level diff of the 2014's benchmark comparing to db-benchmark project.
New:
groupby
questions
median
, sd
max(v1)-min(v2)
order(.); head(.,2)
cor(v1, v2)^2
count
and grouping by 6 columnsjoin
Changes (see groupby2014
task for 2014 fully compliant benchmark script):
v2
and v3
measures increaseddim()
/.shape
is included in timings to force lazy evaluationWe are planning to add even more software solutions and benchmark tasks in future. Feedback is very welcome, feel invited to our issue tracker at https://github.com/h2oai/db-benchmark/issues.
Is pandas now faster than data.table?
According the our results pandas is not faster than data.table.
I am pasting medium size data 5GB (1e8 rows) groupby benchmark plot taken from the report at h2oai.github.io/db-benchmark as of 20210312. Consult the h2oai.github.io/db-benchmark#explore-more-data-cases for other data sizes (1e7, 1e9), data cases (cardinality, NAs, sorted), questions groups (advanced), or tasks (join).
For up-to-date timings please visit https://h2oai.github.io/db-benchmark.
Answered by jangorecki on April 4, 2021
I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.
Answered by DonQuixote on April 4, 2021
Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.
I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())
crashes the dask.
Didn't experience this error with dplyr.
So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.
And yes, you can convert dask back to pandas dataframe with a simple df.compute()
But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.
Answered by Chenying Gao on April 4, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP