TransWikia.com

Is pandas now faster than data.table?

Data Science Asked on April 4, 2021

Here is the GitHub link to the most recent data.table benchmark.

The data.table benchmarks has not been updated since 2014. I heard somewhere that Pandas is now faster than data.table. Is this true? Has anyone done any benchmarks? I have never used Python before but would consider switching if pandas can beat data.table?

4 Answers

A colleague and I have conducted some preliminary studies on the performance differences between pandas and data.table. You can find the study (which was split into two parts) on our Blog (You can find part two here).

We figured that there are some tasks where pandas clearly outperforms data.table, but also cases in which data.table is much faster. You can check it out yourself and let us know what you think of the results.

EDIT:
If you don't want to read the blogs in detail, here is a short summary of our setup and our findings:

Setup

We compared pandas and data.table on 12 different simulated data sets on the following operations (so far), which we called scenarios.

  • Data retrieval with a select-like operation
  • Data filtering with a conditional select operation
  • Data sort operations
  • Data aggregation operations

The computations were performed on a machine with an Intel i7 2.2GHz with 4 physical cores, 16GB RAM and a SSD hard drive. Software Versions were OS X 10.13.3, Python 3.6.4 and R 3.4.2. The respective library versions used were 0.22 for pandas and 1.10.4-3 for data.table

Results in a nutshell

  • data.tableseems to be faster when selecting columns (pandason average takes 50% more time)
  • pandas is faster at filtering rows (roughly 50% on average)
  • data.table seems to be considerably faster at sorting (pandas was sometimes 100 times slower)
  • adding a new column appears faster with pandas
  • aggregating results are completely mixed

Please note that I tried to simplify the results as much as possible to not bore you to death. For a more complete visualization read the studies. If you cannot access our webpage, please send me a message and I will forward you our content. You can find the code for the complete study on GitHub. If you have ideas how to improve our study, please shoot us an e-mail. You can find our contacts on GitHub.

Answered by Tobias Krabel on April 4, 2021

Has anyone done any benchmarks?

Yes, the 2014's benchmark in question has turned into foundation for db-benchmark project. Initial step was to reproduce 2014's benchmark on recent version of software, then to make it a continuous benchmark, so it runs routinely and automatically upgrades software before each run. Over time many things have been added. Below is high level diff of the 2014's benchmark comparing to db-benchmark project.

New:

  • continuous benchmark: runs routinely, upgrades software, re-run benchmarking script
  • more software solutions: spark, python datatable, cuda dataframes, julia dataframes, clickhouse, dask
  • more data cases
    • two new smaller data sizes: 0.5GB (1e7 rows) and 5GB (1e8 rows)
    • two more cardinality factors: unbalanced, heavily unbalanced
    • sortedness
    • data having NA
  • advanced groupby questions
    • median, sd
    • range v1-v2: max(v1)-min(v2)
    • top 2 rows: order(.); head(.,2)
    • regression: cor(v1, v2)^2
    • count and grouping by 6 columns
  • benchmark task: join

Changes (see groupby2014 task for 2014 fully compliant benchmark script):

We are planning to add even more software solutions and benchmark tasks in future. Feedback is very welcome, feel invited to our issue tracker at https://github.com/h2oai/db-benchmark/issues.


Is pandas now faster than data.table?

According the our results pandas is not faster than data.table.

I am pasting medium size data 5GB (1e8 rows) groupby benchmark plot taken from the report at h2oai.github.io/db-benchmark as of 20210312. Consult the h2oai.github.io/db-benchmark#explore-more-data-cases for other data sizes (1e7, 1e9), data cases (cardinality, NAs, sorted), questions groups (advanced), or tasks (join).

For up-to-date timings please visit https://h2oai.github.io/db-benchmark.

groupby_1e8_1e2_0_0_basic

Answered by jangorecki on April 4, 2021

I know this is an older post, but figured it may be worth mentioning - using feather (in R and in Python) allows operating on data frames / data tables and sharing those results through feather.

See feather's github page

Answered by DonQuixote on April 4, 2021

Nope, In fact if dataset size is sooooooo large that pandas crashes, you are basically stuck with dask, which sucks and you can't even do a simple groupby-sum. dplyr may not be fast, but it doesn't mess up.

I'm currently working on some little 2G dataset and a simple print(df.groupby(['INCLEVEL1'])["r"].sum())crashes the dask.

Didn't experience this error with dplyr.

So, if pandas can handle the dataset, I use pandas, if not, stick to R data table.

And yes, you can convert dask back to pandas dataframe with a simple df.compute() But it takes a fairly long time, so you might as well just wait patiently for pandas to load or datatable to read.

Answered by Chenying Gao on April 4, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP