Data Science Asked by stackoverflower on December 14, 2020
I like Python, and I like Spark, but they don’t fit very well together. In particular,
So I wonder if there are any alternatives to pyspark that supports python natively instead of via an adapter layer?
Try checking dask. It's a distributed library which is native with Python and it uses pandas and numpy. So its like using pandas with some wrapper for distribution computation.
Correct answer by avinash raghuthu on December 14, 2020
Try Parallel Python. https://www.parallelpython.com/
I use it for my bespoke data integrations which can scale to multiple machines.
With the bespoke option, you have the flexibility to process data with what ever tools you like.
Eg. algorithmic processing with dataframes take very long, but if you use opencl or other GPU abstraction libraries, you can cut your processing time in half if you are willing to refactor and vectorise your algorithms.
It takes a while to build an "Integration Template" with Parallel Python. But it is worth it once you have it.
You will be able to build many integrations, whether you are distributing your data pulling task, your data pushing task, or your data processing task, a bespoke strategy gives you options and flexibility, where as using an off the shelf integration framework tightly couples you with their product.
Answered by user40285 on December 14, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP