What are the alternatives to Python + Spark (pyspark)?

Question

I like Python, and I like Spark, but they don't fit very well together.  In particular,

it is very hard to use python functions in spark (have to create JVM binding for function in python)
it is hard to debug pyspark, with py4j in the middle

So I wonder if there are any alternatives to pyspark that supports python natively instead of via an adapter layer?
Reference

avinash raghuthu · Accepted Answer

Try checking dask. It's a distributed library which is native with Python and it uses pandas and numpy. So its like using pandas with some wrapper for distribution computation.

user40285 · Answer

Try Parallel Python. https://www.parallelpython.com/

I use it for my bespoke data integrations which can scale to multiple machines.

With the bespoke option, you have the flexibility to process data with what ever tools you like.

Eg. algorithmic processing with dataframes take very long, but if you use opencl or other GPU abstraction libraries, you can cut your processing time in half if you are willing to refactor and vectorise your algorithms.

It takes a while to build an "Integration Template" with Parallel Python.
But it is worth it once you have it.

You will be able to build many integrations, whether you are distributing your data pulling task, your data pushing task, or your data processing task, a bespoke strategy gives you options and flexibility, where as using an off the shelf integration framework tightly couples you with their product.

What are the alternatives to Python + Spark (pyspark)?

2 Answers

Add your own answers!

Ask a Question