DASK can't recognize workers on EMR

Stack Overflow Asked by cPhoenix on December 25, 2021

I have a script to parallelize a web scraping job. We want to use joblib and dask for this. I created a cluster on EMR but I don’t know how I will work with dask on EMR.

When I scale it, it seems it has 4gb memory but actually I have 16gb memory for each. I want to scale it to use all instances and their capacities. how I can do that and how I can scale the DASK Client programmatically to scale itself for the maximum capacity. (e.g today I am using 3 instances but sometimes it will be 10-20-30 etc. when we start the cluster)

EMR Cluster:

1 Master – m5.xlarge 4 vCore, 16 GiB memory

2 Core – m5.xlarge 4 vCore, 16 GiB memory

from dask_yarn import YarnCluster
from dask.distributed import Client

cluster = YarnCluster()
client = Client(cluster)

client.scale(2)

print(client)

[OUT] <Client: 'tcp://172.31.27.208:33307' processes=2 threads=2, memory=4.29 GB>

with joblib.parallel_backend('dask'):
    parse_results: List[Dict[str, Any]] = Parallel()(delayed(parse)(i) for i in range(min_id, max_id))

amazon emr dask dask distributed joblib python

Add your own answers!

Ask a Question

Get help from others!