Stack Overflow Asked by cPhoenix on December 25, 2021
I have a script to parallelize a web scraping job. We want to use joblib and dask for this. I created a cluster on EMR but I don’t know how I will work with dask on EMR.
When I scale it, it seems it has 4gb memory but actually I have 16gb memory for each. I want to scale it to use all instances and their capacities. how I can do that and how I can scale the DASK Client programmatically to scale itself for the maximum capacity. (e.g today I am using 3 instances but sometimes it will be 10-20-30 etc. when we start the cluster)
EMR Cluster:
1 Master – m5.xlarge 4 vCore, 16 GiB memory
2 Core – m5.xlarge 4 vCore, 16 GiB memory
from dask_yarn import YarnCluster
from dask.distributed import Client
cluster = YarnCluster()
client = Client(cluster)
client.scale(2)
print(client)
[OUT] <Client: 'tcp://172.31.27.208:33307' processes=2 threads=2, memory=4.29 GB>
with joblib.parallel_backend('dask'):
parse_results: List[Dict[str, Any]] = Parallel()(delayed(parse)(i) for i in range(min_id, max_id))
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP