TransWikia.com

Distributed DL model with Tensorflow

Data Science Asked by steam_engine on May 2, 2021

Suppose I want to develop and train a big end-to-end deep learning model using Tensorflow (1.15, for legacy reasons). The objects are complex, with many types of features that can be extracted: vector of numeric features of fixed length, sequences, unordered sets, etc. Thus, the model will include many submodules to deal with various types of features.

I have access to a server with several GPUs, so I want to distribute the model across them. What is the best way to do so? So far I’m thinking about placing subsystems on separate GPUs, but this presents some questions:

  1. How costly would be the transfer of computation results between GPUs? Tensorflow does it automatically, right?
  2. How costly would gradient computation and descent be, considering variables are placed on different GPUs? Would gradients also be computed on the same GPUs as their corresponding variables?

One Answer

I invite you to look at the Horovod project on github. It's the most efficient way to currently execute distributed training with tensorflow. They have tutorials and benchmarks resource available

Answered by Jonathan DEKHTIAR on May 2, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP