Mathematica Asked by kirma on January 28, 2021
I have been experimenting with GPU NetTrain
on AWS now when Mma 12.2 supports remote batch jobs (I don’t have an Nvidia GPU to try these things out otherwise). In particular, I’m puzzled by the example on Mathematica documentation tutorial: Sequence Learning and NLP with Neural Networks, Language Modeling, teacherForcingNet
variant. After evaluating requisites, this example does training like this:
result = NetTrain[teacherForcingNet, <|"Input" -> Keys[trainingData]|>,
All, BatchSize -> 64, MaxTrainingRounds -> 5,
TargetDevice -> "CPU", ValidationSet -> Scaled[0.1]]
A CPU-based run, also if run on AWS, results a network with around 40% loss.
I have minimally modified the NetTrain
step to perform this on GPU, and to run it on AWS:
job = RemoteBatchSubmit[env,
NetTrain[teacherForcingNet, <|"Input" -> Keys[trainingData]|>, All,
BatchSize -> 64, MaxTrainingRounds -> 5, TargetDevice -> "GPU",
ValidationSet -> Scaled[0.1]],
TimeConstraint -> Quantity[30, "Minutes"],
RemoteProviderSettings -> <|"GPUCount" -> 1|>]
When the training job completes, the resulting training object is available as job["EvaluationResult"]
(and progress can be actually observed on runtime through job["JobLog"]
). The problem is that when CPU-based training results error rate of around 41%, GPU-based run gets stuck at about 82% (effectively without learning anything).
What gives? Is this common behaviour for some networks (LeNet on MNIST dataset works just fine on GPU, for instance), a bug that needs fixing, and/or is a workaround available? Neither Method
nor WorkingPrecision
changes give difference in results.
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP