Data Science Asked by djvaroli on June 18, 2021
I am currently trying out things with the Tensorflow model optimization library with the goal of reducing the sizes of models that are run in our production Tesnorflow Serving containers. This project was originally inspired by our CTO reading a paper (or report maybe) that stated that after a model is trained a huge chunk of the weights could be removed without significantly affecting model performance.
I have been following the example shown here, and so far I’ve had mixed results and wanted to ask for some help because the resources I’ve found online have not been able to answer some of my questions (perhaps because some of these are obvious and I am just being dumb).
My goal is to have a final model in SavedModel format (.pb file) that I can load into a TF Serving container. I already have a trained model (.h5 weights file) on which I want to perform post-training pruning on. When I do that and then compare the zipped pruned and un-pruned models, the file sizes are drastically different (the pruned one is say 5x smaller). Awesome!
However, when I convert the pruned model into the SavedModel format (.pb file again), the final size of the saved model is the same as it is with the un-pruned case. I realize that when we talk about pruning weights, it’s not actually removing entries from the matrices but setting them to 0 (that’s my understanding at least). So when you compress the model the compression algorithm is able to make use of that to have a final smaller file size. When I convert the model to a SavedModel format for serving, does a similar optimization not take place? Should I expect to see a .pb file of the same size, or is something strange going on in my case? (I am applying tfmot.sparsity.keras.strip_pruning after the pruning is complete). Maybe there are some additional steps I need to take to actually get the benefits of pruned weights, or perhaps I should be doing something different altogether?
My thoughts are that maybe I am misunderstanding the intended use-case and that I shouldn’t expect to see a smaller servable model, but then that leaves me with the question what exactly is the purpose of pruning weights? Does this method improve the inference time, or is it more geared towards making models easier to store by making them smaller? Although that is a bit confusing to me since if you wanted a very small model (say for a IOT device) you would go with tflite and model quantization (as I understood from the resources I’ve looked at).
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP