Recently, many deep architectures use "batch normalization" for training.
What is "batch normalization"? What does it do mathematically? In what way does it help the training process?
How is batch normalization used during training? is it a special layer inserted into the model? Do I need to normalize before each layer, or only once?
Suppose I used batched normalization for training. Does this affect my test-time model? Should I replace the batch normalization with some other/equivalent layer/operation in my "deploy" network?
This question about batch normalization only covers part of this question, I was aiming and hoping for a more detailed answer. More specifically, I would like to know how training with batch normalization affect test time prediction, i.e., the "deploy" network and the TEST phase of the net.
See Question&Answers more detail:os