Dealing with NaNs

Having a model yielding NaNs or Infs is quite common if some of the tinycomponents in your model are not set properly. NaNs are hard to deal withbecause sometimes it is caused by a bug or error in the code, sometimes it’sbecause of the numerical stability of your computational environment (libraryversions, etc.), and even, sometimes it relates to your algorithm. Here we tryto outline common issues which cause the model to yield NaNs, as well asprovide nails and hammers to diagnose it.

Check Superparameters and Weight Initialization

Most frequently, the cause would be that some of the hyperparameters, especiallylearning rates, are set incorrectly. A high learning rate can blow up your wholemodel into NaN outputs even within one epoch of training. So the first andeasiest solution is try to lower it. Keep halving your learning rate until youstart to get resonable output values.

Other hyperparameters may also play a role. For example, are your trainingalgorithms involve regularization terms? If so, are their correspondingpenalties set reasonably? Search a wider hyperparameter space with a few (one ortwo) training epochs each to see if the NaNs could disappear.

Some models can be very sensitive to the initialization of weight vectors. Ifthose weights are not initialized in a proper range, then it is not surprisingthat the model ends up with yielding NaNs.

Run in NanGuardMode, DebugMode, or MonitorMode

If adjusting hyperparameters doesn’t work for you, you can still get help fromTheano’s NanGuardMode. Change the mode of your theano function to NanGuardModeand run them again. The NanGuardMode will monitor all input/output variables ineach node, and raises an error if NaNs are detected. For how to use theNanGuardMode, please refer to nanguardmode.

DebugMode can also help. Run your code in DebugMode with flagmode=DebugMode,DebugMode.check_py=False. This will give you clue about whichop is causing this problem, and then you can inspect that op in more detail. Fordetails of using DebugMode, please refer to debugmode.

Theano’s MonitorMode provides another helping hand. It can be used to stepthrough the execution of a function. You can inspect the inputs and outputs ofeach node being executed when the function is called. For how to use that,please check “How do I Step through a Compiled Function?”.

Numerical Stability

After you have located the op which causes the problem, it may turn out that theNaNs yielded by that op are related to numerical issues. For example,1 / log(p(x) + 1) may result in NaNs for those nodes who have learned toyield a low probability p(x) for some input x.

Algorithm Related

In the most difficult situations, you may go through the above steps and findnothing wrong. If the above methods fail to uncover the cause, there is a goodchance that something is wrong with your algorithm. Go back to the mathematicsand find out if everything is derived correctly.

Cuda Specific Option

The Theano flag nvcc.fastmath=True can genarate NaN. Don’t setthis flag while debugging NaN.