Notebook

In [ ]:

%reload_ext autoreload
%autoreload 2

FP16¶

Intro¶

In this notebook we are going to implement mixed precision floating points.

By default, all computations are done in single-precision which means that all the floats (inputs, activations and weights) are 32-bit floats. If we could use 16-bit floats for each of these values, we would save half the space in RAM and this would enable us to double the size of our model and double the batch size (the first helping to get better results and the second to train quicker).

However, half-precision floating points might lead to not-as-accurate results. Specifically, half-precision floating points can only represent 1, 1+2e-10, 1+2*2e-10 ... which in standard notation are 1, 1.0009765625, 1.001953125 ... (for more information on the limitations of the encoding of half-precision floating point numbers click here). There are some specific calculations where this lack of acccuracy will impact our results. These are:

When updating the weights we basically do w=w-lr*w.grad for each weight and usually lr*w.grad is several orders of magnitude below w. When this happens (e.g. w=1 and lr*w.grad=0.0001) the update will make no effect.
Your gradients may be replaced by 0 because they are too low (underflow).
Activations or loss may hit nan/infinity (overflow) and training might more easily diverge.

To address these problems we will use a combination of different strategies.

To take care of 1 and 3, we will use sigle-precision floating points for some parameters in the training.

For 1, it’s okay if w and grad are both half floats, but when we do the operation w = w - lr * grad, we need to compute it in FP32. To achieve this, we will keep a copy of the weights in FP32 precision (from now on, master model) where we will update and then copy over to the original model. When we copy the weights into the original model we will lose precision, but the updated weight will be kept in FP32 in the master model so that, when the updates add up to a value that can be represented in FP16, the original model can tell the difference (i.e. if the update is +0.0001, the new weight value is updated it will be 1.0001 and the original model will not be able to tell the difference but if it is updated five times the new weight value will be 1.0005 and the original model will incorporate it as 1.0005).

For 3, we will simply keep our batchnorms in single-precision (so our activations are in single precision) and our loss in single-precision (done by converting the last output of the model to single precision before passing it to the loss).

For 2, we will take a different approach, called gradient scaling. We multiply the loss by a scale factor to place the values in a scale that FP16 can handle with more precision. We will then calculate the gradients by backpropagation and, before updating the weights, we will rescale the gradients to the original scale by dividing by scale (remember that, because of the solution proposed in 1, we will update the weights in FP32 in the master model).

Set up¶

In [ ]:

#export
from nb_004a import *

In [ ]:

DATA_PATH = Path('data')
PATH = DATA_PATH/'cifar10'

data_mean,data_std = map(tensor, ([0.491, 0.482, 0.447], [0.247, 0.243, 0.261]))
cifar_norm,cifar_denorm = normalize_funcs(data_mean, data_std)

train_tfms = [flip_lr(p=0.5),
              pad(padding=4),
              crop(size=32, row_pct=(0,1.), col_pct=(0,1.))]
valid_tfms = []

bs = 64

In [ ]:

#export
def to_half(b:Collection[Tensor])->Collection[Tensor]:  
    "[x,y] -> [x.half(),y] (half precision)"
    return [b[0].half(), b[1]]

def compose(*funcs:Callable)->Callable:
    "Compose list of funcs"
    def compose_(funcs, x, *args, **kwargs):
        for f in listify(funcs): x = f(x, *args, **kwargs)
        return x
    return partial(compose_, funcs)

In [ ]:

train_ds = ImageDataset.from_folder(PATH/'train', classes=['airplane','dog'])
valid_ds = ImageDataset.from_folder(PATH/'test', classes=['airplane','dog'])
data = DataBunch.create(train_ds, valid_ds, bs=bs, num_workers=0, 
                        train_tfm=train_tfms, valid_tfm=valid_tfms, dl_tfms=cifar_norm)
len(data.train_dl), len(data.valid_dl)

In [ ]:

metrics = [accuracy]

In [ ]:

model = Darknet([1, 2, 2, 2, 2], num_classes=2, nf=16)
learn = Learner(data, model, metrics=metrics)
sched = OneCycleScheduler(learn, 0.1, 5)

FP16¶

Utils¶

In [ ]:

#export
def bn2float(module:nn.Module)->nn.Module:
    "If a module is batchnorm don't use half precision"
    if isinstance(module, torch.nn.modules.batchnorm._BatchNorm): module.float()
    for child in module.children(): bn2float(child)
    return module

def model2half(model:nn.Module)->nn.Module:
    "Converts the model to half precision except the batchnorm layers"
    return bn2float(model.half())

Helper function to save the master model in FP32 with flat tensors (apparently it helps with performance)

In [ ]:

#export
from torch._utils import _unflatten_dense_tensors
from torch.nn.utils import parameters_to_vector

Now we will implement the three changes we noted above. A summary of the steps we will follow is:

1. Compute the output with the FP16 model, then the loss
2. Back-propagate the gradients in half-precision
3. Copy the gradients in FP32 precision
4. Do the update on the master model (in FP32 precision)
5. Copy the master model in the FP16 model

In [ ]:

#export
def get_master(layer_groups:ModuleList, flat_master:bool=False) -> Tuple[List[List[Tensor]], List[List[Tensor]]]:
    "Returns two lists, one for the model parameters in FP16 and one for the master parameters in FP32"
    split_groups = split_bn_bias(layer_groups)
    model_params = [[param for param in lg.parameters() if param.requires_grad] for lg in split_groups]
    if flat_master:
        master_params = []
        for lg in model_params:
            if len(lg) !=0 :
                mp = parameters_to_vector([param.data.float() for param in lg])
                mp = torch.nn.Parameter(mp, requires_grad=True)
                if mp.grad is None: mp.grad = mp.new(*mp.size())
                master_params.append([mp])
            else: master_params.append([])
        return model_params, master_params
    else:
        master_params = [[param.clone().float().detach() for param in lg] for lg in model_params]
        for mp in master_params:
            for param in mp: param.requires_grad = True
        return model_params, master_params

def model_g2master_g(model_params:Sequence[Tensor], master_params:Sequence[Tensor], flat_master:bool=False)->None:
    "Copies the model gradients to the master parameters for the optimizer step"
    if flat_master:
        for model_group,master_group in zip(model_params,master_params):
            if len(master_group) != 0:
                master_group[0].grad.data.copy_(parameters_to_vector([p.grad.data.float() for p in model_group]))
    else:
        for model_group,master_group in zip(model_params,master_params):
            for model, master in zip(model_group, master_group):
                if model.grad is not None:
                    if master.grad is None: master.grad = master.data.new(*master.data.size())
                    master.grad.data.copy_(model.grad.data)
                else: master.grad = None

def master2model(model_params:Sequence[Tensor], master_params:Sequence[Tensor], flat_master:bool=False)->None:
    "Copy master parameters to model parameters"
    if flat_master:
        for model_group,master_group in zip(model_params,master_params):
            if len(model_group) != 0:
                for model, master in zip(model_group, _unflatten_dense_tensors(master_group[0].data, model_group)):
                    model.data.copy_(master)
    else:
        for model_group,master_group in zip(model_params,master_params):
            for model, master in zip(model_group, master_group): model.data.copy_(master.data)

MixedPrecision¶

In [ ]:

#export
from torch._utils import _unflatten_dense_tensors
from torch.nn.utils import parameters_to_vector

@dataclass
class MixedPrecision(Callback):
    "Callback that handles mixed-precision training"
    learn:Learner
    loss_scale:float=512.
    flat_master:bool=False
    def __post_init__(self): assert torch.backends.cudnn.enabled, "Mixed precision training requires cudnn." 
    
    def on_train_begin(self, **kwargs:Any)->None:
        "Ensures everything is in half precision mode"
#         self.learn.data.train_dl.half = True
        self.learn.data.train_dl.add_tfm(to_half)
        if hasattr(self.learn.data, 'valid_dl') and self.learn.data.valid_dl is not None:
#             self.learn.data.valid_dl.half = True
            self.learn.data.valid_dl.add_tfm(to_half)
        #Get a copy of the model params in FP32
        self.model_params, self.master_params = get_master(self.learn.layer_groups, self.flat_master)
        #Changes the optimizer so that the optimization step is done in FP32.
        opt = self.learn.opt
        mom,wd,beta = opt.mom,opt.wd,opt.beta
        lrs = [lr for lr in self.learn.opt._lr for _ in range(2)]
        opt_params = [{'params': mp, 'lr': lr} for mp,lr in zip(self.master_params, lrs)]
        self.learn.opt.opt = self.learn.opt_fn(opt_params)
        opt.mom,opt.wd,opt.beta = mom,wd,beta
        
    def on_train_end(self, **kwargs:Any)->None:
        "Removes half precision transforms added at `on_train_begin`"
        self.learn.data.train_dl.remove_tfm(to_half)
        if hasattr(self.learn.data, 'valid_dl') and self.learn.data.valid_dl is not None:
            self.learn.data.valid_dl.remove_tfm(to_half)
    
    def on_loss_begin(self, last_output:Tensor, **kwargs:Any) -> Tensor:
        "Converts half precision output to FP32 to avoid reduction overflow."
        return last_output.float()
    
    def on_backward_begin(self, last_loss:Rank0Tensor, **kwargs:Any) -> Rank0Tensor:
        "Scale gradients up by `loss_scale` to prevent underflow"
        #To avoid gradient underflow, we scale the gradients
        return last_loss * self.loss_scale
    
    def on_backward_end(self, **kwargs:Any ):
        "Convert the gradients back to FP32 and divide them by the scale."
        model_g2master_g(self.model_params, self.master_params, self.flat_master)
        for group in self.master_params:
            for param in group: param.grad.div_(self.loss_scale)
    
    def on_step_end(self, **kwargs:Any)->None:
        "Update the params from master to model and zero grad"
        #Zeros the gradients of the model since the optimizer is disconnected.
        self.learn.model.zero_grad()
        #Update the params from master to model.
        master2model(self.model_params, self.master_params, self.flat_master)

In [ ]:

def mixed_precision(loss_scale:float=512., flat_master:bool=False, **kwargs:Any)->MixedPrecision:
    return partial(MixedPrecision, loss_scale=loss_scale, flat_master=flat_master, **kwargs)

In [ ]:

cbs = [one_cycle_scheduler(0.1)]

In [ ]:

model = Darknet([1, 2, 2, 2, 2], num_classes=2, nf=16)
model = model2half(model)
learn = Learner(data, model, metrics=accuracy, callback_fns=cbs)
mp_cb = MixedPrecision(learn, flat_master=True)

In [ ]:

learn.fit(1, 1e-2, callbacks=mp_cb)

In [ ]:

learn.model.layers[0][0].weight.type()

In [ ]:

mp_cb.master_params[0][0].size(),mp_cb.master_params[0][0].type()

to_fp16¶

In [ ]:

#export
def to_fp16(learn:Learner, loss_scale:float=512., flat_master:bool=False)->Learner:
    "Transforms the learner in FP16 precision"
    learn.model = model2half(learn.model)
    learn.mp_cb = MixedPrecision(learn, loss_scale=loss_scale, flat_master=flat_master)
    learn.callbacks.append(learn.mp_cb)
    return learn

Learner.to_fp16 = to_fp16

In [ ]:

model = Darknet([1, 2, 2, 2, 2], num_classes=2, nf=16)
learn = Learner(data, model, metrics=accuracy, callback_fns=cbs)
learn.to_fp16(flat_master=True)

In [ ]:

learn.fit(1, 1e-2)

In [ ]:

learn.mp_cb.master_params[0][0].size()

Test with discriminative lrs¶

In [ ]:

model = Darknet([1, 2, 2, 2, 2], num_classes=2, nf=16)
model = model2half(model)
learn = Learner(data, model, metrics=accuracy)

learn.split(lambda m: split_model(m,[m.layers[9],m.layers[15]]))
cbs = [MixedPrecision(learn, flat_master=True), OneCycleScheduler(learn, 0.1)]

In [ ]:

learn.fit(1, 1e-2, callbacks=cbs)

In [ ]:

learn.model.layers[0][0].weight.type()

In [ ]:

for master in cbs[0].master_params:
    print(master[0].size(),master[0].type())

In [ ]: