Makes problems "simpler", stops values from exploding and keeps coefficients in an acceptable range. Mostly used to avoid certain terms from exploding where small rounding errors can become fatal. It also discourages from learning overly complex models that fit the noise in the training data.
These regularization terms are added to the penalty/loss function
Lasso (L1) regularization
Add a penalty based on the absolute value of coefficients. Some coefficients can become zero, it is therefore used for Feature selection
In pytorch this needs to be implemented manually. Which tells me this is rarely used.
l1_penalty = sum(p.abs().sum() for p in model.parameters())
loss += l1_lambda * l1_penalty
Ridge (L2) regularization
Add a penalty based on the square of the coefficients. (square
Pytorch supports this with the {python} weight_decay
term
{python} optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.01)