Makes problems "simpler", stops values from exploding and keeps coefficients in an acceptable range. Mostly used to avoid certain terms from exploding where small rounding errors can become fatal. It also discourages from learning overly complex models that fit the noise in the training data.

These regularization terms are added to the penalty/loss function

Lasso (L1) regularization

Add a penalty based on the absolute value of coefficients. Some coefficients can become zero, it is therefore used for Feature selection

In pytorch this needs to be implemented manually. Which tells me this is rarely used.

l1_penalty = sum(p.abs().sum() for p in model.parameters())
loss += l1_lambda * l1_penalty

Ridge (L2) regularization

Add a penalty based on the square of the coefficients. (square ² is why its called L2). Coefficients won't be set to zero, but they will get very small. This helps with multicollinearity and model stability. Multicollinearity, because correlated features should be shrunk evenly. It will also discourage the model from relying too heavily on a single feature, because the sum of squared elements will favour equally important elements and discourage one dominating.

Pytorch supports this with the {python} weight_decayterm

{python} optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.01)