It is unclear when to add the data. Either afterwards, or during initial training itself. Here is a bunch of rules to follow to decide when to add the data during the training process:
- If it is possible, adding the data at the beginning is better. It ensures a mix of diversity and realism.
- If the augmented data adds a lot of complexity, you should add it afterwards. If not, then add it at initial training.
- Basic augmentations help the model to learn invariant features. Add them at the beginning.
On a side-note: Data augmentations should be added in the data loading pipeline. Finish the data acquisition/generation/augmentation completely. Don't add it in later in the pipeline. Separation of concerns.
Images
One of the most used libraries for data augmentation is https://albumentations.ai/. It can do much more than images, but that's what I will use it for.
import albumentations as A
Let's define an augmentation pipeline
import cv2
geometric_transformations = A.Compose([
A.Rotate(limit=45, border_mode=cv2.BORDER_REFLECT, value=(0, 0, 0), p=1.0),
# p is the percentage chance that the transformation is applied.
A.HorizontalFlip(p=0.25),
A.VerticalFlip(p=0.25),
],
# we can define more targets, if we want two images with the exact same transformations
additional_targets={'opencv_image': 'image'}
)