Designing models is mostly a trial and error process. However, there are ways of thinking that help. Furthermore you can often guarantee that certain things will not work.

the key seems to be to find some basic principles and build the model architecture using those principles.

An example is: Interpolation. Very basic concept, but crucial for machine learning. Geometric deep learning for example, very simplified, asks, which methods/layers will make interpolation easier.

Geometric deep learning

real world data does not exactly follow our ideas of equivariances. Images will sometimes depict the exact same thing, but, due to noise, or other disturbances, be distorted, as opposed to real mathematical equivariances like shifts, rotations, etc..

Following the principles below is a good idea, but a certain amount of robustness for it to work on real life data will require bending those rules in architecture a little.

Apply the Erlangen Programme mindset to the domain of deep learning

https://arxiv.org/pdf/2104.13478
https://www.youtube.com/watch?v=PtA0lg_e5nA&list=PLn2-dEmQeTfQ8YVuHBOvAhUlnIPYxkeu3

Base theory

The goal is to somehow get around the curse of dimensionality:

this is just one aspect of the curse of dimensionality, there are others like similarity based approaches in high dimensional space

As the number of features/dimensions grows, the amount of data we need to generalize accurately grows exponentially.

One way to think about the curse of dimensionality in this context is interpolation. You have some points and you are trying to find a function that most likely passes through all those points. If the points become more "sparse" as is the case for high dimensions (and finite training points), interpolation becomes much harder.

Hughes/Peaking phenomenon:

With a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the numbers of dimensions or features is used is increased but beyond a certain dimensionality, the accuracy starts deteriorating.

Pasted image 20250205142859.png

the higher the dimensions, the more "sparse" the data becomes.

We do this by exploiting invariances. things where the "dimension" doesn't matter.

Example: Convolutional Layers. These layers doe not care about the location of the "object" that we try to detect -> translation invariance.

Here is an overview of the different invariances achieved by basic model layers.
Pasted image 20250205153604.png

We can decompose the model error into three different parameters all caused from different sources of errors.

TotalError=Approximation_error+Statistical_error+Optimization_error

Error Type Source
Approximation Error The chosen model might not be complex enough to perfectly capture the underlying function or geometry of the data.
Statistical Error Caused by the limited amount of data and the inherent randomness (noise) present in the data sampling process.
Optimization Error Error when the algorithm used to train the model fails to find the best set of parameters. (Gradient descent gets stuck in a local minima for example).

Closely related to Sources of Uncertainty, but from a different point of view.

Statistical error will be reduced if we exploit invariances in the input space.
Neural networks for processing geometric data should respect (exploit) the structure of the domain.

Invariances from the label:

Pasted image 20250206152912.png

Invariant to the rotation.

Equivariance

A mathematical property, where a transformation to the input leads to a predictable transformation of the output of a function.

Equivariant Networks: Networks designed to respect and leverage the property of equivariance in their architecture.

Examples:

It sounds like those models would not require augmentations in these invariances. However this, in my opinion, is wrong, because the models are never fully invariant. Pooling layers for example are not fully translation invariant.

Learning under Invariance

Effect of Invariance Impact on Training
Reduces hypothesis space Makes training more efficient, needs fewer samples
Improves generalization Helps recognize patterns even in unseen transformations
Changes optimization Shared parameters speed up learning and improve gradient flow
Reduces computational cost Needs fewer parameters to learn the same function

Scale separation

Part of How to think about model architecture

The idea of scale separation, in a ml context, is to recognise and exploit, that different aspects/patterns in data appear on different scales.

Example:

CNN for image classification

  1. The early layers focus on low level details. small scale features
  2. The middle layers combine the low level features to form more complex patterns
  3. The later layers combine the middle layer features into high level semantic expressions. The scale is big, because the neurons of the later layers have a large receptive field.

Graph Neural Networks

Graph neural networks

  1. The early layers focus on nodes immediate surroundings (or just the node itself)
  2. Middle Layers focus on more surroundings
  3. Later layers focus on even more surroundings
Notice how exactly like CNNs, the receptive field expands the more {python}GCNConv layers we have. These Layers, even if they work differently share a similar name because the abstract, scale separation is present in both.

Consequences for How to think about model architecture

Note that the choice of the layers and whether they are equivariant or not is not part of the scale separation theory.

One consequence of scale separation is to design "blocks" that analyse data at a specific scale. These blocks always end with a pooling layer, that increases the receptive field (because there should be less higher level features than lower level).

Pasted image 20250206200557.png

Pasted image 20250206200611.png
Pasted image 20250206200557.png
Pasted image 20250206200622.png

Symmetry for Sets

Sets are a very simple version of graphs. We want the output of the model to be equivariant to how the set is stacked.

Example:

We have 5 objects:

0, 20, 50, 98, 17

These are then moved into a tensor and given to the input layer of our model. This set does not have an inherent order, the input tensor however does due to the way data is saved in tensors.

This order is nonsensical, meaning it appears randomly without adding any information to the original (unordered) set whatsoever. Therefore, our model should be equivariant to any permutations of the input tensor.

You might notice, that I used the term equivariance here, not invariance. If for example we do a prediction on each node, we can map each node to its prediction no matter the order of input. Therefore, while technically the term equivariance is correct, it really results to the same and is not a contradiction to the image below.

f being our model, y our prediction.
Pasted image 20250311160950.png

COMPLETE ONCE DEEPSETS HAVE BEEN UNDERSTOOD

Symmetry of graph predictions

Let's assume a simple form of graph prediction, where we are trying to classify each node.

Symmetry of graph predictions means, that the model predicts one prediction per node. The mapping: Input Node -> Class needs to be equivariant, meaning if node 42 is mapped to class X, and then the input graph is permuted, that node 42 is moved a few places down in the input tensor, it still gets mapped to 42, since the Graph is still exactly the same. It just got permuted which is simply an (at least we want it to be) inconsequential action on graphs.

Pasted image 20250311173309.png

Phi operation being done on each node and producing one prediction h for each node.

more predictions on graphs

Note that we need to be careful and might have to introduce sort of pooling operations on the intermediate output(s) to assure permutation invariance in the case of predictions on entire graphs

Pasted image 20250311173414.png

Divers

Physics informed neural networks

Dietmar shared this link: https://en.wikipedia.org/wiki/Physics-informed_neural_networks

Those are networks that aim to incorporate physical laws into their loss/training. While theoretically possible, to also include it into their architecture, it seems hard and I haven't found an example that actually did it.

The base idea is the same though, include what you already know into the models architecture, I don't think this is easily applicable for general problems.