Caution:Contains Error and possibly losses

This is gonna be rather theoretical so just hold onto yourselves....

Neural networks are trained using an optimization process that requires a loss function to calculate the model error. As part of the optimization algorithm, the error for the current state of the model must be estimated repeatedly. This requires the choice of an error function, conventionally called a loss function, that can be used to estimate the loss of the model so that the weights can be updated to reduce the loss on the next evaluation.


In most learning networks, the error is calculated as the difference between the actual output and the predicted output. For accurate predictions, one needs to minimize the calculated error. In a neural network, this is done using back propagation.

The purpose of the algorithm is to train the network such that the error is minimized between the network output and the desired output. The error function is as defined by the equation below and is the same for weights as well as bias terms.

This is the function that is used to determine the loss in a neural network. Different loss functions give different values for the same prediction. They have a very considerable effect on the result of the learning network. One of the most widely used error function is Mean Square Error (MSE). Different loss functions are used to deal with different types of tasks.

Error function J(w) is the function of internal parameters of a network. For accurate predictions, error has to be minimized. Current error is back propagated through the layers where weights and bias are modified in an attempt to minimize the error. The weights are modified using a function called Optimization Function.


Regressive loss functions

They are used in case of regression problems, that is when the target variable is continuous. The most widely used regressive loss function is Mean Square Error(L2) or the Euclidean loss function. Using MSE means that we assume that the underlying data has been generated from a normal distribution (a bell-shaped curve).

MSE is a good choice for a Cost function when we are doing Linear Regression (i.e fitting a line through data for extrapolation).

The MSE function is non-convex for binary classification. In simple terms, if a binary classification model is trained with MSE Cost function, it is not guaranteed to minimize the Cost function. This is because MSE function expects real-valued inputs in range (-∞, ∞), while binary classification models output probabilities in range(0,1) through the Sigmoid/Logistic function.

Other loss functions are:

1. Mean absolute error (MAE) (L1) — measures the mean absolute value of the element-wise difference between predicted value and its actual value.

2. Smooth Absolute Error (Pseudo Huber Loss) — A variant of the Huber loss function, this function is the amalgamation of the best parts of MSE (L2) and MAE (L1). It computes the loss using the appropriate function. It uses the square error for values less than 1 to ensure that they are not ignored by the model by amplifying them. It uses the absolute error for all other cases.

Pseudo-Huber Loss

Where a = (Y − Y _pred)

3. Kullback-Leiber

What is L1 (MAE) and L2 (MSE)?

Looking at it purely mathematically… The L1 norm that is calculated as the sum of the absolute values of the vector. The L2 norm that is calculated as the square root of the sum of the squared vector values. If both L1 and L2 regularization work well, you might be wondering why we need both. It turns out they have different but equally useful properties.

From a practical standpoint, L1 tends to shrink coefficients to zero whereas L2 tends to shrink coefficients evenly. L1 is therefore useful for feature selection, as we can drop any variables associated with coefficients that go down to zero without causing disturbance to other features. L2, on the other hand, is useful when you have collinear/codependent features where change in one feature would possibly cause a change in/affect other features.

Imagine codependent features like gender and pregnancy, we know that males cannot be pregnant, at least not with the current technology:), hence it is certain that change in gender would be affecting our pregnancy feature and vice-versa.

Classification loss functions

The output variable in classification problem is usually a probability value f(x), called the score for the input x. Generally, the magnitude of the score represents the confidence of our prediction. The target variable y, is a binary variable, 1 for true and -1 for false.

In an example (x, y), the margin is defined as y * f(x). The margin is a measure of how correct we are. Most classification losses mainly aim to maximize the margin. Some classification loss algorithms are:

1. Binary Cross Entropy

2. Negative Log Likelihood

3. Margin Classifier

4. Soft Margin Classifier

Let’s briefly visit an important error measurement function, the Cross-Entropy loss.

Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. Cross-entropy loss increases as the predicted probability diverges from the actual label. As the predicted probability decreases, however, the cross entropy (log loss) increases rapidly. Cross Entropy is definitely a good loss function for Classification Problems, because it minimizes the distance between two probability distributions - predicted and actual. So cross entropy makes sure we are minimizing the difference between the two probability. The categorical cross entropy loss measures the dissimilarity between the true label distribution y and the predicted label distribution ŷ, and is defined as cross entropy.

Embedding loss functions

It deals with problems where we have to measure whether two inputs are similar or dissimilar. Some examples are:

1. Hinge Error- In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for Support vector machines (SVMs).

2. Cosine Error- Cosine distance between two inputs.

Also, special mention to:

Sigmoid cross entropy loss in which the input is treated with a sigmoid function before cross entropy loss is calculated.

Weighted cross entropy loss in which we provide a weight according to our desire of obtaining values till a range of positive values.

Softmax cross entropy loss in which there is only one target category instead of multiple. Because of this, the function first uses a softmax function to transform the outputs into a probability distribution which all sums to 1, and then computes the loss function from the true probability distribution.

All this, but for what??

This is going to be useful in choosing what loss function to choose while training your model. Your model needs an apt loss function for better results and their characteristics and your requirement will help you decide which one to choose from an ocean of functions. An appropriate loss function is necessary in a good model as the desired output depends on the loss function directly.

Be smart with what you choose and you’ll be better than the lot.

See ya soon again.

Stay Home. Stay Safe.

36 views0 comments

Recent Posts

See All