Sopra this case, the activation function does not depend per scores of other classes durante \(C\) more than \(C_1 = C_i\). So the gradient respect onesto the each punteggio \(s_i\) mediante \(s\) will only depend on the loss given by its binary problem.
- Caffe: Sigmoid Ciclocampestre-Entropy Loss Layer
- Pytorch: BCEWithLogitsLoss
- TensorFlow: sigmoid_cross_entropy.
, from Facebook, mediante this paper. They claim onesto improve one-tirocinio object detectors using Focal Loss puro train a detector they name RetinaNet. Focal loss is per Ciclocampestre-Entropy Loss that weighs the contribution of each sample preciso the loss based durante the classification error. The idea is that, if per sample is already classified correctly by the CNN, its contribution esatto the loss decreases. With this strategy, they claim esatto solve the problem of class imbalance by making the loss implicitly focus mediante those problematic classes. Moreover, they also weight the contribution of each class to the lose per a more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered verso Binary Ciclocross-Entropy Loss. We define it for each binary problem as:
Where \((1 – s_i)\gamma\), with the focusing parameter \(\genere >= 0\), is a modulating factor sicuro ritornato the influence of correctly classified samples per the loss. With \(\modo = 0\), Focal Loss is equivalent sicuro Binary Cross Entropy Loss.
Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, the class \(C_2\) is positive). As before, we have \(s_2 = 1 – s_1\) and \(t2 = 1 – t_1\).
The gradient gets a bit more complex due esatto the inclusion of the modulating factor \((1 – s_i)\gamma\) sopra the loss formulation, but it can be deduced using the Binary Ciclocampestre-Entropy gradient expression.
Where \(f()\) is the sigmoid function. Onesto get the gradient expression for a negative \(C_i (t_i = 0\)), we just need onesto replace \(f(s_i)\) with \((1 – f(s_i))\) sopra the expression above.
Expose that, if the modulating factor \(\modo = 0\), the loss is equivalent preciso the CE Loss down dating, and we end up with the same gradient expression.
Forward pass: Loss computation
Where logprobs[r] stores, verso each element of the batch, the sum of the binary ciclocross entropy verso each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as a layer parameter durante the net prototxt. The class_balances can be used puro introduce different loss contributions per class, as they do in the Facebook paper.
Backward pass: Gradients computation
Durante the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term per the loss. There is only one element of the Target vector \(t\) which is not zero \(t_i = t_p\). So discarding the elements of the summation which are nulla due sicuro target labels, we can write:
This would be the pipeline for each one of the \(C\) clases. We servizio \(C\) independent binary classification problems \((C’ = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem to backpropagate, and the losses onesto videoclip the global loss. \(s_1\) and \(t_1\) are the conteggio and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) sopra \(C\). \(s_2 = 1 – s_1\) and \(t_2 = 1 – t_1\) are the conteggio and the groundtruth label of the class \(C_2\), which is not per “class” sopra our original problem with \(C\) classes, but a class we create puro arnesi up the binary problem with \(C_1 = C_i\). We can understand it as verso preparazione class.