Adversarial Machine Learning
Where W_l and b_l correspond to the parameters of l'th layer and f_l corresponds to the l'th layer activation function.
In order to train neural network we need labels and a well-defined loss function to approximate these labels. Labels are basically class names for a classification task (imagine a binary classification with 2 labels: cat, dog) which are expressed as one hot vectors ([1,0] for cat, [0,1] for dog). Output of the neural network will be an array consisting of scalar numbers corresponding to the probabilities of each class (for our example cat and dog e.g [0.9, 0.1] => probability of cat and dog for given image respectively). Computation of loss is rather easy, we need to define loss function such that if our predictions are close to the original label, loss should be small, else, it should be high. There are a number of different loss functions one of which (the most popular one) is cross entropy loss defined as:
where c corresponds for the class index, y corresponds for the one hot label and y_hat corresponds for the prediction probabilities of our neural network. If given label is corresponding to class c then we want to make sure prediction of class c is close to one (1 means neural net thinks that the given image is 100% from that class) so that loss decreased. Observe that negative sign in front of the loss is there to make Loss function positive because we have prediction probabilities in between 0 and 1 which makes log output negative. On the other hand if y corresponding to class c is zero we don't care about the prediction of that class at all.
Now we need to train our neural network so that we can decrease loss and predict close to the original labels. How can we do that? The answer is simple, and comes from the calculus. All we need to do is that taking derivative of the loss with respect to the parameters of the neural network and subtract it from the parameters to decrease loss function. This operation is called gradient descent. First question comes into mind after all these explanation is that how can one compute the derivatives after all serially connected layers, especially for intermediate layers. Don't worry about this though, because back-propagation algorithm is doing the magic for us, and calculate all the derivatives (gradients). We will not discuss the details of back-propagation algorithm since it is beyond the scope of this talk. With the help of back-propagation and gradient descent we train our network with training dataset to decrease loss.