LIS640
Workflow
- Data Collection
- Data Processing Normalization For imageL calculate mean, std.
- Model Architecture Selection Can be Convolutional Neural Network or ResNet
- Defining a Loss Function
- Optimization Algorithm: Common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop
- Training the Model
- Evaluation and Validation
- Hyperparameter Tuning
- Regularization and Overfitting Prevention
Details
Two types of Methods
K-Nearest Neighbor(KNN): Non-Parametric, Doesn't learn a model. Nearest Neighbor(NN): Distance Metric to compare images
Nearest Neighbor Classifier(NNC): Memorize the training set and predict the label of an unknown image; Find the nearest neighbor in the training set return the label of the nearest neighbor.
Deep Neural Network(DNN):
Parametric, Learn a model process to optimize the weights of the network. Basic: Linear Classifier
Testing Can we add more layers to deal with non-linear data? Ans: No.
Activation Function
Testing Know the different activation function and choose the best one. And able to calculate the gradient of the activation function.
- Sigmoid (
) Testing ReLU (dx is aka - Leaky ReLU (
) (dx is )
import torch
import torch.autograda sautograd
class SigmoidFunction(autograd.Function):
@staticmethod
def forward(ctx,input):
ctx.save_for_backward(input)
return 1 / ( 1+ torch.exp(-input))
@staticmethod
def backward(ctx,grad_output):
input, = ctx.saved_tensors
sigmoid_output= 1 / ( 1+ torch.exp(-input))
grad_input = grad_output * sigmoid_output * (1 - sigmoid_output)
return grad_input
- Tanh (
) the dx is ( = ) - ReLU (ReLU
) Testing Leaky ReLU if the ReLU dead the gradient is 0, we will use leaky ReLU where is a hyperparameter.
Advance: Use Deep Neural Network to solve non-linear data. Neural Network: Before
Testing What if we don't have enough training data to train the network? Ans: Then the function will still be linear like
Convolutional Neural Network(CNN)
Fully Connected Layer It is not consider the 3D data but only for 1D data.
Convolutional Layer Convolve the filter with the image (eg.
In general:
- Input:
- Filter:
- Output:
Testing What if we stack two convolutional layers? Ans: They are still linear. Therefore, we need to add activation function for each layer after the convolutional layer.
Add Padding
- Input:
- Filter:
- Padding:
- Output:
- where
Receptive Field: Receptive field is the size of the filter. Use Stride Convolution move more pixels in the image.In general: - Input: W - Filter: k - Stride: s - Padding: p - Output:
- Pooling Layer Hyperparameter: Kernel Size, Stride, Padding Max Pooling: k = 2 finding the max value in the window and forming a new window.
- Batch Normalization Learnable scale and shift parameters:
and
Idea: "Normalize" the outputs of a layer so they have zero mean and unit variance
-
Mean:
-
Standard Deviation:
-
Normalized Data:
.
Why we need BN?
- Makes deep networks much easier to train!
- Allows higher learning rates, faster convergence Networks become more robust to initialization
- Acts as regularization during training
- Zero overhead at test-time: can be fused with conv!
Example of Running Mean:then for each trianing then
Batch Normalization usually inserted after Fully Connected(FC) or Convolutional layers, and before nonlinearity.
graph LR id1(FC) --> id2(BN) --> id3(tanh) --> id4(FC) --> id5(BN) style id2 fill:#f9f style id5 fill:#f9f
Testing Convolution Summary c-summary
Testing AlexNet !!!
- Loss Function A loss function tell us how good our model is.
-
Cross-Entropy Loss (Multinomial Logistic Regression) Use Softmax Function to calculate probability and then compare it with the label.
-
Multi-Class SVM Loss
Testing Compare the loss function and choose the best one.
#cross entropy
def cross_entropy_loss(y_pred, y_target):
loss = np.mean(-y_target * np.log(1/(1+np.exp(-y_pred))) - \
(1-y_target) * np.log(1/(1+np.exp(-(1-y_pred)))))
return loss
#svm
def svm_loss(y_pred, y_target):
y_target = 2 * y_target - 1
loss = np.clip(1 - y_target * y_pred, a_min=0, a_max=None).mean()
return loss
then compare two methods:
W_1, b_1 = np.array([-1.9, 1.1]), np.array([0.1])
Linear_Classifier.param_init(W_1, b_1)
y_pred_1 = Linear_Classifier.predict(x_test)
ce_loss_1 = cross_entropy_loss(y_pred_1, y_test)
hinge_loss_1 = svm_loss(y_pred_1, y_test)
print('Cross Entropy and SVM loss for the Group 1: {}, {}'.format(ce_loss_1, hinge_loss_1))
- Optimization Minimize the loss function.
There are two ways to do the optimization:
- Gradient for the whole batch:
- Stochastic Gradient Descent(SGD):
is Minibatch,
Testing Why we are not using whole batch? Ans: if
Problems of SGD:
- If you get local minimum, you will get zero gradient and gets stuck.
- Then you want to add momentum to it
SGD:to SGD+Momentum: and where
Thus our code can be:
#SGD
for t in range (num_steps):
dw = compute_gradient(w)
w -= learning_rate * dw
to
#SGD+Momentum
v=0
for t in range (num_steps):
dw = compute_gradient (w)
v = rho * v + dw
w -= learning_rate * v
- Use AdaGrad to solve the problem that progress in any direction
grad_squared = 0
for t in range (num_steps):
dw = compute_gradient(w)
grad_squared += dw * dw
w -= learning_rate * dw / (grad_squared.sqrt () + 1e-7)
Testing What happens with AdaGrad? Ans: Progress along "steep" directions is damped; progress along "flat" directions is accelerated
- New problem from AdaGrad is that it is accumulated, you need "Leaky"
So, we can use RMSProp: "Leaky Adagrad"
grad_squared = 0
for t in range(num_steps):
dw = compute_gradient(w)
grad_squared =decay_rate * grad_squared +(1 -decay_rate)*dw*dw
w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
- Now we can apply it into all
dw
Which is Adam: RMSProp + Momentum
moment1 = 0
moment2 = 0
for t in range(1, num_steps +1): # Start at t = 1
dw = compute_gradient (w)
moment1 = betal * moment1 + (1 - beta1) * dw
moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)
- Bias correction for the fact that first and second moment estimates start at zero
moment1 = 0 moment2 = 0
for t ni range(1, num_steps + 1): # Start at t = 1
dw = compute gradient (w)
moment= betal* moment1+ ( 1- betal)* dw
moment = beta2 * moment2 + (1 - beta2) * dw * dw
moment1_unbias= moment1 / (1 - beta1* *t )
moment2_unbias = moment2 / 1( - beta2 * t)
w-= learning_rate * moment_unbias / (moment2_unbias.sqrt () + 1e-7)
Overview:
flowchart TB %% Define node styles classDef default fill:#f0f0f0,stroke:#333,stroke-width:2px classDef highlight fill:#f9f,stroke:#333,stroke-width:2px classDef subgraphStyle fill:#fff,stroke:#ddd,stroke-width:2px subgraph First["First Steps"] direction TB B[Whole Batch] -->|no| M[Momentum] S[SGD] -->|zero grad| M end subgraph Second["Adaptive Methods"] direction TB A[AdaGrad] -->|accumulate| R[RMSProp] end subgraph Last["Modern Optimizers"] direction TB AD[Adam] -->|bias correction| AB[Adam+Bias] end %% Connections between subgraphs M -->|steep| A R -->|improve| AD %% Apply styles class S highlight class R highlight class AB highlight class First,Second,Last subgraphStyle
Practice
Let's define new operations, Sigmoid function
def sigmoid(x):
return 1.0 / (1.0 +(-x). exp())
Then starting initialized the parameters:
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True)
w2 = torch.randn(H, D_out, requires_grad=True)
learning_rate = 1e-6
for t in range(500):
y_pred = sigmoid(x.mm(wl)).mm(w2)
loss = (y_red - y).pow(2).sum()
loss.backward()
if t % 50 == 0:
print(t, loss.item())
with torch.no_grad() :
w1 -= learning_rate * w1.grad
w2 -= learning_rate * w2.grad
w1.grad.zero_()
w2.grad.zero_()
Then we need to define new autograd opertators
class Sigmoid(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
y = 1.0 / (1.0 + (-x).exp())
ctx.save_for_backward(y)
return y
@staticmethod
def backward(ctx, grad_y) :
y, = ctx.saved_tensors
grad_x = grad_y * y * (1.0- y)
return grad_x
def sigmoid(x):
return Sigmoid.apply(x)