LIS640

Workflow

Data Collection
Data Processing Normalization For imageL calculate mean, std.
Model Architecture Selection Can be Convolutional Neural Network or ResNet
Defining a Loss Function
Optimization Algorithm: Common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop
Training the Model
Evaluation and Validation
Hyperparameter Tuning
Regularization and Overfitting Prevention

Details

Two types of Methods

K-Nearest Neighbor(KNN): Non-Parametric, Doesn't learn a model. Nearest Neighbor(NN): Distance Metric to compare images

d (x, y) = \sqrt{\sum_{i = 1}^{n} (x_{i} - y_{i})^{2}}

Nearest Neighbor Classifier(NNC): Memorize the training set and predict the label of an unknown image; Find the nearest neighbor in the training set return the label of the nearest neighbor.

Deep Neural Network(DNN):

Parametric, Learn a model process to optimize the weights of the network. Basic: Linear Classifier $f (x, W) - W x$ where $f$ is (class,) and W is (class, pixel) and x is (pixel,). Testing: find the best weight W for f And $f (c x, W) = W (c x) = c \times f (x, W)$ Limitation: Only classify linearly separable data.

Testing Can we add more layers to deal with non-linear data? Ans: No. $y = W_{1} W_{2} x$ is still linear.

Activation Function

Testing Know the different activation function and choose the best one. And able to calculate the gradient of the activation function.

Sigmoid ( $σ (x) = \frac{1}{1 + e^{- x}}$ $[0, 1]$ ) Testing ReLU (dx is $d / d x = e^{- x} / (1 + e^{- x})^{2}$ aka $σ^{'} (x) = σ (x) (1 - σ (x))$
Leaky ReLU ( $max (0, 1 x, x)$ ) (dx is $max (0.1, 1)$ )

import torch 
import torch.autograda sautograd

class SigmoidFunction(autograd.Function): 
	@staticmethod
	def forward(ctx,input):
		ctx.save_for_backward(input) 
		return 1 / ( 1+ torch.exp(-input))

	@staticmethod
	def backward(ctx,grad_output):
		input, = ctx.saved_tensors
		sigmoid_output= 1 / ( 1+ torch.exp(-input))
		grad_input = grad_output * sigmoid_output * (1 - sigmoid_output) 
		return grad_input

Tanh ( $\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}}$ $[- 1, 1]$ ) the dx is ( $s e c h^{2} (x)$ = $4 / (e^{- x} + e^{x})^{2}$ )
ReLU (ReLU $(x) = max (0, x)$ $[0, \infty]$ ) Testing Leaky ReLU if the ReLU dead the gradient is 0, we will use leaky ReLU $(x) = max (α x, x)$ where $α = 0.01$ is a hyperparameter.

Advance: Use Deep Neural Network to solve non-linear data. Neural Network: Before $f (x) = W_{x} + b$ , Now $f (x) = W_{2} max (0, W_{1} x + b_{1}) + b_{2}$ (two layers)

Testing What if we don't have enough training data to train the network? Ans: Then the function will still be linear like $f (x) = W_{1} (W_{2} x + b_{2}) + b_{1} = W_{1} W_{2} x + W_{1} b_{2} + b_{1}$ which is still linear.

Convolutional Neural Network(CNN)

Fully Connected Layer It is not consider the 3D data but only for 1D data.

Convolutional Layer Convolve the filter with the image (eg. $3 \times 32 \times 32$ )). Testing Choose a filter (eg. $3 \times 5 \times 5$ )) map(dot product) it to a new layer (eg. $1 \times 28 \times 28$ )).
In general:

Input: $W$
Filter: $k$
Output: $W - k + 1$

Testing What if we stack two convolutional layers? Ans: They are still linear. Therefore, we need to add activation function for each layer after the convolutional layer.

Add Padding

Input: $W$
Filter: $k$
Padding: $p$
Output: $W + 2 p - k + 1$
where $p = (k - 1) / 2$

Receptive Field: Receptive field is the size of the filter. Use Stride Convolution move more pixels in the image.In general: - Input: W - Filter: k - Stride: s - Padding: p - Output: $(W - k + 2 p) / s + 1$ Refer the Convolution Summary

Pooling Layer Hyperparameter: Kernel Size, Stride, Padding Max Pooling: k = 2 finding the max value in the window and forming a new window.
Batch Normalization Learnable scale and shift parameters: $γ$ and $β$
Idea: "Normalize" the outputs of a layer so they have zero mean and unit variance

Mean: $μ_{j} = \frac{1}{N} \sum_{i = 1}^{N} x_{i, j}$
Standard Deviation: $σ_{j}^{2} = \frac{1}{N} \sum_{i = 1}^{N} (x_{i, j} - μ_{j})^{2}$
Normalized Data: ${\hat{x}}_{i, j} = \frac{x_{i, j} - μ_{j}}{\sqrt{σ_{j}^{2} + ϵ}}$ .
Why we need BN?

Makes deep networks much easier to train!
Allows higher learning rates, faster convergence Networks become more robust to initialization
Acts as regularization during training
Zero overhead at test-time: can be fused with conv!
Example of Running Mean: $μ_{j}^{t e s t} = 0$ then for each trianing $μ_{j} = \frac{1}{N} \sum_{i = 1}^{N} x_{i, j}$ then $μ = 0.99 μ_{j} + 0.01 μ_{j}^{t e s t}$

Batch Normalization usually inserted after Fully Connected(FC) or Convolutional layers, and before nonlinearity.

graph LR

id1(FC) --> id2(BN) --> id3(tanh) --> id4(FC) --> id5(BN)
style id2 fill:#f9f
style id5 fill:#f9f

Testing Convolution Summary c-summary

LeNet-5

Testing AlexNet !!!

AlexNet

Loss Function A loss function tell us how good our model is.

Cross-Entropy Loss (Multinomial Logistic Regression) Use Softmax Function to calculate probability and then compare it with the label.
Multi-Class SVM Loss

Testing Compare the loss function and choose the best one.

#cross entropy
def cross_entropy_loss(y_pred, y_target):
    loss = np.mean(-y_target * np.log(1/(1+np.exp(-y_pred))) - \
                   (1-y_target) * np.log(1/(1+np.exp(-(1-y_pred)))))
    return loss
#svm
def svm_loss(y_pred, y_target):
    y_target = 2 * y_target - 1
    loss = np.clip(1 - y_target * y_pred, a_min=0, a_max=None).mean()
    return loss

then compare two methods:

W_1, b_1 = np.array([-1.9, 1.1]), np.array([0.1])
Linear_Classifier.param_init(W_1, b_1)
y_pred_1 = Linear_Classifier.predict(x_test)
ce_loss_1 = cross_entropy_loss(y_pred_1, y_test)
hinge_loss_1 = svm_loss(y_pred_1, y_test)
print('Cross Entropy and SVM loss for the Group 1: {}, {}'.format(ce_loss_1, hinge_loss_1))

Optimization Minimize the loss function.
There are two ways to do the optimization:

Gradient for the whole batch: $\nabla_{W} L (W) = \frac{1}{N} \sum_{i = 1}^{N} \nabla_{W} L_{i} (x_{i}, y_{i}, W)$
Stochastic Gradient Descent(SGD): $M$ is Minibatch, $\approx \frac{1}{M} \sum_{i = 1}^{M} \nabla_{W} L_{i} (x_{i}, y_{i}, W)$

Testing Why we are not using whole batch? Ans: if $N$ is large, the full sum are expensive.

Problems of SGD:

If you get local minimum, you will get zero gradient and gets stuck.
Then you want to add momentum to it
SGD: $x_{t + 1} = x_{t} - α \nabla f (x_{t})$ to SGD+Momentum: $x_{t + 1} = x_{t} - α v_{t + 1}$ and $v_{t + 1} = ρ v_{t} + \nabla f (x_{t})$ where $ρ = 0.9$
Thus our code can be:

#SGD
for t in range (num_steps):
	dw = compute_gradient(w)
	w -= learning_rate * dw

#SGD+Momentum
v=0
for t in range (num_steps):
	dw = compute_gradient (w) 
	v = rho * v + dw
	w -= learning_rate * v

Use AdaGrad to solve the problem that progress in any direction

grad_squared = 0
for t in range (num_steps): 
	dw = compute_gradient(w)
	grad_squared += dw * dw
	w -= learning_rate * dw / (grad_squared.sqrt () + 1e-7)

Testing What happens with AdaGrad? Ans: Progress along "steep" directions is damped; progress along "flat" directions is accelerated

New problem from AdaGrad is that it is accumulated, you need "Leaky"
So, we can use RMSProp: "Leaky Adagrad"

grad_squared = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	grad_squared =decay_rate * grad_squared +(1 -decay_rate)*dw*dw
	w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)

Now we can apply it into all dw
Which is Adam: RMSProp + Momentum

moment1 = 0
moment2 = 0
for t in range(1, num_steps +1): # Start at t = 1 
	dw = compute_gradient (w)
	moment1 = betal * moment1 + (1 - beta1) * dw
	moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
	w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)

Bias correction for the fact that first and second moment estimates start at zero

moment1 = 0 moment2 = 0
for t ni range(1, num_steps + 1): # Start at t = 1 
	dw = compute gradient (w)
	moment= betal* moment1+ ( 1- betal)* dw
	moment = beta2 * moment2 + (1 - beta2) * dw * dw
	moment1_unbias= moment1 / (1 - beta1* *t )
	moment2_unbias = moment2 / 1( - beta2 * t)
	w-= learning_rate * moment_unbias / (moment2_unbias.sqrt () + 1e-7)

Overview:

flowchart TB
    %% Define node styles
    classDef default fill:#f0f0f0,stroke:#333,stroke-width:2px
    classDef highlight fill:#f9f,stroke:#333,stroke-width:2px
    classDef subgraphStyle fill:#fff,stroke:#ddd,stroke-width:2px

    subgraph First["First Steps"]
        direction TB
        B[Whole Batch] -->|no| M[Momentum]
        S[SGD] -->|zero grad| M
    end

    subgraph Second["Adaptive Methods"]
        direction TB
        A[AdaGrad] -->|accumulate| R[RMSProp]
    end

    subgraph Last["Modern Optimizers"]
        direction TB
        AD[Adam] -->|bias correction| AB[Adam+Bias]
    end

    %% Connections between subgraphs
    M -->|steep| A
    R -->|improve| AD

    %% Apply styles
    class S highlight
    class R highlight
    class AB highlight
    class First,Second,Last subgraphStyle

Practice

Let's define new operations, Sigmoid function

def sigmoid(x):
return 1.0 / (1.0 +(-x). exp())

Then starting initialized the parameters:

import torch
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True) 
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6 
for t in range(500):
	y_pred = sigmoid(x.mm(wl)).mm(w2) 
	loss = (y_red - y).pow(2).sum()
	
	loss.backward() 
	if t % 50 == 0:
		print(t, loss.item())
	
	with torch.no_grad() :
		w1 -= learning_rate * w1.grad 
		w2 -= learning_rate * w2.grad 
		w1.grad.zero_()
		w2.grad.zero_()

Then we need to define new autograd opertators

class Sigmoid(torch.autograd.Function):
	@staticmethod
	def forward(ctx, x):
		y = 1.0 / (1.0 + (-x).exp())
		ctx.save_for_backward(y)
		return y

	@staticmethod
	def backward(ctx, grad_y) :
		y, = ctx.saved_tensors 
		grad_x = grad_y * y * (1.0- y) 
		return grad_x

def sigmoid(x):
	return Sigmoid.apply(x)