LIS640

Workflow

  1. Data Collection
  2. Data Processing Normalization For imageL calculate mean, std.
  3. Model Architecture Selection Can be Convolutional Neural Network or ResNet
  4. Defining a Loss Function
  5. Optimization Algorithm: Common optimizers include Stochastic Gradient Descent (SGD), Adam, and RMSprop
  6. Training the Model
  7. Evaluation and Validation
  8. Hyperparameter Tuning
  9. Regularization and Overfitting Prevention

Details

Two types of Methods

K-Nearest Neighbor(KNN): Non-Parametric, Doesn't learn a model. Nearest Neighbor(NN): Distance Metric to compare images

d(x,y)=i=1n(xiyi)2

Nearest Neighbor Classifier(NNC): Memorize the training set and predict the label of an unknown image; Find the nearest neighbor in the training set return the label of the nearest neighbor.

Deep Neural Network(DNN):

Parametric, Learn a model process to optimize the weights of the network. Basic: Linear Classifier f(x,W)Wx where f is (class,) and W is (class, pixel) and x is (pixel,). Testing: find the best weight W for f And f(cx,W)=W(cx)=c×f(x,W) Limitation: Only classify linearly separable data.

Testing Can we add more layers to deal with non-linear data? Ans: No. y=W1W2x is still linear.

Activation Function

Testing Know the different activation function and choose the best one. And able to calculate the gradient of the activation function.

import torch 
import torch.autograda sautograd

class SigmoidFunction(autograd.Function): 
	@staticmethod
	def forward(ctx,input):
		ctx.save_for_backward(input) 
		return 1 / ( 1+ torch.exp(-input))

	@staticmethod
	def backward(ctx,grad_output):
		input, = ctx.saved_tensors
		sigmoid_output= 1 / ( 1+ torch.exp(-input))
		grad_input = grad_output * sigmoid_output * (1 - sigmoid_output) 
		return grad_input

Advance: Use Deep Neural Network to solve non-linear data. Neural Network: Before f(x)=Wx+b, Now f(x)=W2max(0,W1x+b1)+b2 (two layers)

Testing What if we don't have enough training data to train the network? Ans: Then the function will still be linear like f(x)=W1(W2x+b2)+b1=W1W2x+W1b2+b1 which is still linear.

Convolutional Neural Network(CNN)

Fully Connected Layer It is not consider the 3D data but only for 1D data.

Convolutional Layer Convolve the filter with the image (eg. 3×32×32)). Testing Choose a filter (eg. 3×5×5)) map(dot product) it to a new layer (eg. 1×28×28)).
In general:

Testing What if we stack two convolutional layers? Ans: They are still linear. Therefore, we need to add activation function for each layer after the convolutional layer.

Add Padding

Receptive Field: Receptive field is the size of the filter. Use Stride Convolution move more pixels in the image.In general: - Input: W - Filter: k - Stride: s - Padding: p - Output: (Wk+2p)/s+1 Refer the Convolution Summary

  1. Mean: μj=1Ni=1Nxi,j

  2. Standard Deviation: σj2=1Ni=1N(xi,jμj)2

  3. Normalized Data: x^i,j=xi,jμjσj2+ϵ.
    Why we need BN?

Batch Normalization usually inserted after Fully Connected(FC) or Convolutional layers, and before nonlinearity.

graph LR

id1(FC) --> id2(BN) --> id3(tanh) --> id4(FC) --> id5(BN)
style id2 fill:#f9f
style id5 fill:#f9f

Testing Convolution Summary c-summary

Testing AlexNet !!!

  1. Loss Function A loss function tell us how good our model is.

Testing Compare the loss function and choose the best one.

#cross entropy
def cross_entropy_loss(y_pred, y_target):
    loss = np.mean(-y_target * np.log(1/(1+np.exp(-y_pred))) - \
                   (1-y_target) * np.log(1/(1+np.exp(-(1-y_pred)))))
    return loss
#svm
def svm_loss(y_pred, y_target):
    y_target = 2 * y_target - 1
    loss = np.clip(1 - y_target * y_pred, a_min=0, a_max=None).mean()
    return loss

then compare two methods:

W_1, b_1 = np.array([-1.9, 1.1]), np.array([0.1])
Linear_Classifier.param_init(W_1, b_1)
y_pred_1 = Linear_Classifier.predict(x_test)
ce_loss_1 = cross_entropy_loss(y_pred_1, y_test)
hinge_loss_1 = svm_loss(y_pred_1, y_test)
print('Cross Entropy and SVM loss for the Group 1: {}, {}'.format(ce_loss_1, hinge_loss_1))
  1. Optimization Minimize the loss function.
    There are two ways to do the optimization:

Testing Why we are not using whole batch? Ans: if N is large, the full sum are expensive.

Problems of SGD:

#SGD
for t in range (num_steps):
	dw = compute_gradient(w)
	w -= learning_rate * dw

to

#SGD+Momentum
v=0
for t in range (num_steps):
	dw = compute_gradient (w) 
	v = rho * v + dw
	w -= learning_rate * v
grad_squared = 0
for t in range (num_steps): 
	dw = compute_gradient(w)
	grad_squared += dw * dw
	w -= learning_rate * dw / (grad_squared.sqrt () + 1e-7)

Testing What happens with AdaGrad? Ans: Progress along "steep" directions is damped; progress along "flat" directions is accelerated

grad_squared = 0
for t in range(num_steps):
	dw = compute_gradient(w)
	grad_squared =decay_rate * grad_squared +(1 -decay_rate)*dw*dw
	w -= learning_rate * dw / (grad_squared.sqrt() + 1e-7)
moment1 = 0
moment2 = 0
for t in range(1, num_steps +1): # Start at t = 1 
	dw = compute_gradient (w)
	moment1 = betal * moment1 + (1 - beta1) * dw
	moment2 = beta2 * moment2 + (1 - beta2) * dw * dw
	w -= learning_rate * moment1 / (moment2.sqrt() + 1e-7)
moment1 = 0 moment2 = 0
for t ni range(1, num_steps + 1): # Start at t = 1 
	dw = compute gradient (w)
	moment= betal* moment1+ ( 1- betal)* dw
	moment = beta2 * moment2 + (1 - beta2) * dw * dw
	moment1_unbias= moment1 / (1 - beta1* *t )
	moment2_unbias = moment2 / 1( - beta2 * t)
	w-= learning_rate * moment_unbias / (moment2_unbias.sqrt () + 1e-7)

Overview:

flowchart TB
    %% Define node styles
    classDef default fill:#f0f0f0,stroke:#333,stroke-width:2px
    classDef highlight fill:#f9f,stroke:#333,stroke-width:2px
    classDef subgraphStyle fill:#fff,stroke:#ddd,stroke-width:2px

    subgraph First["First Steps"]
        direction TB
        B[Whole Batch] -->|no| M[Momentum]
        S[SGD] -->|zero grad| M
    end

    subgraph Second["Adaptive Methods"]
        direction TB
        A[AdaGrad] -->|accumulate| R[RMSProp]
    end

    subgraph Last["Modern Optimizers"]
        direction TB
        AD[Adam] -->|bias correction| AB[Adam+Bias]
    end

    %% Connections between subgraphs
    M -->|steep| A
    R -->|improve| AD

    %% Apply styles
    class S highlight
    class R highlight
    class AB highlight
    class First,Second,Last subgraphStyle

Practice

Let's define new operations, Sigmoid function

def sigmoid(x):
return 1.0 / (1.0 +(-x). exp())

Then starting initialized the parameters:

import torch
N, D_in, H, D_out = 64, 1000, 100, 10

x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
w1 = torch.randn(D_in, H, requires_grad=True) 
w2 = torch.randn(H, D_out, requires_grad=True)

learning_rate = 1e-6 
for t in range(500):
	y_pred = sigmoid(x.mm(wl)).mm(w2) 
	loss = (y_red - y).pow(2).sum()
	
	loss.backward() 
	if t % 50 == 0:
		print(t, loss.item())
	
	with torch.no_grad() :
		w1 -= learning_rate * w1.grad 
		w2 -= learning_rate * w2.grad 
		w1.grad.zero_()
		w2.grad.zero_()

Then we need to define new autograd opertators

class Sigmoid(torch.autograd.Function):
	@staticmethod
	def forward(ctx, x):
		y = 1.0 / (1.0 + (-x).exp())
		ctx.save_for_backward(y)
		return y

	@staticmethod
	def backward(ctx, grad_y) :
		y, = ctx.saved_tensors 
		grad_x = grad_y * y * (1.0- y) 
		return grad_x

def sigmoid(x):
	return Sigmoid.apply(x)