Stats 340

Discrete Distribution

E[X]=0p1+1p2++npn+1

and

Var[X]=(xμ)2P(X=x)

Uniform Distribution

For interval [a,b], we have E[X]=a+b2

Bernoulli Distribution

Definition

A Bernoulli distribution is a discrete probability distribution for a random variable which takes the value 1 with success probability of p and the value 0 with failure probability of q=1p. Therefore, it can be considered as a simple case of the binomial distribution where a single experiment/trial is conducted. It is also a special case of the two-point distribution, for which the possible outcomes need not be 0 and 1.

Probability Mass Function

The probability mass function (pmf) of a Bernoulli random variable X is defined as:

P(X=k)={pif k=1,1pif k=0.

Expectation and Variance

The expectation (mean) of a Bernoulli random variable X is p, and the variance is p(1p).

Example

Let's consider a simple example of flipping a coin. If we define success as the coin landing heads up, and if the coin is fair, then the success probability $ p $ is 0.5. The Bernoulli distribution can model this experiment.

R Code

# Define the probability of success
p <- 0.5

# Simulate one Bernoulli trial
trial <- rbinom(1, size = 1, prob = p)

# Print the result
cat("The result of the Bernoulli trial is:", trial, "\n")

Binomial Distribution

Definition

The Binomial Distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent trials, with each trial having only two possible outcomes (commonly referred to as "success" and "failure"). The parameters of a binomial distribution are n and p, where n represents the number of trials, and p is the probability of success on any given trial.

Certainly! Let's add the general probability equations for both the Binomial Distribution and the Geometric Distribution, using LaTeX for clear mathematical representation.

General Probability Equation

The probability of getting exactly k successes in n trials in a binomial distribution is given by the formula:

P(X=k)=(nk)pk(1p)nk

where:

Example

Suppose we want to find the probability of getting exactly 3 heads in 5 tosses of a fair coin.

Here, the number of trials (n) is 5 (since the coin is tossed 5 times), and the probability of success (p), which in this case is getting a head, is 0.5 (since the coin is fair).

R Code

# Define the parameters
n <- 5       # Number of trials
p <- 0.5     # Probability of success on each trial
x <- 3       # Number of successes

# Calculate the probability
probability <- dbinom(x, size = n, prob = p)

# Print the result
cat("The probability of getting exactly 3 heads in 5 tosses of a fair coin is:", probability, "\n")

In R, the rbinom function is used to generate random variates from a binomial distribution. This can be useful for simulation purposes or to understand the distribution of outcomes under specified conditions (number of trials and probability of success).

Example

Let's simulate the outcome of 10 experiments, each consisting of 5 tosses of a fair coin, and count the number of heads (successes) in each experiment.

R Code

# Define the parameters
n <- 5       # Number of trials in each experiment
p <- 0.5     # Probability of success on each trial
experiments <- 10  # Number of experiments

# Generate random variates
random_variates <- rbinom(experiments, size = n, prob = p)

# Print the results
cat("Random variates from 10 experiments of 5 coin tosses each:", random_variates, "\n")

Geometric Distribution

Definition

The Geometric Distribution is a discrete probability distribution that models the number of trials needed to achieve the first success in a series of independent and identically distributed Bernoulli trials, where each trial has only two possible outcomes (success or failure). The parameter of a geometric distribution is p, the probability of success on each trial.

Geometric Distribution

General Probability Equation

The probability that the first success occurs on the kth trial in a geometric distribution is given by the formula:

P(X=k)=(1p)k1p

where:

Example

Suppose we want to find the probability that the first success (e.g., first head) occurs on the 4th toss of a fair coin.
Here, the probability of success (p), which is getting a head, is 0.5 (since the coin is fair).

R Code

# Define the parameters
p <- 0.5     # Probability of success on each trial
x <- 4       # The trial on which the first success occurs

# Calculate the probability
probability <- dgeom(x - 1, prob = p) # x - 1 because dgeom counts the number of failures before the first success

# Print the result
cat("The probability of the first head occurring on the 4th toss is:", probability, "\n")

The rgeom function in R is used to generate random variates from a geometric distribution. This function can simulate the number of trials required to obtain the first success in a series of independent trials, each with the same probability of success.

Example

Let's simulate 10 scenarios to find out how many trials are needed to achieve the first success (e.g., first head) when tossing a fair coin.
In this simulation, the probability of success (getting a head) on each trial is 0.5.

R Code

# Define the parameters
p <- 0.5     # Probability of success on each trial
scenarios <- 10  # Number of scenarios to simulate

# Generate random variates
random_variates <- rgeom(scenarios, prob = p) + 1 # +1 because rgeom returns the number of failures before the first success

# Print the results
cat("Number of trials needed for the first success in 10 scenarios:", random_variates, "\n")

Poisson Distribution

Definition

The Poisson Distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, assuming that these events occur with a known constant mean rate and independently of the time since the last event. The parameter of the Poisson distribution is λ (lambda), which represents the average rate (mean) of occurrences in a fixed interval.

General Probability Equation

The probability of observing k events in a fixed interval is given by the formula:

P(X=k)=λkeλk!

where:

Example

Suppose we want to find the probability of receiving 2 calls in a 1-hour period at a call center, given that the call center receives an average of 5 calls per hour.

Here, λ=5 calls per hour, and we are interested in k=2 calls.

R Code for the Example

# Define the parameters
lambda <- 5  # Average rate (mean) of occurrences
k <- 2       # Number of events

# Calculate the probability
probability <- dpois(k, lambda)

# Print the result
cat("The probability of receiving 2 calls in a 1-hour period is:", probability, "\n")

Example for rpois

Let's simulate the number of calls received in 10 different 1-hour periods at the call center, with an average rate of 5 calls per hour.

R Code for rpois

# Define the parameters
lambda <- 5  # Average rate (mean) of occurrences
periods <- 10  # Number of periods to simulate

# Generate random variates
random_variates <- rpois(periods, lambda)

# Print the results
cat("Number of calls received in 10 different 1-hour periods:", random_variates, "\n")

Long-run Averages

More formally, if X is a discrete random variable, we define its expectation to be $$ \mathbb{E}X = \sum_{k} k \Pr[X = k], $$ where the sum is over all k such that Pr[X=k]>0.

Independence

Definition

In probability theory, two events are said to be independent if the occurrence of one event does not affect the probability of occurrence of the other event. In other words, events A and B are independent if and only if:

P(AB)=P(A)P(B)

This can also be extended to random variables. Two random variables X and Y are independent if the occurrence of a particular value of X does not affect the probability distribution of Y, and vice versa.

Conditional Probability

Definition

Conditional probability is a measure of the probability of an event occurring given that another event has already occurred. If the event of interest is A and the event B has occurred, the conditional probability of A given B is written as P(A|B), and is calculated by the formula:

P(A|B)=P(AB)P(B),

provided that P(B)>0.

Bayes' Rule

Definition

Bayes' Theorem is a way of finding a probability when we know certain other probabilities. The formula is:

P(A|B)=P(B|A)P(A)P(B)P(A|B)=P(B|A)P(A)P(B|A)P(A)+P(B|A)P(A)

where from

P(B)=P(B|A)P(A)+P(B|A)P(A)

This theorem allows us to update our prior beliefs (the probability of A) with new evidence (the probability of B given A) to obtain a revised probability (the probability of A given B).

Expectation

If X is a random variable, then

E(aX+b)=aEX+b.

and

E(X+Y)=EX+EY.

Variance

Var(X)=E(X2)(EX)2

Now, the first and last terms there are the variances of X and Y:

VarX=E(XEX)2,VarY=E(YEY)2.

So

Var(X+Y)=VarX+2E[(XEX)(YEY)]+VarY.

This term might be familiar to you—it is (two times) the covariance of X and Y, often written

Cov(X,Y)=E[(XEX)(YEY)].

Now, if Cov(X,Y)=0, then

Var(X±Y)=VarX+VarY.

Correlation

ρxy=cov(x,y)σxσy

Type Error I and II

H0 true H0 false
Accept H0 True Negative ✅ False Negative ❌ (Type II) β
Reject H0 False Positive ❌ (Type I) α True Positive ✅

Example

H0 (Negative) true H0 false (Positive True)
Accept H0 (Negative) 41 8
Reject H0 (Positive) 19 32
Type I Error: $P(rej H_0 true) = \frac{19}{60}$, Type II Error: $P(acpt H_0 false) = \frac{8}{40}$
Power: $P(rej H_A) = \frac{32}{40}$

Relations

  1. Increasing α (making the test more liberal in rejecting the null hypothesis) can decrease β (reducing the risk of a Type II error), thus increasing the power of the test.
  2. Decreasing α (being more conservative about rejecting the null hypothesis) increases β (higher risk of missing an effect), thus reducing the power of the test.
  3. To increase power (reduce β) without changing α

Monte Carlo Method

To (approximately) compute Pr[E], we

Example

Note that we have specified the standard deviation to be 3—the variance is σ2=3, so standard deviation is σ=3.
Our event of interest is E={0X3}. Monte Carlo says that to estimate Pr[E], we repeat our experiment lots of times and count what fraction of the time the event E happens.
So we should generate lots of copies of XN(μ=1,σ2=3) and count how often 0X3. Let's do just that.

# the function examine happened or not
event_E_happened <- function( x ) {
  if( 0 <= x & x <= 3 ) {
    return( TRUE ) # The event happened
  } else {
    return( FALSE ) # The event DIDN'T happen
  }
}

# Now MC says that we should generate lots of copies of X...
NMC <- 1000; # 1000 seems like "a lot".
results <- rep( 0, NMC ); # We're going to record outcomes here.
for( i in 1:NMC) {
  # Generate a draw from the normal, and then...
  X <- rnorm( 1, mean=1, sd=sqrt(3) );
  # ...record whether or not our event of interest happened.
  results[i] <- event_E_happened(X);
}
# Now, compute what fraction of our trials were "successes" (i.e., E happened)
sum(results)/NMC

As the number of replications increases, the standard error decreases, which can lead to a smaller p-value and a higher probability of rejecting the null hypothesis when it is false, thus improving the power of the test.

Power of Hypothesis Test

The power of a hypothesis test is the probability of rejecting the null.

CDF, PDF, Inverse CDF, Probability

Probability Density Function (PDF)

4_continuous_probability_density_functions.png|300
For continuous function there is no probability at a point P(X=a)=0

Cumulative Distribution Function (CDF)

image/svg+xml 1.00.90.80.70.60.50.40.30.20.10.0 0 1 2 3 4 5 λ = 1.5 λ = 0.5 λ = 1.0 x

Inverse CDF

Also known as the quantile function. For any CDF function F(x)=x2 the qx(0.5) (inverse CDF) is F(x)=0.5=x2 thus the qx(0.5)= sqrt(0.5).

Law of Large Numbers

According to the LLN, as the number of trials or sample size increases, the sample mean (average of the sample outcomes) will converge to the expected value (true mean) of the population from which the samples are drawn, provided that the expected value exists.

Convergence of Sample Mean to True Mean

The Law of Large Numbers states that as the sample size n grows, the sample mean X¯n gets closer to the population mean μ, under the assumption that the population mean is finite.

limnP(|X¯nμ|>ϵ)=0

This means that for a sufficiently large n, the sample mean X¯n is a good approximation of the population mean μ.

Scaling of the Variance of the Sample Mean with Sample Size

The variance of the sample mean decreases as the sample size increases. Specifically, if σ2 is the variance of the population from which the samples are drawn, the variance of the sample mean σX¯2 is given by:

σX¯2=σ2n

As the sample size increases, the variance of the sample mean decreases, making the sample mean a more precise estimator of the population mean.

Sample Mean and Variance

Sample Mean X¯ and variance s2 formula:

X¯=1nXiands2=1n1(XiX¯)2

And E(X¯)=μ, E(s2)=σ2. Which means as sample size n increases, X¯ converges to μ: X¯μ as n. Same as variance: s2σ2 as n.
If n increase, sample variance will not change but sample mean will get decreases to n.

Odd

The odds of an event is defined as the ratio of the probability of the event happening to the probability of the event not happening. Mathematically, if p is the probability of the event (here, making a purchase), the odds O is given by:

O=p1p

Estimation

Estimators

The estimator in this case is the sample mean, which is a method used to estimate the population mean. The formula for the sample mean x¯ as an estimator is:

x¯=1ni=1nxi

where xi​ are the observed values in the sample and n is the sample size.

Estimate

If the heights of the 50 students are measured and the resulting values are 160 cm, 165 cm, 170 cm, etc., you would plug these values into the formula to compute the sample mean. This computed value is the estimate of the population mean.

Statistic

In this scenario, the sample mean x¯ is a statistic—it is a value calculated from the sample data.

Expectation and Variance

EX¯=p we say X¯ is an unbiased estimator of p. VarZ=E(ZEZ)2 and the variance of estimator p^ based on a sample of size n is

Varp^=p(1p)n

The Central Limit Theorem

It explains why many distributions in nature tend to approximate the normal distribution, even if the underlying distribution itself is not normal, provided that the sample size is large enough.

Key Elements of the CLT

Mathematical Formulation

If X1,X2,,Xn are i.i.d. variables with mean μ and finite variance σ2, then the sample mean:

X¯n=1ni=1nXi

will have a distribution that approaches a normal distribution with mean μ and variance σ2/n as n increases. Mathematically, this can be written as:

n(X¯nμ)dN(0,σ2)

where d denotes convergence in distribution. Normalized form X¯μσ2/n.

Fact

a higher confidence level demands a wider interval, since z is larger when higher confidence level.

Residual Sum of Squares (RSS) or (SSR)

The RSS is calculated by taking the sum of the squares of the residuals (the differences between the observed values and the predicted values). Mathematically, it can be expressed as:

RSS=i=1n(yiyi^)2

Why RSS

The RSS measures the variance left unexplained by the model, thus providing a direct measure of the model’s predictive power. Lower RSS values indicate a model that closely fits the data. Sensitivity to Large Errors: Squaring the residuals has the effect of heavily penalizing larger errors.

Example

Consider a dataset where we want to predict house prices based on the size of the house (in square meters).

# Fit linear model
model <- lm(house_price ~ house_size, data = data)
# Get summary of the model
summary(model)

Then calculate the RSS

# Calculate RSS
rss <- sum(residuals(model)^2)
print(rss)

Suppose your output from R looks like this:

Linear Regression

Estimate of β0 and β1

To minimize the loss to get the estimates:

β^0=y¯β^1x¯andβ^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2

Variance of estimates

After fitting, we can find our predicted y^i, i.e. the y values on the line.

y^i=β^0+β^ixi

the model residuals ϵ^i=yiy^i.
Mean Squared Error:

σ^2=SSEn2=1n2i(yiy^i)2

The formula for the 95% confidence interval is: Estimate±(1.96×Standard Error)

Standard Error (SE(β^)) and Residual Standard Error(RSE)

Formula: σ2/(XiX¯)2 for point estimate s/n ; RSE=(YiY^i)2/(n2)

Linear Regression Assumption

The response variable has a linear relationship with the predictor variables;
Linear regression assumes that the errors (residuals) follow a normal distribution with mean zero;
The errors have constant variance;

Interaction Term

# here (Type * Treatment) is the interaction
model <- lm(response ~ Type * Treatment + conc, data = data)

The Estimate is the coefficient of relation. For example:

Type -> -9.38 1.62
Treatment -> -3.58 1.85
conc -> 0.0177 0.0022
Type:Treatment -> -6.55 2.61

Proportion of the change

proportion of the change in response that is explained by the change in predictors, check the R-squared.

  1. Multiple R-squared: 0.7072
  2. Adjusted R-squared: 0.6923
    Thus, the 70.72% is the one.

Which predictors are the most significant in this model

Check the p-value the smallest one is the target.

Multiple Regression

Multiple regression is an extension of simple linear regression that allows you to predict an outcome based on multiple predictors.

Multiple Regression Model

In a multiple regression model, the formula looks like this:

Y=β0+β1X1+β2X2+...+βnXn+ϵ

Example

# Fit multiple regression model
model <- lm(house_price ~ house_size + num_bedrooms + age_of_house, data = data)
# Get summary of the model
summary(model)

Model Fit with RSS

We define the residual sum of squares. Let β^0 and β^1,

RSS=i=1n(yiy^i)2=i=1n(yi(β^0+β^ixi))2

R2: R-squared

R2=TSSRSSTSS=1RSSTSS

or

R2=1(YiY^i)2(YiY¯)2

where, as a reminder,

TSS=i=1n(yiy¯)2andRSS=i=1n(yiy^i)2

where β^0=y¯.

MSS

sum of squares: the sum of squares between our model and the “dumbest” model:

MSS=i=1n(y^iy¯)2

This quantity is often called the model sum of squares (MSS) or the explained sum of squares (ESS).
Since the TSS=RSS+MSS, so that R2=TSSRSSTSS=MSSTSS.

Quadratic Term (Nonlinear transformations)

let’s look in particular at mpg (miles per gallon) and hp (horsepower).

mtc_lm <- lm( mpg ~ 1 + hp, data=mtcars );

and you can get quadratic term

mtc_lm <- lm( mpg ~ 1 + hp + I(hp^2), data=mtcars );

Notice that to get our non-linear term hp^2 into the model, we had to write our formula as mpg ~ 1 + hp + I(hp^2). If we just wrote mpg ~ 1 + hp + hp^2, R would parse hp^2 as just hp. I(...) prevents this and ensures that R doesn’t clobber the expression inside.

Logistic Regression

We do that using the logistic function and Sigmoid function

σ(z)=11+ez=ez1+ez.

Logistic regression in R

pima_logistic <- glm( diabetes ~ 1 + glu, data=Pima.te, family=binomial );
summary(pima_logistic)

Model Coefficients

The odds of an event E with probability p are given by Odds(E)=p1p.
So, let’s suppose that we have a logistic regression model with a single predictor, that takes predictor x and outputs a probability

p(x)=11+exp{(β0+β1x)}

The odds associated with this probability are

Odds(x)=p(x)1p(x)=11+exp{(β0+β1x)}=exp{β0+β1x}

If we look at the log odds associated with our logistic regression model,

Log-Odds(x)=logOdd(x)=logexp{β0+β1x}=β0+β1x

How to predict the log-odds:
Predicted log-odds = Intercept + Pclass2 * 0 + Pclass3 * 0 + Sexmale * 1 + Age * 20
where question is set class=1, male=1 and age=20.

Validation Sets

The Mean Squared Error (MSE):

E(Y^n+1Yn+1)2

Model Selection

RIDGE Regression

the short answer is that ridge regression (and other shrinkage methods) prevents over-fitting. λ makes it more expensive to simply choose whatever coefficients we please, which in turn prevents us from over-fitting to the data. To select the optimal value of λ, you can use techniques like cross-validation. By varying λ, you can find the value that minimizes the cross-validation error or maximizes a performance metric like R-squared.

LASSO Regression

Lasso Regression aims to minimize the sum of squared residuals plus a penalty term that is proportional to the absolute value of the coefficients (L1 norm). Lasso Regression is beneficial when you have a large number of features and suspect that only a few of them are relevant. It automatically performs feature selection and produces a more interpretable model.

AIC (Akaike Information Criterion)

AIC is used for comparing and selecting models based on their relative quality, considering both the goodness of fit and the complexity of the model. It is particularly useful when you have multiple models with different numbers of predictors and want to choose the best model among them.

R-squared (R2)

R-squared measures the proportion of variance in the response variable that is explained by the predictors in the model. It is commonly used to compare models with the same number of predictors.

Adjusted R-squared

Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model. Adjusted R-squared is useful when you want to compare models with different numbers of predictors.