Stats 340

Discrete Distribution

E [X] = 0 \cdot p_{1} + 1 \cdot p_{2} + \dots + n \cdot p_{n + 1}

and

V a r [X] = (x - μ)^{2} \cdot P (X = x)

Uniform Distribution

For interval $[a, b]$ , we have $E [X] = \frac{a + b}{2}$

Bernoulli Distribution

Definition

A Bernoulli distribution is a discrete probability distribution for a random variable which takes the value 1 with success probability of $p$ and the value 0 with failure probability of $q = 1 - p$ . Therefore, it can be considered as a simple case of the binomial distribution where a single experiment/trial is conducted. It is also a special case of the two-point distribution, for which the possible outcomes need not be 0 and 1.

Probability Mass Function

The probability mass function (pmf) of a Bernoulli random variable $X$ is defined as:

P (X = k) = {\begin{cases} p & if k = 1, \\ 1 - p & if k = 0. \end{cases}

Expectation and Variance

The expectation (mean) of a Bernoulli random variable $X$ is $p$ , and the variance is $p (1 - p)$ .

Example

Let's consider a simple example of flipping a coin. If we define success as the coin landing heads up, and if the coin is fair, then the success probability $ p $ is 0.5. The Bernoulli distribution can model this experiment.

R Code

# Define the probability of success
p <- 0.5

# Simulate one Bernoulli trial
trial <- rbinom(1, size = 1, prob = p)

# Print the result
cat("The result of the Bernoulli trial is:", trial, "\n")

Binomial Distribution

Definition

The Binomial Distribution is a discrete probability distribution that describes the number of successes in a fixed number of independent trials, with each trial having only two possible outcomes (commonly referred to as "success" and "failure"). The parameters of a binomial distribution are $n$ and $p$ , where $n$ represents the number of trials, and $p$ is the probability of success on any given trial.

Certainly! Let's add the general probability equations for both the Binomial Distribution and the Geometric Distribution, using LaTeX for clear mathematical representation.

General Probability Equation

The probability of getting exactly $k$ successes in $n$ trials in a binomial distribution is given by the formula:

P (X = k) = (\binom{n}{k}) p^{k} (1 - p)^{n - k}

where:

$P (X = k)$ is the probability of getting exactly $k$ successes,
$(\binom{n}{k})$ is the binomial coefficient, which calculates the number of ways to choose $k$ successes from $n$ trials,
$p$ is the probability of success on a single trial,
$1 - p$ is the probability of failure on a single trial.
$E (X) = n p$ and $V a r (X) = n p (1 - p)$

Example

Suppose we want to find the probability of getting exactly 3 heads in 5 tosses of a fair coin.

Here, the number of trials ( $n$ ) is 5 (since the coin is tossed 5 times), and the probability of success ( $p$ ), which in this case is getting a head, is 0.5 (since the coin is fair).

R Code

# Define the parameters
n <- 5       # Number of trials
p <- 0.5     # Probability of success on each trial
x <- 3       # Number of successes

# Calculate the probability
probability <- dbinom(x, size = n, prob = p)

# Print the result
cat("The probability of getting exactly 3 heads in 5 tosses of a fair coin is:", probability, "\n")

In R, the rbinom function is used to generate random variates from a binomial distribution. This can be useful for simulation purposes or to understand the distribution of outcomes under specified conditions (number of trials and probability of success).

Example

Let's simulate the outcome of 10 experiments, each consisting of 5 tosses of a fair coin, and count the number of heads (successes) in each experiment.

R Code

# Define the parameters
n <- 5       # Number of trials in each experiment
p <- 0.5     # Probability of success on each trial
experiments <- 10  # Number of experiments

# Generate random variates
random_variates <- rbinom(experiments, size = n, prob = p)

# Print the results
cat("Random variates from 10 experiments of 5 coin tosses each:", random_variates, "\n")

Geometric Distribution

Definition

The Geometric Distribution is a discrete probability distribution that models the number of trials needed to achieve the first success in a series of independent and identically distributed Bernoulli trials, where each trial has only two possible outcomes (success or failure). The parameter of a geometric distribution is $p$ , the probability of success on each trial.

Geometric Distribution

General Probability Equation

The probability that the first success occurs on the $k$ th trial in a geometric distribution is given by the formula:

P (X = k) = (1 - p)^{k - 1} p

where:

$P (X = k)$ is the probability that the first success occurs on the $k$ th trial,
$p$ is the probability of success on each trial,
$1 - p$ is the probability of failure on each trial,
$k$ is the trial number of the first success.
The $E [X]$ is $\frac{1 - p}{p}$ and $V a r [X]$ is also $\frac{1 - p}{p^{2}}$

Example

Suppose we want to find the probability that the first success (e.g., first head) occurs on the 4th toss of a fair coin.
Here, the probability of success ( $p$ ), which is getting a head, is 0.5 (since the coin is fair).

R Code

# Define the parameters
p <- 0.5     # Probability of success on each trial
x <- 4       # The trial on which the first success occurs

# Calculate the probability
probability <- dgeom(x - 1, prob = p) # x - 1 because dgeom counts the number of failures before the first success

# Print the result
cat("The probability of the first head occurring on the 4th toss is:", probability, "\n")

The rgeom function in R is used to generate random variates from a geometric distribution. This function can simulate the number of trials required to obtain the first success in a series of independent trials, each with the same probability of success.

Example

Let's simulate 10 scenarios to find out how many trials are needed to achieve the first success (e.g., first head) when tossing a fair coin.
In this simulation, the probability of success (getting a head) on each trial is 0.5.

R Code

# Define the parameters
p <- 0.5     # Probability of success on each trial
scenarios <- 10  # Number of scenarios to simulate

# Generate random variates
random_variates <- rgeom(scenarios, prob = p) + 1 # +1 because rgeom returns the number of failures before the first success

# Print the results
cat("Number of trials needed for the first success in 10 scenarios:", random_variates, "\n")

Poisson Distribution

Definition

The Poisson Distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space, assuming that these events occur with a known constant mean rate and independently of the time since the last event. The parameter of the Poisson distribution is $λ$ (lambda), which represents the average rate (mean) of occurrences in a fixed interval.

General Probability Equation

The probability of observing $k$ events in a fixed interval is given by the formula:

P (X = k) = \frac{λ^{k} e^{- λ}}{k!}

where:

$P (X = k)$ is the probability of observing exactly $k$ events,
$λ$ is the average number of events in an interval,
$e$ is the base of the natural logarithm (approximately 2.71828),
$k!$ is the factorial of $k$ .
The $E [X]$ is $λ$ and $V a r [X]$ is also $λ$

Example

Suppose we want to find the probability of receiving 2 calls in a 1-hour period at a call center, given that the call center receives an average of 5 calls per hour.

Here, $λ = 5$ calls per hour, and we are interested in $k = 2$ calls.

R Code for the Example

# Define the parameters
lambda <- 5  # Average rate (mean) of occurrences
k <- 2       # Number of events

# Calculate the probability
probability <- dpois(k, lambda)

# Print the result
cat("The probability of receiving 2 calls in a 1-hour period is:", probability, "\n")

Example for `rpois`

Let's simulate the number of calls received in 10 different 1-hour periods at the call center, with an average rate of 5 calls per hour.

R Code for `rpois`

# Define the parameters
lambda <- 5  # Average rate (mean) of occurrences
periods <- 10  # Number of periods to simulate

# Generate random variates
random_variates <- rpois(periods, lambda)

# Print the results
cat("Number of calls received in 10 different 1-hour periods:", random_variates, "\n")

Long-run Averages

More formally, if $X$ is a discrete random variable, we define its expectation to be $$ \mathbb{E}X = \sum_{k} k \Pr[X = k], $$ where the sum is over all $k$ such that $Pr [X = k] > 0$ .

Note that this set could be finite or infinite.
If the set is infinite, the sum might not converge, in which case we say that the expectation is either infinite or doesn't exist. But that won't be an issue this semester.

Independence

Definition

In probability theory, two events are said to be independent if the occurrence of one event does not affect the probability of occurrence of the other event. In other words, events A and B are independent if and only if:

P (A \cap B) = P (A) \cdot P (B)

This can also be extended to random variables. Two random variables X and Y are independent if the occurrence of a particular value of X does not affect the probability distribution of Y, and vice versa.

Conditional Probability

Definition

Conditional probability is a measure of the probability of an event occurring given that another event has already occurred. If the event of interest is $A$ and the event $B$ has occurred, the conditional probability of $A$ given $B$ is written as $P (A | B)$ , and is calculated by the formula:

P (A | B) = \frac{P (A \cap B)}{P (B)},

provided that $P (B) > 0$ .

Bayes' Rule

Definition

Bayes' Theorem is a way of finding a probability when we know certain other probabilities. The formula is:

P (A | B) = \frac{P (B | A) \cdot P (A)}{P (B)}

P (A | B) = \frac{P (B | A) \cdot P (A)}{P (B | A) \cdot P (A) + P (B | A^{'}) P (A^{'})}

where from

P (B) = P (B | A) \cdot P (A) + P (B | A^{'}) \cdot P (A^{'})

This theorem allows us to update our prior beliefs (the probability of $A$ ) with new evidence (the probability of $B$ given $A$ ) to obtain a revised probability (the probability of $A$ given $B$ ).

Expectation

If $X$ is a random variable, then

E (a X + b) = a E X + b .

and

E (X + Y) = E X + E Y .

Variance

Var (X) = E (X^{2}) - (E X)^{2}

Now, the first and last terms there are the variances of $X$ and $Y$ :

Var X = E (X - E X)^{2}, Var Y = E (Y - E Y)^{2} .

Var (X + Y) = Var X + 2 E [(X - E X) (Y - E Y)] + Var Y .

This term might be familiar to you—it is (two times) the $covariance$ of $X$ and $Y$ , often written

Cov (X, Y) = E [(X - E X) (Y - E Y)] .

Now, if $Cov (X, Y) = 0$ , then

Var (X \pm Y) = Var X + Var Y .

Correlation

$ρ_{x y} = \frac{c o v (x, y)}{σ_{x} \cdot σ_{y}}$

Type Error I and II

	$H_{0}$ true	$H_{0}$ false
Accept $H_{0}$	True Negative ✅	False Negative ❌ (Type II) $β$
Reject $H_{0}$	False Positive ❌ (Type I) $α$	True Positive ✅

A Type I error corresponds to rejecting the null hypothesis when it is in fact true. That is, type I errors correspond to “false alarms”.
A Type II error corresponds to accepting the null hypothesis when it is not true. That is, type II errors correspond to “misses”.

Example

	$H_{0}$ (Negative) true	$H_{0}$ false (Positive True)
Accept $H_{0}$ (Negative)	41	8
Reject $H_{0}$ (Positive)	19	32
Type I Error: $P(rej	H_0 true) = \frac{19}{60}$, Type II Error: $P(acpt	H_0 false) = \frac{8}{40}$
Power: $P(rej	H_A) = \frac{32}{40}$

Relations

Increasing $α$ (making the test more liberal in rejecting the null hypothesis) can decrease $β$ (reducing the risk of a Type II error), thus increasing the power of the test.
Decreasing $α$ (being more conservative about rejecting the null hypothesis) increases $β$ (higher risk of missing an effect), thus reducing the power of the test.
To increase power (reduce $β$ ) without changing $α$

Monte Carlo Method

To (approximately) compute $P r [E]$ , we

generate lots of $replicates$ of $X$ , say $X_{1}, X_{2}, \dots, X_{M}$ for some number $M$ of Monte Carlo replicates.
Count how many of these replicates correspond to event $E$ occurring. That is, how many $i$ are there such that $X_{i} \in S$ .
Estimate $P r [E]$ as $M^{- 1} \sum_{i = 1}^{M} 1 {X_{i} \in S}$ .

Example

Note that we have specified the standard deviation to be $\sqrt{3}$ —the variance is $σ^{2} = 3$ , so standard deviation is $σ = \sqrt{3}$ .
Our event of interest is $E = {0 \leq X \leq 3}$ . Monte Carlo says that to estimate $P r [E]$ , we repeat our experiment lots of times and count what fraction of the time the event $E$ happens.
So we should generate lots of copies of $X \sim N (μ = 1, σ^{2} = 3)$ and count how often $0 \leq X \leq 3$ . Let's do just that.

# the function examine happened or not
event_E_happened <- function( x ) {
  if( 0 <= x & x <= 3 ) {
    return( TRUE ) # The event happened
  } else {
    return( FALSE ) # The event DIDN'T happen
  }
}

# Now MC says that we should generate lots of copies of X...
NMC <- 1000; # 1000 seems like "a lot".
results <- rep( 0, NMC ); # We're going to record outcomes here.
for( i in 1:NMC) {
  # Generate a draw from the normal, and then...
  X <- rnorm( 1, mean=1, sd=sqrt(3) );
  # ...record whether or not our event of interest happened.
  results[i] <- event_E_happened(X);
}
# Now, compute what fraction of our trials were "successes" (i.e., E happened)
sum(results)/NMC

As the number of replications increases, the standard error decreases, which can lead to a smaller p-value and a higher probability of rejecting the null hypothesis when it is false, thus improving the power of the test.

Power of Hypothesis Test

The power of a hypothesis test is the probability of rejecting the null.

CDF, PDF, Inverse CDF, Probability

Probability Density Function (PDF)

$4_continuous_probability_density_functions.png|300$
For continuous function there is no probability at a point $P (X = a) = 0$

Cumulative Distribution Function (CDF)

Inverse CDF

Also known as the quantile function. For any CDF function $F (x) = x^{2}$ the qx(0.5) (inverse CDF) is $F (x) = 0.5 = x^{2}$ thus the qx(0.5)= sqrt(0.5).

Law of Large Numbers

According to the LLN, as the number of trials or sample size increases, the sample mean (average of the sample outcomes) will converge to the expected value (true mean) of the population from which the samples are drawn, provided that the expected value exists.

Convergence of Sample Mean to True Mean

The Law of Large Numbers states that as the sample size $n$ grows, the sample mean ${\bar{X}}_{n}$ gets closer to the population mean $μ$ , under the assumption that the population mean is finite.

lim_{n \to \infty} P (| {\bar{X}}_{n} - μ | > ϵ) = 0

This means that for a sufficiently large $n$ , the sample mean ${\bar{X}}_{n}$ is a good approximation of the population mean $μ$ .

Scaling of the Variance of the Sample Mean with Sample Size

The variance of the sample mean decreases as the sample size increases. Specifically, if $σ^{2}$ is the variance of the population from which the samples are drawn, the variance of the sample mean $σ_{\bar{X}}^{2}$ is given by:

σ_{\bar{X}}^{2} = \frac{σ^{2}}{n}

As the sample size increases, the variance of the sample mean decreases, making the sample mean a more precise estimator of the population mean.

Sample Mean and Variance

Sample Mean $\bar{X}$ and variance $s^{2}$ formula:

\bar{X} = \frac{1}{n} \sum X_{i} and s^{2} = \frac{1}{n - 1} \sum (X_{i} - \bar{X})^{2}

And $E (\bar{X}) = μ$ , $E (s^{2}) = σ^{2}$ . Which means as sample size $n$ increases, $\bar{X}$ converges to $μ$ : $\bar{X} \to μ$ as $n \to \infty$ . Same as variance: $s^{2} \to σ^{2}$ as $n \to \infty$ .
If $n$ increase, sample variance will not change but sample mean will get decreases to $n$ .

Odd

The odds of an event is defined as the ratio of the probability of the event happening to the probability of the event not happening. Mathematically, if $p$ is the probability of the event (here, making a purchase), the odds $O$ is given by:

O = \frac{p}{1 - p}

Estimation

Estimators

The estimator in this case is the sample mean, which is a method used to estimate the population mean. The formula for the sample mean $\bar{x}$ as an estimator is:

\bar{x} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

where $x_{i}$ are the observed values in the sample and $n$ is the sample size.

Estimate

If the heights of the 50 students are measured and the resulting values are 160 cm, 165 cm, 170 cm, etc., you would plug these values into the formula to compute the sample mean. This computed value is the estimate of the population mean.

Statistic

In this scenario, the sample mean $\bar{x}$ is a statistic—it is a value calculated from the sample data.

Expectation and Variance

$E \bar{X} = p$ we say $\bar{X}$ is an unbiased estimator of $p$ . $Var Z = E (Z - E Z)^{2}$ and the variance of estimator $\hat{p}$ based on a sample of size $n$ is

Var \hat{p} = \frac{p (1 - p)}{n}

The Central Limit Theorem

It explains why many distributions in nature tend to approximate the normal distribution, even if the underlying distribution itself is not normal, provided that the sample size is large enough.

Key Elements of the CLT

Independent and identically distributed (i.i.d.): The observations in the sample must be independent of each other and come from the same distribution.
Sufficiently large sample size

Mathematical Formulation

If $X_{1}, X_{2}, \dots, X_{n}$ are i.i.d. variables with mean $μ$ and finite variance $σ^{2}$ , then the sample mean:

{\bar{X}}_{n} = \frac{1}{n} \sum_{i = 1}^{n} X_{i}

will have a distribution that approaches a normal distribution with mean $μ$ and variance $σ^{2} / n$ as $n$ increases. Mathematically, this can be written as:

\sqrt{n} ({\bar{X}}_{n} - μ) \overset{d}{\to} N (0, σ^{2})

where $\overset{d}{\to}$ denotes convergence in distribution. Normalized form $\frac{\bar{X} - μ}{\sqrt{σ^{2} / n}}$ .

Fact

a higher confidence level demands a wider interval, since $z^{*}$ is larger when higher confidence level.

Residual Sum of Squares (RSS) or (SSR)

The RSS is calculated by taking the sum of the squares of the residuals (the differences between the observed values and the predicted values). Mathematically, it can be expressed as:

RSS = \sum_{i = 1}^{n} (y_{i} - \hat{y_{i}})^{2}

Why RSS

The RSS measures the variance left unexplained by the model, thus providing a direct measure of the model’s predictive power. Lower RSS values indicate a model that closely fits the data. Sensitivity to Large Errors: Squaring the residuals has the effect of heavily penalizing larger errors.

Example

Consider a dataset where we want to predict house prices based on the size of the house (in square meters).

# Fit linear model
model <- lm(house_price ~ house_size, data = data)
# Get summary of the model
summary(model)

Then calculate the RSS

# Calculate RSS
rss <- sum(residuals(model)^2)
print(rss)

Suppose your output from R looks like this:

RSS: 12000
R-squared: 0.85
This would mean:
The model has a fairly low RSS, suggesting it fits the data reasonably well, but there might still be some variability left unexplained.
An R-squared of 0.85 is quite high, indicating a good predictive model for house prices based on house size.

Linear Regression

Estimate of $β_{0}$ and $β_{1}$

To minimize the loss to get the estimates:

{\hat{β}}_{0} = \bar{y} - {\hat{β}}_{1} \bar{x} and {\hat{β}}_{1} = \frac{\sum_{i = 1}^{n} (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} (x_{i} - \bar{x})^{2}}

Variance of estimates

After fitting, we can find our predicted ${\hat{y}}_{i}$ , i.e. the $y$ values on the line.

{\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{i} x_{i}

the model residuals ${\hat{ϵ}}_{i} = y_{i} - {\hat{y}}_{i}$ .
Mean Squared Error:

{\hat{σ}}^{2} = \frac{SSE}{n - 2} = \frac{1}{n - 2} \sum_{i} (y_{i} - {\hat{y}}_{i})^{2}

The formula for the 95% confidence interval is: Estimate $\pm$ (1.96 $\times$ Standard Error)

Standard Error ( $SE (\hat{β})$ ) and Residual Standard Error(RSE)

Formula: $\sqrt{σ^{2} / \sum (X_{i} - \bar{X})^{2}}$ for point estimate $s / \sqrt{n}$ ; RSE= $\sqrt{\sum (Y_{i} - {\hat{Y}}_{i})^{2} / (n - 2)}$

Linear Regression Assumption

The response variable has a linear relationship with the predictor variables;
Linear regression assumes that the errors (residuals) follow a normal distribution with mean zero;
The errors have constant variance;

Interaction Term

# here (Type * Treatment) is the interaction
model <- lm(response ~ Type * Treatment + conc, data = data)

The Estimate is the coefficient of relation. For example:

Type -> -9.38 1.62
Treatment -> -3.58 1.85
conc -> 0.0177 0.0022
Type:Treatment -> -6.55 2.61

When Type = 0 and Treatment = 0, the interaction term has no effect on the response variable.intercept (27.620528)
When Type = 1 and Treatment = 0, the effect on the response variable is the coefficient of Type (-9.380952).
When Type = 0 and Treatment = 1, the effect on the response variable is the coefficient of Treatment (-3.580952).
When Type = 1 and Treatment = 1, the effect on the response variable is the Intercept + TypeMississippi + Treatmentchilled + TypeMississippi:Treatmentchilled = 27.620528 + (-9.380952) + (-3.580952) + (-6.557143) = 8.101481
Expected response = Intercept + TypeMississippi * 1 + Treatmentchilled * 0 + conc * 500 + TypeMississippi:Treatmentchilled * (1 * 0)
Expected response = 27.620528 + (-9.380952) * 1 + (-3.580952) * 0 + 0.017731 * 500 + (-6.557143) * (1 * 0)
Fact: The intercept, also known as the constant term, is the expected value of the response variable when all predictor variables are equal to zero.

Proportion of the change

proportion of the change in response that is explained by the change in predictors, check the R-squared.

Multiple R-squared: 0.7072
Adjusted R-squared: 0.6923
Thus, the 70.72% is the one.

Which predictors are the most significant in this model

Check the p-value the smallest one is the target.

Multiple Regression

Multiple regression is an extension of simple linear regression that allows you to predict an outcome based on multiple predictors.

Multiple Regression Model

In a multiple regression model, the formula looks like this:

Y = β_{0} + β_{1} X_{1} + β_{2} X_{2} + . . . + β_{n} X_{n} + ϵ

Example

# Fit multiple regression model
model <- lm(house_price ~ house_size + num_bedrooms + age_of_house, data = data)
# Get summary of the model
summary(model)

Model Fit with RSS

We define the residual sum of squares. Let ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ,

RSS = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2} = \sum_{i = 1}^{n} (y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{i} x_{i}))^{2}

$R^{2}$ : $R$ -squared

R^{2} = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS}

R^{2} = 1 - \frac{\sum (Y_{i} - {\hat{Y}}_{i})^{2}}{\sum (Y_{i} - \bar{Y})^{2}}

where, as a reminder,

TSS = \sum_{i = 1}^{n} (y_{i} - \bar{y})^{2} and RSS = \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

where ${\hat{β}}_{0} = \bar{y}$ .

MSS

sum of squares: the sum of squares between our model and the “dumbest” model:

MSS = \sum_{i = 1}^{n} ({\hat{y}}_{i} - \bar{y})^{2}

This quantity is often called the model sum of squares (MSS) or the explained sum of squares (ESS).
Since the $TSS = RSS + MSS$ , so that $R^{2} = \frac{TSS - RSS}{TSS} = \frac{MSS}{TSS}$ .

Quadratic Term (Nonlinear transformations)

let’s look in particular at mpg (miles per gallon) and hp (horsepower).

mtc_lm <- lm( mpg ~ 1 + hp, data=mtcars );

and you can get quadratic term

mtc_lm <- lm( mpg ~ 1 + hp + I(hp^2), data=mtcars );

Notice that to get our non-linear term hp^2 into the model, we had to write our formula as mpg ~ 1 + hp + I(hp^2). If we just wrote mpg ~ 1 + hp + hp^2, R would parse hp^2 as just hp. I(...) prevents this and ensures that R doesn’t clobber the expression inside.

Logistic Regression

We do that using the logistic function and Sigmoid function

σ (z) = \frac{1}{1 + e^{- z}} = \frac{e^{z}}{1 + e^{z}} .

Logistic regression in R

pima_logistic <- glm( diabetes ~ 1 + glu, data=Pima.te, family=binomial );
summary(pima_logistic)

Model Coefficients

The odds of an event E with probability p are given by $Odds (E) = \frac{p}{1 - p}$ .
So, let’s suppose that we have a logistic regression model with a single predictor, that takes predictor $x$ and outputs a probability

p (x) = \frac{1}{1 + \exp {- (β_{0} + β_{1} x)}}

The odds associated with this probability are

Odds (x) = \frac{p (x)}{1 - p (x)} = \frac{1}{1 + \exp {- (β_{0} + β_{1} x)}} = \exp {β_{0} + β_{1} x}

If we look at the log odds associated with our logistic regression model,

Log-Odds (x) = \log Odd (x) = \log \exp {β_{0} + β_{1} x} = β_{0} + β_{1} x

How to predict the log-odds:
Predicted log-odds = Intercept + Pclass2 * 0 + Pclass3 * 0 + Sexmale * 1 + Age * 20
where question is set class=1, male=1 and age=20.

Validation Sets

The Mean Squared Error (MSE):

E ({\hat{Y}}_{n + 1} - Y_{n + 1})^{2}

Model Selection

RIDGE Regression

the short answer is that ridge regression (and other shrinkage methods) prevents over-fitting. $λ$ makes it more expensive to simply choose whatever coefficients we please, which in turn prevents us from over-fitting to the data. To select the optimal value of $λ$ , you can use techniques like cross-validation. By varying $λ$ , you can find the value that minimizes the cross-validation error or maximizes a performance metric like R-squared.

LASSO Regression

Lasso Regression aims to minimize the sum of squared residuals plus a penalty term that is proportional to the absolute value of the coefficients (L1 norm). Lasso Regression is beneficial when you have a large number of features and suspect that only a few of them are relevant. It automatically performs feature selection and produces a more interpretable model.

AIC (Akaike Information Criterion)

AIC is used for comparing and selecting models based on their relative quality, considering both the goodness of fit and the complexity of the model. It is particularly useful when you have multiple models with different numbers of predictors and want to choose the best model among them.

R-squared ( $R^{2}$ )

R-squared measures the proportion of variance in the response variable that is explained by the predictors in the model. It is commonly used to compare models with the same number of predictors.

Adjusted R-squared

Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model. Adjusted R-squared is useful when you want to compare models with different numbers of predictors.

Discrete Distribution

Uniform Distribution

Bernoulli Distribution

Definition

Probability Mass Function

Expectation and Variance

Example

R Code

Binomial Distribution

Definition

General Probability Equation

Example

R Code

Example

R Code

Geometric Distribution

Definition

Geometric Distribution

General Probability Equation

Example

R Code

Example

R Code

Poisson Distribution

Definition

General Probability Equation

Example

R Code for the Example

Example for rpois

R Code for rpois

Long-run Averages

Independence

Definition

Conditional Probability

Definition

Bayes' Rule

Definition

Expectation

Variance

Correlation

Type Error I and II

Example

Relations

Monte Carlo Method

Example

Power of Hypothesis Test

CDF, PDF, Inverse CDF, Probability

Probability Density Function (PDF)

Cumulative Distribution Function (CDF)

Inverse CDF

Law of Large Numbers

Convergence of Sample Mean to True Mean

Scaling of the Variance of the Sample Mean with Sample Size

Sample Mean and Variance

Odd

Estimation

Estimators

Estimate

Statistic

Expectation and Variance

The Central Limit Theorem

Key Elements of the CLT

Mathematical Formulation

Fact

Residual Sum of Squares (RSS) or (SSR)

Why RSS

Example

Linear Regression

Estimate of β0 and β1

Variance of estimates

Standard Error (SE(β^)) and Residual Standard Error(RSE)

Linear Regression Assumption

Interaction Term

Proportion of the change

Which predictors are the most significant in this model

Multiple Regression

Multiple Regression Model

Example

Model Fit with RSS

R2: R-squared

Example for `rpois`

R Code for `rpois`

Estimate of $β_{0}$ and $β_{1}$

Standard Error ( $SE (\hat{β})$ ) and Residual Standard Error(RSE)

$R^{2}$ : $R$ -squared

R-squared ( $R^{2}$ )