SRaI

@LMU

125

0.0 (0)

S. B.

Kartei Details

Zusammenfassung	This flashcard set covers advanced statistical methods at the university level, focusing on topics like ANOVA, missing data analysis, and copula models. It delves into key concepts such as variance, data imputation, and the trade-offs between sample size and data quality. Researchers and students in statistics or data science will find this set particularly useful for understanding complex analytical techniques and their practical applications.
Karten	125
Lernende	1
Sprache	English
Kategorie	Informatik
Stufe	Universität
Erstellt / Aktualisiert	04.10.2019 / 11.10.2019
Weblink	https://card2brain.ch/cards/20191004_srai
Einbinden	<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Kartenliste

Name bayes roule

\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)

What is a random variable?

random variable y maps from event-space omega to real values

Define the expected value and variance

\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)

\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)

Define the exponential family

\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)

with t(y) = statistics = function of data
theta = parameter (vector)
K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
h(y) >= 0, unimportant
and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)

What is the t-distribution good for?

statistical test for mean of normal distributed variables
when variance is unknown (estimated from data)

Define covariance for Y1, Y2.

What about independence? What does this imply?

\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)

Cov(Yj, Yk) = 0 if Yj, Yk are independent

f(yj, yk) = f(yj) * f(yk)
E(Yj, Yk) ) E(Yj) * E(Yk)

Define correlation.

Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))

Name iterated expectation.

\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)

What is the idea of central limit theorem?

What is required? (3)

Give an example.

sum of any distributed random variable converges to normal distribution
- for n -> infinity (asymptotically)
conditions
- i.i.d.
- mean given
- finite variance
example: random-walk: Yn = sum(x_i)
- Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
- with Y_n sum(x_i)

What is the moment generating function? What about its k-th derivative?

What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?

\(M_Y(t) = E(e^{t Y})\)
\(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
\(K_Y(t) = log(M_Y(t))\)
- first derivative w.r.t. t: E(Y) = first moment
- 2nd derivative w.r.t. t = Var(Y) = 2nd moment
- as long as moments are finite

What is the general idea of parametric statistical models? (2) Define Posterior and its components.

y (data) is realisation of Y with Y~F(y; theta) (model)
theta unknown
\(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- posterior = all information about theta
- likelihood: information in data
- prior: knowledge about theta before observing the data
- denominator: normalisation constant (f(y), independent of theta)

Define likelihood and loglikelihood.

Name characteristics. (2)

\(L(\theta;y) = \prod f(y_i; \theta)\)
\(l(\theta;y) = \sum log~f(y_i; \theta)\)

Characteristics

plausibility of parameter-values theta, given data y
posterior proportional to Likelihood * prior

How theta can be estimated (theta_hat)? (3, 2 each). Relate them.

posterior mean estimate
- theta_hat = E(posterior)
- requires (numerical) integration (E(.))
posterior mode estimate
- theta_hat = max(posterior)
- first derivative required
ML-estimation
- theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
- assumes flat/constant/non-informative prior
posterior mode estimate = ML-estimation if prior is flat

Explain invariance-property of ML-estimate

gamma = g(theta), theta = g^-1(gamma)
gamma_ML = g(theta_ML) >> no reestimation required after transformation
g(.) bijective transformation function
counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)

Define loss, squared loss. (3)

L: set of t x parameter-space -> R+
L(t, theta) = (t-theta)^2 >> min
L(theta, theta) = 0

What is the problem with loss? (2)

Explain solutions. (2), (3)

t(y) = theta_hat depens on sample
- risk = expected loss (R(t, theta) = E(L(.))
- estimate theta_hat s.t. R(.) is minimised
theta unknown (risk and loss depend on theta)
- minimax approach: theta_hat = chose t(y) of theta with maximal risk s.t. risk is minimized
- bayes risk: minimise Expectation of risk w.r.t. theta
- posterior bayes risk: minimise Expectation of Loss given the data

Define MSE. How does it decompose?

R(t, theta) = E((t(y) - theta)^2) = expectation of squared loss
- = Var(t(y)) + Bias^2(t, theta)
- = stochastic error (variability of t) + systematic error

Define Bias. What is the goal at estimation?

Bias(t, theta) = E(t(y)) - theta
aim: asymptotically unbiased estimate s.t. Bias = 0 for whole parameter space

What is the general idea of Kullack-Leibler divergence? (4)

Define KL(t, theta).

What is the result, if KL is used as a loss function?

compares distributions
not symmetric
log(f(y; theta)) - log(f(y; t))) = log(f(y; theta) / f(y; t))
- theta -> true distribution
- t -> estimated distribution
0 if t = theta else >= 0
KL(t, theta) = \(\int log\frac{f(y; \theta)}{f(y; t)} f(y; \theta)dy\)
E(KL) = R(t, theta) = integral(I(theta) * (t - theta)^2 * prod(f(y;theta))
- fisher-information independent of data
- >> min(KL-risk) approximates min(MSE)

What is sufficiency?

What is a sufficient statistic? (3)

quality of estimate t(y)

all information about theta in t()
- idea: replace data y by t(y)
if f(y | t(y) ; theta) is independent of theta
problem: hard to show

What is Nayman-factorisation? What is the problem?

sufficient if f(y; theta) = h(y) * g(t(y); theta)
- h() and g() non-negative
- >> h() independent of theta
- >> g() depends on y only through t()
weak statement as y is already sufficient

Define minimal sufficient statistics.

t() is sufficient AND
it exists t*() s.t. t() = h(t*(.))

Define consistency. (2)

quantify information in data
theta_hat is consistent if MSE(theta_hat, theta) -> 0 for n -> inf
- >> theta_hat -> theta (asymptotically)

Define cramer rao bound (2).

Name properties (2)

lower limit of MSE for given n
holds only for fisher regular distributions
for unbiased theta_hat: MSE() >= inv(I(theta))
- Var(theta_hat) >= inverse of fisher-information (= in best case)
for biased theta: MSE() >= Bias()^2 + (. = 1 if bias = 0)^2/I(theta)
- smaller variance possible, but larger bias then

What is the idea of confidence intervals? (2)

quantify uncertainty of theta_hat
interval estimate instead of point estimate
- P(theta in CI) >= 1-alpha (-> =)

What is a pivotal statistics? Give an example.

g(y; theta) independent of theta
i.e. x = (theta_hat - theta) / sqrt(Var(theta_hat)) ~ N(0,1)
- pivotal statistics by using CLT
- P(z_alpha/2 <= x <= z_1-alpha/2) = 1-alpha
- CI = [theta_hat +- z_1-alpha/2 * sqrt(Var(theta_hat))]
  - problem: Var(theta_hat) depends on theta >> circle >> estimate Var(theta_hat)

What are creditability intervals? (2)

What is the highest density integral? (2)

bayesian approach using the posterior
P(theta in CI | y) ) = integral from left bound to right bound over posterior >= 1-alpha
- integral of posterior from -inf to left bound = integral of posterior from right bound to inf = alpha/2
- >> cut left and right probability mass of alpha/2

HDI(y) = {theta; posterior >= c)
c s.t. integral over posterior with all thetas of HDI = 1-alpha
- >> cut from top (posterior of left and right bound same)

What is the difference between confidence and creditability?

different approaches/reasoning, similar results

Define fisher information.

What is the meaning of it? (1 + 4)

Expectation(- 2nd derivative of log-likelihood w.r.t. theta)
- Expectation of observed information
measures the amount of information that y carries about theta (increases with n)
- variance of score
- reciprocal variance of estimate (lowest possible variance of unbiased estimator)
- >> central role in asypmtotical theory of ML-estimation
- >> can be used for tests (i.e. Wald-test)

What is fisher regularity? (4)

support of y is independent of theta
parameter-space of theta is open
f(y; theta) is twice differentiable wrt theta
integration and differentiation are exchangeable

What are requirements for ML-inference? (2)

fisher regularity
Y_i ~ f(y; theta) iid

Define the score. (3)

first derivative of l(theta; y) w.r.t. theta = s(theta; y)
is used for ML-estimation: s(theta_hat) = 0 (max of l(theta;y))
s(theta; y) ~ N(0, I(theta))
- approx.
- proof
  - 1st Barlett identity: 1 = integral(f(y; theta)) -> diff() -> E(s) = 0
  - 2nd Barlett identity: 0 = E(s) -> diff -> E(s^2) = I(theta) = Var(s)

Explain why the ML-estimate is random, what is the aim to know and how to deal with it?

l(theta; y) depends on sample (maxima around theta)
- aim: quantify expectation and uncertainty of theta_hat
>> asymptotic normality of theta_hat
- theta_hat ~ N(theta, inv(I(theta)) (approx.)
- proof:
  - define s_n, I_n etc.
  - TLS around s_n at theta
  - >> theta_hat - theta = - s_n / s'_n = inv(I_n) * s_n
  - theta_hat - theta ~ N(0, inv(I_n)) as s_n ~ N(0, I_n)

Explain the numerical calculation of ML estimate. (Fisher Scoring) (3)

Name Problem and solutions (2)

theta_t+1 = theta_t - s / s' = theta_t + s / I
0) theta_t = theta_zero
1) theta_t+1 = theta_t + s(theta_t) / I(theta_t)
2) stop if ||theta_t+1 - theta_t|| < d
3) theta_ML = theta_t+1
>> can end up in local optimum
- different theta_zero
- adapt step-size

What is the idea of testing?

Define type 1 and two error. What is the problem with the errors?

idea: answer questions based on data
type-one error: "H1" | H0, reject H0 even it is true
type-two error: "H0" | H1, accept H0 even it is false
problem: errors are complementary: no type1 error by always "H0" i.e.

What is a significant alpha test? Give an example.

P("H1" | H0) <= alpha (bound for type one error)
P(y_bar > c | mu <= mu_zero) <= alpha
- >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
  - with z_1-alpha = 1-alpha quantile of N(0, 1)

What general kind of tests exist? (3)

What is the general problem for these tests and how to solve it?

one sided test: H0: mu <= mu_zero, H1: mu > mu_zero
- "H1" <> y_bar > c
two-sided test: H0: mu = mu_zero, H1: mu != mu_zero
- "H1" <> |y_bar - mu_zero| > c
testing theta: H0: theta in theta-zero-space, H1: theta not in theta-zero-space
problem: for calculation of c, sigma is required (mostly unknown)
- >> estimate sigma_hat >> (y_hat - mu_zero) / (sigma_hat / sqrt(n)) ~ t_n-1 distributed

Explain Wald-, score- and LR-test and compare them.

all test: H0: theta = theta_zero, H1: theta != theta_zero
Wald-test
- "H1" <> |theta_hat - theta_zero| > c (large deviation of thetas speak against H0)
- c = z_1-alpha/2 * sqrt(inv(I(theta_zero))) >> theta_hat instead of theta_zero used as theta_hat ~ theta_zero under H0
- >> estimation of theta_hat and variance of estimate required
Score-test
- "H1" <> |s(theta_zero; y)| > c (large score speaks against H0)
- c = z_1-alpha/2 * sqrt(I(theta_zero))
- >> score of theta_zero is enough, no theta_hat required
likelihood-ratio test
- "H1" <> lr(theta, theta_hat) > c (large lr speaks against H0)
- lr() = 2(l(theta_hat) - l(theta)) ~ X^2_p
  - lr(theta, theta_hat) = l(theta_hat) - l(theta) >= 0
  - l(theta) = TSE at theta_hat = l(theta_hat) - 0.5 s^2 / I
  - lr(,) = 0.5 s^2/I
- c = X^2_p,1-alpha

What is the idea of power of a test?

Define the power. Give an example

What is the aim?

no statement about type 2 error yet
power = P("H1" | H1) = 1 - P(type2 error)
- = 1 - PHI(z_1-alpha + (mu_zero - mu) / (sigma / sqrt(n)))
- increasing n increases the power, decreases type2 error
aim: maximal power while maintaining alpha

What is the Neyman-Pearson-Lemma: aim, rules, idea of proof.

aim: construct optimal test: significant alpha-test + higher power than regual PSI(y)
"H1" <> l(theta_zero) - l(theta_1) <= c
- a = f(y; theta_zero) / f(y; theta_1) <= exp(c) <= k
- PHI(y) = 1 if a <= k, else 0
proof:
- P(PSI(y) = 1; theta_one) <= P(PHI(y) = 1; theta_one) >> larger power
- three case (a < k, a > k, a = k)
- always: power(PHI) - power (PSI) >= P(PSI(y) = 1 - PHI(y) = 1| H0))

Kartenliste

Lernen

SRaI

Lernkarteien erstellen oder kopieren

Lernkarteien erstellen oder kopieren

Melde dich an, um alle Karten zu sehen.

SWITCHaai

Office 365

Edulog

Apple ID

Google