SRaI
@LMU
@LMU
Kartei Details
| Zusammenfassung | This flashcard set covers advanced statistical methods at the university level, focusing on topics like ANOVA, missing data analysis, and copula models. It delves into key concepts such as variance, data imputation, and the trade-offs between sample size and data quality. Researchers and students in statistics or data science will find this set particularly useful for understanding complex analytical techniques and their practical applications. |
|---|---|
| Karten | 125 |
| Lernende | 1 |
| Sprache | English |
| Kategorie | Informatik |
| Stufe | Universität |
| Erstellt / Aktualisiert | 04.10.2019 / 11.10.2019 |
| Weblink |
https://card2brain.ch/cards/20191004_srai
|
| Einbinden |
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Name bayes roule
\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)
What is a random variable?
- random variable y maps from event-space omega to real values
Define the expected value and variance
\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)
\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)
Define the exponential family
\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)
- with t(y) = statistics = function of data
- theta = parameter (vector)
- K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
- h(y) >= 0, unimportant
- and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)
What is the t-distribution good for?
- statistical test for mean of normal distributed variables
- when variance is unknown (estimated from data)
Define covariance for Y1, Y2.
What about independence? What does this imply?
\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)
Cov(Yj, Yk) = 0 if Yj, Yk are independent
- f(yj, yk) = f(yj) * f(yk)
- E(Yj, Yk) ) E(Yj) * E(Yk)
Define correlation.
- Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))
Name iterated expectation.
\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)
What is the idea of central limit theorem?
What is required? (3)
Give an example.
- sum of any distributed random variable converges to normal distribution
- for n -> infinity (asymptotically)
- conditions
- i.i.d.
- mean given
- finite variance
- example: random-walk: Yn = sum(x_i)
- Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
- with Y_n sum(x_i)
What is the moment generating function? What about its k-th derivative?
What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?
- \(M_Y(t) = E(e^{t Y})\)
- \(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
- \(K_Y(t) = log(M_Y(t))\)
- first derivative w.r.t. t: E(Y) = first moment
- 2nd derivative w.r.t. t = Var(Y) = 2nd moment
- as long as moments are finite
What is the general idea of parametric statistical models? (2) Define Posterior and its components.
- y (data) is realisation of Y with Y~F(y; theta) (model)
- theta unknown
- \(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- posterior = all information about theta
- likelihood: information in data
- prior: knowledge about theta before observing the data
- denominator: normalisation constant (f(y), independent of theta)
Define likelihood and loglikelihood.
Name characteristics. (2)
- \(L(\theta;y) = \prod f(y_i; \theta)\)
- \(l(\theta;y) = \sum log~f(y_i; \theta)\)
Characteristics
- plausibility of parameter-values theta, given data y
- posterior proportional to Likelihood * prior
How theta can be estimated (theta_hat)? (3, 2 each). Relate them.
- posterior mean estimate
- theta_hat = E(posterior)
- requires (numerical) integration (E(.))
- posterior mode estimate
- theta_hat = max(posterior)
- first derivative required
- ML-estimation
- theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
- assumes flat/constant/non-informative prior
- posterior mode estimate = ML-estimation if prior is flat
Explain invariance-property of ML-estimate
- gamma = g(theta), theta = g^-1(gamma)
- gamma_ML = g(theta_ML) >> no reestimation required after transformation
- g(.) bijective transformation function
- counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)
Define loss, squared loss. (3)
- L: set of t x parameter-space -> R+
- L(t, theta) = (t-theta)^2 >> min
- L(theta, theta) = 0
What is the problem with loss? (2)
Explain solutions. (2), (3)
- t(y) = theta_hat depens on sample
- risk = expected loss (R(t, theta) = E(L(.))
- estimate theta_hat s.t. R(.) is minimised
- theta unknown (risk and loss depend on theta)
- minimax approach: theta_hat = chose t(y) of theta with maximal risk s.t. risk is minimized
- bayes risk: minimise Expectation of risk w.r.t. theta
- posterior bayes risk: minimise Expectation of Loss given the data
Define MSE. How does it decompose?
- R(t, theta) = E((t(y) - theta)^2) = expectation of squared loss
- = Var(t(y)) + Bias^2(t, theta)
- = stochastic error (variability of t) + systematic error
Define Bias. What is the goal at estimation?
- Bias(t, theta) = E(t(y)) - theta
- aim: asymptotically unbiased estimate s.t. Bias = 0 for whole parameter space
What is the general idea of Kullack-Leibler divergence? (4)
Define KL(t, theta).
What is the result, if KL is used as a loss function?
- compares distributions
- not symmetric
- log(f(y; theta)) - log(f(y; t))) = log(f(y; theta) / f(y; t))
- theta -> true distribution
- t -> estimated distribution
- 0 if t = theta else >= 0
- KL(t, theta) = \(\int log\frac{f(y; \theta)}{f(y; t)} f(y; \theta)dy\)
- E(KL) = R(t, theta) = integral(I(theta) * (t - theta)^2 * prod(f(y;theta))
- fisher-information independent of data
- >> min(KL-risk) approximates min(MSE)
What is sufficiency?
What is a sufficient statistic? (3)
- quality of estimate t(y)
- all information about theta in t()
- idea: replace data y by t(y)
- if f(y | t(y) ; theta) is independent of theta
- problem: hard to show
What is Nayman-factorisation? What is the problem?
- sufficient if f(y; theta) = h(y) * g(t(y); theta)
- h() and g() non-negative
- >> h() independent of theta
- >> g() depends on y only through t()
- weak statement as y is already sufficient
Define minimal sufficient statistics.
- t() is sufficient AND
- it exists t*() s.t. t() = h(t*(.))
Define consistency. (2)
- quantify information in data
- theta_hat is consistent if MSE(theta_hat, theta) -> 0 for n -> inf
- >> theta_hat -> theta (asymptotically)
Define cramer rao bound (2).
Name properties (2)
- lower limit of MSE for given n
- holds only for fisher regular distributions
- for unbiased theta_hat: MSE() >= inv(I(theta))
- Var(theta_hat) >= inverse of fisher-information (= in best case)
- for biased theta: MSE() >= Bias()^2 + (. = 1 if bias = 0)^2/I(theta)
- smaller variance possible, but larger bias then
What is the idea of confidence intervals? (2)
- quantify uncertainty of theta_hat
- interval estimate instead of point estimate
- P(theta in CI) >= 1-alpha (-> =)
What is a pivotal statistics? Give an example.
- g(y; theta) independent of theta
- i.e. x = (theta_hat - theta) / sqrt(Var(theta_hat)) ~ N(0,1)
- pivotal statistics by using CLT
- P(z_alpha/2 <= x <= z_1-alpha/2) = 1-alpha
- CI = [theta_hat +- z_1-alpha/2 * sqrt(Var(theta_hat))]
- problem: Var(theta_hat) depends on theta >> circle >> estimate Var(theta_hat)
What are creditability intervals? (2)
What is the highest density integral? (2)
- bayesian approach using the posterior
- P(theta in CI | y) ) = integral from left bound to right bound over posterior >= 1-alpha
- integral of posterior from -inf to left bound = integral of posterior from right bound to inf = alpha/2
- >> cut left and right probability mass of alpha/2
- HDI(y) = {theta; posterior >= c)
- c s.t. integral over posterior with all thetas of HDI = 1-alpha
- >> cut from top (posterior of left and right bound same)
What is the difference between confidence and creditability?
- different approaches/reasoning, similar results
Define fisher information.
What is the meaning of it? (1 + 4)
- Expectation(- 2nd derivative of log-likelihood w.r.t. theta)
- Expectation of observed information
- measures the amount of information that y carries about theta (increases with n)
- variance of score
- reciprocal variance of estimate (lowest possible variance of unbiased estimator)
- >> central role in asypmtotical theory of ML-estimation
- >> can be used for tests (i.e. Wald-test)
What is fisher regularity? (4)
- support of y is independent of theta
- parameter-space of theta is open
- f(y; theta) is twice differentiable wrt theta
- integration and differentiation are exchangeable
What are requirements for ML-inference? (2)
- fisher regularity
- Y_i ~ f(y; theta) iid
Define the score. (3)
- first derivative of l(theta; y) w.r.t. theta = s(theta; y)
- is used for ML-estimation: s(theta_hat) = 0 (max of l(theta;y))
- s(theta; y) ~ N(0, I(theta))
- approx.
- proof
- 1st Barlett identity: 1 = integral(f(y; theta)) -> diff() -> E(s) = 0
- 2nd Barlett identity: 0 = E(s) -> diff -> E(s^2) = I(theta) = Var(s)
Explain why the ML-estimate is random, what is the aim to know and how to deal with it?
- l(theta; y) depends on sample (maxima around theta)
- aim: quantify expectation and uncertainty of theta_hat
- >> asymptotic normality of theta_hat
- theta_hat ~ N(theta, inv(I(theta)) (approx.)
- proof:
- define s_n, I_n etc.
- TLS around s_n at theta
- >> theta_hat - theta = - s_n / s'_n = inv(I_n) * s_n
- theta_hat - theta ~ N(0, inv(I_n)) as s_n ~ N(0, I_n)
Explain the numerical calculation of ML estimate. (Fisher Scoring) (3)
Name Problem and solutions (2)
- theta_t+1 = theta_t - s / s' = theta_t + s / I
- 0) theta_t = theta_zero
- 1) theta_t+1 = theta_t + s(theta_t) / I(theta_t)
- 2) stop if ||theta_t+1 - theta_t|| < d
- 3) theta_ML = theta_t+1
- >> can end up in local optimum
- different theta_zero
- adapt step-size
What is the idea of testing?
Define type 1 and two error. What is the problem with the errors?
- idea: answer questions based on data
- type-one error: "H1" | H0, reject H0 even it is true
- type-two error: "H0" | H1, accept H0 even it is false
- problem: errors are complementary: no type1 error by always "H0" i.e.
What is a significant alpha test? Give an example.
- P("H1" | H0) <= alpha (bound for type one error)
- P(y_bar > c | mu <= mu_zero) <= alpha
- >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
- with z_1-alpha = 1-alpha quantile of N(0, 1)
- >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
What general kind of tests exist? (3)
What is the general problem for these tests and how to solve it?
- one sided test: H0: mu <= mu_zero, H1: mu > mu_zero
- "H1" <> y_bar > c
- two-sided test: H0: mu = mu_zero, H1: mu != mu_zero
- "H1" <> |y_bar - mu_zero| > c
- testing theta: H0: theta in theta-zero-space, H1: theta not in theta-zero-space
- problem: for calculation of c, sigma is required (mostly unknown)
- >> estimate sigma_hat >> (y_hat - mu_zero) / (sigma_hat / sqrt(n)) ~ t_n-1 distributed
Explain Wald-, score- and LR-test and compare them.
- all test: H0: theta = theta_zero, H1: theta != theta_zero
- Wald-test
- "H1" <> |theta_hat - theta_zero| > c (large deviation of thetas speak against H0)
- c = z_1-alpha/2 * sqrt(inv(I(theta_zero))) >> theta_hat instead of theta_zero used as theta_hat ~ theta_zero under H0
- >> estimation of theta_hat and variance of estimate required
- Score-test
- "H1" <> |s(theta_zero; y)| > c (large score speaks against H0)
- c = z_1-alpha/2 * sqrt(I(theta_zero))
- >> score of theta_zero is enough, no theta_hat required
- likelihood-ratio test
- "H1" <> lr(theta, theta_hat) > c (large lr speaks against H0)
- lr() = 2(l(theta_hat) - l(theta)) ~ X^2_p
- lr(theta, theta_hat) = l(theta_hat) - l(theta) >= 0
- l(theta) = TSE at theta_hat = l(theta_hat) - 0.5 s^2 / I
- lr(,) = 0.5 s^2/I
- c = X^2_p,1-alpha
What is the idea of power of a test?
Define the power. Give an example
What is the aim?
- no statement about type 2 error yet
- power = P("H1" | H1) = 1 - P(type2 error)
- = 1 - PHI(z_1-alpha + (mu_zero - mu) / (sigma / sqrt(n)))
- increasing n increases the power, decreases type2 error
- aim: maximal power while maintaining alpha
What is the Neyman-Pearson-Lemma: aim, rules, idea of proof.
- aim: construct optimal test: significant alpha-test + higher power than regual PSI(y)
- "H1" <> l(theta_zero) - l(theta_1) <= c
- a = f(y; theta_zero) / f(y; theta_1) <= exp(c) <= k
- PHI(y) = 1 if a <= k, else 0
- proof:
- P(PSI(y) = 1; theta_one) <= P(PHI(y) = 1; theta_one) >> larger power
- three case (a < k, a > k, a = k)
- always: power(PHI) - power (PSI) >= P(PSI(y) = 1 - PHI(y) = 1| H0))