SRaI
@LMU
@LMU
Set of flashcards Details
Flashcards | 125 |
---|---|
Language | English |
Category | Computer Science |
Level | University |
Created / Updated | 04.10.2019 / 11.10.2019 |
Weblink |
https://card2brain.ch/box/20191004_srai
|
Embed |
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Name bayes roule
\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)
What is a random variable?
- random variable y maps from event-space omega to real values
Define the expected value and variance
\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)
\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)
Define the exponential family
\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)
- with t(y) = statistics = function of data
- theta = parameter (vector)
- K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
- h(y) >= 0, unimportant
- and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)
What is the t-distribution good for?
- statistical test for mean of normal distributed variables
- when variance is unknown (estimated from data)
Define covariance for Y1, Y2.
What about independence? What does this imply?
\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)
Cov(Yj, Yk) = 0 if Yj, Yk are independent
- f(yj, yk) = f(yj) * f(yk)
- E(Yj, Yk) ) E(Yj) * E(Yk)
Define correlation.
- Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))
Name iterated expectation.
\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)
What is the idea of central limit theorem?
What is required? (3)
Give an example.
- sum of any distributed random variable converges to normal distribution
- for n -> infinity (asymptotically)
- conditions
- i.i.d.
- mean given
- finite variance
- example: random-walk: Yn = sum(x_i)
- Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
- with Y_n sum(x_i)
What is the moment generating function? What about its k-th derivative?
What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?
- \(M_Y(t) = E(e^{t Y})\)
- \(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
- \(K_Y(t) = log(M_Y(t))\)
- first derivative w.r.t. t: E(Y) = first moment
- 2nd derivative w.r.t. t = Var(Y) = 2nd moment
- as long as moments are finite
What is the general idea of parametric statistical models? (2) Define Posterior and its components.
- y (data) is realisation of Y with Y~F(y; theta) (model)
- theta unknown
- \(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- posterior = all information about theta
- likelihood: information in data
- prior: knowledge about theta before observing the data
- denominator: normalisation constant (f(y), independent of theta)
Define likelihood and loglikelihood.
Name characteristics. (2)
- \(L(\theta;y) = \prod f(y_i; \theta)\)
- \(l(\theta;y) = \sum log~f(y_i; \theta)\)
Characteristics
- plausibility of parameter-values theta, given data y
- posterior proportional to Likelihood * prior
How theta can be estimated (theta_hat)? (3, 2 each). Relate them.
- posterior mean estimate
- theta_hat = E(posterior)
- requires (numerical) integration (E(.))
- posterior mode estimate
- theta_hat = max(posterior)
- first derivative required
- ML-estimation
- theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
- assumes flat/constant/non-informative prior
- posterior mode estimate = ML-estimation if prior is flat
Explain invariance-property of ML-estimate
- gamma = g(theta), theta = g^-1(gamma)
- gamma_ML = g(theta_ML) >> no reestimation required after transformation
- g(.) bijective transformation function
- counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)
Define loss, squared loss. (3)
- L: set of t x parameter-space -> R+
- L(t, theta) = (t-theta)^2 >> min
- L(theta, theta) = 0
What is the problem with loss? (2)
Explain solutions. (2), (3)
- t(y) = theta_hat depens on sample
- risk = expected loss (R(t, theta) = E(L(.))
- estimate theta_hat s.t. R(.) is minimised
- theta unknown (risk and loss depend on theta)
- minimax approach: theta_hat = chose t(y) of theta with maximal risk s.t. risk is minimized
- bayes risk: minimise Expectation of risk w.r.t. theta
- posterior bayes risk: minimise Expectation of Loss given the data
Define MSE. How does it decompose?
- R(t, theta) = E((t(y) - theta)^2) = expectation of squared loss
- = Var(t(y)) + Bias^2(t, theta)
- = stochastic error (variability of t) + systematic error
Define Bias. What is the goal at estimation?
- Bias(t, theta) = E(t(y)) - theta
- aim: asymptotically unbiased estimate s.t. Bias = 0 for whole parameter space
What is the general idea of Kullack-Leibler divergence? (4)
Define KL(t, theta).
What is the result, if KL is used as a loss function?
- compares distributions
- not symmetric
- log(f(y; theta)) - log(f(y; t))) = log(f(y; theta) / f(y; t))
- theta -> true distribution
- t -> estimated distribution
- 0 if t = theta else >= 0
- KL(t, theta) = \(\int log\frac{f(y; \theta)}{f(y; t)} f(y; \theta)dy\)
- E(KL) = R(t, theta) = integral(I(theta) * (t - theta)^2 * prod(f(y;theta))
- fisher-information independent of data
- >> min(KL-risk) approximates min(MSE)
What is sufficiency?
What is a sufficient statistic? (3)
- quality of estimate t(y)
- all information about theta in t()
- idea: replace data y by t(y)
- if f(y | t(y) ; theta) is independent of theta
- problem: hard to show
What is Nayman-factorisation? What is the problem?
- sufficient if f(y; theta) = h(y) * g(t(y); theta)
- h() and g() non-negative
- >> h() independent of theta
- >> g() depends on y only through t()
- weak statement as y is already sufficient
Define minimal sufficient statistics.
- t() is sufficient AND
- it exists t*() s.t. t() = h(t*(.))
Define consistency. (2)
- quantify information in data
- theta_hat is consistent if MSE(theta_hat, theta) -> 0 for n -> inf
- >> theta_hat -> theta (asymptotically)
Define cramer rao bound (2).
Name properties (2)
- lower limit of MSE for given n
- holds only for fisher regular distributions
- for unbiased theta_hat: MSE() >= inv(I(theta))
- Var(theta_hat) >= inverse of fisher-information (= in best case)
- for biased theta: MSE() >= Bias()^2 + (. = 1 if bias = 0)^2/I(theta)
- smaller variance possible, but larger bias then
What is the idea of confidence intervals? (2)
- quantify uncertainty of theta_hat
- interval estimate instead of point estimate
- P(theta in CI) >= 1-alpha (-> =)
What is a pivotal statistics? Give an example.
- g(y; theta) independent of theta
- i.e. x = (theta_hat - theta) / sqrt(Var(theta_hat)) ~ N(0,1)
- pivotal statistics by using CLT
- P(z_alpha/2 <= x <= z_1-alpha/2) = 1-alpha
- CI = [theta_hat +- z_1-alpha/2 * sqrt(Var(theta_hat))]
- problem: Var(theta_hat) depends on theta >> circle >> estimate Var(theta_hat)
What are creditability intervals? (2)
What is the highest density integral? (2)
- bayesian approach using the posterior
- P(theta in CI | y) ) = integral from left bound to right bound over posterior >= 1-alpha
- integral of posterior from -inf to left bound = integral of posterior from right bound to inf = alpha/2
- >> cut left and right probability mass of alpha/2
- HDI(y) = {theta; posterior >= c)
- c s.t. integral over posterior with all thetas of HDI = 1-alpha
- >> cut from top (posterior of left and right bound same)
What is the difference between confidence and creditability?
- different approaches/reasoning, similar results
Define fisher information.
What is the meaning of it? (1 + 4)
- Expectation(- 2nd derivative of log-likelihood w.r.t. theta)
- Expectation of observed information
- measures the amount of information that y carries about theta (increases with n)
- variance of score
- reciprocal variance of estimate (lowest possible variance of unbiased estimator)
- >> central role in asypmtotical theory of ML-estimation
- >> can be used for tests (i.e. Wald-test)
What is fisher regularity? (4)
- support of y is independent of theta
- parameter-space of theta is open
- f(y; theta) is twice differentiable wrt theta
- integration and differentiation are exchangeable
What are requirements for ML-inference? (2)
- fisher regularity
- Y_i ~ f(y; theta) iid
Define the score. (3)
- first derivative of l(theta; y) w.r.t. theta = s(theta; y)
- is used for ML-estimation: s(theta_hat) = 0 (max of l(theta;y))
- s(theta; y) ~ N(0, I(theta))
- approx.
- proof
- 1st Barlett identity: 1 = integral(f(y; theta)) -> diff() -> E(s) = 0
- 2nd Barlett identity: 0 = E(s) -> diff -> E(s^2) = I(theta) = Var(s)
Explain why the ML-estimate is random, what is the aim to know and how to deal with it?
- l(theta; y) depends on sample (maxima around theta)
- aim: quantify expectation and uncertainty of theta_hat
- >> asymptotic normality of theta_hat
- theta_hat ~ N(theta, inv(I(theta)) (approx.)
- proof:
- define s_n, I_n etc.
- TLS around s_n at theta
- >> theta_hat - theta = - s_n / s'_n = inv(I_n) * s_n
- theta_hat - theta ~ N(0, inv(I_n)) as s_n ~ N(0, I_n)
Explain the numerical calculation of ML estimate. (Fisher Scoring) (3)
Name Problem and solutions (2)
- theta_t+1 = theta_t - s / s' = theta_t + s / I
- 0) theta_t = theta_zero
- 1) theta_t+1 = theta_t + s(theta_t) / I(theta_t)
- 2) stop if ||theta_t+1 - theta_t|| < d
- 3) theta_ML = theta_t+1
- >> can end up in local optimum
- different theta_zero
- adapt step-size
What is the idea of testing?
Define type 1 and two error. What is the problem with the errors?
- idea: answer questions based on data
- type-one error: "H1" | H0, reject H0 even it is true
- type-two error: "H0" | H1, accept H0 even it is false
- problem: errors are complementary: no type1 error by always "H0" i.e.
What is a significant alpha test? Give an example.
- P("H1" | H0) <= alpha (bound for type one error)
- P(y_bar > c | mu <= mu_zero) <= alpha
- >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
- with z_1-alpha = 1-alpha quantile of N(0, 1)
- >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
What general kind of tests exist? (3)
What is the general problem for these tests and how to solve it?
- one sided test: H0: mu <= mu_zero, H1: mu > mu_zero
- "H1" <> y_bar > c
- two-sided test: H0: mu = mu_zero, H1: mu != mu_zero
- "H1" <> |y_bar - mu_zero| > c
- testing theta: H0: theta in theta-zero-space, H1: theta not in theta-zero-space
- problem: for calculation of c, sigma is required (mostly unknown)
- >> estimate sigma_hat >> (y_hat - mu_zero) / (sigma_hat / sqrt(n)) ~ t_n-1 distributed
Explain Wald-, score- and LR-test and compare them.
- all test: H0: theta = theta_zero, H1: theta != theta_zero
- Wald-test
- "H1" <> |theta_hat - theta_zero| > c (large deviation of thetas speak against H0)
- c = z_1-alpha/2 * sqrt(inv(I(theta_zero))) >> theta_hat instead of theta_zero used as theta_hat ~ theta_zero under H0
- >> estimation of theta_hat and variance of estimate required
- Score-test
- "H1" <> |s(theta_zero; y)| > c (large score speaks against H0)
- c = z_1-alpha/2 * sqrt(I(theta_zero))
- >> score of theta_zero is enough, no theta_hat required
- likelihood-ratio test
- "H1" <> lr(theta, theta_hat) > c (large lr speaks against H0)
- lr() = 2(l(theta_hat) - l(theta)) ~ X^2_p
- lr(theta, theta_hat) = l(theta_hat) - l(theta) >= 0
- l(theta) = TSE at theta_hat = l(theta_hat) - 0.5 s^2 / I
- lr(,) = 0.5 s^2/I
- c = X^2_p,1-alpha
What is the idea of power of a test?
Define the power. Give an example
What is the aim?
- no statement about type 2 error yet
- power = P("H1" | H1) = 1 - P(type2 error)
- = 1 - PHI(z_1-alpha + (mu_zero - mu) / (sigma / sqrt(n)))
- increasing n increases the power, decreases type2 error
- aim: maximal power while maintaining alpha
What is the Neyman-Pearson-Lemma: aim, rules, idea of proof.
- aim: construct optimal test: significant alpha-test + higher power than regual PSI(y)
- "H1" <> l(theta_zero) - l(theta_1) <= c
- a = f(y; theta_zero) / f(y; theta_1) <= exp(c) <= k
- PHI(y) = 1 if a <= k, else 0
- proof:
- P(PSI(y) = 1; theta_one) <= P(PHI(y) = 1; theta_one) >> larger power
- three case (a < k, a > k, a = k)
- always: power(PHI) - power (PSI) >= P(PSI(y) = 1 - PHI(y) = 1| H0))