SRaI

@LMU

@LMU


Set of flashcards Details

Flashcards 125
Language English
Category Computer Science
Level University
Created / Updated 04.10.2019 / 11.10.2019
Weblink
https://card2brain.ch/box/20191004_srai
Embed
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Name bayes roule

\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)

What is a random variable?

  • random variable y maps from event-space omega to real values

Define the expected value and variance

\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)

\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)

Define the exponential family

\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)

  • with t(y) = statistics = function of data
  • theta = parameter (vector)
  • K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
  • h(y) >= 0, unimportant
  • and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)

What is the t-distribution good for?

  • statistical test for mean of normal distributed variables
  • when variance is unknown (estimated from data)

Define covariance for Y1, Y2.

What about independence? What does this imply?

\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)

Cov(Yj, Yk) = 0 if Yj, Yk are independent

  • f(yj, yk) = f(yj) * f(yk)
  • E(Yj, Yk) ) E(Yj) * E(Yk)

Define correlation.

  • Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))

Name iterated expectation.

\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)

What is the idea of central limit theorem?

What is required? (3)

Give an example.

  • sum of any distributed random variable converges to normal distribution
    • for n -> infinity (asymptotically)
  • conditions
    • i.i.d.
    • mean given
    • finite variance
  • example: random-walk: Yn = sum(x_i) 
    • Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
    • with Y_n  sum(x_i)

What is the moment generating function? What about its k-th derivative?

What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?

  • \(M_Y(t) = E(e^{t Y})\)
  • \(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
  • \(K_Y(t) = log(M_Y(t))\)
    • first derivative w.r.t. t: E(Y) = first moment
    • 2nd derivative w.r.t. t = Var(Y) = 2nd moment
    • as long as moments are finite

What is the general idea of parametric statistical models? (2) Define Posterior and its components.

  • y (data) is realisation of Y with Y~F(y; theta) (model)
  • theta unknown
  • \(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
    • posterior = all information about theta
    • likelihood: information in data
    • prior: knowledge about theta before observing the data
    • denominator: normalisation constant (f(y), independent of theta) 

Define likelihood and loglikelihood.

Name characteristics. (2)

  • \(L(\theta;y) = \prod f(y_i; \theta)\)
  • \(l(\theta;y) = \sum log~f(y_i; \theta)\)

Characteristics

  • plausibility of parameter-values theta, given data y
  • posterior proportional to Likelihood * prior

How theta can be estimated (theta_hat)? (3, 2 each). Relate them.

  • posterior mean estimate
    • theta_hat = E(posterior)
    • requires (numerical) integration (E(.))
  • posterior mode estimate
    • theta_hat = max(posterior)
    • first derivative required
  • ML-estimation
    • theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
    • assumes flat/constant/non-informative prior
  • posterior mode estimate = ML-estimation if prior is flat

Explain invariance-property of ML-estimate

  • gamma = g(theta), theta = g^-1(gamma)
  • gamma_ML = g(theta_ML) >> no reestimation required after transformation
  • g(.) bijective transformation function
  • counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)

Define loss, squared loss. (3)

  • L: set of t   x  parameter-space -> R+
  • L(t, theta) = (t-theta)^2 >> min
  • L(theta, theta) = 0

What is the problem with loss? (2)

Explain solutions. (2), (3)

  • t(y) = theta_hat depens on sample
    • risk = expected loss (R(t, theta) = E(L(.))
    • estimate theta_hat s.t. R(.) is minimised
  • theta unknown (risk and loss depend on theta)
    • minimax approach: theta_hat = chose t(y) of theta with maximal risk s.t. risk is minimized
    • bayes risk: minimise Expectation of risk w.r.t. theta
    • posterior bayes risk: minimise Expectation of Loss given the data

Define MSE. How does it decompose?

  • R(t, theta) = E((t(y) - theta)^2) = expectation of squared loss
    • = Var(t(y)) + Bias^2(t, theta)
    • = stochastic error (variability of t) + systematic error

Define Bias. What is the goal at estimation?

  • Bias(t, theta) = E(t(y)) - theta
  • aim: asymptotically unbiased estimate s.t. Bias = 0 for whole parameter space

What is the general idea of Kullack-Leibler divergence? (4)

Define KL(t, theta).

What is the result, if KL is used as a loss function?

  • compares distributions
  • not symmetric
  • log(f(y; theta)) - log(f(y; t))) = log(f(y; theta) / f(y; t))
    • theta -> true distribution
    • t -> estimated distribution
  • 0 if t = theta else >= 0
  • KL(t, theta) = \(\int log\frac{f(y; \theta)}{f(y; t)} f(y; \theta)dy\)
  • E(KL) = R(t, theta) = integral(I(theta) * (t - theta)^2 * prod(f(y;theta))
    • fisher-information independent of data
    • >> min(KL-risk) approximates min(MSE)

What is sufficiency?

What is a sufficient statistic? (3)

  • quality of estimate t(y)

 

  • all information about theta in t() 
    • idea: replace data y by t(y)
  • if f(y | t(y) ; theta) is independent of theta
  • problem: hard to show

What is Nayman-factorisation? What is the problem?

  • sufficient if f(y; theta) = h(y) * g(t(y); theta)
    • h() and g() non-negative
    • >> h() independent of theta
    • >> g() depends on y only through t()
  • weak statement as y is already sufficient

Define minimal sufficient statistics.

  • t() is sufficient AND
  • it exists t*() s.t. t() = h(t*(.))

Define consistency. (2)

  • quantify information in data
  • theta_hat is consistent if MSE(theta_hat, theta) -> 0 for n -> inf
    • >> theta_hat -> theta (asymptotically)

Define cramer rao bound (2).

Name properties (2)

  • lower limit of MSE for given n
  • holds only for fisher regular distributions
  • for unbiased theta_hat: MSE() >= inv(I(theta))
    • Var(theta_hat) >= inverse of fisher-information (= in best case)
  • for biased theta: MSE() >= Bias()^2 + (. = 1 if bias = 0)^2/I(theta)
    • smaller variance possible, but larger bias then

What is the idea of confidence intervals? (2)

  • quantify uncertainty of theta_hat
  • interval estimate instead of point estimate
    • P(theta in CI) >= 1-alpha (-> =)

What is a pivotal statistics? Give an example.

  • g(y; theta) independent of theta
  • i.e. x = (theta_hat - theta) / sqrt(Var(theta_hat)) ~ N(0,1) 
    • pivotal statistics by using CLT
    • P(z_alpha/2 <= x <= z_1-alpha/2) = 1-alpha
    • CI = [theta_hat +- z_1-alpha/2 * sqrt(Var(theta_hat))]
      • problem: Var(theta_hat) depends on theta >> circle >> estimate Var(theta_hat)

What are creditability intervals? (2)

What is the highest density integral? (2)

  • bayesian approach using the posterior
  • P(theta in CI | y) ) = integral from left bound to right bound over posterior >= 1-alpha
    • integral of posterior from -inf to left bound = integral of posterior from right bound to inf = alpha/2
    • >> cut left and right probability mass of alpha/2 

 

  • HDI(y) = {theta; posterior >= c)
  • c s.t. integral over posterior with all thetas of HDI = 1-alpha
    • >> cut from top (posterior of left and right bound same)

What is the difference between confidence and creditability?

  • different approaches/reasoning, similar results

Define fisher information.

What is the meaning of it? (1 + 4)

  • Expectation(- 2nd derivative of log-likelihood w.r.t. theta)
    • Expectation of observed information
  • measures the amount of information that y carries about theta (increases with n)
    • variance of score
    • reciprocal variance of estimate (lowest possible variance of unbiased estimator)
    • >> central role in asypmtotical theory of ML-estimation
    • >> can be used for tests (i.e. Wald-test)

What is fisher regularity? (4)

  • support of y is independent of theta
  • parameter-space of theta is open
  • f(y; theta) is twice differentiable wrt theta
  • integration and differentiation are exchangeable

What are requirements for ML-inference? (2)

  • fisher regularity
  • Y_i ~ f(y; theta) iid

Define the score. (3)

  • first derivative of l(theta; y) w.r.t. theta = s(theta; y)
  • is used for ML-estimation: s(theta_hat) = 0 (max of l(theta;y))
  • s(theta; y) ~ N(0, I(theta))
    • approx.
    • proof
      • 1st Barlett identity: 1 = integral(f(y; theta)) -> diff() -> E(s) = 0
      • 2nd Barlett identity: 0 = E(s) -> diff -> E(s^2) = I(theta) = Var(s)

Explain why the ML-estimate is random, what is the aim to know and how to deal with it?

  • l(theta; y) depends on sample (maxima around theta)
    • aim: quantify expectation and uncertainty of theta_hat
  • >> asymptotic normality of theta_hat
    • theta_hat ~ N(theta, inv(I(theta)) (approx.)
    • proof:
      • define s_n, I_n etc.
      • TLS around s_n at theta
      • >> theta_hat - theta = - s_n / s'_n = inv(I_n) * s_n
      • theta_hat - theta ~ N(0, inv(I_n)) as s_n ~ N(0, I_n)

Explain the numerical calculation of ML estimate. (Fisher Scoring) (3)

Name Problem and solutions (2)

  • theta_t+1 = theta_t - s / s' = theta_t + s / I
  • 0) theta_t = theta_zero
  • 1) theta_t+1 = theta_t + s(theta_t) / I(theta_t)
  • 2) stop if ||theta_t+1 - theta_t|| < d
  • 3) theta_ML = theta_t+1
  • >> can end up in local optimum
    • different theta_zero
    • adapt step-size

What is the idea of testing? 

Define type 1 and two error. What is the problem with the errors?

  • idea: answer questions based on data
  • type-one error: "H1" | H0, reject H0 even it is true
  • type-two error: "H0" | H1, accept H0 even it is false
  • problem: errors are complementary: no type1 error by always "H0" i.e.

What is a significant alpha test? Give an example.

  • P("H1" | H0) <= alpha (bound for type one error)
  • P(y_bar > c | mu <= mu_zero) <= alpha
    • >> pivotal statistic >> c = mu_zero + z_1-alpha * sigma / sqrt(n)
      • with z_1-alpha = 1-alpha quantile of N(0, 1)

What general kind of tests exist? (3)

What is the general problem for these tests and how to solve it?

  • one sided test: H0: mu <= mu_zero, H1: mu > mu_zero
    • "H1" <> y_bar > c
  • two-sided test: H0: mu = mu_zero, H1: mu != mu_zero
    • "H1" <> |y_bar - mu_zero| > c
  • testing theta: H0: theta in theta-zero-space, H1: theta not in theta-zero-space
  • problem: for calculation of c, sigma is required (mostly unknown)
    • >> estimate sigma_hat >> (y_hat - mu_zero) / (sigma_hat / sqrt(n)) ~ t_n-1 distributed

Explain Wald-, score- and LR-test and compare them.

  • all test: H0: theta = theta_zero, H1: theta != theta_zero
  • Wald-test
    • "H1" <> |theta_hat - theta_zero| > c (large deviation of thetas speak against H0)
    • c = z_1-alpha/2 * sqrt(inv(I(theta_zero))) >> theta_hat instead of theta_zero used as theta_hat ~ theta_zero under H0
    • >> estimation of theta_hat and variance of estimate required
  • Score-test
    • "H1" <> |s(theta_zero; y)| > c (large score speaks against H0)
    • c = z_1-alpha/2 * sqrt(I(theta_zero))
    • >> score of theta_zero is enough, no theta_hat required
  • likelihood-ratio test
    • "H1" <> lr(theta, theta_hat) > c (large lr speaks against H0)
    • lr() = 2(l(theta_hat) - l(theta)) ~ X^2_p
      • lr(theta, theta_hat) = l(theta_hat) - l(theta) >= 0
      • l(theta) = TSE at theta_hat = l(theta_hat) - 0.5 s^2 / I
      • lr(,) = 0.5 s^2/I
    • c = X^2_p,1-alpha

What is the idea of power of a test?

Define the power. Give an example

What is the aim?

  • no statement about type 2 error yet
  • power = P("H1" | H1) = 1 - P(type2 error)
    • = 1 - PHI(z_1-alpha + (mu_zero - mu) / (sigma / sqrt(n)))
    • increasing n increases the power, decreases type2 error
  • aim: maximal power while maintaining alpha

What is the Neyman-Pearson-Lemma: aim, rules, idea of proof.

  • aim: construct optimal test: significant alpha-test + higher power than regual PSI(y)
  • "H1" <> l(theta_zero) - l(theta_1) <= c
    • a = f(y; theta_zero) / f(y; theta_1) <= exp(c) <= k
    • PHI(y) = 1 if a <= k, else 0
  • proof:
    • P(PSI(y) = 1; theta_one) <= P(PHI(y) = 1; theta_one) >> larger power
    • three case (a < k, a > k, a = k)
    • always: power(PHI) - power (PSI) >= P(PSI(y) = 1 - PHI(y) = 1| H0))