SRaI

@LMU

@LMU


Set of flashcards Details

Flashcards 125
Language English
Category Computer Science
Level University
Created / Updated 04.10.2019 / 11.10.2019
Weblink
https://card2brain.ch/box/20191004_srai
Embed
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the core idea of gooness of fit of a test (1)

Name two examples. (1:3, 1)

  • idea: test whole distribution model, not only parameter
    • H0: Y ~ f(y; theta)
  • Chi-squared goodness of fit
    • contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
      • expected by assumed distribution
    • n_k far from l_k speaks against H0
    • X^2 = ..., "H1" <> if X^2 >= Chi-squared_K-1-p,1-a
  • Kolomogrov-Smirnov test
    • compares empirical and hypothetical distribution

What is the definition of the p-value? (2) Name different values. (4)

Relate p-value and sample size.

  • probability to observe data (y_bar) that contradicts H0 even more than the observed data (y_bar_obs)
    • P(y_bar >= y_bar_obs | H0)
  • evidence of contradiction of H0
  • values
    • p <= 0.1: weak evidence against H0
    • p <= 0.05 increased
    • p <= 0.01 strong
    • >> "H1" <> p <= alpha 
  • as sample size increases, the uncertainty decreases, hence the p-value descreases
    • in big data everything becomes significant

relate CI and testing. (1)

  • "H1" <> theta not in CI
    • P(theta in CI) >= 1-alpha

What is the main problem of multiple testing?

What is FWER?

What is Bonferoni adjustment? What is the problem here?

  • running j independent tests: P(reject at least one H_j | H0) between alpha and m*alpha = P(type 1 error)
    • probability to reject at least one hypothesis tends to 1 for large m (i.e. alpha = 0.05, m = 20)
  • Family-wise error rate = probability to reject at least one hypothesis (to be controlled)
  • BA: alpha_adj = alpha / m 
    • problem: alpha gets really small for large m, H0 is not often rejected

What is the idea of Holm's procedure? (2) Name steps. What does this imply?

  • order p-values ascending
  • limit FWER to alpha in each step
  • if p_1 > alpha/m: accept all, stop
    • else reject H0_1, go on
  • if p_2 > alpha/(m-1): accept remaining, stop
    • else reject H0_2, go on
  • ...
  • if p_m > alpha: accept H0_m, stop
    • else reject H0_m
  • implication: alpha_adj increases each iteration (but p-value also) >> more rejections

What is the general idea of bayes reasoning? Define the posterior, characterise the denominator. (2)

Relate posterior to likelihood.

  • idea
    • theta is a random variable with prior probability \(f_\theta(\vartheta)\)
    • express uncertainty about theta as probability (posterior)
  • \(f_\theta (\vartheta|y) = \frac{f(y;~\vartheta)f_\theta(\vartheta)}{\int{f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
    • normalisation constant is hard to derive analytically
    • normalisation constant = marginal density, uncertainty about theta integrated out
  • posterior proportional to likelihood * prior

What is a conjugate prior? (3)

What is a problem with it?

Give an example. (3) Why is this outcome interesting? (2)

  • data from a family of distribution: F_y = { f(y; theta), theta from its space }
  • prior from a family of distribution F_theta = { f(theta; gamma), gamma from its space }
  • F_theta is conjugate to F_y (likelihood) if posterior from family of prior.
  • problem: hard to find besides classical cases
  • example
    • mu ~ N(gamma, tao^2)
    • mu|y ~ N( , ) >> posterior again normal
    • mu | y --a--> N(y_bar (= mu_hat), sigma^2/n)
      • bayes: asympotical normality of theta around theta_hat
      • ML: asympotical normality of theta around theta_hat

selecting the prior: Explain flat (2) and jeffrey's prior. (5)

  • flat prior
    • prior = constant
    • problem: transformation of flat prior results in not constant prior (knowledge implicit)
      • f_gamma(gamma) not flat with gamma = g(theta) (transformation rule)
      • i.e. f_pi(pi) = 1, f_gamma(gamma) = exp(gamma)/(1+exp(gamma))
  • Jeffrey's prior
    • prior proportional to sqrt(I_theta(theta))
    • transformation invariant: f_gamma(gamma) proportional to sqrt(I_gamma(gamma))
    • not flat
    • distance between prior and posterior is maximized (maximal information from the data)
    • requires fisher regularity

What are general problems with priors? (2 + 1)

  • no prior expressing ignorance exists
  • prior influences results of analysis
  • >> how to express no prior knowledge?

What is the core idea of empirical bayes? (2 + 1)

  • prior depends on hyperparameter gamma \(f_\theta(\vartheta; \gamma)\)
  • estimate gamma with max(L(y; gamma) -> \(f_\theta(\vartheta; \gamma_{ML})\)
  • >> prior depends on gamma_hat, hence on data
    • contradicts bayesian idea, but useful in practise

What is the core idea of hierarchical bayes? (1) Discuss it (1 + 1)

  • shift uncertainty of prior to higher level
    • y|theta ~ f_y(y; theta) >> theta | gamma ~ f_theta(theta; gamma) >> gamma |. ~ f_gamma(gamma; . ) 
  • + flexible distribution of parameter of interest
  • - costly computation of prior

Why numerical methods for posterior are required?

What are solutions? (4) Briefly discuss them.

  • numerical integration of normalisation constant (assuming prior is given)
  • solutions
    • approximation of f(y) (2)
      • numerical approximation (low-dimensional only)
      • Laplace approximation (works also high-dimensional, easy to estimate)
    • sampling from the denominator (Monte Carlo approx.) (3 + 1)
      • sampling from prior (problem: F_theta(theta) not given, integration again)
      • rejection sampling (use f*_theta(.) with given F*_theta(.))
      • importance sampling
      • >> problematic for high-dimensional theta
    • sampling from the posterior (Monte Carle approx.)
      • = Markov Chain Monte Carlo
        • sample from posterior to estimate expectation, variance etc.
      • Metropolis-Hasting (low acceptance for high-dimensional theta)
      • Gibbs sampling (one component of theta sampled in each step, works high-dimensional)
    • approximation of f(y) or posterior
      • variational bayes (high-dimensional, component-wise)

What is the core idea of monte carlo approximation?

  • numerical problem solving based on simulation and porbability theory

Briefly describe Laplace approximation. (4)

Discuss (2)

  • assumes n iid samples
  • CLT for bayes
  • f(y) = integral(exp(likelihood + log(prior))
    • likelihood grows with n
    • likelihood + log(prior) = l_p,n(theta)
      • >> s_p,n(theta), I_p,n(theta)
  • TSE of f(y) around theta_hat_p
    • \(f(y; \theta_p) f_\theta(\theta_p)\sqrt{2\pi}J_{p,n}(\theta_p)^{-1/2}\)
  • + works also high-dimensional (small adaption of formular)
  • + easy calculation of theta_p (ML)
  • + good approximation for n -> inf

What is the core idea when sampling from the denominator of posterior?

What is required for it? What is the problem there?

  • denominator = E(f(y; theta)) >> sampling from it to get estimate of E(.) 
    • as E_hat() converges to E(.) for n -> inf
  • therefor sampling from prior is required
    • theta*_j  = F^-1(u*_j)
    • >> for sampling from prior, F_theta(theta) (cdf) is required >> requires numerical integration again

Explain rejection sampling. Idea (1). How to produce a sample? (3)

What is special about it? What is the problem?

  • idea: sample from proposal density f*_theta(.) (umbrella, see formula) with analytical given F*_theta(.)
    • f(theta) <= a * f*(theta)
  • produce sample from f(theta)
    • 1) draw u*
    • 2) draw theta* from f*
    • 3) if u* <= f(theta*) / (a * f*(theta*)) accept theta* as sample
      • else reject theta*
    • proof: P(theta* <= theta | theta* accept) = P(theta* <= theta, theta* accept) / P(theta* accept) = F_theta(theta)/a * a
  • produces sample from prior without drawing from it
  • problem: for large a, acceptance of theta* is low >> no sample
    • >> f*(theta) must be close to f(theta) (low a)

What is the general idea of MCMC? (2)

  • theta*j and theta*j+1 are correlated (>> chain, no iid)
  • distribution of chain converges to stationary distribution = posterior

What is the idea of Metropolis Hastings? (2)

Name the steps. (3)

What is a general problem here?

  • f(y) is unknown, but posterior is proportional to likelihood * prior
  • ratio of posteriors for theta, theta~ is known as f(y) cancels out
  • steps
    • 1) draw theta* from proposal distribution q(.|theta*_t)
    • 2) accept theta* as new sample with probabillity alpha
      • \(\alpha(\theta^*_t|\theta^*) = min\{1, \frac{f(\theta^*|y)}{f(\theta^*_t|y)} \frac{q(\theta^*_t|\theta^*)}{q(\theta^*|\theta^*_t)}\}\)
      • alpha = 1 if posterior of proposed theta is larger than previous
      • alpha < 1 if posterior of propsed theta is smaller than previous (acceptance = fraction)
      • if q is symmetric: Metropolis algorithm
    • 3) draw u*, accept theta* if alpha() >= u*
      • else don't accept it
  • problem: for high-dimensional theta: alpha -> 0 (i.e. 0.8 acceptance each dimension >> 0.8^p)

What is the influence of q at Metropolis-hastings?

  • narrow q:
    • similar theta*
    • high acceptance as posteriors will be mostly the same >> E(alpha) = 1
    • >> not much movement in parameter-space
  • wide q:
    • different theta*
    • acceptance will vary a lot (0 to 1) as posteriors will differ >> E(alpha) << 1
    • >> jumping, exploring
  • >> balance between acceptance and exploring

How to use MCMC in practise? (3)

  • use different starting values
  • define and delete burn in phase
  • thinning out: only use uncorrelated samples (where autocorrelation -> 0) to get iid samples from posterior

Briefly explain the idea of gibbs sampling.

  • sample only one single component of theta-vector in each step (avoid problem of acceptance)
  • steps
    • draw theta*_k from f(theta_k* | y, theta_-k,t)
      • set theta*k,t+1 = theta*_k if accepted
      • use new theta*k,t+1 during drawing of other components
  • proof
    • \(f_{\theta_k}(\theta_k|y, \theta_{-k}) \propto f_{\theta_k}(\theta_k, \theta_{-k}|y) = f_\theta(\theta|y)\)

Briefly explain variational bayes.

Name the steps. (2)

  • approximate f(y) or posterior by replacing it by q(.)
  • by minimizing the KL-divergence between i.e. f(theta;y) (unknown) and q()
    • min KL(q(), f())
    • with componentwise independence s.t. \(q_\theta(\theta) = \prod_{i = 1}^{p} q_k(\theta_k)\)
    • separate upper and lower part by k-th components and rest
    • ... \(= \int log\frac{q_k(\theta_k)}{f^*_k(\theta_k|y)} q_k(\theta_k)d\theta_k = KL(f^*_k(.|.), q_k(.))\)
    • is minimized if \(q_k(\theta_k) \propto f^*_k(\theta_k|y)\)
      • \(q_k(\theta_k)=f^*_k(\theta_k|y) / \int f^*_k\)
  • steps
    • 0) initialise q_k(0)
    • for each dimension
      • q_k,t+1 = f*_k,t(.|.) with all other q_j,t fixed (j != k) 

What is the core idea of bayes factor? (3)

Give a definition of it.

Explain values. (4)

  • bayesian p-value
  • compares two models based on posterior
    • P(M_j|y) = f(y|M_j) * P(M_j) / f(y)
  • ratio can be easily analysed (f(y) cancels out)
  • \(\frac{f(y | M_1)}{f(y|M_0)}\) with \(f(y|M_j) = \int f(y|\vartheta) f_\theta(\vartheta|M_j)d\vartheta\)
    • y depends on theta, theta depes on M_j
  • values
    • 1-3 no evidence for M1
    • 3-20 evidence for M1
    • 20-50 strong evidence for M1
    • >150 very strong evidence for M1

What is the aim of bootstrapping? (1)

What is the general idea?

What is a problem here?

What general assumption is done?

Name the main steps. (5)

  • aim: quantify/approximate uncertainty of estimates (i.e. variance, bias, CI, tests)
  • idea: sample from data instead of true distribution as f(.), F(.) are unknown
    • "mimic true variation"
  • problem: only 2n-1 over n different B-samples exist >> B is finite
  • assumption: Y_i ~ F(.) iid
  • steps
    • 1) calculate t(y)
    • 2) sample y*_i (i = 1...n) with replacement from y >> y* = bootstrap-sample
    • 3) calculate t(y*)
    • 4) repeat 2) 3) B times
    • 5) estimate i.e. Var(t(y))_hat based on t(y*_b) (b = 1...B)

Explain the plug-in principle.

Compare real world and boostrap world. (1)

Give examples for replacement. (3)

  • F(.) replaced by F^_n(y) (theoretical distribution-function by empirical)
    • draw from F^_n(y) instead of F(.) with replacement
    • >> simulation instead of analytical solution
  • real word: y ~ F(.) iid >> t(y), boostrap-world y* ~ F^_n(y) iid >> t(y*)
  • replacements
    • mu by y_bar
    • y_bar by y*_bar
    • F(y) by F^_n(y*)
  •  

Give examples how to apply boostrapping. (2)

  • bias
    • \(bias(\hat\xi, \xi) = E(\hat\xi) - \xi \Rightarrow \hat{bias}(\hat\xi, \xi) = \hat{\bar\xi}^*-\hat\xi\)
    • drawn from F^_n(y) and F(.)
    • correction: \(\hat{\hat\xi} = \hat\xi - \hat{bias}(\hat\xi, \xi) = 2\hat\xi+\hat{\bar\xi}^*\)
      • zero bias leads to increased variance...
  • standard-error
    • \(\xi = \mu_y - \mu_z \Rightarrow \hat{\xi} = \bar{y} - \bar{z} \Rightarrow \hat{\xi}^{*b} = \bar{y}^{*b} - \bar{z}^{*b}\)

What is the idea of parametric boostrapping?

Discuss it (1+ 1,1)

  • idea: Y_i ~ F(., theta) iid >> Y*_i ~ F(. | theta_hat) iid
  • y*_i can differ from y_i 
    • + useful for small samples (or extreme cases)
    • - parametric assumptions (+ estimation of those)

Explain how bootstrapping can be used in regression. (4) Explain general character, how it works and pros and cons.

  • residual-based
    • fit a model on x, y >> beta_hat, y_hat >> epsilon_hat >> epsilon* = samples from residuals
    • >> y* = Xbeta_hat + epsilon*
    • beta*_hat = (XTX)^-1 XTy*
  • model-based
    • fit a model on x, y, estimate variance of residuals (sigma_hat^2)
    • y* = Xbeta_hat + epsilon* (~N(0, sigma_hat^2)
    • beta*_hat = (XTX)^-1 XTy*
  • >> problem: induces variance homogeneity (as residuals are independent of x) >> variance unchanged, nothing gained by bootstrapping
    • Var(beta*_hat) = (XTX)^-1 sigma_hat^2
  • pairwise/case resampling
    • y* = X*beta + epsilon (sampling whole rows from X with replacement)
    • beta*_hat = (X*T X*)^-1 X*T y*
    • - contradicts regression idea: model y given X
  • wild bootstrap
    • fit a model on y, X >> epsilon_hat
    • \(\hat\epsilon^*_i = V^*_i\hat\epsilon_i\) 
      • with V*_i drawn from point distribution with E(V) = 0, Var(V) = 1
    • y* = Xbeta_hat + epsilon*_hat
    • mimics empirical estimates
    • ++ all original samples considered
  • >> variance heterogeneity can be modeled as epsilon*_hat_i depends on x_i

What is the main idea of regression? (4)

How is the solution recieved? Explain an alternative.

  • relates input X with output y
  • y = beta_zero + Xbeta_x + epsilon
  • error epsilon can't be explained by model (epsilon ~ N(0, sigma^2)
    • error independent of x >> variance homogeneity
  • E(y|X) = beta_zero + Xbeta_x = y_hat
    • >> modeling the mean y|X ~ N(beta_zero + Xbeta_x, sigma^2)
  • estimation by:
    • beta_0_hat, beta_x_hat = min sum((y_i - y_hat_i)^2) 
    • or: ML-estimation: normal-distribution with mu = (beta_zero + Xbeta_x)

Explain the matrix-notation of regression.

  • design-matrix X (with first column = 1 for beta_zero), n x p+1
  • epsilon-vector: n x 1
  • response-vector: n x 1
  • beta-vector: (beta_zero, beta_x)T: p+1 x 1
  • >> y = Xbeta + epsilon

Name ML-estimates in regression (I(beta_hat), beta_hat, sigma_ML^2)

  • I(beta_hat) = XTX/sigma^2, I^-1(beta_hat) = sigma^2/(XTX) = Var(beta_hat)
  • beta_hat = (XTX)^-1 * XTy
  • sigma_ML^2 = (y-Xbeta)T(y-Xbeta) / n (biased)
  • >> beta_hat ~ N(beta, Var(beta_hat))
    • (E(beta_hat), Var(beta_hat))

How to recieve uncorrelated beta_0 and beta_x? (2)

  • XTX implies correlated estimates (usually useful)
  • solution:
    •  \(x^* = x_i - \bar x \Rightarrow x_i = x^*+\bar x\)
    • \(y_i = \beta^*_0 + x^*_i\beta_x+\epsilon_i \Rightarrow \hat \beta^*=(X^{*T}X^*)^{-1}X^{*T}y\)

How to model non-linear in regression? (3) Why is this possible?

  • quadratic effect: \(x^2\beta_{xx}\)
  • binary covariate: \(2\beta_2\)
  • categorical variable: \(1_{\{edu = 2\}}\beta_3\)
  • linear regression = linear in parameter theta, not in X

Name the hat-matrix, where does it come from?

How can it be used?

  • \(H = X(X^TX)^{-1}X^T\)
    • idempotent, s.t. HT = H, HH = H, same for (I-H)
  • from \(y-\hat y = y-X\hat\beta = y-X(X^TX)^{-1}X^Ty = y-Hy=(I-H)y\)
  • used for: \(E((y-X\hat\beta)^T(y-X\hat\beta)) = E(y^T(I-H)(I-H)y)=...=\sigma^2(n-p)\)
    • with \(E(YY^T)=Var(Y)+E(y)E(y^T)\)

Bayesian regression. Name properties. (2 + 1)

  • beta is unknown
  • \(\beta, \sigma^2\sim f(\beta, \sigma^2|y)\) = posterior with flat prior
    • with mu = Xbeta
    • \((y-X\beta)=(y-X\hat\beta)-(X\beta-X\hat\beta)\)
    • \((y-X\beta)^T(y-X\beta)=...=\sigma^2(n-p)+(\beta-\hat\beta)X^TX(\beta-\hat\beta)\)
    • \(f(\beta, \sigma^2|y)\propto~...\times exp(-1/(2\sigma^2)(\beta-\hat\beta)X^TX(\beta-\hat\beta))\)
      • posterior again normal
  • >> \(\beta, \sigma^2|y\sim N(\hat\beta, (X^TX)^{-1}\sigma^2)\)
    • ML: beta_hat, sigma^2|y ~ N(beta, (XTX)^-1 sigma^2)
    • >> same results, different reasoning

What is the core idea of GLM (name cases, h(n) and interpretation) (3).

Explain transformation of response. (1)

Name requirements. (2)

  • response can be 
    • binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
      • effects: odds/log-odds
    • count (poisson regression): h(n) = exp(n)
      • multiplicative effects
    • categorical (cumulative regression)
  • transformation:
    • E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
      • n = h^-1(E(y|X)) = g(mu)
  • required
    • y|X ~ exp-family
    • link-function: E(y|X) = h(n)
      • natural/canonical link: theta = n
      • remember: \(\partial\kappa(\theta)/\partial(\theta) = E(t(y)) = \mu\)

 

How does prediction work in regression? (2)

  • y_hat = Xbeta_hat + epsilon
  • E(y_hat) = Xbeta_hat

How is the variance of prediction in regression defined? What does this imply?

  • \(Var(\hat y) = Var(X\hat\beta+\epsilon)=\sigma^2(X(X^TX)^{-1}X^T+1)\)
  • >> high variance in regions of low mass of data
    • as (XTX)^-1 -> 0 for n -> inf; -> inf for n -> 0

When to use weighted regression? (2)

1: 2 + 4

2: 4

  • modeling variance heterogeneity (Var(y_i) = sigma^2(x_i))
    • sigma^2(x_i) = a * sigma^2
    • >> y ~ N(Xbeta, sigma^2 W^-1) with W = diag(1/a_j), j = 1...n
      • Var(y) = sigma^2 W^-1
      • in loglikelihood: ...(y-Xbeta)TW(y-Xbeta)
      • in beta_hat = (XTWX)^-1 XTWy
      • Var(beta_hat) = sigma^2(XTWX)^-1, rest cancels out as Var(y) = sigma^2 W^-1
  • survey weighting
    • biased data in surveys >> over- or underrepresentation 
    • >> introduce weights for samples (W)
    • beta_hat see above
    • Var(beta_hat) = ... nothing cancels out as Var(y) = sigma^2 (W not included)

Briefly describe quantile-regression (3)

  • estimation of quantile(s), not expectation
  • modeling of variance heterogeneity implicit (analysing slopes of quantiles)
  • squared error replaced by check-function (>> linear programming as not solvable analytically)