Set of flashcards SRaI (Page 2 of 4)

Flashcards	125
Language	English
Category	Computer Science
Level	University
Created / Updated	04.10.2019 / 11.10.2019
Weblink	https://card2brain.ch/box/20191004_srai
Embed	<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the core idea of gooness of fit of a test (1)

Name two examples. (1:3, 1)

idea: test whole distribution model, not only parameter
- H0: Y ~ f(y; theta)
Chi-squared goodness of fit
- contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
  - expected by assumed distribution
- n_k far from l_k speaks against H0
- X^2 = ..., "H1" <> if X^2 >= Chi-squared_K-1-p,1-a
Kolomogrov-Smirnov test
- compares empirical and hypothetical distribution

What is the definition of the p-value? (2) Name different values. (4)

Relate p-value and sample size.

probability to observe data (y_bar) that contradicts H0 even more than the observed data (y_bar_obs)
- P(y_bar >= y_bar_obs | H0)
evidence of contradiction of H0
values
- p <= 0.1: weak evidence against H0
- p <= 0.05 increased
- p <= 0.01 strong
- >> "H1" <> p <= alpha
as sample size increases, the uncertainty decreases, hence the p-value descreases
- in big data everything becomes significant

relate CI and testing. (1)

"H1" <> theta not in CI
- P(theta in CI) >= 1-alpha

What is the main problem of multiple testing?

What is FWER?

What is Bonferoni adjustment? What is the problem here?

running j independent tests: P(reject at least one H_j | H0) between alpha and m*alpha = P(type 1 error)
- probability to reject at least one hypothesis tends to 1 for large m (i.e. alpha = 0.05, m = 20)
Family-wise error rate = probability to reject at least one hypothesis (to be controlled)
BA: alpha_adj = alpha / m
- problem: alpha gets really small for large m, H0 is not often rejected

What is the idea of Holm's procedure? (2) Name steps. What does this imply?

order p-values ascending
limit FWER to alpha in each step
if p_1 > alpha/m: accept all, stop
- else reject H0_1, go on
if p_2 > alpha/(m-1): accept remaining, stop
- else reject H0_2, go on
...
if p_m > alpha: accept H0_m, stop
- else reject H0_m
implication: alpha_adj increases each iteration (but p-value also) >> more rejections

What is the general idea of bayes reasoning? Define the posterior, characterise the denominator. (2)

Relate posterior to likelihood.

idea
- theta is a random variable with prior probability \(f_\theta(\vartheta)\)
- express uncertainty about theta as probability (posterior)
\(f_\theta (\vartheta|y) = \frac{f(y;~\vartheta)f_\theta(\vartheta)}{\int{f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- normalisation constant is hard to derive analytically
- normalisation constant = marginal density, uncertainty about theta integrated out
posterior proportional to likelihood * prior

What is a conjugate prior? (3)

What is a problem with it?

Give an example. (3) Why is this outcome interesting? (2)

data from a family of distribution: F_y = { f(y; theta), theta from its space }
prior from a family of distribution F_theta = { f(theta; gamma), gamma from its space }
F_theta is conjugate to F_y (likelihood) if posterior from family of prior.
problem: hard to find besides classical cases
example
- mu ~ N(gamma, tao^2)
- mu|y ~ N( , ) >> posterior again normal
- mu | y --a--> N(y_bar (= mu_hat), sigma^2/n)
  - bayes: asympotical normality of theta around theta_hat
  - ML: asympotical normality of theta around theta_hat

selecting the prior: Explain flat (2) and jeffrey's prior. (5)

flat prior
- prior = constant
- problem: transformation of flat prior results in not constant prior (knowledge implicit)
  - f_gamma(gamma) not flat with gamma = g(theta) (transformation rule)
  - i.e. f_pi(pi) = 1, f_gamma(gamma) = exp(gamma)/(1+exp(gamma))
Jeffrey's prior
- prior proportional to sqrt(I_theta(theta))
- transformation invariant: f_gamma(gamma) proportional to sqrt(I_gamma(gamma))
- not flat
- distance between prior and posterior is maximized (maximal information from the data)
- requires fisher regularity

What are general problems with priors? (2 + 1)

no prior expressing ignorance exists
prior influences results of analysis
>> how to express no prior knowledge?

What is the core idea of empirical bayes? (2 + 1)

prior depends on hyperparameter gamma \(f_\theta(\vartheta; \gamma)\)
estimate gamma with max(L(y; gamma) -> \(f_\theta(\vartheta; \gamma_{ML})\)
>> prior depends on gamma_hat, hence on data
- contradicts bayesian idea, but useful in practise

What is the core idea of hierarchical bayes? (1) Discuss it (1 + 1)

shift uncertainty of prior to higher level
- y|theta ~ f_y(y; theta) >> theta | gamma ~ f_theta(theta; gamma) >> gamma |. ~ f_gamma(gamma; . )
+ flexible distribution of parameter of interest
- costly computation of prior

Why numerical methods for posterior are required?

What are solutions? (4) Briefly discuss them.

numerical integration of normalisation constant (assuming prior is given)
solutions
- approximation of f(y) (2)
  - numerical approximation (low-dimensional only)
  - Laplace approximation (works also high-dimensional, easy to estimate)
- sampling from the denominator (Monte Carlo approx.) (3 + 1)
  - sampling from prior (problem: F_theta(theta) not given, integration again)
  - rejection sampling (use f*_theta(.) with given F*_theta(.))
  - importance sampling
  - >> problematic for high-dimensional theta
- sampling from the posterior (Monte Carle approx.)
  - = Markov Chain Monte Carlo
    - sample from posterior to estimate expectation, variance etc.
  - Metropolis-Hasting (low acceptance for high-dimensional theta)
  - Gibbs sampling (one component of theta sampled in each step, works high-dimensional)
- approximation of f(y) or posterior
  - variational bayes (high-dimensional, component-wise)

What is the core idea of monte carlo approximation?

numerical problem solving based on simulation and porbability theory

Briefly describe Laplace approximation. (4)

Discuss (2)

assumes n iid samples
CLT for bayes
f(y) = integral(exp(likelihood + log(prior))
- likelihood grows with n
- likelihood + log(prior) = l_p,n(theta)
  - >> s_p,n(theta), I_p,n(theta)
TSE of f(y) around theta_hat_p
- \(f(y; \theta_p) f_\theta(\theta_p)\sqrt{2\pi}J_{p,n}(\theta_p)^{-1/2}\)
+ works also high-dimensional (small adaption of formular)
+ easy calculation of theta_p (ML)
+ good approximation for n -> inf

What is the core idea when sampling from the denominator of posterior?

What is required for it? What is the problem there?

denominator = E(f(y; theta)) >> sampling from it to get estimate of E(.)
- as E_hat() converges to E(.) for n -> inf
therefor sampling from prior is required
- theta*_j = F^-1(u*_j)
- >> for sampling from prior, F_theta(theta) (cdf) is required >> requires numerical integration again

Explain rejection sampling. Idea (1). How to produce a sample? (3)

What is special about it? What is the problem?

idea: sample from proposal density f*_theta(.) (umbrella, see formula) with analytical given F*_theta(.)
- f(theta) <= a * f*(theta)
produce sample from f(theta)
- 1) draw u*
- 2) draw theta* from f*
- 3) if u* <= f(theta*) / (a * f*(theta*)) accept theta* as sample
  - else reject theta*
- proof: P(theta* <= theta | theta* accept) = P(theta* <= theta, theta* accept) / P(theta* accept) = F_theta(theta)/a * a
produces sample from prior without drawing from it
problem: for large a, acceptance of theta* is low >> no sample
- >> f*(theta) must be close to f(theta) (low a)

What is the general idea of MCMC? (2)

theta*j and theta*j+1 are correlated (>> chain, no iid)
distribution of chain converges to stationary distribution = posterior

What is the idea of Metropolis Hastings? (2)

Name the steps. (3)

What is a general problem here?

f(y) is unknown, but posterior is proportional to likelihood * prior
ratio of posteriors for theta, theta~ is known as f(y) cancels out
steps
- 1) draw theta* from proposal distribution q(.|theta*_t)
- 2) accept theta* as new sample with probabillity alpha
  - \(\alpha(\theta^*_t|\theta^*) = min\{1, \frac{f(\theta^*|y)}{f(\theta^*_t|y)} \frac{q(\theta^*_t|\theta^*)}{q(\theta^*|\theta^*_t)}\}\)
  - alpha = 1 if posterior of proposed theta is larger than previous
  - alpha < 1 if posterior of propsed theta is smaller than previous (acceptance = fraction)
  - if q is symmetric: Metropolis algorithm
- 3) draw u*, accept theta* if alpha() >= u*
  - else don't accept it
problem: for high-dimensional theta: alpha -> 0 (i.e. 0.8 acceptance each dimension >> 0.8^p)

What is the influence of q at Metropolis-hastings?

narrow q:
- similar theta*
- high acceptance as posteriors will be mostly the same >> E(alpha) = 1
- >> not much movement in parameter-space
wide q:
- different theta*
- acceptance will vary a lot (0 to 1) as posteriors will differ >> E(alpha) << 1
- >> jumping, exploring
>> balance between acceptance and exploring

How to use MCMC in practise? (3)

use different starting values
define and delete burn in phase
thinning out: only use uncorrelated samples (where autocorrelation -> 0) to get iid samples from posterior

Briefly explain the idea of gibbs sampling.

sample only one single component of theta-vector in each step (avoid problem of acceptance)
steps
- draw theta*_k from f(theta_k* | y, theta_-k,t)
  - set theta*k,t+1 = theta*_k if accepted
  - use new theta*k,t+1 during drawing of other components
proof
- \(f_{\theta_k}(\theta_k|y, \theta_{-k}) \propto f_{\theta_k}(\theta_k, \theta_{-k}|y) = f_\theta(\theta|y)\)

Briefly explain variational bayes.

Name the steps. (2)

approximate f(y) or posterior by replacing it by q(.)
by minimizing the KL-divergence between i.e. f(theta;y) (unknown) and q()
- min KL(q(), f())
- with componentwise independence s.t. \(q_\theta(\theta) = \prod_{i = 1}^{p} q_k(\theta_k)\)
- separate upper and lower part by k-th components and rest
- ... \(= \int log\frac{q_k(\theta_k)}{f^*_k(\theta_k|y)} q_k(\theta_k)d\theta_k = KL(f^*_k(.|.), q_k(.))\)
- is minimized if \(q_k(\theta_k) \propto f^*_k(\theta_k|y)\)
  - \(q_k(\theta_k)=f^*_k(\theta_k|y) / \int f^*_k\)
steps
- 0) initialise q_k(0)
- for each dimension
  - q_k,t+1 = f*_k,t(.|.) with all other q_j,t fixed (j != k)

What is the core idea of bayes factor? (3)

Give a definition of it.

Explain values. (4)

bayesian p-value
compares two models based on posterior
- P(M_j|y) = f(y|M_j) * P(M_j) / f(y)
ratio can be easily analysed (f(y) cancels out)
\(\frac{f(y | M_1)}{f(y|M_0)}\) with \(f(y|M_j) = \int f(y|\vartheta) f_\theta(\vartheta|M_j)d\vartheta\)
- y depends on theta, theta depes on M_j
values
- 1-3 no evidence for M1
- 3-20 evidence for M1
- 20-50 strong evidence for M1
- >150 very strong evidence for M1

What is the aim of bootstrapping? (1)

What is the general idea?

What is a problem here?

What general assumption is done?

Name the main steps. (5)

aim: quantify/approximate uncertainty of estimates (i.e. variance, bias, CI, tests)
idea: sample from data instead of true distribution as f(.), F(.) are unknown
- "mimic true variation"
problem: only 2n-1 over n different B-samples exist >> B is finite
assumption: Y_i ~ F(.) iid
steps
- 1) calculate t(y)
- 2) sample y*_i (i = 1...n) with replacement from y >> y* = bootstrap-sample
- 3) calculate t(y*)
- 4) repeat 2) 3) B times
- 5) estimate i.e. Var(t(y))_hat based on t(y*_b) (b = 1...B)

Explain the plug-in principle.

Compare real world and boostrap world. (1)

Give examples for replacement. (3)

F(.) replaced by F^_n(y) (theoretical distribution-function by empirical)
- draw from F^_n(y) instead of F(.) with replacement
- >> simulation instead of analytical solution
real word: y ~ F(.) iid >> t(y), boostrap-world y* ~ F^_n(y) iid >> t(y*)
replacements
- mu by y_bar
- y_bar by y*_bar
- F(y) by F^_n(y*)

Give examples how to apply boostrapping. (2)

bias
- \(bias(\hat\xi, \xi) = E(\hat\xi) - \xi \Rightarrow \hat{bias}(\hat\xi, \xi) = \hat{\bar\xi}^*-\hat\xi\)
- drawn from F^_n(y) and F(.)
- correction: \(\hat{\hat\xi} = \hat\xi - \hat{bias}(\hat\xi, \xi) = 2\hat\xi+\hat{\bar\xi}^*\)
  - zero bias leads to increased variance...
standard-error
- \(\xi = \mu_y - \mu_z \Rightarrow \hat{\xi} = \bar{y} - \bar{z} \Rightarrow \hat{\xi}^{*b} = \bar{y}^{*b} - \bar{z}^{*b}\)

What is the idea of parametric boostrapping?

Discuss it (1+ 1,1)

idea: Y_i ~ F(., theta) iid >> Y*_i ~ F(. | theta_hat) iid
y*_i can differ from y_i
- + useful for small samples (or extreme cases)
- - parametric assumptions (+ estimation of those)

Explain how bootstrapping can be used in regression. (4) Explain general character, how it works and pros and cons.

residual-based
- fit a model on x, y >> beta_hat, y_hat >> epsilon_hat >> epsilon* = samples from residuals
- >> y* = Xbeta_hat + epsilon*
- beta*_hat = (X^TX)^-1 X^Ty*
model-based
- fit a model on x, y, estimate variance of residuals (sigma_hat^2)
- y* = Xbeta_hat + epsilon* (~N(0, sigma_hat^2)
- beta*_hat = (X^TX)^-1 X^Ty*
>> problem: induces variance homogeneity (as residuals are independent of x) >> variance unchanged, nothing gained by bootstrapping
- Var(beta*_hat) = (XTX)^-1 sigma_hat^2
pairwise/case resampling
- y* = X*beta + epsilon (sampling whole rows from X with replacement)
- beta*_hat = (X*T X*)^-1 X*T y*
- - contradicts regression idea: model y given X
wild bootstrap
- fit a model on y, X >> epsilon_hat
- \(\hat\epsilon^*_i = V^*_i\hat\epsilon_i\)
  - with V*_i drawn from point distribution with E(V) = 0, Var(V) = 1
- y* = Xbeta_hat + epsilon*_hat
- mimics empirical estimates
- ++ all original samples considered
>> variance heterogeneity can be modeled as epsilon*_hat_i depends on x_i

What is the main idea of regression? (4)

How is the solution recieved? Explain an alternative.

relates input X with output y
y = beta_zero + Xbeta_x + epsilon
error epsilon can't be explained by model (epsilon ~ N(0, sigma^2)
- error independent of x >> variance homogeneity
E(y|X) = beta_zero + Xbeta_x = y_hat
- >> modeling the mean y|X ~ N(beta_zero + Xbeta_x, sigma^2)
estimation by:
- beta_0_hat, beta_x_hat = min sum((y_i - y_hat_i)^2)
- or: ML-estimation: normal-distribution with mu = (beta_zero + Xbeta_x)

Explain the matrix-notation of regression.

design-matrix X (with first column = 1 for beta_zero), n x p+1
epsilon-vector: n x 1
response-vector: n x 1
beta-vector: (beta_zero, beta_x)^T: p+1 x 1
>> y = Xbeta + epsilon

Name ML-estimates in regression (I(beta_hat), beta_hat, sigma_ML^2)

I(beta_hat) = X^TX/sigma^2, I^-1(beta_hat) = sigma^2/(X^TX) = Var(beta_hat)
beta_hat = (X^TX)^-1 * X^Ty
sigma_ML^2 = (y-Xbeta)^T(y-Xbeta) / n (biased)
>> beta_hat ~ N(beta, Var(beta_hat))
- (E(beta_hat), Var(beta_hat))

How to recieve uncorrelated beta_0 and beta_x? (2)

X^TX implies correlated estimates (usually useful)
solution:
- \(x^* = x_i - \bar x \Rightarrow x_i = x^*+\bar x\)
- \(y_i = \beta^*_0 + x^*_i\beta_x+\epsilon_i \Rightarrow \hat \beta^*=(X^{*T}X^*)^{-1}X^{*T}y\)

How to model non-linear in regression? (3) Why is this possible?

quadratic effect: \(x^2\beta_{xx}\)
binary covariate: \(2\beta_2\)
categorical variable: \(1_{\{edu = 2\}}\beta_3\)
linear regression = linear in parameter theta, not in X

Name the hat-matrix, where does it come from?

How can it be used?

\(H = X(X^TX)^{-1}X^T\)
- idempotent, s.t. H^T = H, HH = H, same for (I-H)
from \(y-\hat y = y-X\hat\beta = y-X(X^TX)^{-1}X^Ty = y-Hy=(I-H)y\)
used for: \(E((y-X\hat\beta)^T(y-X\hat\beta)) = E(y^T(I-H)(I-H)y)=...=\sigma^2(n-p)\)
- with \(E(YY^T)=Var(Y)+E(y)E(y^T)\)

Bayesian regression. Name properties. (2 + 1)

beta is unknown
\(\beta, \sigma^2\sim f(\beta, \sigma^2|y)\) = posterior with flat prior
- with mu = Xbeta
- \((y-X\beta)=(y-X\hat\beta)-(X\beta-X\hat\beta)\)
- \((y-X\beta)^T(y-X\beta)=...=\sigma^2(n-p)+(\beta-\hat\beta)X^TX(\beta-\hat\beta)\)
- \(f(\beta, \sigma^2|y)\propto~...\times exp(-1/(2\sigma^2)(\beta-\hat\beta)X^TX(\beta-\hat\beta))\)
  - posterior again normal
>> \(\beta, \sigma^2|y\sim N(\hat\beta, (X^TX)^{-1}\sigma^2)\)
- ML: beta_hat, sigma^2|y ~ N(beta, (XTX)^-1 sigma^2)
- >> same results, different reasoning

What is the core idea of GLM (name cases, h(n) and interpretation) (3).

Explain transformation of response. (1)

Name requirements. (2)

response can be
- binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
  - effects: odds/log-odds
- count (poisson regression): h(n) = exp(n)
  - multiplicative effects
- categorical (cumulative regression)
transformation:
- E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
  - n = h^-1(E(y|X)) = g(mu)
required
- y|X ~ exp-family
- link-function: E(y|X) = h(n)
  - natural/canonical link: theta = n
  - remember: \(\partial\kappa(\theta)/\partial(\theta) = E(t(y)) = \mu\)

How does prediction work in regression? (2)

y_hat = Xbeta_hat + epsilon
E(y_hat) = Xbeta_hat

How is the variance of prediction in regression defined? What does this imply?

\(Var(\hat y) = Var(X\hat\beta+\epsilon)=\sigma^2(X(X^TX)^{-1}X^T+1)\)
>> high variance in regions of low mass of data
- as (X^TX)^-1 -> 0 for n -> inf; -> inf for n -> 0

When to use weighted regression? (2)

1: 2 + 4

2: 4

modeling variance heterogeneity (Var(y_i) = sigma^2(x_i))
- sigma^2(x_i) = a * sigma^2
- >> y ~ N(Xbeta, sigma^2 W^-1) with W = diag(1/a_j), j = 1...n
  - Var(y) = sigma^2 W^-1
  - in loglikelihood: ...(y-Xbeta)^TW(y-Xbeta)
  - in beta_hat = (X^TWX)^-1 X^TWy
  - Var(beta_hat) = sigma^2(X^TWX)^-1, rest cancels out as Var(y) = sigma^2 W^-1
survey weighting
- biased data in surveys >> over- or underrepresentation
- >> introduce weights for samples (W)
- beta_hat see above
- Var(beta_hat) = ... nothing cancels out as Var(y) = sigma^2 (W not included)

Briefly describe quantile-regression (3)

estimation of quantile(s), not expectation
modeling of variance heterogeneity implicit (analysing slopes of quantiles)
squared error replaced by check-function (>> linear programming as not solvable analytically)

SRaI

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google