SRaI
@LMU
@LMU
Set of flashcards Details
Flashcards | 125 |
---|---|
Language | English |
Category | Computer Science |
Level | University |
Created / Updated | 04.10.2019 / 11.10.2019 |
Weblink |
https://card2brain.ch/box/20191004_srai
|
Embed |
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
What is the core idea of gooness of fit of a test (1)
Name two examples. (1:3, 1)
- idea: test whole distribution model, not only parameter
- H0: Y ~ f(y; theta)
- Chi-squared goodness of fit
- contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
- expected by assumed distribution
- n_k far from l_k speaks against H0
- X^2 = ..., "H1" <> if X^2 >= Chi-squared_K-1-p,1-a
- contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
- Kolomogrov-Smirnov test
- compares empirical and hypothetical distribution
What is the definition of the p-value? (2) Name different values. (4)
Relate p-value and sample size.
- probability to observe data (y_bar) that contradicts H0 even more than the observed data (y_bar_obs)
- P(y_bar >= y_bar_obs | H0)
- evidence of contradiction of H0
- values
- p <= 0.1: weak evidence against H0
- p <= 0.05 increased
- p <= 0.01 strong
- >> "H1" <> p <= alpha
- as sample size increases, the uncertainty decreases, hence the p-value descreases
- in big data everything becomes significant
relate CI and testing. (1)
- "H1" <> theta not in CI
- P(theta in CI) >= 1-alpha
What is the main problem of multiple testing?
What is FWER?
What is Bonferoni adjustment? What is the problem here?
- running j independent tests: P(reject at least one H_j | H0) between alpha and m*alpha = P(type 1 error)
- probability to reject at least one hypothesis tends to 1 for large m (i.e. alpha = 0.05, m = 20)
- Family-wise error rate = probability to reject at least one hypothesis (to be controlled)
- BA: alpha_adj = alpha / m
- problem: alpha gets really small for large m, H0 is not often rejected
What is the idea of Holm's procedure? (2) Name steps. What does this imply?
- order p-values ascending
- limit FWER to alpha in each step
- if p_1 > alpha/m: accept all, stop
- else reject H0_1, go on
- if p_2 > alpha/(m-1): accept remaining, stop
- else reject H0_2, go on
- ...
- if p_m > alpha: accept H0_m, stop
- else reject H0_m
- implication: alpha_adj increases each iteration (but p-value also) >> more rejections
What is the general idea of bayes reasoning? Define the posterior, characterise the denominator. (2)
Relate posterior to likelihood.
- idea
- theta is a random variable with prior probability \(f_\theta(\vartheta)\)
- express uncertainty about theta as probability (posterior)
- \(f_\theta (\vartheta|y) = \frac{f(y;~\vartheta)f_\theta(\vartheta)}{\int{f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- normalisation constant is hard to derive analytically
- normalisation constant = marginal density, uncertainty about theta integrated out
- posterior proportional to likelihood * prior
What is a conjugate prior? (3)
What is a problem with it?
Give an example. (3) Why is this outcome interesting? (2)
- data from a family of distribution: F_y = { f(y; theta), theta from its space }
- prior from a family of distribution F_theta = { f(theta; gamma), gamma from its space }
- F_theta is conjugate to F_y (likelihood) if posterior from family of prior.
- problem: hard to find besides classical cases
- example
- mu ~ N(gamma, tao^2)
- mu|y ~ N( , ) >> posterior again normal
- mu | y --a--> N(y_bar (= mu_hat), sigma^2/n)
- bayes: asympotical normality of theta around theta_hat
- ML: asympotical normality of theta around theta_hat
selecting the prior: Explain flat (2) and jeffrey's prior. (5)
- flat prior
- prior = constant
- problem: transformation of flat prior results in not constant prior (knowledge implicit)
- f_gamma(gamma) not flat with gamma = g(theta) (transformation rule)
- i.e. f_pi(pi) = 1, f_gamma(gamma) = exp(gamma)/(1+exp(gamma))
- Jeffrey's prior
- prior proportional to sqrt(I_theta(theta))
- transformation invariant: f_gamma(gamma) proportional to sqrt(I_gamma(gamma))
- not flat
- distance between prior and posterior is maximized (maximal information from the data)
- requires fisher regularity
What are general problems with priors? (2 + 1)
- no prior expressing ignorance exists
- prior influences results of analysis
- >> how to express no prior knowledge?
What is the core idea of empirical bayes? (2 + 1)
- prior depends on hyperparameter gamma \(f_\theta(\vartheta; \gamma)\)
- estimate gamma with max(L(y; gamma) -> \(f_\theta(\vartheta; \gamma_{ML})\)
- >> prior depends on gamma_hat, hence on data
- contradicts bayesian idea, but useful in practise
What is the core idea of hierarchical bayes? (1) Discuss it (1 + 1)
- shift uncertainty of prior to higher level
- y|theta ~ f_y(y; theta) >> theta | gamma ~ f_theta(theta; gamma) >> gamma |. ~ f_gamma(gamma; . )
- + flexible distribution of parameter of interest
- - costly computation of prior
Why numerical methods for posterior are required?
What are solutions? (4) Briefly discuss them.
- numerical integration of normalisation constant (assuming prior is given)
- solutions
- approximation of f(y) (2)
- numerical approximation (low-dimensional only)
- Laplace approximation (works also high-dimensional, easy to estimate)
- sampling from the denominator (Monte Carlo approx.) (3 + 1)
- sampling from prior (problem: F_theta(theta) not given, integration again)
- rejection sampling (use f*_theta(.) with given F*_theta(.))
- importance sampling
- >> problematic for high-dimensional theta
- sampling from the posterior (Monte Carle approx.)
- = Markov Chain Monte Carlo
- sample from posterior to estimate expectation, variance etc.
- Metropolis-Hasting (low acceptance for high-dimensional theta)
- Gibbs sampling (one component of theta sampled in each step, works high-dimensional)
- = Markov Chain Monte Carlo
- approximation of f(y) or posterior
- variational bayes (high-dimensional, component-wise)
- approximation of f(y) (2)
What is the core idea of monte carlo approximation?
- numerical problem solving based on simulation and porbability theory
Briefly describe Laplace approximation. (4)
Discuss (2)
- assumes n iid samples
- CLT for bayes
- f(y) = integral(exp(likelihood + log(prior))
- likelihood grows with n
- likelihood + log(prior) = l_p,n(theta)
- >> s_p,n(theta), I_p,n(theta)
- TSE of f(y) around theta_hat_p
- \(f(y; \theta_p) f_\theta(\theta_p)\sqrt{2\pi}J_{p,n}(\theta_p)^{-1/2}\)
- + works also high-dimensional (small adaption of formular)
- + easy calculation of theta_p (ML)
- + good approximation for n -> inf
What is the core idea when sampling from the denominator of posterior?
What is required for it? What is the problem there?
- denominator = E(f(y; theta)) >> sampling from it to get estimate of E(.)
- as E_hat() converges to E(.) for n -> inf
- therefor sampling from prior is required
- theta*_j = F^-1(u*_j)
- >> for sampling from prior, F_theta(theta) (cdf) is required >> requires numerical integration again
Explain rejection sampling. Idea (1). How to produce a sample? (3)
What is special about it? What is the problem?
- idea: sample from proposal density f*_theta(.) (umbrella, see formula) with analytical given F*_theta(.)
- f(theta) <= a * f*(theta)
- produce sample from f(theta)
- 1) draw u*
- 2) draw theta* from f*
- 3) if u* <= f(theta*) / (a * f*(theta*)) accept theta* as sample
- else reject theta*
- proof: P(theta* <= theta | theta* accept) = P(theta* <= theta, theta* accept) / P(theta* accept) = F_theta(theta)/a * a
- produces sample from prior without drawing from it
- problem: for large a, acceptance of theta* is low >> no sample
- >> f*(theta) must be close to f(theta) (low a)
What is the general idea of MCMC? (2)
- theta*j and theta*j+1 are correlated (>> chain, no iid)
- distribution of chain converges to stationary distribution = posterior
What is the idea of Metropolis Hastings? (2)
Name the steps. (3)
What is a general problem here?
- f(y) is unknown, but posterior is proportional to likelihood * prior
- ratio of posteriors for theta, theta~ is known as f(y) cancels out
- steps
- 1) draw theta* from proposal distribution q(.|theta*_t)
- 2) accept theta* as new sample with probabillity alpha
- \(\alpha(\theta^*_t|\theta^*) = min\{1, \frac{f(\theta^*|y)}{f(\theta^*_t|y)} \frac{q(\theta^*_t|\theta^*)}{q(\theta^*|\theta^*_t)}\}\)
- alpha = 1 if posterior of proposed theta is larger than previous
- alpha < 1 if posterior of propsed theta is smaller than previous (acceptance = fraction)
- if q is symmetric: Metropolis algorithm
- 3) draw u*, accept theta* if alpha() >= u*
- else don't accept it
- problem: for high-dimensional theta: alpha -> 0 (i.e. 0.8 acceptance each dimension >> 0.8^p)
What is the influence of q at Metropolis-hastings?
- narrow q:
- similar theta*
- high acceptance as posteriors will be mostly the same >> E(alpha) = 1
- >> not much movement in parameter-space
- wide q:
- different theta*
- acceptance will vary a lot (0 to 1) as posteriors will differ >> E(alpha) << 1
- >> jumping, exploring
- >> balance between acceptance and exploring
How to use MCMC in practise? (3)
- use different starting values
- define and delete burn in phase
- thinning out: only use uncorrelated samples (where autocorrelation -> 0) to get iid samples from posterior
Briefly explain the idea of gibbs sampling.
- sample only one single component of theta-vector in each step (avoid problem of acceptance)
- steps
- draw theta*_k from f(theta_k* | y, theta_-k,t)
- set theta*k,t+1 = theta*_k if accepted
- use new theta*k,t+1 during drawing of other components
- draw theta*_k from f(theta_k* | y, theta_-k,t)
- proof
- \(f_{\theta_k}(\theta_k|y, \theta_{-k}) \propto f_{\theta_k}(\theta_k, \theta_{-k}|y) = f_\theta(\theta|y)\)
Briefly explain variational bayes.
Name the steps. (2)
- approximate f(y) or posterior by replacing it by q(.)
- by minimizing the KL-divergence between i.e. f(theta;y) (unknown) and q()
- min KL(q(), f())
- with componentwise independence s.t. \(q_\theta(\theta) = \prod_{i = 1}^{p} q_k(\theta_k)\)
- separate upper and lower part by k-th components and rest
- ... \(= \int log\frac{q_k(\theta_k)}{f^*_k(\theta_k|y)} q_k(\theta_k)d\theta_k = KL(f^*_k(.|.), q_k(.))\)
- is minimized if \(q_k(\theta_k) \propto f^*_k(\theta_k|y)\)
- \(q_k(\theta_k)=f^*_k(\theta_k|y) / \int f^*_k\)
- steps
- 0) initialise q_k(0)
- for each dimension
- q_k,t+1 = f*_k,t(.|.) with all other q_j,t fixed (j != k)
What is the core idea of bayes factor? (3)
Give a definition of it.
Explain values. (4)
- bayesian p-value
- compares two models based on posterior
- P(M_j|y) = f(y|M_j) * P(M_j) / f(y)
- ratio can be easily analysed (f(y) cancels out)
- \(\frac{f(y | M_1)}{f(y|M_0)}\) with \(f(y|M_j) = \int f(y|\vartheta) f_\theta(\vartheta|M_j)d\vartheta\)
- y depends on theta, theta depes on M_j
- values
- 1-3 no evidence for M1
- 3-20 evidence for M1
- 20-50 strong evidence for M1
- >150 very strong evidence for M1
What is the aim of bootstrapping? (1)
What is the general idea?
What is a problem here?
What general assumption is done?
Name the main steps. (5)
- aim: quantify/approximate uncertainty of estimates (i.e. variance, bias, CI, tests)
- idea: sample from data instead of true distribution as f(.), F(.) are unknown
- "mimic true variation"
- problem: only 2n-1 over n different B-samples exist >> B is finite
- assumption: Y_i ~ F(.) iid
- steps
- 1) calculate t(y)
- 2) sample y*_i (i = 1...n) with replacement from y >> y* = bootstrap-sample
- 3) calculate t(y*)
- 4) repeat 2) 3) B times
- 5) estimate i.e. Var(t(y))_hat based on t(y*_b) (b = 1...B)
Explain the plug-in principle.
Compare real world and boostrap world. (1)
Give examples for replacement. (3)
- F(.) replaced by F^_n(y) (theoretical distribution-function by empirical)
- draw from F^_n(y) instead of F(.) with replacement
- >> simulation instead of analytical solution
- real word: y ~ F(.) iid >> t(y), boostrap-world y* ~ F^_n(y) iid >> t(y*)
- replacements
- mu by y_bar
- y_bar by y*_bar
- F(y) by F^_n(y*)
Give examples how to apply boostrapping. (2)
- bias
- \(bias(\hat\xi, \xi) = E(\hat\xi) - \xi \Rightarrow \hat{bias}(\hat\xi, \xi) = \hat{\bar\xi}^*-\hat\xi\)
- drawn from F^_n(y) and F(.)
- correction: \(\hat{\hat\xi} = \hat\xi - \hat{bias}(\hat\xi, \xi) = 2\hat\xi+\hat{\bar\xi}^*\)
- zero bias leads to increased variance...
- standard-error
- \(\xi = \mu_y - \mu_z \Rightarrow \hat{\xi} = \bar{y} - \bar{z} \Rightarrow \hat{\xi}^{*b} = \bar{y}^{*b} - \bar{z}^{*b}\)
What is the idea of parametric boostrapping?
Discuss it (1+ 1,1)
- idea: Y_i ~ F(., theta) iid >> Y*_i ~ F(. | theta_hat) iid
- y*_i can differ from y_i
- + useful for small samples (or extreme cases)
- - parametric assumptions (+ estimation of those)
Explain how bootstrapping can be used in regression. (4) Explain general character, how it works and pros and cons.
- residual-based
- fit a model on x, y >> beta_hat, y_hat >> epsilon_hat >> epsilon* = samples from residuals
- >> y* = Xbeta_hat + epsilon*
- beta*_hat = (XTX)^-1 XTy*
- model-based
- fit a model on x, y, estimate variance of residuals (sigma_hat^2)
- y* = Xbeta_hat + epsilon* (~N(0, sigma_hat^2)
- beta*_hat = (XTX)^-1 XTy*
- >> problem: induces variance homogeneity (as residuals are independent of x) >> variance unchanged, nothing gained by bootstrapping
- Var(beta*_hat) = (XTX)^-1 sigma_hat^2
- pairwise/case resampling
- y* = X*beta + epsilon (sampling whole rows from X with replacement)
- beta*_hat = (X*T X*)^-1 X*T y*
- - contradicts regression idea: model y given X
- wild bootstrap
- fit a model on y, X >> epsilon_hat
- \(\hat\epsilon^*_i = V^*_i\hat\epsilon_i\)
- with V*_i drawn from point distribution with E(V) = 0, Var(V) = 1
- y* = Xbeta_hat + epsilon*_hat
- mimics empirical estimates
- ++ all original samples considered
- >> variance heterogeneity can be modeled as epsilon*_hat_i depends on x_i
What is the main idea of regression? (4)
How is the solution recieved? Explain an alternative.
- relates input X with output y
- y = beta_zero + Xbeta_x + epsilon
- error epsilon can't be explained by model (epsilon ~ N(0, sigma^2)
- error independent of x >> variance homogeneity
- E(y|X) = beta_zero + Xbeta_x = y_hat
- >> modeling the mean y|X ~ N(beta_zero + Xbeta_x, sigma^2)
- estimation by:
- beta_0_hat, beta_x_hat = min sum((y_i - y_hat_i)^2)
- or: ML-estimation: normal-distribution with mu = (beta_zero + Xbeta_x)
Explain the matrix-notation of regression.
- design-matrix X (with first column = 1 for beta_zero), n x p+1
- epsilon-vector: n x 1
- response-vector: n x 1
- beta-vector: (beta_zero, beta_x)T: p+1 x 1
- >> y = Xbeta + epsilon
Name ML-estimates in regression (I(beta_hat), beta_hat, sigma_ML^2)
- I(beta_hat) = XTX/sigma^2, I^-1(beta_hat) = sigma^2/(XTX) = Var(beta_hat)
- beta_hat = (XTX)^-1 * XTy
- sigma_ML^2 = (y-Xbeta)T(y-Xbeta) / n (biased)
- >> beta_hat ~ N(beta, Var(beta_hat))
- (E(beta_hat), Var(beta_hat))
How to recieve uncorrelated beta_0 and beta_x? (2)
- XTX implies correlated estimates (usually useful)
- solution:
- \(x^* = x_i - \bar x \Rightarrow x_i = x^*+\bar x\)
- \(y_i = \beta^*_0 + x^*_i\beta_x+\epsilon_i \Rightarrow \hat \beta^*=(X^{*T}X^*)^{-1}X^{*T}y\)
How to model non-linear in regression? (3) Why is this possible?
- quadratic effect: \(x^2\beta_{xx}\)
- binary covariate: \(2\beta_2\)
- categorical variable: \(1_{\{edu = 2\}}\beta_3\)
- linear regression = linear in parameter theta, not in X
Name the hat-matrix, where does it come from?
How can it be used?
- \(H = X(X^TX)^{-1}X^T\)
- idempotent, s.t. HT = H, HH = H, same for (I-H)
- from \(y-\hat y = y-X\hat\beta = y-X(X^TX)^{-1}X^Ty = y-Hy=(I-H)y\)
- used for: \(E((y-X\hat\beta)^T(y-X\hat\beta)) = E(y^T(I-H)(I-H)y)=...=\sigma^2(n-p)\)
- with \(E(YY^T)=Var(Y)+E(y)E(y^T)\)
Bayesian regression. Name properties. (2 + 1)
- beta is unknown
- \(\beta, \sigma^2\sim f(\beta, \sigma^2|y)\) = posterior with flat prior
- with mu = Xbeta
- \((y-X\beta)=(y-X\hat\beta)-(X\beta-X\hat\beta)\)
- \((y-X\beta)^T(y-X\beta)=...=\sigma^2(n-p)+(\beta-\hat\beta)X^TX(\beta-\hat\beta)\)
- \(f(\beta, \sigma^2|y)\propto~...\times exp(-1/(2\sigma^2)(\beta-\hat\beta)X^TX(\beta-\hat\beta))\)
- posterior again normal
- >> \(\beta, \sigma^2|y\sim N(\hat\beta, (X^TX)^{-1}\sigma^2)\)
- ML: beta_hat, sigma^2|y ~ N(beta, (XTX)^-1 sigma^2)
- >> same results, different reasoning
What is the core idea of GLM (name cases, h(n) and interpretation) (3).
Explain transformation of response. (1)
Name requirements. (2)
- response can be
- binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
- effects: odds/log-odds
- count (poisson regression): h(n) = exp(n)
- multiplicative effects
- categorical (cumulative regression)
- binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
- transformation:
- E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
- n = h^-1(E(y|X)) = g(mu)
- E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
- required
- y|X ~ exp-family
- link-function: E(y|X) = h(n)
- natural/canonical link: theta = n
- remember: \(\partial\kappa(\theta)/\partial(\theta) = E(t(y)) = \mu\)
How does prediction work in regression? (2)
- y_hat = Xbeta_hat + epsilon
- E(y_hat) = Xbeta_hat
How is the variance of prediction in regression defined? What does this imply?
- \(Var(\hat y) = Var(X\hat\beta+\epsilon)=\sigma^2(X(X^TX)^{-1}X^T+1)\)
- >> high variance in regions of low mass of data
- as (XTX)^-1 -> 0 for n -> inf; -> inf for n -> 0
When to use weighted regression? (2)
1: 2 + 4
2: 4
- modeling variance heterogeneity (Var(y_i) = sigma^2(x_i))
- sigma^2(x_i) = a * sigma^2
- >> y ~ N(Xbeta, sigma^2 W^-1) with W = diag(1/a_j), j = 1...n
- Var(y) = sigma^2 W^-1
- in loglikelihood: ...(y-Xbeta)TW(y-Xbeta)
- in beta_hat = (XTWX)^-1 XTWy
- Var(beta_hat) = sigma^2(XTWX)^-1, rest cancels out as Var(y) = sigma^2 W^-1
- survey weighting
- biased data in surveys >> over- or underrepresentation
- >> introduce weights for samples (W)
- beta_hat see above
- Var(beta_hat) = ... nothing cancels out as Var(y) = sigma^2 (W not included)
Briefly describe quantile-regression (3)
- estimation of quantile(s), not expectation
- modeling of variance heterogeneity implicit (analysing slopes of quantiles)
- squared error replaced by check-function (>> linear programming as not solvable analytically)