SRaI
@LMU
@LMU
Fichier Détails
| Résumé | This flashcard set covers advanced statistical methods at the university level, focusing on topics like ANOVA, missing data analysis, and copula models. It delves into key concepts such as variance, data imputation, and the trade-offs between sample size and data quality. Researchers and students in statistics or data science will find this set particularly useful for understanding complex analytical techniques and their practical applications. |
|---|---|
| Cartes-fiches | 125 |
| Utilisateurs | 1 |
| Langue | English |
| Catégorie | Informatique |
| Niveau | Université |
| Crée / Actualisé | 04.10.2019 / 11.10.2019 |
| Lien de web |
https://card2brain.ch/cards/20191004_srai?max=40&offset=40
|
| Intégrer |
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
What is the core idea of gooness of fit of a test (1)
Name two examples. (1:3, 1)
- idea: test whole distribution model, not only parameter
- H0: Y ~ f(y; theta)
- Chi-squared goodness of fit
- contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
- expected by assumed distribution
- n_k far from l_k speaks against H0
- X^2 = ..., "H1" <> if X^2 >= Chi-squared_K-1-p,1-a
- contingency-table: observed (n_k) vs. expected (l_k) # of elements in cells
- Kolomogrov-Smirnov test
- compares empirical and hypothetical distribution
What is the definition of the p-value? (2) Name different values. (4)
Relate p-value and sample size.
- probability to observe data (y_bar) that contradicts H0 even more than the observed data (y_bar_obs)
- P(y_bar >= y_bar_obs | H0)
- evidence of contradiction of H0
- values
- p <= 0.1: weak evidence against H0
- p <= 0.05 increased
- p <= 0.01 strong
- >> "H1" <> p <= alpha
- as sample size increases, the uncertainty decreases, hence the p-value descreases
- in big data everything becomes significant
relate CI and testing. (1)
- "H1" <> theta not in CI
- P(theta in CI) >= 1-alpha
What is the main problem of multiple testing?
What is FWER?
What is Bonferoni adjustment? What is the problem here?
- running j independent tests: P(reject at least one H_j | H0) between alpha and m*alpha = P(type 1 error)
- probability to reject at least one hypothesis tends to 1 for large m (i.e. alpha = 0.05, m = 20)
- Family-wise error rate = probability to reject at least one hypothesis (to be controlled)
- BA: alpha_adj = alpha / m
- problem: alpha gets really small for large m, H0 is not often rejected
What is the idea of Holm's procedure? (2) Name steps. What does this imply?
- order p-values ascending
- limit FWER to alpha in each step
- if p_1 > alpha/m: accept all, stop
- else reject H0_1, go on
- if p_2 > alpha/(m-1): accept remaining, stop
- else reject H0_2, go on
- ...
- if p_m > alpha: accept H0_m, stop
- else reject H0_m
- implication: alpha_adj increases each iteration (but p-value also) >> more rejections
What is the general idea of bayes reasoning? Define the posterior, characterise the denominator. (2)
Relate posterior to likelihood.
- idea
- theta is a random variable with prior probability \(f_\theta(\vartheta)\)
- express uncertainty about theta as probability (posterior)
- \(f_\theta (\vartheta|y) = \frac{f(y;~\vartheta)f_\theta(\vartheta)}{\int{f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- normalisation constant is hard to derive analytically
- normalisation constant = marginal density, uncertainty about theta integrated out
- posterior proportional to likelihood * prior
What is a conjugate prior? (3)
What is a problem with it?
Give an example. (3) Why is this outcome interesting? (2)
- data from a family of distribution: F_y = { f(y; theta), theta from its space }
- prior from a family of distribution F_theta = { f(theta; gamma), gamma from its space }
- F_theta is conjugate to F_y (likelihood) if posterior from family of prior.
- problem: hard to find besides classical cases
- example
- mu ~ N(gamma, tao^2)
- mu|y ~ N( , ) >> posterior again normal
- mu | y --a--> N(y_bar (= mu_hat), sigma^2/n)
- bayes: asympotical normality of theta around theta_hat
- ML: asympotical normality of theta around theta_hat
selecting the prior: Explain flat (2) and jeffrey's prior. (5)
- flat prior
- prior = constant
- problem: transformation of flat prior results in not constant prior (knowledge implicit)
- f_gamma(gamma) not flat with gamma = g(theta) (transformation rule)
- i.e. f_pi(pi) = 1, f_gamma(gamma) = exp(gamma)/(1+exp(gamma))
- Jeffrey's prior
- prior proportional to sqrt(I_theta(theta))
- transformation invariant: f_gamma(gamma) proportional to sqrt(I_gamma(gamma))
- not flat
- distance between prior and posterior is maximized (maximal information from the data)
- requires fisher regularity
What are general problems with priors? (2 + 1)
- no prior expressing ignorance exists
- prior influences results of analysis
- >> how to express no prior knowledge?
What is the core idea of empirical bayes? (2 + 1)
- prior depends on hyperparameter gamma \(f_\theta(\vartheta; \gamma)\)
- estimate gamma with max(L(y; gamma) -> \(f_\theta(\vartheta; \gamma_{ML})\)
- >> prior depends on gamma_hat, hence on data
- contradicts bayesian idea, but useful in practise
What is the core idea of hierarchical bayes? (1) Discuss it (1 + 1)
- shift uncertainty of prior to higher level
- y|theta ~ f_y(y; theta) >> theta | gamma ~ f_theta(theta; gamma) >> gamma |. ~ f_gamma(gamma; . )
- + flexible distribution of parameter of interest
- - costly computation of prior
Why numerical methods for posterior are required?
What are solutions? (4) Briefly discuss them.
- numerical integration of normalisation constant (assuming prior is given)
- solutions
- approximation of f(y) (2)
- numerical approximation (low-dimensional only)
- Laplace approximation (works also high-dimensional, easy to estimate)
- sampling from the denominator (Monte Carlo approx.) (3 + 1)
- sampling from prior (problem: F_theta(theta) not given, integration again)
- rejection sampling (use f*_theta(.) with given F*_theta(.))
- importance sampling
- >> problematic for high-dimensional theta
- sampling from the posterior (Monte Carle approx.)
- = Markov Chain Monte Carlo
- sample from posterior to estimate expectation, variance etc.
- Metropolis-Hasting (low acceptance for high-dimensional theta)
- Gibbs sampling (one component of theta sampled in each step, works high-dimensional)
- = Markov Chain Monte Carlo
- approximation of f(y) or posterior
- variational bayes (high-dimensional, component-wise)
- approximation of f(y) (2)
What is the core idea of monte carlo approximation?
- numerical problem solving based on simulation and porbability theory
Briefly describe Laplace approximation. (4)
Discuss (2)
- assumes n iid samples
- CLT for bayes
- f(y) = integral(exp(likelihood + log(prior))
- likelihood grows with n
- likelihood + log(prior) = l_p,n(theta)
- >> s_p,n(theta), I_p,n(theta)
- TSE of f(y) around theta_hat_p
- \(f(y; \theta_p) f_\theta(\theta_p)\sqrt{2\pi}J_{p,n}(\theta_p)^{-1/2}\)
- + works also high-dimensional (small adaption of formular)
- + easy calculation of theta_p (ML)
- + good approximation for n -> inf
What is the core idea when sampling from the denominator of posterior?
What is required for it? What is the problem there?
- denominator = E(f(y; theta)) >> sampling from it to get estimate of E(.)
- as E_hat() converges to E(.) for n -> inf
- therefor sampling from prior is required
- theta*_j = F^-1(u*_j)
- >> for sampling from prior, F_theta(theta) (cdf) is required >> requires numerical integration again
Explain rejection sampling. Idea (1). How to produce a sample? (3)
What is special about it? What is the problem?
- idea: sample from proposal density f*_theta(.) (umbrella, see formula) with analytical given F*_theta(.)
- f(theta) <= a * f*(theta)
- produce sample from f(theta)
- 1) draw u*
- 2) draw theta* from f*
- 3) if u* <= f(theta*) / (a * f*(theta*)) accept theta* as sample
- else reject theta*
- proof: P(theta* <= theta | theta* accept) = P(theta* <= theta, theta* accept) / P(theta* accept) = F_theta(theta)/a * a
- produces sample from prior without drawing from it
- problem: for large a, acceptance of theta* is low >> no sample
- >> f*(theta) must be close to f(theta) (low a)
What is the general idea of MCMC? (2)
- theta*j and theta*j+1 are correlated (>> chain, no iid)
- distribution of chain converges to stationary distribution = posterior
What is the idea of Metropolis Hastings? (2)
Name the steps. (3)
What is a general problem here?
- f(y) is unknown, but posterior is proportional to likelihood * prior
- ratio of posteriors for theta, theta~ is known as f(y) cancels out
- steps
- 1) draw theta* from proposal distribution q(.|theta*_t)
- 2) accept theta* as new sample with probabillity alpha
- \(\alpha(\theta^*_t|\theta^*) = min\{1, \frac{f(\theta^*|y)}{f(\theta^*_t|y)} \frac{q(\theta^*_t|\theta^*)}{q(\theta^*|\theta^*_t)}\}\)
- alpha = 1 if posterior of proposed theta is larger than previous
- alpha < 1 if posterior of propsed theta is smaller than previous (acceptance = fraction)
- if q is symmetric: Metropolis algorithm
- 3) draw u*, accept theta* if alpha() >= u*
- else don't accept it
- problem: for high-dimensional theta: alpha -> 0 (i.e. 0.8 acceptance each dimension >> 0.8^p)
What is the influence of q at Metropolis-hastings?
- narrow q:
- similar theta*
- high acceptance as posteriors will be mostly the same >> E(alpha) = 1
- >> not much movement in parameter-space
- wide q:
- different theta*
- acceptance will vary a lot (0 to 1) as posteriors will differ >> E(alpha) << 1
- >> jumping, exploring
- >> balance between acceptance and exploring
How to use MCMC in practise? (3)
- use different starting values
- define and delete burn in phase
- thinning out: only use uncorrelated samples (where autocorrelation -> 0) to get iid samples from posterior
Briefly explain the idea of gibbs sampling.
- sample only one single component of theta-vector in each step (avoid problem of acceptance)
- steps
- draw theta*_k from f(theta_k* | y, theta_-k,t)
- set theta*k,t+1 = theta*_k if accepted
- use new theta*k,t+1 during drawing of other components
- draw theta*_k from f(theta_k* | y, theta_-k,t)
- proof
- \(f_{\theta_k}(\theta_k|y, \theta_{-k}) \propto f_{\theta_k}(\theta_k, \theta_{-k}|y) = f_\theta(\theta|y)\)
Briefly explain variational bayes.
Name the steps. (2)
- approximate f(y) or posterior by replacing it by q(.)
- by minimizing the KL-divergence between i.e. f(theta;y) (unknown) and q()
- min KL(q(), f())
- with componentwise independence s.t. \(q_\theta(\theta) = \prod_{i = 1}^{p} q_k(\theta_k)\)
- separate upper and lower part by k-th components and rest
- ... \(= \int log\frac{q_k(\theta_k)}{f^*_k(\theta_k|y)} q_k(\theta_k)d\theta_k = KL(f^*_k(.|.), q_k(.))\)
- is minimized if \(q_k(\theta_k) \propto f^*_k(\theta_k|y)\)
- \(q_k(\theta_k)=f^*_k(\theta_k|y) / \int f^*_k\)
- steps
- 0) initialise q_k(0)
- for each dimension
- q_k,t+1 = f*_k,t(.|.) with all other q_j,t fixed (j != k)
What is the core idea of bayes factor? (3)
Give a definition of it.
Explain values. (4)
- bayesian p-value
- compares two models based on posterior
- P(M_j|y) = f(y|M_j) * P(M_j) / f(y)
- ratio can be easily analysed (f(y) cancels out)
- \(\frac{f(y | M_1)}{f(y|M_0)}\) with \(f(y|M_j) = \int f(y|\vartheta) f_\theta(\vartheta|M_j)d\vartheta\)
- y depends on theta, theta depes on M_j
- values
- 1-3 no evidence for M1
- 3-20 evidence for M1
- 20-50 strong evidence for M1
- >150 very strong evidence for M1
What is the aim of bootstrapping? (1)
What is the general idea?
What is a problem here?
What general assumption is done?
Name the main steps. (5)
- aim: quantify/approximate uncertainty of estimates (i.e. variance, bias, CI, tests)
- idea: sample from data instead of true distribution as f(.), F(.) are unknown
- "mimic true variation"
- problem: only 2n-1 over n different B-samples exist >> B is finite
- assumption: Y_i ~ F(.) iid
- steps
- 1) calculate t(y)
- 2) sample y*_i (i = 1...n) with replacement from y >> y* = bootstrap-sample
- 3) calculate t(y*)
- 4) repeat 2) 3) B times
- 5) estimate i.e. Var(t(y))_hat based on t(y*_b) (b = 1...B)
Explain the plug-in principle.
Compare real world and boostrap world. (1)
Give examples for replacement. (3)
- F(.) replaced by F^_n(y) (theoretical distribution-function by empirical)
- draw from F^_n(y) instead of F(.) with replacement
- >> simulation instead of analytical solution
- real word: y ~ F(.) iid >> t(y), boostrap-world y* ~ F^_n(y) iid >> t(y*)
- replacements
- mu by y_bar
- y_bar by y*_bar
- F(y) by F^_n(y*)
Give examples how to apply boostrapping. (2)
- bias
- \(bias(\hat\xi, \xi) = E(\hat\xi) - \xi \Rightarrow \hat{bias}(\hat\xi, \xi) = \hat{\bar\xi}^*-\hat\xi\)
- drawn from F^_n(y) and F(.)
- correction: \(\hat{\hat\xi} = \hat\xi - \hat{bias}(\hat\xi, \xi) = 2\hat\xi+\hat{\bar\xi}^*\)
- zero bias leads to increased variance...
- standard-error
- \(\xi = \mu_y - \mu_z \Rightarrow \hat{\xi} = \bar{y} - \bar{z} \Rightarrow \hat{\xi}^{*b} = \bar{y}^{*b} - \bar{z}^{*b}\)
What is the idea of parametric boostrapping?
Discuss it (1+ 1,1)
- idea: Y_i ~ F(., theta) iid >> Y*_i ~ F(. | theta_hat) iid
- y*_i can differ from y_i
- + useful for small samples (or extreme cases)
- - parametric assumptions (+ estimation of those)
Explain how bootstrapping can be used in regression. (4) Explain general character, how it works and pros and cons.
- residual-based
- fit a model on x, y >> beta_hat, y_hat >> epsilon_hat >> epsilon* = samples from residuals
- >> y* = Xbeta_hat + epsilon*
- beta*_hat = (XTX)^-1 XTy*
- model-based
- fit a model on x, y, estimate variance of residuals (sigma_hat^2)
- y* = Xbeta_hat + epsilon* (~N(0, sigma_hat^2)
- beta*_hat = (XTX)^-1 XTy*
- >> problem: induces variance homogeneity (as residuals are independent of x) >> variance unchanged, nothing gained by bootstrapping
- Var(beta*_hat) = (XTX)^-1 sigma_hat^2
- pairwise/case resampling
- y* = X*beta + epsilon (sampling whole rows from X with replacement)
- beta*_hat = (X*T X*)^-1 X*T y*
- - contradicts regression idea: model y given X
- wild bootstrap
- fit a model on y, X >> epsilon_hat
- \(\hat\epsilon^*_i = V^*_i\hat\epsilon_i\)
- with V*_i drawn from point distribution with E(V) = 0, Var(V) = 1
- y* = Xbeta_hat + epsilon*_hat
- mimics empirical estimates
- ++ all original samples considered
- >> variance heterogeneity can be modeled as epsilon*_hat_i depends on x_i
What is the main idea of regression? (4)
How is the solution recieved? Explain an alternative.
- relates input X with output y
- y = beta_zero + Xbeta_x + epsilon
- error epsilon can't be explained by model (epsilon ~ N(0, sigma^2)
- error independent of x >> variance homogeneity
- E(y|X) = beta_zero + Xbeta_x = y_hat
- >> modeling the mean y|X ~ N(beta_zero + Xbeta_x, sigma^2)
- estimation by:
- beta_0_hat, beta_x_hat = min sum((y_i - y_hat_i)^2)
- or: ML-estimation: normal-distribution with mu = (beta_zero + Xbeta_x)
Explain the matrix-notation of regression.
- design-matrix X (with first column = 1 for beta_zero), n x p+1
- epsilon-vector: n x 1
- response-vector: n x 1
- beta-vector: (beta_zero, beta_x)T: p+1 x 1
- >> y = Xbeta + epsilon
Name ML-estimates in regression (I(beta_hat), beta_hat, sigma_ML^2)
- I(beta_hat) = XTX/sigma^2, I^-1(beta_hat) = sigma^2/(XTX) = Var(beta_hat)
- beta_hat = (XTX)^-1 * XTy
- sigma_ML^2 = (y-Xbeta)T(y-Xbeta) / n (biased)
- >> beta_hat ~ N(beta, Var(beta_hat))
- (E(beta_hat), Var(beta_hat))
How to recieve uncorrelated beta_0 and beta_x? (2)
- XTX implies correlated estimates (usually useful)
- solution:
- \(x^* = x_i - \bar x \Rightarrow x_i = x^*+\bar x\)
- \(y_i = \beta^*_0 + x^*_i\beta_x+\epsilon_i \Rightarrow \hat \beta^*=(X^{*T}X^*)^{-1}X^{*T}y\)
How to model non-linear in regression? (3) Why is this possible?
- quadratic effect: \(x^2\beta_{xx}\)
- binary covariate: \(2\beta_2\)
- categorical variable: \(1_{\{edu = 2\}}\beta_3\)
- linear regression = linear in parameter theta, not in X
Name the hat-matrix, where does it come from?
How can it be used?
- \(H = X(X^TX)^{-1}X^T\)
- idempotent, s.t. HT = H, HH = H, same for (I-H)
- from \(y-\hat y = y-X\hat\beta = y-X(X^TX)^{-1}X^Ty = y-Hy=(I-H)y\)
- used for: \(E((y-X\hat\beta)^T(y-X\hat\beta)) = E(y^T(I-H)(I-H)y)=...=\sigma^2(n-p)\)
- with \(E(YY^T)=Var(Y)+E(y)E(y^T)\)
Bayesian regression. Name properties. (2 + 1)
- beta is unknown
- \(\beta, \sigma^2\sim f(\beta, \sigma^2|y)\) = posterior with flat prior
- with mu = Xbeta
- \((y-X\beta)=(y-X\hat\beta)-(X\beta-X\hat\beta)\)
- \((y-X\beta)^T(y-X\beta)=...=\sigma^2(n-p)+(\beta-\hat\beta)X^TX(\beta-\hat\beta)\)
- \(f(\beta, \sigma^2|y)\propto~...\times exp(-1/(2\sigma^2)(\beta-\hat\beta)X^TX(\beta-\hat\beta))\)
- posterior again normal
- >> \(\beta, \sigma^2|y\sim N(\hat\beta, (X^TX)^{-1}\sigma^2)\)
- ML: beta_hat, sigma^2|y ~ N(beta, (XTX)^-1 sigma^2)
- >> same results, different reasoning
What is the core idea of GLM (name cases, h(n) and interpretation) (3).
Explain transformation of response. (1)
Name requirements. (2)
- response can be
- binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
- effects: odds/log-odds
- count (poisson regression): h(n) = exp(n)
- multiplicative effects
- categorical (cumulative regression)
- binary (logistic regression): h(n) = exp(n)/(1+exp(n)), or PHI(n)
- transformation:
- E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
- n = h^-1(E(y|X)) = g(mu)
- E(y|X) = h(n) with n = linear-predictor = beta_0 + Xbeta_x
- required
- y|X ~ exp-family
- link-function: E(y|X) = h(n)
- natural/canonical link: theta = n
- remember: \(\partial\kappa(\theta)/\partial(\theta) = E(t(y)) = \mu\)
How does prediction work in regression? (2)
- y_hat = Xbeta_hat + epsilon
- E(y_hat) = Xbeta_hat
How is the variance of prediction in regression defined? What does this imply?
- \(Var(\hat y) = Var(X\hat\beta+\epsilon)=\sigma^2(X(X^TX)^{-1}X^T+1)\)
- >> high variance in regions of low mass of data
- as (XTX)^-1 -> 0 for n -> inf; -> inf for n -> 0
When to use weighted regression? (2)
1: 2 + 4
2: 4
- modeling variance heterogeneity (Var(y_i) = sigma^2(x_i))
- sigma^2(x_i) = a * sigma^2
- >> y ~ N(Xbeta, sigma^2 W^-1) with W = diag(1/a_j), j = 1...n
- Var(y) = sigma^2 W^-1
- in loglikelihood: ...(y-Xbeta)TW(y-Xbeta)
- in beta_hat = (XTWX)^-1 XTWy
- Var(beta_hat) = sigma^2(XTWX)^-1, rest cancels out as Var(y) = sigma^2 W^-1
- survey weighting
- biased data in surveys >> over- or underrepresentation
- >> introduce weights for samples (W)
- beta_hat see above
- Var(beta_hat) = ... nothing cancels out as Var(y) = sigma^2 (W not included)
Briefly describe quantile-regression (3)
- estimation of quantile(s), not expectation
- modeling of variance heterogeneity implicit (analysing slopes of quantiles)
- squared error replaced by check-function (>> linear programming as not solvable analytically)