SRaI

@LMU

@LMU


Fichier Détails

Cartes-fiches 125
Langue English
Catégorie Informatique
Niveau Université
Crée / Actualisé 04.10.2019 / 11.10.2019
Lien de web
https://card2brain.ch/box/20191004_srai
Intégrer
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Characterise non-parametric models:

idea (1)

Penalised splines (1)

  • idea
    • more flexible model type: y=m(x) + epsilon (m(x) = any smooth function)
  • splines
    • m(x) = linear combination of known basis-functions
    • \(B(x)\theta = \sum B_k(\theta)\theta_k\)

Explain B-splines. idea (1)

How to find K?

 

  • knots on x-axis, linear/quadratic function between knots
  • K needs to be chosen well: too wiggled vs too smooth
    • penalty on theta: theta_j - 2*theta_j-1 + theta_j-2 >> small
      • defines L-matrix
    • \(\hat\theta = (X^TX + \lambda LL^T)^{-1}X^Ty\)
      • with likelihood with penalty-term: \(l_p(\theta, \sigma^2, \lambda) = l(\theta, \sigma^2) - p(\theta, \lambda) = l(\theta, \sigma^2)- 0.5\lambda/(\sigma^2)~\theta^TLL^T\theta\)

How to find lambda in case of B-splines (2 cases first, how to choose then (2))?

  • if lambda = 0: unpenalised 
    • better fit, (fit(lambda) -> 0), higher dim(lambda)
  • if lambda -> inf: strongly penalised
    • worse fit (fit(lambda) -> inf), lower dim(lambda
  • 1) AIC = fit(lambda) + 2dim(lambda)
    • >> tradeoff: fit vs. complexity: minimise AIC
  • 2) Bayesian approach: generalised inverse

When does bootstrapping fail? (3)

  • non iid (drawing from F^_n(.) doesn't mimic F() well)
  • too small sample size (F^_n() doesn't converge to F())
  • special cases:
    • drawing from uniform-distribution U(0, theta)
    • theta_ML = max(y_i)
    • probability that highest value is included: 1 - (1-1/n)^n = 1-1/e = 63% (for n -> inf)
    • doesn't converge to sample distribution

 

How bootstrapping can be extended? (3)

  • timeseries, longitudial data: non-iid >> block-bootstrap
  • bayesian bootstrap: introduce weights, draw from weights
  • subsample bootstrap: (m out of n bootstrap, m < n)
    • specialcase: m = 1: Jackknife

Compare crossvalidation and bootstrapping.

What is the aim of CV? Define MSEP, how to estimate it?

Aim of bootstrapping? How can it be used to predict error?

  • CV
    • quantify prediction error
    • Mean square error of prediction (MSEP) = E((y-y_hat)^2) = ... Var(y) + MSE(m(x), m_hat(x))
    • intuitive: estimation of MSEP: E((y-m_hat(x))^2)
      • too optimistic as m_hat(x) was built with minimising MSEP
      • solution: MSEP_hat = E((y-y_hat_-k(i))^2)
        • model without k-th fold's samples
  • bootstrapping
    • quantify uncertainty of estimate
    • quantify prediction error also possible: 
      • E((y-m_hat*)^2)
        • m_hat* = model based on bootstrap sample
        • error evaluated on all samples

How can bootstrapping be used to define CIs? What is the problem here?

  • use quantiles of bootstrap samples to define bounds
  • problem: B must be sufficient large as alpha/2-quantiles rely on alpha/2 * B samples (B=200, alpha= 0.05, only 5 samples)

Explain how bootstrapping can be used for testing? Give two examples (4, 2).

What is a problem here?

  • simulation of/under H0
  • examples
    • 1) H0: samples from same distribution
      • t(x) = |y_bar - z_bar|
      • sample and randomly assign to y and z
      • t(x*) = |y*_bar - z*_bar|
      • p-value = E(1{t(x*} > t(x)})
    • 2) H0: distribution is symmetric around d
      • t(x) = x - x_med (+- 0 under H0)
      • extend dataset and mirror every sample at d (y' = 2d-y)
  • problem: power can't be calculated, as no information about H1 given 

What is a problem when sampling on big data? How to tackle it? (1 each)

  • problem: analysis of all data sometimes not possible (restrictions of hardware)
  • solution: draw only a random sample of size n (how to choose n?)
    • tradeoff: computational effort (grows at least lineary) vs. accuracy (increases with at most sqrt(n))

What is the main idea of misspecified models? (3)

  • so far: likelihood assumes model f(.) >> implies model selection by assuming Y_i ~ f(y; theta)
  • now: Y_i ~ g(y,) with unknown g and f() != g() >> implies model is mostly wrong
  • >> ML-estimation is still valid even without knowing g() (to be shown)

Derive and name properties of misspecified models (f() != g())

idea: (3), properties (4)

  • minimise: \(KL(g,f) = \int log\frac{g(y)}{f(y)}g(y)dy=\int log(g(y))g(y)dy -\int log(f(y))g(y)dy\)
    • first term = constant regarding f (only depends on g)
    • >> maximise last term >> \(\int s(\theta;y)g(y)dy =E_g(s(\theta;y))=0 \Rightarrow \theta_0\) >> score w.r.t. g(), not f()
  • it follows \((\hat\theta-\theta_0)\sim N(0, I^{-1}(\theta_0) V(\theta_0)I^{-1}(\theta_0))\)
    • with \(I(\theta_0) = E_g(-s''(\theta; y))\) w.r.t g() not f()
    • and \(V(\theta_0)=Var_g(s(\theta_0;y))\)
    • \(I^{-1}(\theta_0)V(\theta_0)=1\) if f() = g()

Sketch how to derive the AIC.

Main problem (1).

sketch of steps (3-4)

Definition of AIC (3)

  • calculation of \(\int log(f(y;\hat\theta))g(y)dy\) not possible as g() is unknown.
  • theta_hat depens on sample >> KL depends on sample >> E()
    • E(KL) = constant - \(E_{y_1, ..., y_n}(...) = a\) >> TLS
    • compare latter part with \(1/n\sum logf(y_i;\theta) = b\) >> TLS
    • for n -> inf: a - 1/n * sum(p / n) = b
      • >> a - p = b   (p = bias correction) 
      • >> max!
  • AIC = - 2 sum(log f(y; theta_hat)) + 2p
    • approximates \(E_{y_1,..., y_n}(KL(g(), f(;\theta))-\int log(g(y))g(y)dy\)
    • latter: absolute value / g() unknown, still possible to compare any models!
    • to be minimised

Name modifications of AIC (3)

  • bias-corrected AIC
    • adapted penalty
  • deviance information criterion
    • bayesian version of AIC
    • simulation based
  • BIC = -2*l(theta_hat) + p*log(n) >> minimise
    • maximises posterior for n -> inf: \(P(M_k|y) = \frac{f(y|M_k)~f(M_k)}{f(y)}\)
    • >> BIC penalizes more complex models more (useful in practise)
      • AIC = 2p, BIC = log(n) * p (mostly larger)

AIC and CV. Define each, what is the main difference? (2)

  • AIC: \(E_{Y_1,...Y_n}(E_Y(log(g(y))-E_Y(log(f(y;\hat\theta)))\)
  • CV: \(E_{Y_1,...Y_n}(E_Y(Y-\mu)^2)\)
  • comparable but: 
    • AIC uses KL instead of (y-mu_hat)^2
    • AIC is analytically, CV is simulation-based

What is the idea of model averaging?

How does it work in general? (2)

  • idea: make use of multiple equally good models
  • averaging
    • norming of AIC-values \(\Delta AIC_k = AIC_k-min(AIC)\), best model = 0
    • weighting of models based on AIC with \(P(M_k|y) \) = softmax of delta_AIC

Explain the problem of inference after model selection:

what is the ad-hoc idea? (2) What is the problem then? (2)

What is the general problem? (1) Name two solutions.

  • selected model k_hat = argmin AIC_k
  • adhoc-idea: \(\hat\theta_{\hat k}-\theta_{k0} \sim N(0, ...)\)
    • quiet scandal: data is used twice (model selection and variance estimation)
    • >> variance is biased
  • >> combining model selection and estimation of CIs is not possible
    • bootstrap
    • estimation of variance on test-data

Explain the main concept of LASSO (lesat shrinkage and selection parameter):

what is the main idea (1)

give technical details. (2 + 2)

  • idea
    • combine model selection (theta_j = 0) AND parameter estimation in one step
  • details
    • define paramters which might be set to 0 (indexset I)
    • penalized likelihood: \(l_p(\theta, \lambda) = l(\theta)-\lambda\sum_{j\in I}|\theta_j|\) >> max
      • max l_p() s.t. sum(theta_j) <= c (quadratic programming: linear constraint)
      • c = inf <> lambda = 0 >> normal likelihood
      • c = 0 <> lambda = inf >> all theta_j -> 0
  •  

Briefly sketch the bayesian view on LASSO. (2)

Compare it to regular LASSO. (2)

  • Laplace prior
  • log(posterior without normalisation constant)  = \(l(\theta)-\sum log(\sigma)+|\theta_j|/\sigma = l_p(\theta;\sigma)\)
  • comparison
    • hyperparameter: sigma (not lambda)
    • prior knowledge included (influences results)

Name the multivarite normal distribution. Name properties (3)

  • \(f(y;\theta) = 1/(2\pi)^{q/2} ~ |\Sigma|^{-1/2}~exp(-0.5(y-\mu)^T\Sigma^{-1}(y-\mu))\)
    • >> Y ~ N(mu, Sigma)
    • Y can be separated in (Y_a, Y_b) >> mu_a, mu_b, Sigma_aa, etc.
    • Sigma: symmetric, positive definite

Name three approaches how to simplify Sigma. (3)

Why is this useful?

  • independence of variables
  • conditional independence
  • PCA
  • >> p over 2 parameters need to be estimated (to be reduced)

How to simplify Sigma by independence of variables? (2)

  • test: H0: \(\Sigma_{jk} = 0\)
  • compare j-th and k-th variables

Relate Sigma and conditional independence.

  • \(\Omega_{j, k} = \Sigma^{-1}_{j, k} = 0\) expresses conditional independence
    • if \(\Omega_{j, k} = 0 \) >> j and k are independent conditioning on the rest 
    • as \(f(y, \mu = 0, \Sigma) = ...exp(-1/2y^T\Sigma^{-1}y)=...exp(-0.5\sum\sum y_ly_m\Omega_{m,l})\) becomes 0 >> conditional independent
  • useful, as inverse of Sigma is used in multivariate normal
    • - but no information about Sigma

What is the aim of graphical models? (2)

Why is it useful? Name a drawback.

  • describing conditional independence as graph (V = set of nodes, E = set of edges)
  • if \((j,k)\notin E\) >> j, k are independent conditioning on the rest (not connected = cond. indep.)
  • + many entries of Sigma^-1 are 0 >> imposes low-dimensional dependence-structure
  • - no information about Sigma

Briefly sketch PCA. (4)

  • decompose Sigma = VAVT
  • simplify Sigma: \(\Sigma^*=V^*\Lambda^* V^{*T}\)
  • compute scores based on first k eigen-vectors (rotation of data)
  • >> useful for high-dimensional data

What is the idea of copulas? (2)

Name properties of copulas.(3)

  • derive multivariate distribution beyond m.v. normal distribution (non-linear dependence structure)
  • combines marginal distributions: C: [0,1]^q -> [0,1]
    • Sklar's theorem: \(F(y_1, ...y_n)=C(F_1(y_1),...,F_n(y_n))=P(Y_1\le y_1,...y_q \le y_q)\)
  • properties
    • monotonically increasing (regular cdf)
    • distribution-function (integral = 1)
    • has univariate, uniform margins: \(C(F_1(y_1), ..., F_1(\infty))=F_1(y_1)\)

Name the steps of model-strategy of Copulas.

What happens if the copula depends on a parameter. How to find estimate?

  • 1) model univariate margins: \(\hat F_j(), \hat u_{ij}=F_j(y_{ij})\) 
  • 2) fit copula to u^_ij (lives in q-dimensional unit-cube) to model dependence structure
    • \(c(u_1, ..., u_q) = \frac{\partial^q(C(u_1, ..., u_q))}{\partial u_1,...,u_q}\)
  • hence: \(f(y_1...y_q)=c(F_1(y_1),...,F_q(y_q))~\prod f_j(y_j)\)
  • parametrized copula: c(u1,...,uq | theta)
    • >> l(theta) = sum(log(c(u^_i1,...u^_iq; theta))
      • likelihood depends on estimates u, not on data directly >> empirical distribution function in step1 useful

Name common compulas. (4) How to choose copula and parameter?

  • gaussian copula
    • with R = correlation-matrix >> similar to m.v. normal
  • archimedian copula
    • Clayton copula (sattel nach links unten)
    • Frank copula (similar to m.v. normal)
    • Gumbel copula (sattel nach rechts oben)
  • >> decision data driven i.e. ML, AIC

Discuss copulas. 2+1

  • ++ just one/few parameter (compare m.v. normal
  • ++ flexible, non-linear modeling
  • -- restricted to low amount of parameter (problematic for large p)

What is the idea of pair copulas? (1)

Sketch an example for q = 3. (3)

What can it be used for? What is the main advantage?

 

  • more flexible approach: build copula based on pairs of copulas 
  • \(f(y_1, y_2, y_3)= f_1(y_1)f_2(y_2)f_3(y_3)~c_{12}c_{13}c_{23|1}\)
    • \(c_{23|1}\) would depend on y_1 >> assume it doesn't
    • for \(c_{23|1}\) estimation of \(F(y_2|y_1), F(y_3|y_1)\) required
      • depends again on pairwise copula (here C13), which needs to be fitted first
  • >> nested structure of pairwise copulas (q*(q-1)/2) pairwise copulas required) >> flexible joint distributions
    • each paired copula can be estimated and from different type

Missing data: what is item non-response and unit non-response?

Why is missing data analysis important?

  • item: single values missing
  • unit: whole sample-data missing
  • >> always (!) before analysis for correct reasoning

What means complete case analysis?

What is the problems here? (2)

  • only consider elements without NA
    • \(\hat y_i = \sum R_{ij}y_{ij} / \sum R_{ij}\) with R_ij = 1 if y_ij is given, 0 else
  • problems
    • throw away most of the data (large p: p(complete case) = 0.99^q get small)
    • remaining data may depend on missing pattern >> bias!

Name missing pattern. (3) Explain briefly (1each).

  • Missing completely at random (MCAR)
    • probability of missingness is independent of data
  • Missing at random (MAR)
    • probability of missingness depends on observed data
  • Missing not at random (MNAR)
    • probability of missingness depends on missing data (not in observed data)

What is special about MCAR?

  • \(P(y_i|R_i = 1) = P(y_i)\)
  • >> missingness doesn't influence the estimation

Is complete case analysis applicable for MAR?

Explain the two cases.

  • complete case analysis not (always) applicable
    • as \(P(y_i|R_i =1) = \frac{P(R_i = 1|y_{io})}{P(R_i = 1)}P(y_i)\) and first part ist mostly != 1
  • cases
    • missing target variable: complete case analysis applicable
      • as \(P(y|x,z,R_{xi}=1) = P(y|x,z)\)
    • missing covariates: correction required
      • as \(P(y|x,z,R_{xi}=1) = \frac{P(R_{xi}=1|y,z)}{P(R_{xi}=1|z)}P(y|x,z) \neq P(y|x,z)\) >> biased >> to be corrected

How to correct the bias in case of MAR and complete case analysis:

What is the problem? How to tackle this? (1:2) What is the result?

How can it be summarised?

  • problem: \(s_{cc}(\theta)=\sum R_i~s_i(\theta) \Rightarrow E() \neq0\)
  • correction: 
    • model \(P(R=1|y_{io}) = \pi_i \Rightarrow \hat\pi_i\)
    • correct score: \(s_{cc-corrected}=\sum\frac{R_i}{\hat\pi_i}s_i(\theta)\)
      • if pi_i_hat = 0.5 >> s_cc_i weighted with 2 in case R_i = 1
  • >> E(s_cc-corrected(theta)) = 0
  • >> weighted complete case analysis (with P(data is missing)) 

What is special about MNAR? (2) Give an example.

  • results of analysis will be wrong
  • no solution how to tackle this 
  • example: values for high and low income will be often NA

 

EM:

What is the general problem about the likelihood in case of using all data? (1) What is the first approach? What is the problem?

Name and describe the steps of EM (3).

 

  • use all data (not only complete cases) >> likelihood depends also on unoberved data
    • >> observed likelihood: \(l_O(\theta) = \sum logf(y_{io};\theta)\)
    • problem: f(y_iO; theta) may have a complex dependence
  • steps
    • E-step (expectation)
      • \(Q(\theta;\theta_t)=\sum\int l_i(\theta)~f(y_{iM}|y_{iO})dy_{iM}\)
      • >> replace missing values by expectation (based on observed values)
    • M-step (maximization)
      • \(\frac{\partial Q(\theta;\theta_t)}{\partial \theta} = 0 \Rightarrow \theta_{t+1}\)
      • find new theta with maximal Q
    • iterate until convergence (small chages in likelihood)

Dicuss EM (3,2)

  • ++ stable, robust, easy to apply
  • -- slow, inference (estimation of variance! as Var(theta_hat) = inv(I(theta), not given due to y_iM)
    • >> simulation-based methods required

Sketch why the likelihood keeps increasing at EM (3).

  • \(f(y;\theta) = f(y_{iM}, y_{iO})=f(y_{iO};\theta)f(y_{iM}|y_{iO};\theta)\)
  • \(\Rightarrow l_O(\theta) =l(\theta)-\sum logf(y_{iM}|y_{iO};\theta) \Rightarrow E()\)
  • \(= Q(\theta, \theta_t)-H(\theta, \theta_t)\)
    • Q increases with theta_t+1, H decreases with theta_t+1 (as \(H(\theta_t, \theta_t) \ge H(\theta, \theta_t)\) >> new is small or equal)
      • as KL(theta, theta_t) = H(theta_t, theta_t) - H(theta, theta_t) >= 0

Multiple Imputation.

Difference to EM (1)

Name the main steps of MI (4)

  • EM: E(Y_iM | Y_iO, theta_t) >> modeling the expectation
    • MI: Y_iM | Y_iO ~ f(Y_iM | Y_iO; theta) >> modeling Y_iM given Y_iO (any model)
  • steps
    • 1) generate \(Y^*_{iM}|Y_{iO} \sim f(Y_{iM}|Y_{iO};\theta)\) and impute NAs (predict Y_iM given Y_iO)
    • 2) create K completed datasets with Y*_iM
    • 3) compute estimate \(\hat\theta_k\)for each k in K
    • 4) combine \(\hat\theta_k\)
      • \(\hat\theta_{MI}\) = mean of theta_hat_k
      • Variance estimation of theta_that_MI: rubin's rule
        • \(\hat{Var}(\hat\theta_{MI})= \bar V+(1+1/K)B\)
        • with \(\hat V_k=I^{-1}(\hat\theta_k)\)
        • and B = avg deviation of theta_hat_k from theta_hat_MI