SRaI
@LMU
@LMU
Fichier Détails
Cartes-fiches | 125 |
---|---|
Langue | English |
Catégorie | Informatique |
Niveau | Université |
Crée / Actualisé | 04.10.2019 / 11.10.2019 |
Lien de web |
https://card2brain.ch/box/20191004_srai
|
Intégrer |
<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Créer ou copier des fichiers d'apprentissage
Avec un upgrade tu peux créer ou copier des fichiers d'apprentissage sans limite et utiliser de nombreuses fonctions supplémentaires.
Connecte-toi pour voir toutes les cartes.
What means complete case analysis?
What is the problems here? (2)
- only consider elements without NA
- \(\hat y_i = \sum R_{ij}y_{ij} / \sum R_{ij}\) with R_ij = 1 if y_ij is given, 0 else
- problems
- throw away most of the data (large p: p(complete case) = 0.99^q get small)
- remaining data may depend on missing pattern >> bias!
Name missing pattern. (3) Explain briefly (1each).
- Missing completely at random (MCAR)
- probability of missingness is independent of data
- Missing at random (MAR)
- probability of missingness depends on observed data
- Missing not at random (MNAR)
- probability of missingness depends on missing data (not in observed data)
What is special about MCAR?
- \(P(y_i|R_i = 1) = P(y_i)\)
- >> missingness doesn't influence the estimation
Is complete case analysis applicable for MAR?
Explain the two cases.
- complete case analysis not (always) applicable
- as \(P(y_i|R_i =1) = \frac{P(R_i = 1|y_{io})}{P(R_i = 1)}P(y_i)\) and first part ist mostly != 1
- cases
- missing target variable: complete case analysis applicable
- as \(P(y|x,z,R_{xi}=1) = P(y|x,z)\)
- missing covariates: correction required
- as \(P(y|x,z,R_{xi}=1) = \frac{P(R_{xi}=1|y,z)}{P(R_{xi}=1|z)}P(y|x,z) \neq P(y|x,z)\) >> biased >> to be corrected
- missing target variable: complete case analysis applicable
How to correct the bias in case of MAR and complete case analysis:
What is the problem? How to tackle this? (1:2) What is the result?
How can it be summarised?
- problem: \(s_{cc}(\theta)=\sum R_i~s_i(\theta) \Rightarrow E() \neq0\)
- correction:
- model \(P(R=1|y_{io}) = \pi_i \Rightarrow \hat\pi_i\)
- correct score: \(s_{cc-corrected}=\sum\frac{R_i}{\hat\pi_i}s_i(\theta)\)
- if pi_i_hat = 0.5 >> s_cc_i weighted with 2 in case R_i = 1
- >> E(s_cc-corrected(theta)) = 0
- >> weighted complete case analysis (with P(data is missing))
What is special about MNAR? (2) Give an example.
- results of analysis will be wrong
- no solution how to tackle this
- example: values for high and low income will be often NA
EM:
What is the general problem about the likelihood in case of using all data? (1) What is the first approach? What is the problem?
Name and describe the steps of EM (3).
- use all data (not only complete cases) >> likelihood depends also on unoberved data
- >> observed likelihood: \(l_O(\theta) = \sum logf(y_{io};\theta)\)
- problem: f(y_iO; theta) may have a complex dependence
- steps
- E-step (expectation)
- \(Q(\theta;\theta_t)=\sum\int l_i(\theta)~f(y_{iM}|y_{iO})dy_{iM}\)
- >> replace missing values by expectation (based on observed values)
- M-step (maximization)
- \(\frac{\partial Q(\theta;\theta_t)}{\partial \theta} = 0 \Rightarrow \theta_{t+1}\)
- find new theta with maximal Q
- iterate until convergence (small chages in likelihood)
- E-step (expectation)
Dicuss EM (3,2)
- ++ stable, robust, easy to apply
- -- slow, inference (estimation of variance! as Var(theta_hat) = inv(I(theta), not given due to y_iM)
- >> simulation-based methods required
Sketch why the likelihood keeps increasing at EM (3).
- \(f(y;\theta) = f(y_{iM}, y_{iO})=f(y_{iO};\theta)f(y_{iM}|y_{iO};\theta)\)
- \(\Rightarrow l_O(\theta) =l(\theta)-\sum logf(y_{iM}|y_{iO};\theta) \Rightarrow E()\)
- \(= Q(\theta, \theta_t)-H(\theta, \theta_t)\)
- Q increases with theta_t+1, H decreases with theta_t+1 (as \(H(\theta_t, \theta_t) \ge H(\theta, \theta_t)\) >> new is small or equal)
- as KL(theta, theta_t) = H(theta_t, theta_t) - H(theta, theta_t) >= 0
- Q increases with theta_t+1, H decreases with theta_t+1 (as \(H(\theta_t, \theta_t) \ge H(\theta, \theta_t)\) >> new is small or equal)
Multiple Imputation.
Difference to EM (1)
Name the main steps of MI (4)
- EM: E(Y_iM | Y_iO, theta_t) >> modeling the expectation
- MI: Y_iM | Y_iO ~ f(Y_iM | Y_iO; theta) >> modeling Y_iM given Y_iO (any model)
- steps
- 1) generate \(Y^*_{iM}|Y_{iO} \sim f(Y_{iM}|Y_{iO};\theta)\) and impute NAs (predict Y_iM given Y_iO)
- 2) create K completed datasets with Y*_iM
- 3) compute estimate \(\hat\theta_k\)for each k in K
- 4) combine \(\hat\theta_k\)
- \(\hat\theta_{MI}\) = mean of theta_hat_k
- Variance estimation of theta_that_MI: rubin's rule
- \(\hat{Var}(\hat\theta_{MI})= \bar V+(1+1/K)B\)
- with \(\hat V_k=I^{-1}(\hat\theta_k)\)
- and B = avg deviation of theta_hat_k from theta_hat_MI
Explain the tradeoff between quality and quantity.
What is given? (4)
State the final formular.
Give an example.
What can we conclude?
- given
- finite population y1...yN
- g(y) = quantity of interest
- \(\mu_g=E(g(y))\)
- \(\hat\mu_g=\frac{E(R~g(y))}{E(R)}\)
- \((\hat\mu_g - \mu_g) = \rho_{Rg}~\sigma_g~\sqrt{(N-n)/n}\)
- accuracy = quality (correlation between availability of data and quantity of interest) * variation (of quantity of interest) * quantity
- example: 90% out of 1.000.000 with 5% correlation is equal to 3600 samples without correlation
- >> quantity doesnt compensate quality (important in times of big data)
Why is the assumption X (i.e. own price) -> Y (sales) often wrong. Explain.
How to tackle this? (3)
- Y depends on X but also on Z (observable, but not available) and U (unobservable)
- \(f_{Y|X,Z,U}=\int f_{Y|X,Z,U}f_{Z,U}~dZdU \neq f_{Y|X}\)
- solutions
- extend dataset by Z, U (i.e. competitors price)
- X independent of Z, U (i.e. own price independent of competitor's price)
- find instrumental variable which influcences X (own price) but independent of Z, U (i.e. competitor's price)
Explain the effect of making X independent of Z,U.
How can this be done?
- \(\int\frac{f_{Y,X,Z,U}}{f_{X|Z,U}}dZdU = \int\frac{f_{Y,X,Z,U}}{f_{X}}dZdU = f_{Y|X}\)
- >> experiment: make X independent / randomly of other influencing effects
What is the main idea of ANOVA? (1)
Name statistics and idea of it (3)
State the F-statistics. (3)
- ANOVA = analysis of variance
- test in experimental setting (i.e. comparing two versions >> AB-test)
- RSS_X = sum of squares residuals = sum of squared deviation of y_jk to group mean y_bar_k .
- RSS_zero = sum of squared deviation to overall mean y_bar . .
- idea: if RSS_X and RSS_zero are close >> low variance, low difference >> no significant difference/effect
- \(F=\frac{\frac{RSS_0-RSS_X}{K-1}}{\frac{RSS_X}{n-K}}\)
- nominator: how much difference normed by number of groups
- denominator: variance of RSS_X (error in groups) bias corrected (# groups)
- >> relates difference of overall RSS_0 to groups RSS_X with the variance within the group
Name elements of the ANOVA table. (6)
- soruce of error (X or residuals)
- Sum of squares (RSS_0 - RSS_X and RSS_X)
- df (K-1, n-K)
- MSE
- F-statistic
- p-value
Name bayes roule
\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)
What is a random variable?
- random variable y maps from event-space omega to real values
Define the expected value and variance
\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)
\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)
Define the exponential family
\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)
- with t(y) = statistics = function of data
- theta = parameter (vector)
- K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
- h(y) >= 0, unimportant
- and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)
What is the t-distribution good for?
- statistical test for mean of normal distributed variables
- when variance is unknown (estimated from data)
Define covariance for Y1, Y2.
What about independence? What does this imply?
\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)
Cov(Yj, Yk) = 0 if Yj, Yk are independent
- f(yj, yk) = f(yj) * f(yk)
- E(Yj, Yk) ) E(Yj) * E(Yk)
Define correlation.
- Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))
Name iterated expectation.
\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)
What is the idea of central limit theorem?
What is required? (3)
Give an example.
- sum of any distributed random variable converges to normal distribution
- for n -> infinity (asymptotically)
- conditions
- i.i.d.
- mean given
- finite variance
- example: random-walk: Yn = sum(x_i)
- Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
- with Y_n sum(x_i)
What is the moment generating function? What about its k-th derivative?
What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?
- \(M_Y(t) = E(e^{t Y})\)
- \(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
- \(K_Y(t) = log(M_Y(t))\)
- first derivative w.r.t. t: E(Y) = first moment
- 2nd derivative w.r.t. t = Var(Y) = 2nd moment
- as long as moments are finite
What is the general idea of parametric statistical models? (2) Define Posterior and its components.
- y (data) is realisation of Y with Y~F(y; theta) (model)
- theta unknown
- \(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- posterior = all information about theta
- likelihood: information in data
- prior: knowledge about theta before observing the data
- denominator: normalisation constant (f(y), independent of theta)
Define likelihood and loglikelihood.
Name characteristics. (2)
- \(L(\theta;y) = \prod f(y_i; \theta)\)
- \(l(\theta;y) = \sum log~f(y_i; \theta)\)
Characteristics
- plausibility of parameter-values theta, given data y
- posterior proportional to Likelihood * prior
How theta can be estimated (theta_hat)? (3, 2 each). Relate them.
- posterior mean estimate
- theta_hat = E(posterior)
- requires (numerical) integration (E(.))
- posterior mode estimate
- theta_hat = max(posterior)
- first derivative required
- ML-estimation
- theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
- assumes flat/constant/non-informative prior
- posterior mode estimate = ML-estimation if prior is flat
Explain invariance-property of ML-estimate
- gamma = g(theta), theta = g^-1(gamma)
- gamma_ML = g(theta_ML) >> no reestimation required after transformation
- g(.) bijective transformation function
- counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)
Define loss, squared loss. (3)
- L: set of t x parameter-space -> R+
- L(t, theta) = (t-theta)^2 >> min
- L(theta, theta) = 0
-
- 1 / 125
-