Cartes mémoires SRaI

Cartes-fiches	125
Langue	English
Catégorie	Informatique
Niveau	Université
Crée / Actualisé	04.10.2019 / 11.10.2019
Lien de web	https://card2brain.ch/box/20191004_srai
Intégrer	<iframe src="https://card2brain.ch/box/20191004_srai/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What means complete case analysis?

What is the problems here? (2)

only consider elements without NA
- \(\hat y_i = \sum R_{ij}y_{ij} / \sum R_{ij}\) with R_ij = 1 if y_ij is given, 0 else
problems
- throw away most of the data (large p: p(complete case) = 0.99^q get small)
- remaining data may depend on missing pattern >> bias!

Name missing pattern. (3) Explain briefly (1each).

Missing completely at random (MCAR)
- probability of missingness is independent of data
Missing at random (MAR)
- probability of missingness depends on observed data
Missing not at random (MNAR)
- probability of missingness depends on missing data (not in observed data)

What is special about MCAR?

\(P(y_i|R_i = 1) = P(y_i)\)
>> missingness doesn't influence the estimation

Is complete case analysis applicable for MAR?

Explain the two cases.

complete case analysis not (always) applicable
- as \(P(y_i|R_i =1) = \frac{P(R_i = 1|y_{io})}{P(R_i = 1)}P(y_i)\) and first part ist mostly != 1
cases
- missing target variable: complete case analysis applicable
  - as \(P(y|x,z,R_{xi}=1) = P(y|x,z)\)
- missing covariates: correction required
  - as \(P(y|x,z,R_{xi}=1) = \frac{P(R_{xi}=1|y,z)}{P(R_{xi}=1|z)}P(y|x,z) \neq P(y|x,z)\) >> biased >> to be corrected

How to correct the bias in case of MAR and complete case analysis:

What is the problem? How to tackle this? (1:2) What is the result?

How can it be summarised?

problem: \(s_{cc}(\theta)=\sum R_i~s_i(\theta) \Rightarrow E() \neq0\)
correction:
- model \(P(R=1|y_{io}) = \pi_i \Rightarrow \hat\pi_i\)
- correct score: \(s_{cc-corrected}=\sum\frac{R_i}{\hat\pi_i}s_i(\theta)\)
  - if pi_i_hat = 0.5 >> s_cc_i weighted with 2 in case R_i = 1
>> E(s_cc-corrected(theta)) = 0
>> weighted complete case analysis (with P(data is missing))

What is special about MNAR? (2) Give an example.

results of analysis will be wrong
no solution how to tackle this
example: values for high and low income will be often NA

EM:

What is the general problem about the likelihood in case of using all data? (1) What is the first approach? What is the problem?

Name and describe the steps of EM (3).

use all data (not only complete cases) >> likelihood depends also on unoberved data
- >> observed likelihood: \(l_O(\theta) = \sum logf(y_{io};\theta)\)
- problem: f(y_iO; theta) may have a complex dependence
steps
- E-step (expectation)
  - \(Q(\theta;\theta_t)=\sum\int l_i(\theta)~f(y_{iM}|y_{iO})dy_{iM}\)
  - >> replace missing values by expectation (based on observed values)
- M-step (maximization)
  - \(\frac{\partial Q(\theta;\theta_t)}{\partial \theta} = 0 \Rightarrow \theta_{t+1}\)
  - find new theta with maximal Q
- iterate until convergence (small chages in likelihood)

Dicuss EM (3,2)

++ stable, robust, easy to apply
-- slow, inference (estimation of variance! as Var(theta_hat) = inv(I(theta), not given due to y_iM)
- >> simulation-based methods required

Sketch why the likelihood keeps increasing at EM (3).

\(f(y;\theta) = f(y_{iM}, y_{iO})=f(y_{iO};\theta)f(y_{iM}|y_{iO};\theta)\)
\(\Rightarrow l_O(\theta) =l(\theta)-\sum logf(y_{iM}|y_{iO};\theta) \Rightarrow E()\)
\(= Q(\theta, \theta_t)-H(\theta, \theta_t)\)
- Q increases with theta_t+1, H decreases with theta_t+1 (as \(H(\theta_t, \theta_t) \ge H(\theta, \theta_t)\) >> new is small or equal)
  - as KL(theta, theta_t) = H(theta_t, theta_t) - H(theta, theta_t) >= 0

Multiple Imputation.

Difference to EM (1)

Name the main steps of MI (4)

EM: E(Y_iM | Y_iO, theta_t) >> modeling the expectation
- MI: Y_iM | Y_iO ~ f(Y_iM | Y_iO; theta) >> modeling Y_iM given Y_iO (any model)
steps
- 1) generate \(Y^*_{iM}|Y_{iO} \sim f(Y_{iM}|Y_{iO};\theta)\) and impute NAs (predict Y_iM given Y_iO)
- 2) create K completed datasets with Y*_iM
- 3) compute estimate \(\hat\theta_k\)for each k in K
- 4) combine \(\hat\theta_k\)
  - \(\hat\theta_{MI}\) = mean of theta_hat_k
  - Variance estimation of theta_that_MI: rubin's rule
    - \(\hat{Var}(\hat\theta_{MI})= \bar V+(1+1/K)B\)
    - with \(\hat V_k=I^{-1}(\hat\theta_k)\)
    - and B = avg deviation of theta_hat_k from theta_hat_MI

Explain the tradeoff between quality and quantity.

What is given? (4)

State the final formular.

Give an example.

What can we conclude?

given
- finite population y1...yN
- g(y) = quantity of interest
- \(\mu_g=E(g(y))\)
- \(\hat\mu_g=\frac{E(R~g(y))}{E(R)}\)
\((\hat\mu_g - \mu_g) = \rho_{Rg}~\sigma_g~\sqrt{(N-n)/n}\)
- accuracy = quality (correlation between availability of data and quantity of interest) * variation (of quantity of interest) * quantity
example: 90% out of 1.000.000 with 5% correlation is equal to 3600 samples without correlation
>> quantity doesnt compensate quality (important in times of big data)

Why is the assumption X (i.e. own price) -> Y (sales) often wrong. Explain.

How to tackle this? (3)

Y depends on X but also on Z (observable, but not available) and U (unobservable)
- \(f_{Y|X,Z,U}=\int f_{Y|X,Z,U}f_{Z,U}~dZdU \neq f_{Y|X}\)
solutions
- extend dataset by Z, U (i.e. competitors price)
- X independent of Z, U (i.e. own price independent of competitor's price)
- find instrumental variable which influcences X (own price) but independent of Z, U (i.e. competitor's price)

Explain the effect of making X independent of Z,U.

How can this be done?

\(\int\frac{f_{Y,X,Z,U}}{f_{X|Z,U}}dZdU = \int\frac{f_{Y,X,Z,U}}{f_{X}}dZdU = f_{Y|X}\)
>> experiment: make X independent / randomly of other influencing effects

What is the main idea of ANOVA? (1)

Name statistics and idea of it (3)

State the F-statistics. (3)

ANOVA = analysis of variance
test in experimental setting (i.e. comparing two versions >> AB-test)
RSS_X = sum of squares residuals = sum of squared deviation of y_jk to group mean y_bar_k .
RSS_zero = sum of squared deviation to overall mean y_bar . .
idea: if RSS_X and RSS_zero are close >> low variance, low difference >> no significant difference/effect
\(F=\frac{\frac{RSS_0-RSS_X}{K-1}}{\frac{RSS_X}{n-K}}\)
- nominator: how much difference normed by number of groups
- denominator: variance of RSS_X (error in groups) bias corrected (# groups)
- >> relates difference of overall RSS_0 to groups RSS_X with the variance within the group

Name elements of the ANOVA table. (6)

soruce of error (X or residuals)
Sum of squares (RSS_0 - RSS_X and RSS_X)
df (K-1, n-K)
MSE
F-statistic
p-value

Name bayes roule

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(P(A|B) = \frac{P(B|A) ~\cdot~ P(A)}{P(B)}\)

What is a random variable?

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

random variable y maps from event-space omega to real values

Define the expected value and variance

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(E(Y) = \int_{-\infty}^{\infty}u ~f(u)~du = \mu\)

\(Var(Y) = \int_{-\infty}^{\infty}(y - \mu)^2~f(y)~dy = E((Y - \mu)^2) = \sigma^2 = E(Y^2) - \mu^2\)

Define the exponential family

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(f(y,\theta) = exp(t^T(y)~\theta - K(\theta)~h(y))\)

with t(y) = statistics = function of data
theta = parameter (vector)
K(theta) = normalisation constant s.t. integral (f(y,theta)) = 1
h(y) >= 0, unimportant
and \(\frac{\partial K(\theta)}{\partial\theta} = E(t(Y))\)

What is the t-distribution good for?

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

statistical test for mean of normal distributed variables
when variance is unknown (estimated from data)

Define covariance for Y1, Y2.

What about independence? What does this imply?

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(Cov(Y_1, Y_2) = E((Y_1 - E(Y_1)(Y_2 - E(Y_2))) = E(Y_1~Y_2)-E(Y_1)E(Y_2)\)

Cov(Yj, Yk) = 0 if Yj, Yk are independent

f(yj, yk) = f(yj) * f(yk)
E(Yj, Yk) ) E(Yj) * E(Yk)

Define correlation.

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

Corr(Yj, Yk) = Cov(Yj, Yk) / sqrt(Var(Yj) * Var(Yk))

Name iterated expectation.

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(E(Y) = E_X(E(Y|X))\\Var(Y) = E_X(Var(Y|X) + Var_X(E(Y|X))\)

What is the idea of central limit theorem?

What is required? (3)

Give an example.

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

sum of any distributed random variable converges to normal distribution
- for n -> infinity (asymptotically)
conditions
- i.i.d.
- mean given
- finite variance
example: random-walk: Yn = sum(x_i)
- Z_n = (Y_n) / sqrt(n) ~ N(0, sigma^2)
- with Y_n sum(x_i)

What is the moment generating function? What about its k-th derivative?

What is the cumulant generating function? What is special about the first and 2nd derivative? What are restrictions?

Tastatur-Befehle:

= drehen,

= vor-/rückwärts,

= scrollen

\(M_Y(t) = E(e^{t Y})\)
\(\frac{\partial^kM_Y(t)}{(\partial t)^k} = E(Y^k) \) evaluated at t = 0
\(K_Y(t) = log(M_Y(t))\)
- first derivative w.r.t. t: E(Y) = first moment
- 2nd derivative w.r.t. t = Var(Y) = 2nd moment
- as long as moments are finite

What is the general idea of parametric statistical models? (2) Define Posterior and its components.

y (data) is realisation of Y with Y~F(y; theta) (model)
theta unknown
\(f_\theta (\vartheta|y) = \frac{(\prod f(y_i; ~\vartheta))f_\theta(\vartheta)}{\int{\prod f(y;~\vartheta)f_\theta(\vartheta)d\vartheta}}\)
- posterior = all information about theta
- likelihood: information in data
- prior: knowledge about theta before observing the data
- denominator: normalisation constant (f(y), independent of theta)

Define likelihood and loglikelihood.

Name characteristics. (2)

\(L(\theta;y) = \prod f(y_i; \theta)\)
\(l(\theta;y) = \sum log~f(y_i; \theta)\)

Characteristics

plausibility of parameter-values theta, given data y
posterior proportional to Likelihood * prior

How theta can be estimated (theta_hat)? (3, 2 each). Relate them.

posterior mean estimate
- theta_hat = E(posterior)
- requires (numerical) integration (E(.))
posterior mode estimate
- theta_hat = max(posterior)
- first derivative required
ML-estimation
- theta_hat = max(L(theta; y)) (set first derivative w.r.t. theta of loglikelihood to zero)
- assumes flat/constant/non-informative prior
posterior mode estimate = ML-estimation if prior is flat

Explain invariance-property of ML-estimate

gamma = g(theta), theta = g^-1(gamma)
gamma_ML = g(theta_ML) >> no reestimation required after transformation
g(.) bijective transformation function
counts also for Variance s.t. gamma_hat - gamma ~ N(0, derivative of theta wrt gamma * inv(I(theta)) * derivative of theta wrt gamma)

Define loss, squared loss. (3)

L: set of t x parameter-space -> R+
L(t, theta) = (t-theta)^2 >> min
L(theta, theta) = 0

SRaI

Créer ou copier des fichiers d'apprentissage

Créer ou copier des fichiers d'apprentissage

Connecte-toi pour voir toutes les cartes.

SWITCHaai

Office 365

Edulog

Apple ID

Google