Applied Biostatistics
M Tanadini, ETHZ
M Tanadini, ETHZ
Set of flashcards Details
Flashcards | 21 |
---|---|
Language | English |
Category | Maths |
Level | University |
Created / Updated | 23.09.2021 / 26.04.2022 |
Weblink |
https://card2brain.ch/box/20210923_applied_biostatistics
|
Embed |
<iframe src="https://card2brain.ch/box/20210923_applied_biostatistics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Create or copy sets of flashcards
With an upgrade you can create or copy an unlimited number of sets and use many more additional features.
Log in to see all the cards.
Basics: Introduction to linear models in R
- Explain the following terms:
- response variable
- model
- error
- parameter
- regression coefficients
- residuals
- What are the linear model's functions
- what does it represent?
- how can it be used in practice?
Linear models
Structure
- response variable (y) = model + error
- model
- intercept term
- value of y when x = 0
- slope term
- slope
- perdictor variable (x)
- continuous usually represented by beta
- categorical usually represented by alpha, beta, gamma...
- multiple predictors can be used in LM --> all parameters (=slope & predictor) are summed, not multiplied or divided or exponentiated!
- intercept term
- error
- represent the portion of the response variable that the model fails to explain
- parameters
- intercept (β0) as well as slope (β1) are the model's parameters
- regression coefficients are estimated from data using least squares method
- estimated regression coefficient are denoted with a hat (β^1)
- residuals are the difference between observed and predicted values (res = y-y^)
Function
- model function
- simplified mathematical representation of reality, somewhat like a geographic map --> useful to predict and analyse reality
- used to extract relevant information from data
- the model is fitted to observed data to:
- a) estimate, adjust and improve the model parameter(s)
- b) quantify uncertainty of the model
- in practice, models can be used to
- test hypotheses: does smoking increase risk of lung cancer?
- estimate effects: by how much does it increase the risk of having lung cancer?
- predictor models
Basics: Linear Regression
- sum of errors (distance from observation to RL) will always be = 0, for infinite number of lines --> square creates only + values and Minimum of SSE represents best fit
- variation:
- explained variation (explained sum of squares, ESS): difference between mean value (Y-) and predicted value on RL (Y^)
- unexplained variation (Residual sum of squares, RSS): difference between mean and observation which lies outside of space between mean and RL
- SST: total of SSR and SSE
- R2: measure of the strength of the relationship between observations and model
- proportion of explaines vs unexplained variation --> is affected by explained varation in the first place: if high, then R high, otherwise low
- affected by df because if many useless explanatory variables introduced to model, the model does not become better but df is reduced (df=n-k-1). In this case, R2 will increase, which is misleading --> adjusted R2
- Adj-R2
- if used variables are useless, adj R2 will decrease bcs R2 does not change. Otherwise, it will increase because formula uses 1-(1-R2)(n-1/n-k-1) and is therefore sensitive to increases in useful k which also increase R2
- does not have bounds (R2 from 0 to 1) --> not intuitive, but allows to compare models with more/less variables
- Error term
- e: error between observation and model fit (LR). should be zero when squared
- ε: true error between observation and real relationship between observations. Since this real relationship is the one trying to be estimated, this error cannot be calculated. The true relation is unknown.
- degrees of freedom
- RL needs at least 3 observations to be set (with 2 point, the R2 is always = 1 because the model has no degrees of freedom) --> each observation is "anchor" for model to be fitted by. In addition to the two observations that are required to draw a line, every additional observation adds one degree of freedom
- depends on the number of explanatory variables (k) used --> for each k, models gains additional dimension. For freedom, the model requires at least two observations in one of its dimensions (otherwise R2=1) --> df = n-k-1
https://www.youtube.com/watch?v=aq8VU5KLmkY&t=30s
Basics: R Output, Variables
Variables section
Coefficients
- coefficients of the population formula
- constant/ b0: if all plotted values 0, then coefficient valid --> intercept
- some will be much greater than others and therefore appear to be more impactful, but depends on range of the variable (0-10 vs -1000-1000)
St. Error
- average expected error for coefficient
t-value
- in contrast to St. Err., t-value is standarized and therefore comparable
- coeff divided by std. error
- the higher the t-stat, the more significant the variable
- can also be neagtive, in that case negative correlation
- P > |t|: p-value for coefficient
- H0 always that bn = 0 --> p-value assesses how likely it is to get a different bn=0 by chance
- essentially indicates if a variable has a sign. impact on model
95% CI
- true coefficient will be somewhere in the 95% CI
- if crosses 0, the variable can potentially
Course 1
Interpretation of LRM
- Give mathematical and biological definition of:
- intercept
- slope
- p-value
Interpretation of LRM
- intercept:
- mathematical: if X is 0, Y will be = intercept
- biologically:
- value cannot be negative, should be 0 or positive to make sense
- if negative, the biological interpretation becomes nonsensical
- for a body weight of 0, the "starting" heart weight is (should be) of 0 too
- improving/correcting nonsensical intercept by parametrisation
- substract constant value from predictors, e.g. mean body weight from all body weights
- use log-transformation on response variable
- value cannot be negative, should be 0 or positive to make sense
- in R:
- intercept and slope are "estimates" of the real data
- given by lm(formula = predictor var ~ response var, data = d.cats)
- slope:
- mathematical: for each unit on X, Y will increase by slope value
- biological:
- in R: intercept and slope are "estimates" given by lm(formula = predictor var ~ response var, data = d.cats)
- SE of estimate: describes range in which estimate is applied (?)
- residual SE: explain sundefined variation of the model
- multiple R-squared: explains how much of the data can be explained by the model --> 0-100% (0 - 1)
- adjusted R squared: takes into account complexibility of (different) models to make them more comparable
- p values
- dichotomous thinking is bad practice! values around 0.05, e.g. 0.049 are less strong than 0.001 --> significance is a grey-scale
- smarter to display CI
Course 1
- What steps important before running regression models?
- which possibilities exist
- which code is used in such cases?
Prior to any regression model: Visualize data
- three options to visualize data with more than one predictor
- different shapes for different predictors
- different colors for different predictors
- panneling --> superimpose both plots/lines
- scatterplots
- can be used with different colors or shapes for the additional predictor sex
- panneling: the facets argument in ggplot2 alows to create graph with two panels
- facet_grid(. ~ Sex)
- boxplots
- can be used to inspect distribution between factors via IQR
Basics: LRM basics
LRM basics
- what are the intercept and slope and how are they determined ?
Definitions
- X-axis
- independent variable
- predictor
- Y-Axis
- dependent variable
- observations: data points = actual values
- linear regression line
- can be + or -
- fits observations as much as possible --> least square method: error betwen estimate on line and observation is as small as possible for all observations
- has an intercept (b0) as well as a slope (b1): y = b0 + b1 * x
- slope b1
- crosses mean of all x and all y values
- slope determined by least square methods using \(\sum(x-\hat{x})(y-\hat{y})/\sum(x-\hat{x})^2\) which means the difference between each x and xmean value is summed, then multiplied with y - ymean. Subsequently, the value is divided by the mean squared.
- intercept b0
- can be determined after b1 by solving formula for b0
- slope b1
Course 1
Interactions:
- which possibilities exist?
Interactions, 2 possibilities:
- either sex as additional predictor: lm(Hwt ~ Bwt + Sex, data = d.cats)
- regression lines are paralell because model is forced to do so (why?)
- intercept differs
- or sex as sex as sex-weight interaction: lm(Hwt ~ Bwt * Sex, data = d.cats)
- lines and intercept differ for both models
- if plotted seperately and model fitted, differences in slope and intercept are well visible (qplot, data inspection prior to GLM)
Course 1
Treatment contrasts
- what does it mean?
- how can it be changed?
Treatment contrasts
- Setting correct treatment contrast important when comparing new to old, or when the reference is a gold-standard
- Historically, the term “treatment contrasts” comes from clinical studies where different treatments are compared
- R uses treatment contrasts in alpha-numerical ordering
- relevel(data, ref = "M") to set factor "M" as reference
Course 2: Testing the effect of categorical variable
- What first step needs to be done?
- How can the results of a LM with categorical variables be interpreted
- what does the intercept represent?
- what needs to be kept in mind regarding R?
Testing the effect of categorical variable
- First do boxplots to visualize data
- indicates that growth drates differ
- shows that data is more or less symmetric --> mean and median will be close
- IQR: box contains 50% of data, black line is median, whiskers are 1.5 times the data in box, rest outliers
- interpretation of coefficients of LM
- intercept
- called estimate in summary and represents mean growth rate of each species
- due to treatment contrasts, Fagus is chosen as reference
- all values negative --> all species grow slower than Fagus
- represents the mean growth rate of the different tree species
- can be calculated manually with aggregate function, use this to double check if coef() is correct
- called estimate in summary and represents mean growth rate of each species
- p-values
- not senseful for Fagus because intercept always compares to 0 --> however, this means the Fagus' growth rate is stat. sign. NOT 0 (which makes total sense in biology: trees do grow).
- all species have differing growth rates compared to Fagus --> does not tell anything for other comparisons, e.g. Quercus vs Picea --> multiple testing
- intercept
Course 2: Testing the effect of a categorical variable
- How can the effect of a categorical variable be tested?
- which statistical tests are to be used in this case?
- what is tested against what?
- what is the last step of the analysis?
- how can entire groups be tested against each other?
Testing different categories by ANOVA
- need to adress initial question: do species differ in growth rates?
- necessary to set up model that does not take species into account, so model with species can be compared to the former using an ANOVA
- model with species explains data with less unexplained variance (36 vs 43) at the cost of only 3 additional parameters (the one with no species has 1 degree of freedom, the other model 4, the difference being 3)
- F-test: shows that the addition of 3 parameters reduces the unexplained variability with statistical significance (if the drop from 43 to 36 was achieved by 100 add. parameters, the F-test would have been n.s.)
- ANOVA indicates significant difference between species --> usually enough to say that growth rates differ significantly
- check contrast Kärtchen for subsequent steps on species factor
- correcting for multiple testing as last step of analysis: either Tuckey or Bonferroni
- Bonferroni: quite restrictive/conservative because divides p-value by number of tests performed
- Tuckey: more complex way of setting significance level, not to mess around with if unfamiliar with Tuckey
- which one to use: follow research group's or journal to publish guidelines, most importantly be fair --> define differences to check before running the test, do not use it as a discovery tool
Testing groups against each other
- create vector to group different factors of a common group: idea is to take mean growth rate of one group and comparing to mean growth rate of other --> done by creating a vector with 1/n (e.g. 1/2) for one group and -1/n for other group --> vector sum is 1-1=0
- similar to weighting
Course 2: Tree Growth
Testing several variables
testing categorical variables against each other
- instead of testing one model agains another, several variables within one model may also be of interest since allows to find variables which explain the model most (if not, they can still be useful to correct model, e.g. experience vs gender salary)
- p-value will indicate if a variable has a significant impact on the model
- RSS explain drop in unexplained variability
- p-value will change if coding error is corrected, because all variables correlated to each other. It not, then variables totally independent
testing categorical and continuous variables
- adding continuous variable to lm and then checking with either t-test of summary() or drop1()
- Df for continous variable is 1
- in case of age, effect is small becasue p >> 0.05
R code
- using update() to modify linear model
- + to add new response variables (or - to drop them)
- . to keep same variables as in the data specified in update()
- using drop1() to compare models with 1 variable less than model containing all variables
- repeats anova() for model comparison automatically, always dropping one vraible at the time
- watch out for degrees of freedom --> output of model susceptible to coding error if one variable is coded in numbers but is not continuous. SiteID is categorical as must be coded accodingly (factor()) and corrected in the model (update(data = XY, .~. +SiteID.fac - SiteID ))
Course 2: Tree Growth
Testing interactions
- the effect of a given variable may only become visible when interactions are allowed --> age
- with lm.trees.4 <- update(lm.trees.3, . ~ . + age) and drop1, one slope for age is created that is the mean slope of all species --> the slope is flat and therefore
- interaction could also be non-linear (often in biology) so this must be kept in mind too
- with lm.trees.5 <- update(lm.trees.4,. ~ . + age:species) and drop1, one slope is created for each species showing differences which are reflected by a sig. p-value
- scatter plots still look non
- with lm.trees.4 <- update(lm.trees.3, . ~ . + age) and drop1, one slope for age is created that is the mean slope of all species --> the slope is flat and therefore
- R code
- always use drop1 function to check impact of different variables
- using ANOVA will lead to error because it adds up variables and therefore the order of the variables become important, which is nonsensical
unfertig: Fragen
Linear Regression: Output Interpretation
ANOVA section
SS: sum of squares
- each observation's y-value minus mean value, squared and then summed
- indicator of how much variation is in the y variable
- SS low --> most observations around mean
- SS of Model: explains how much variation the model is covering
- SS of Residuals: uncovered variation the model is not explaining
- MS: mean squares
- SS of residuals or of model divided by df
- df: number of variables used to explain data
- MSE: mean squared error (uses SS of the residuals): generally used to estimate how much each observation misses the predicted value, if high the model is bad
- also called SER: std error of regression
- SS of residuals or of model divided by df
R2: arr squared
- indicates how much/well the model explains the data
- percentage (0-1): SS Model/ SS Total
F-test
- used to assess if the independent variables used to fit model are better than 0 explanatory variables, H0 (should ideally be the case)
- uses mean squares (MS) of model divided by MS residuals --> comparable to R2 but using MS?
- if F-test > 0.05, then H0 can be rejected and the model explains at least some of the data
- does not tell you which one of the variables is the most powerful in explaining the data because considers all of them at once
- Prob > F: p-value of F-test
- if < 0.05, the used expl. variables do not explain data
- b1=b2=b3=bn=0
- if < 0.05, the used expl. variables do not explain data
Course 3 unfertig(weitermachen bei 2.5 in script)
- What are the first steps to take before analyzing data ?
- why is this step important, what can it be used for?
Graphical Analysis:
- Allows to spot mistakes in data
- Allows to understand relation between response and predictor variables
- smoother can be used to visualize potential relationship between predictor and
- Gives hints on which interactions may be relevant --> look at them step by step
- R: use ggplot with one predictor (e.g. site diversity) vs response variable (growth rate)
- often, linearity is assumed when biological relationships are actually non-linear --> use smoothers
- linear models can model non-linear relationships --> quadratic effects or polynomials possible
- depending on how smoothers looks, a linear or non-linear model must be chosen --> appears to be experience related
Course 3
Variables
Response variable
- variable on y-axis
- variable that is tried to be explained by the model
- also called dependent variable
- in y = f(x) y is the dependent or response variable
Control variables
- known to affect the response variable, e.g. species on growth rate or sex on muscle strength
- control variables are kept in model but usually p-values not assessed
Predictor of interest
- is the variable to be assessed for its relationship with the response variable
- p-value is of interest to see if predictor has an impact on response variable
- p-value less important if goal is simply to make a prediction
- on the x-axis
Design variables
- exist because of study design --> measurements at different sites
- must be included in analysis
- should never be tested for significance
Course 8: Poisson Models
- GLM and linear models: how do they differ?
- Poisson models
- which three conditions must be fulfilled for model?
- which mathematical strategies are used to adjust model?
- draw poisson distribution
- how is this done in R?
- visualization
- function to do poisson models
- which three conditions must be fulfilled for model?
GLM
- linear models
- response variable is continuous
- observations follow normal distribution --> inappropriate for count data because
- values must be positive integers --> no -1 child or 0.83 push-ups possible
- generalized linear models (GLM)
- response variable can be count data --> number of children, of tumors, push-ups...
- very much "real life" --> usually events per time or per area
- observations can follow different distributions: poisson, binomial, normal, gamma
- for count data, poisson distribution is assumed
- response variable can be count data --> number of children, of tumors, push-ups...
Poisson model
- Conditions for model
- must avoid negative fitted values
- use exponent of all coefficients (?) in equation to get +ve values only
- yˆ = exp(βˆ0 + βˆ1 · x1)
- use exponent of all coefficients (?) in equation to get +ve values only
- simulated values must be integers
- count cannot be negative or decimals --> transformation of response variable via natural logarithm (link function)
- variance depends on the mean
- the higher the mean, the greater the expected variance (heavy smoker more variance than non-smoker) --> variance increases linearly with mean
- natural logarithm is used as a link function between response variable y and poisson distribution
- must avoid negative fitted values
- in R
- can be specified in the glm() by family = "poisson"
- display data with plot()
Course 8: Poisson Models
Interpreting R output (exercises)
- what are the coefficients
- how to get coefficients (code, fct)
- what are the estimates?
- what do the estimates mean, e.g. 1.06?
- where can overdispersion be identified?
- how to does overdispersion affect models?
- significance of predictor variable
- which test to use, and how?
- how to check levels of the predictor?
Interpreting Poisson model in R
- boxplots are used to visualize the data, because the predictor variable is categorical --> independent of eachother
- boxplot(count ~ spray, data = InsectSprays)
- or use ggplot: geom_boxplot
- boxplot usually shows how lower counts have lower variability, and higher count higher variability
- coefficients
- intercept = median
- coef(glm.insects) --> intercept of all categories
- exp(coef(glm.insects)["(Intercept)"]) --> intercept of the treatment contrast variable
- estimates
- exp(coef(glm.insects)["sprayB"]) # or any other category --> 1.06
- spray B has 106% more insects than the reference A
- lower values are considered decrease, so 0.4 means -60% compared to reference
- overdispersion
- check Residual deviance (not Null deviance!) and compare to df --> usually Residual deviance >> df and therefore model overdispersed
- when using quasipoisson, the coefficient for quasipoisson and poisson will be identical. However, the significance on the summary will change (and the dispersion parameter)
- testing if predictor variables have a significant effect (in other words: do the different spray types work better/worse?)
- ANOVA (why not Chi-sqrd?)
- use quasi model
- remove predictor variable and compare to a model with the predictor
- use drop1 for multiple variables?
- ANOVA (why not Chi-sqrd?)
- testing levels of the predictor after ANOVA yield sign. results for the predictor
- set up matrix of contrasts to group the sprays of interest
- old vs new; organic vs conventional; ...
- sum of the contrasts must be 1 or -1, depending on which group
- if 2 levels, then each level gets 1/2, if 3 levels, each levels gets 1/3 etc
- glht(): general liner hypothesis testing
- use quasipoisson model for this function (generally always use quasi)
- vgl Bild
- set up matrix of contrasts to group the sprays of interest
Course 8: Poisson Models
Overdispersion
- how does variance behave in poisson model?
- how does this compare to the real world?
Overdispersion
- Poisson distribution assumes a variance that is equal to the mean
- linear
- exp: two complaints = variance of 2, 10 complaints = variance of 10
- in R
- "dispersion parameter taken to be 1"
- variance is way different in real-life --> most times greater than assumed in the Poisson Model
- real-life count data very often overdispersed, in very rare cases underdispersed (variance lower than mean)
- overdispersion is the rule, not the exception!
- use quasi-poisson instead of poisson
- real-life count data very often overdispersed, in very rare cases underdispersed (variance lower than mean)
- Checking for overdispersion (or underdispersion)
- If distribution was poisson distributed, the residual deviance & degrees of freedom in summary output would be approximately the same
- in R
- use glm(..., family = "quasipoisson") to set up alternative to the poisson model
- negative binomial as other option
- use glm(..., family = "quasipoisson") to set up alternative to the poisson model
Course 9: Binomial models
- Binary vs binomial distribution: explain differences
- binary: what types of variable can the predictor variable be?
- binomial: what type of variable are the predictor and response variable?
Binomial vs Binary
- both based on count data --> non-normal distribution
- binary
- usually thought of yes/no --> predictor variable can also be discrete
- food intake of an animal on one occasion is discrete and binary: out of 100g, how much was eaten?
- client clicks on advertisement or not
- can be translated to binomial data --> binomial is aggregated binary data
- survival of one patient after treatment (y/n, n=1) --> survival of treatment A (90%, n=15) vs of B (50%, n=15)
- usually thought of yes/no --> predictor variable can also be discrete
- binomial
- data expressed as proportion: sucess/trials
- compliance is binomial
- more than one measurement
- aggregation of patients in different groups
- intake of food on day A, B, C...
- predictor variable is discrete data: countable and finite
- increasing the precision may make the statistics look very powerful (small SD?), but not to be confounded with number of measurements taken, which is more important!
- response variable is integer (?)
- can be probability --> 0 to 1
- data expressed as proportion: sucess/trials
Course 9: Binomial models
- Binomial distributions
- how does it behave?
- what problems arise hen fitting LM to binomial data?
- how can they be solved?
Binomial distributions
- variability highest for mid-range count and lowest at boundaries of the predictor variable
- distribution is sigmoidal
- bound to lower and upper bound
- Problems when fitting LM to binomial distribution
- boundaries set by binomial distribution (natural upper limit: number of passports one can have)
- fitted values as well as predictions can lay outside of the discrete range
- distribution issues: sigmoidal vs normal
- normal distribution is not bound (unbounded) vs. binomial is, due to natural limits set by count data
- boundaries set by binomial distribution (natural upper limit: number of passports one can have)
- link function for binomial data necessary
- logit function: inverse log of formula is taken --> boundaries are set between 0 and 1
- heterogenous variability is taken into account --> though log-transofrmation it is "normalized" and LM can be fit to it
Course 9: binomial models (unfertig)
weitermache ab p. 8
Interpreting binomial models in R
-
- 1 / 21
-