Set of flashcards Applied Biostatistics

Flashcards	21
Language	English
Category	Maths
Level	University
Created / Updated	23.09.2021 / 26.04.2022
Weblink	https://card2brain.ch/box/20210923_applied_biostatistics
Embed	<iframe src="https://card2brain.ch/box/20210923_applied_biostatistics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Basics: Introduction to linear models in R

Explain the following terms:
- response variable
- model
- error
- parameter
- regression coefficients
- residuals
What are the linear model's functions
- what does it represent?
- how can it be used in practice?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Linear models

Structure

response variable (y) = model + error
model
- intercept term
  - value of y when x = 0
- slope term
  - slope
  - perdictor variable (x)
    - continuous usually represented by beta
    - categorical usually represented by alpha, beta, gamma...
  - multiple predictors can be used in LM --> all parameters (=slope & predictor) are summed, not multiplied or divided or exponentiated!
error
- represent the portion of the response variable that the model fails to explain
parameters
- intercept (β0) as well as slope (β1) are the model's parameters
regression coefficients are estimated from data using least squares method
- estimated regression coefficient are denoted with a hat (β^1)
residuals are the difference between observed and predicted values (res = y-y^)

Function

model function
- simplified mathematical representation of reality, somewhat like a geographic map --> useful to predict and analyse reality
- used to extract relevant information from data
- the model is fitted to observed data to:
  - a) estimate, adjust and improve the model parameter(s)
  - b) quantify uncertainty of the model
- in practice, models can be used to
  - test hypotheses: does smoking increase risk of lung cancer?
  - estimate effects: by how much does it increase the risk of having lung cancer?
predictor models

Basics: Linear Regression

Keyboard commands:

= turn,

= for-/backward,

= scroll

sum of errors (distance from observation to RL) will always be = 0, for infinite number of lines --> square creates only + values and Minimum of SSE represents best fit
variation:
- explained variation (explained sum of squares, ESS): difference between mean value (Y-) and predicted value on RL (Y^)
- unexplained variation (Residual sum of squares, RSS): difference between mean and observation which lies outside of space between mean and RL
- SST: total of SSR and SSE
R²: measure of the strength of the relationship between observations and model
- proportion of explaines vs unexplained variation --> is affected by explained varation in the first place: if high, then R high, otherwise low
- affected by df because if many useless explanatory variables introduced to model, the model does not become better but df is reduced (df=n-k-1). In this case, R² will increase, which is misleading --> adjusted R²
Adj-R²
- if used variables are useless, adj R² will decrease bcs R² does not change. Otherwise, it will increase because formula uses 1-(1-R²)(n-1/n-k-1) and is therefore sensitive to increases in useful k which also increase R²
- does not have bounds (R² from 0 to 1) --> not intuitive, but allows to compare models with more/less variables
Error term
- e: error between observation and model fit (LR). should be zero when squared
- ε: true error between observation and real relationship between observations. Since this real relationship is the one trying to be estimated, this error cannot be calculated. The true relation is unknown.
degrees of freedom
- RL needs at least 3 observations to be set (with 2 point, the R2 is always = 1 because the model has no degrees of freedom) --> each observation is "anchor" for model to be fitted by. In addition to the two observations that are required to draw a line, every additional observation adds one degree of freedom
- depends on the number of explanatory variables (k) used --> for each k, models gains additional dimension. For freedom, the model requires at least two observations in one of its dimensions (otherwise R2=1) --> df = n-k-1

https://www.youtube.com/watch?v=aq8VU5KLmkY&t=30s

Basics: R Output, Variables

Keyboard commands:

= turn,

= for-/backward,

= scroll

Variables section

Coefficients

coefficients of the population formula
constant/ b₀: if all plotted values 0, then coefficient valid --> intercept
some will be much greater than others and therefore appear to be more impactful, but depends on range of the variable (0-10 vs -1000-1000)

St. Error

average expected error for coefficient

t-value

in contrast to St. Err., t-value is standarized and therefore comparable
- coeff divided by std. error
the higher the t-stat, the more significant the variable
- can also be neagtive, in that case negative correlation
P > |t|: p-value for coefficient
- H0 always that b_n = 0 --> p-value assesses how likely it is to get a different b_n=0 by chance
- essentially indicates if a variable has a sign. impact on model

95% CI

true coefficient will be somewhere in the 95% CI
if crosses 0, the variable can potentially

Course 1

Interpretation of LRM

Give mathematical and biological definition of:
- intercept
- slope
- p-value

Keyboard commands:

= turn,

= for-/backward,

= scroll

Interpretation of LRM

intercept:
- mathematical: if X is 0, Y will be = intercept
- biologically:
  - value cannot be negative, should be 0 or positive to make sense
    - if negative, the biological interpretation becomes nonsensical
  - for a body weight of 0, the "starting" heart weight is (should be) of 0 too
  - improving/correcting nonsensical intercept by parametrisation
    - substract constant value from predictors, e.g. mean body weight from all body weights
    - use log-transformation on response variable
- in R:
  - intercept and slope are "estimates" of the real data
  - given by lm(formula = predictor var ~ response var, data = d.cats)
slope:
- mathematical: for each unit on X, Y will increase by slope value
- biological:
- in R: intercept and slope are "estimates" given by lm(formula = predictor var ~ response var, data = d.cats)
  - SE of estimate: describes range in which estimate is applied (?)
  - residual SE: explain sundefined variation of the model
  - multiple R-squared: explains how much of the data can be explained by the model --> 0-100% (0 - 1)
    - adjusted R squared: takes into account complexibility of (different) models to make them more comparable
p values
- dichotomous thinking is bad practice! values around 0.05, e.g. 0.049 are less strong than 0.001 --> significance is a grey-scale
- smarter to display CI

Course 1

What steps important before running regression models?
- which possibilities exist
- which code is used in such cases?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Prior to any regression model: Visualize data

three options to visualize data with more than one predictor
- different shapes for different predictors
- different colors for different predictors
- panneling --> superimpose both plots/lines
scatterplots
- can be used with different colors or shapes for the additional predictor sex
- panneling: the facets argument in ggplot2 alows to create graph with two panels
  - facet_grid(. ~ Sex)
boxplots
- can be used to inspect distribution between factors via IQR

Basics: LRM basics

LRM basics

what are the intercept and slope and how are they determined ?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Definitions

X-axis
- independent variable
- predictor
Y-Axis
- dependent variable
observations: data points = actual values
linear regression line
- can be + or -
- fits observations as much as possible --> least square method: error betwen estimate on line and observation is as small as possible for all observations
- has an intercept (b₀) as well as a slope (b₁): y = b0 + b1 * x
  - slope b₁
    - crosses mean of all x and all y values
    - slope determined by least square methods using \(\sum(x-\hat{x})(y-\hat{y})/\sum(x-\hat{x})^2\) which means the difference between each x and x_mean value is summed, then multiplied with y - y_mean. Subsequently, the value is divided by the mean squared.
  - intercept b₀
    - can be determined after b₁ by solving formula for b₀

Course 1

Interactions:

which possibilities exist?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Interactions, 2 possibilities:

either sex as additional predictor: lm(Hwt ~ Bwt + Sex, data = d.cats)
- regression lines are paralell because model is forced to do so (why?)
- intercept differs
or sex as sex as sex-weight interaction: lm(Hwt ~ Bwt * Sex, data = d.cats)
- lines and intercept differ for both models
- if plotted seperately and model fitted, differences in slope and intercept are well visible (qplot, data inspection prior to GLM)

Course 1

Treatment contrasts

what does it mean?
how can it be changed?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Treatment contrasts

Setting correct treatment contrast important when comparing new to old, or when the reference is a gold-standard
- Historically, the term “treatment contrasts” comes from clinical studies where different treatments are compared
R uses treatment contrasts in alpha-numerical ordering
- relevel(data, ref = "M") to set factor "M" as reference

Course 2: Testing the effect of categorical variable

What first step needs to be done?
How can the results of a LM with categorical variables be interpreted
- what does the intercept represent?
- what needs to be kept in mind regarding R?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Testing the effect of categorical variable

First do boxplots to visualize data
- indicates that growth drates differ
- shows that data is more or less symmetric --> mean and median will be close
- IQR: box contains 50% of data, black line is median, whiskers are 1.5 times the data in box, rest outliers
interpretation of coefficients of LM
- intercept
  - called estimate in summary and represents mean growth rate of each species
    - due to treatment contrasts, Fagus is chosen as reference
    - all values negative --> all species grow slower than Fagus
  - represents the mean growth rate of the different tree species
  - can be calculated manually with aggregate function, use this to double check if coef() is correct
- p-values
  - not senseful for Fagus because intercept always compares to 0 --> however, this means the Fagus' growth rate is stat. sign. NOT 0 (which makes total sense in biology: trees do grow).
  - all species have differing growth rates compared to Fagus --> does not tell anything for other comparisons, e.g. Quercus vs Picea --> multiple testing

Course 2: Testing the effect of a categorical variable

How can the effect of a categorical variable be tested?
- which statistical tests are to be used in this case?
- what is tested against what?
- what is the last step of the analysis?
how can entire groups be tested against each other?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Testing different categories by ANOVA

need to adress initial question: do species differ in growth rates?
necessary to set up model that does not take species into account, so model with species can be compared to the former using an ANOVA
- model with species explains data with less unexplained variance (36 vs 43) at the cost of only 3 additional parameters (the one with no species has 1 degree of freedom, the other model 4, the difference being 3)
- F-test: shows that the addition of 3 parameters reduces the unexplained variability with statistical significance (if the drop from 43 to 36 was achieved by 100 add. parameters, the F-test would have been n.s.)
ANOVA indicates significant difference between species --> usually enough to say that growth rates differ significantly
- check contrast Kärtchen for subsequent steps on species factor
correcting for multiple testing as last step of analysis: either Tuckey or Bonferroni
- Bonferroni: quite restrictive/conservative because divides p-value by number of tests performed
- Tuckey: more complex way of setting significance level, not to mess around with if unfamiliar with Tuckey
- which one to use: follow research group's or journal to publish guidelines, most importantly be fair --> define differences to check before running the test, do not use it as a discovery tool

Testing groups against each other

create vector to group different factors of a common group: idea is to take mean growth rate of one group and comparing to mean growth rate of other --> done by creating a vector with 1/n (e.g. 1/2) for one group and -1/n for other group --> vector sum is 1-1=0
- similar to weighting

Course 2: Tree Growth

Testing several variables

testing categorical variables against each other

instead of testing one model agains another, several variables within one model may also be of interest since allows to find variables which explain the model most (if not, they can still be useful to correct model, e.g. experience vs gender salary)
p-value will indicate if a variable has a significant impact on the model
- RSS explain drop in unexplained variability
- p-value will change if coding error is corrected, because all variables correlated to each other. It not, then variables totally independent

testing categorical and continuous variables

adding continuous variable to lm and then checking with either t-test of summary() or drop1()
- Df for continous variable is 1
- in case of age, effect is small becasue p >> 0.05

R code

using update() to modify linear model
- + to add new response variables (or - to drop them)
- . to keep same variables as in the data specified in update()
using drop1() to compare models with 1 variable less than model containing all variables
- repeats anova() for model comparison automatically, always dropping one vraible at the time
- watch out for degrees of freedom --> output of model susceptible to coding error if one variable is coded in numbers but is not continuous. SiteID is categorical as must be coded accodingly (factor()) and corrected in the model (update(data = XY, .~. +SiteID.fac - SiteID ))

Course 2: Tree Growth

Testing interactions

the effect of a given variable may only become visible when interactions are allowed --> age
- with lm.trees.4 <- update(lm.trees.3, . ~ . + age) and drop1, one slope for age is created that is the mean slope of all species --> the slope is flat and therefore
  - interaction could also be non-linear (often in biology) so this must be kept in mind too
- with lm.trees.5 <- update(lm.trees.4,. ~ . + age:species) and drop1, one slope is created for each species showing differences which are reflected by a sig. p-value
  - scatter plots still look non
R code
- always use drop1 function to check impact of different variables
- using ANOVA will lead to error because it adds up variables and therefore the order of the variables become important, which is nonsensical

unfertig: Fragen

Linear Regression: Output Interpretation

ANOVA section

SS: sum of squares

each observation's y-value minus mean value, squared and then summed
indicator of how much variation is in the y variable
- SS low --> most observations around mean
SS of Model: explains how much variation the model is covering
SS of Residuals: uncovered variation the model is not explaining
MS: mean squares
- SS of residuals or of model divided by df
  - df: number of variables used to explain data
- MSE: mean squared error (uses SS of the residuals): generally used to estimate how much each observation misses the predicted value, if high the model is bad
  - also called SER: std error of regression

R²: arr squared

indicates how much/well the model explains the data
- percentage (0-1): SS Model/ SS Total

F-test

used to assess if the independent variables used to fit model are better than 0 explanatory variables, H0 (should ideally be the case)
- uses mean squares (MS) of model divided by MS residuals --> comparable to R2 but using MS?
- if F-test > 0.05, then H0 can be rejected and the model explains at least some of the data
  - does not tell you which one of the variables is the most powerful in explaining the data because considers all of them at once
Prob > F: p-value of F-test
- if < 0.05, the used expl. variables do not explain data
  - b1=b2=b3=bn=0

Course 3 unfertig(weitermachen bei 2.5 in script)

What are the first steps to take before analyzing data ?
why is this step important, what can it be used for?

Graphical Analysis:

Allows to spot mistakes in data
Allows to understand relation between response and predictor variables
- smoother can be used to visualize potential relationship between predictor and
Gives hints on which interactions may be relevant --> look at them step by step
- R: use ggplot with one predictor (e.g. site diversity) vs response variable (growth rate)
- often, linearity is assumed when biological relationships are actually non-linear --> use smoothers
  - linear models can model non-linear relationships --> quadratic effects or polynomials possible
  - depending on how smoothers looks, a linear or non-linear model must be chosen --> appears to be experience related

Course 3

Variables

Response variable

variable on y-axis
variable that is tried to be explained by the model
also called dependent variable
- in y = f(x) y is the dependent or response variable

Control variables

known to affect the response variable, e.g. species on growth rate or sex on muscle strength
control variables are kept in model but usually p-values not assessed

Predictor of interest

is the variable to be assessed for its relationship with the response variable
p-value is of interest to see if predictor has an impact on response variable
- p-value less important if goal is simply to make a prediction
on the x-axis

Design variables

exist because of study design --> measurements at different sites
must be included in analysis
should never be tested for significance

Course 8: Poisson Models

GLM and linear models: how do they differ?
Poisson models
- which three conditions must be fulfilled for model?
  - which mathematical strategies are used to adjust model?
- draw poisson distribution
- how is this done in R?
  - visualization
  - function to do poisson models

GLM

linear models
- response variable is continuous
- observations follow normal distribution --> inappropriate for count data because
  - values must be positive integers --> no -1 child or 0.83 push-ups possible
generalized linear models (GLM)
- response variable can be count data --> number of children, of tumors, push-ups...
  - very much "real life" --> usually events per time or per area
- observations can follow different distributions: poisson, binomial, normal, gamma
  - for count data, poisson distribution is assumed

Poisson model

Conditions for model
- must avoid negative fitted values
  - use exponent of all coefficients (?) in equation to get +ve values only
    - yˆ = exp(βˆ0 + βˆ1 · x1)
- simulated values must be integers
  - count cannot be negative or decimals --> transformation of response variable via natural logarithm (link function)
- variance depends on the mean
  - the higher the mean, the greater the expected variance (heavy smoker more variance than non-smoker) --> variance increases linearly with mean
  - natural logarithm is used as a link function between response variable y and poisson distribution
in R
- can be specified in the glm() by family = "poisson"
- display data with plot()

Course 8: Poisson Models

Interpreting R output (exercises)

what are the coefficients
- how to get coefficients (code, fct)
what are the estimates?
- what do the estimates mean, e.g. 1.06?
where can overdispersion be identified?
- how to does overdispersion affect models?
significance of predictor variable
- which test to use, and how?
- how to check levels of the predictor?

Interpreting Poisson model in R

boxplots are used to visualize the data, because the predictor variable is categorical --> independent of eachother
- boxplot(count ~ spray, data = InsectSprays)
- or use ggplot: geom_boxplot
boxplot usually shows how lower counts have lower variability, and higher count higher variability
coefficients
- intercept = median
- coef(glm.insects) --> intercept of all categories
- exp(coef(glm.insects)["(Intercept)"]) --> intercept of the treatment contrast variable
estimates
- exp(coef(glm.insects)["sprayB"]) # or any other category --> 1.06
- spray B has 106% more insects than the reference A
- lower values are considered decrease, so 0.4 means -60% compared to reference
overdispersion
- check Residual deviance (not Null deviance!) and compare to df --> usually Residual deviance >> df and therefore model overdispersed
- when using quasipoisson, the coefficient for quasipoisson and poisson will be identical. However, the significance on the summary will change (and the dispersion parameter)
testing if predictor variables have a significant effect (in other words: do the different spray types work better/worse?)
- ANOVA (why not Chi-sqrd?)
  - use quasi model
- remove predictor variable and compare to a model with the predictor
  - use drop1 for multiple variables?
testing levels of the predictor after ANOVA yield sign. results for the predictor
- set up matrix of contrasts to group the sprays of interest
  - old vs new; organic vs conventional; ...
  - sum of the contrasts must be 1 or -1, depending on which group
    - if 2 levels, then each level gets 1/2, if 3 levels, each levels gets 1/3 etc
- glht(): general liner hypothesis testing
  - use quasipoisson model for this function (generally always use quasi)
  - vgl Bild

Course 8: Poisson Models

Overdispersion

how does variance behave in poisson model?
- how does this compare to the real world?

Overdispersion

Poisson distribution assumes a variance that is equal to the mean
- linear
- exp: two complaints = variance of 2, 10 complaints = variance of 10
- in R
  - "dispersion parameter taken to be 1"
variance is way different in real-life --> most times greater than assumed in the Poisson Model
- real-life count data very often overdispersed, in very rare cases underdispersed (variance lower than mean)
  - overdispersion is the rule, not the exception!
- use quasi-poisson instead of poisson
Checking for overdispersion (or underdispersion)
- If distribution was poisson distributed, the residual deviance & degrees of freedom in summary output would be approximately the same
- in R
  - use glm(..., family = "quasipoisson") to set up alternative to the poisson model
    - negative binomial as other option

Course 9: Binomial models

Binary vs binomial distribution: explain differences
- binary: what types of variable can the predictor variable be?
- binomial: what type of variable are the predictor and response variable?

Binomial vs Binary

both based on count data --> non-normal distribution
binary
- usually thought of yes/no --> predictor variable can also be discrete
  - food intake of an animal on one occasion is discrete and binary: out of 100g, how much was eaten?
  - client clicks on advertisement or not
- can be translated to binomial data --> binomial is aggregated binary data
  - survival of one patient after treatment (y/n, n=1) --> survival of treatment A (90%, n=15) vs of B (50%, n=15)
binomial
- data expressed as proportion: sucess/trials
  - compliance is binomial
- more than one measurement
  - aggregation of patients in different groups
  - intake of food on day A, B, C...
- predictor variable is discrete data: countable and finite
  - increasing the precision may make the statistics look very powerful (small SD?), but not to be confounded with number of measurements taken, which is more important!
- response variable is integer (?)
  - can be probability --> 0 to 1

Course 9: Binomial models

Binomial distributions
- how does it behave?
- what problems arise hen fitting LM to binomial data?
  - how can they be solved?

Binomial distributions

variability highest for mid-range count and lowest at boundaries of the predictor variable
- distribution is sigmoidal
- bound to lower and upper bound
Problems when fitting LM to binomial distribution
- boundaries set by binomial distribution (natural upper limit: number of passports one can have)
  - fitted values as well as predictions can lay outside of the discrete range
- distribution issues: sigmoidal vs normal
  - normal distribution is not bound (unbounded) vs. binomial is, due to natural limits set by count data
link function for binomial data necessary
- logit function: inverse log of formula is taken --> boundaries are set between 0 and 1
- heterogenous variability is taken into account --> though log-transofrmation it is "normalized" and LM can be fit to it

Course 9: binomial models (unfertig)

weitermache ab p. 8

Interpreting binomial models in R

Applied Biostatistics

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google