Premium Partner

Applied Biostatistics

M Tanadini, ETHZ

M Tanadini, ETHZ


Kartei Details

Karten 21
Sprache English
Kategorie Mathematik
Stufe Universität
Erstellt / Aktualisiert 23.09.2021 / 26.04.2022
Lizenzierung Keine Angabe
Weblink
https://card2brain.ch/box/20210923_applied_biostatistics
Einbinden
<iframe src="https://card2brain.ch/box/20210923_applied_biostatistics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Basics: Introduction to linear models in R

  • Explain the following terms:
    • response variable
    • model
    • error 
    • parameter
    • regression coefficients
    • residuals
  • What are the linear model's functions
    • what does it represent?
    • how can it be used in practice?

Linear models

Structure

  • response variable (y) = model + error
  • model 
    • intercept term
      • value of y when x = 0
    • slope term
      • slope
      • perdictor variable (x)
        • continuous usually represented by beta
        • categorical usually represented by alpha, beta, gamma...
      • multiple predictors can be used in LM --> all parameters (=slope & predictor) are summed, not multiplied or divided or exponentiated!
  • error
    • represent the portion of the response variable that the model fails to explain
  • parameters
    • intercept (β0) as well as slope (β1) are the model's parameters
  • regression coefficients are estimated from data using least squares method
    • estimated regression coefficient are denoted with a hat (β^1)
  • residuals are the difference between observed and predicted values (res = y-y^)

Function

  • model function
    • simplified mathematical representation of reality, somewhat like a geographic map --> useful to predict and analyse reality
    • used to extract relevant information from data
    • the model is fitted to observed data to:
      • a) estimate, adjust and improve the model parameter(s)
      • b) quantify uncertainty of the model
    • in practice, models can be used to 
      • test hypotheses: does smoking increase risk of lung cancer?
      • estimate effects: by how much does it increase the risk of having lung cancer?
  • predictor models
    •  

Basics: Linear Regression

  • sum of errors (distance from observation to RL) will always be = 0, for infinite number of lines --> square creates only + values and Minimum of SSE represents best fit
  • variation:
    • explained variation (explained sum of squares, ESS): difference between mean value (Y-) and predicted value on RL (Y^)
    • unexplained variation (Residual sum of squares, RSS): difference between mean and observation which lies outside of space between mean and RL
    • SST: total of SSR and SSE
  • R2: measure of the strength of the relationship between observations and model
    • proportion of explaines vs unexplained variation --> is affected by explained varation in the first place: if high, then R high, otherwise low
    • affected by df because if many useless explanatory variables introduced to model, the model does not become better but df is reduced (df=n-k-1). In this case, R2 will increase, which is misleading --> adjusted R2
  • Adj-R2
    • if used variables are useless, adj R2 will decrease bcs R2 does not change. Otherwise, it will increase because formula uses 1-(1-R2)(n-1/n-k-1) and is therefore sensitive to increases in useful k which also increase R2
    • does not have bounds (R2 from 0 to 1) --> not intuitive, but allows to compare models with more/less variables
  • Error term
    • e: error between observation and model fit (LR). should be zero when squared
    • ε: true error between observation and real relationship between observations. Since this real relationship is the one trying to be estimated, this error cannot be calculated. The true relation is unknown.
  • degrees of freedom
    • RL needs at least 3 observations to be set (with 2 point, the R2 is always = 1 because the model has no degrees of freedom) --> each observation is "anchor" for model to be fitted by. In addition to the two observations that are required to draw a line, every additional observation adds one degree of freedom
    • depends on the number of explanatory variables (k) used --> for each k, models gains additional dimension. For freedom, the model requires at least two observations in one of its dimensions (otherwise R2=1) --> df = n-k-1

https://www.youtube.com/watch?v=aq8VU5KLmkY&t=30s

Basics: R Output, Variables

Variables section

Coefficients

  • coefficients of the population formula
  • constant/ b0: if all plotted values 0, then coefficient valid --> intercept
  • some will be much greater than others and therefore appear to be more impactful, but depends on range of the variable (0-10 vs -1000-1000)

St. Error

  • average expected error for coefficient

t-value

  • in contrast to St. Err., t-value is standarized and therefore comparable
    • coeff divided by std. error
  • the higher the t-stat, the more significant the variable
    • can also be neagtive, in that case negative correlation
  • P > |t|: p-value for coefficient
    • H0 always that bn = 0 --> p-value assesses how likely it is to get a different bn=0 by chance
    • essentially indicates if a variable has a sign. impact on model

95% CI

  • true coefficient will be somewhere in the 95% CI
  • if crosses 0, the variable can potentially 

Course 1

Interpretation of LRM

  • Give mathematical and biological definition of:
    • intercept
    • slope
    • p-value

Interpretation of LRM

  • intercept:
    • mathematical: if X is 0, Y will be = intercept
    • biologically:
      • value cannot be negative, should be 0 or positive to make sense
        • if negative, the biological interpretation becomes nonsensical
      • for a body weight of 0, the "starting" heart weight is (should be) of 0 too
      • improving/correcting nonsensical intercept by parametrisation 
        • substract constant value from predictors, e.g. mean body weight from all body weights
        • use log-transformation on response variable 
    • in R:
      • intercept and slope are "estimates" of the real data
      • given by lm(formula = predictor var ~ response var, data = d.cats)
  • slope:
    • mathematical: for each unit on X, Y will increase by slope value
    • biological:
    • in R: intercept and slope are "estimates" given by lm(formula = predictor var ~ response var, data = d.cats)
      • SE of estimate: describes range in which estimate is applied (?)
      • residual SE: explain sundefined variation of the model
      • multiple R-squared: explains how much of the data can be explained by the model --> 0-100% (0 - 1)
        • adjusted R squared: takes into account complexibility of (different) models to make them more comparable
  • p values
    • dichotomous thinking is bad practice! values around 0.05, e.g. 0.049 are less strong than 0.001 --> significance is a grey-scale
    • smarter to display CI

Course 1

  • What steps important before running regression models?
    • which possibilities exist
    • which code is used in such cases?

Prior to any regression model: Visualize data

  • three options to visualize data with more than one predictor
    • different shapes for different predictors
    • different colors for different predictors
    • panneling --> superimpose both plots/lines
  • scatterplots
    • can be used with different colors or shapes for the additional predictor sex
    • panneling: the facets argument in ggplot2 alows to create graph with two panels
      • facet_grid(. ~ Sex)
  • boxplots
    • can be used to inspect distribution between factors via IQR

Basics: LRM basics 

LRM basics

  • what are the intercept and slope and how are they determined ?

Definitions

  • X-axis
    • independent variable
    • predictor
  • Y-Axis
    • dependent variable
    •  
  • observations: data points = actual values
  • linear regression line
    • can be + or -
    • fits observations as much as possible --> least square method: error betwen estimate on line and observation is as small as possible for all observations
    • has an intercept (b0) as well as a slope (b1): y = b0 + b1 * x
      • slope b1
        • crosses mean of all x and all y values
        • slope determined by least square methods using \(\sum(x-\hat{x})(y-\hat{y})/\sum(x-\hat{x})^2\)  which means the difference between each x and xmean value is summed, then multiplied with y - ymean. Subsequently, the value is divided by the mean squared.
      • intercept b0
        • can be determined after b1 by solving formula for b0

Course 1

Interactions:

  • which possibilities exist?

Interactions, 2 possibilities:

  1. either sex as additional predictor: lm(Hwt ~ Bwt + Sex, data = d.cats)
    • regression lines are paralell because model is forced to do so (why?)
    • intercept differs
  2. or sex as sex as sex-weight interaction: lm(Hwt ~ Bwt * Sex, data = d.cats)
    • lines and intercept differ for both models
    • if plotted seperately and model fitted, differences in slope and intercept are well visible (qplot, data inspection prior to GLM)

Course 1

Treatment contrasts

  • what does it mean?
  • how can it be changed?

Treatment contrasts

  • Setting correct treatment contrast important when comparing new to old, or when the reference is a gold-standard
    • Historically, the term “treatment contrasts” comes from clinical studies where different treatments are compared
  • R uses treatment contrasts in alpha-numerical ordering
    • relevel(data, ref = "M") to set factor "M" as reference