Applied Biostatistics
M Tanadini, ETHZ
M Tanadini, ETHZ
21
0.0 (0)
Kartei Details
Karten | 21 |
---|---|
Sprache | English |
Kategorie | Mathematik |
Stufe | Universität |
Erstellt / Aktualisiert | 23.09.2021 / 26.04.2022 |
Lizenzierung | Keine Angabe |
Weblink |
https://card2brain.ch/box/20210923_applied_biostatistics
|
Einbinden |
<iframe src="https://card2brain.ch/box/20210923_applied_biostatistics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Basics: Introduction to linear models in R
- Explain the following terms:
- response variable
- model
- error
- parameter
- regression coefficients
- residuals
- What are the linear model's functions
- what does it represent?
- how can it be used in practice?
Linear models
Structure
- response variable (y) = model + error
- model
- intercept term
- value of y when x = 0
- slope term
- slope
- perdictor variable (x)
- continuous usually represented by beta
- categorical usually represented by alpha, beta, gamma...
- multiple predictors can be used in LM --> all parameters (=slope & predictor) are summed, not multiplied or divided or exponentiated!
- intercept term
- error
- represent the portion of the response variable that the model fails to explain
- parameters
- intercept (β0) as well as slope (β1) are the model's parameters
- regression coefficients are estimated from data using least squares method
- estimated regression coefficient are denoted with a hat (β^1)
- residuals are the difference between observed and predicted values (res = y-y^)
Function
- model function
- simplified mathematical representation of reality, somewhat like a geographic map --> useful to predict and analyse reality
- used to extract relevant information from data
- the model is fitted to observed data to:
- a) estimate, adjust and improve the model parameter(s)
- b) quantify uncertainty of the model
- in practice, models can be used to
- test hypotheses: does smoking increase risk of lung cancer?
- estimate effects: by how much does it increase the risk of having lung cancer?
- predictor models
Basics: Linear Regression
- sum of errors (distance from observation to RL) will always be = 0, for infinite number of lines --> square creates only + values and Minimum of SSE represents best fit
- variation:
- explained variation (explained sum of squares, ESS): difference between mean value (Y-) and predicted value on RL (Y^)
- unexplained variation (Residual sum of squares, RSS): difference between mean and observation which lies outside of space between mean and RL
- SST: total of SSR and SSE
- R2: measure of the strength of the relationship between observations and model
- proportion of explaines vs unexplained variation --> is affected by explained varation in the first place: if high, then R high, otherwise low
- affected by df because if many useless explanatory variables introduced to model, the model does not become better but df is reduced (df=n-k-1). In this case, R2 will increase, which is misleading --> adjusted R2
- Adj-R2
- if used variables are useless, adj R2 will decrease bcs R2 does not change. Otherwise, it will increase because formula uses 1-(1-R2)(n-1/n-k-1) and is therefore sensitive to increases in useful k which also increase R2
- does not have bounds (R2 from 0 to 1) --> not intuitive, but allows to compare models with more/less variables
- Error term
- e: error between observation and model fit (LR). should be zero when squared
- ε: true error between observation and real relationship between observations. Since this real relationship is the one trying to be estimated, this error cannot be calculated. The true relation is unknown.
- degrees of freedom
- RL needs at least 3 observations to be set (with 2 point, the R2 is always = 1 because the model has no degrees of freedom) --> each observation is "anchor" for model to be fitted by. In addition to the two observations that are required to draw a line, every additional observation adds one degree of freedom
- depends on the number of explanatory variables (k) used --> for each k, models gains additional dimension. For freedom, the model requires at least two observations in one of its dimensions (otherwise R2=1) --> df = n-k-1
https://www.youtube.com/watch?v=aq8VU5KLmkY&t=30s
Basics: R Output, Variables
Variables section
Coefficients
- coefficients of the population formula
- constant/ b0: if all plotted values 0, then coefficient valid --> intercept
- some will be much greater than others and therefore appear to be more impactful, but depends on range of the variable (0-10 vs -1000-1000)
St. Error
- average expected error for coefficient
t-value
- in contrast to St. Err., t-value is standarized and therefore comparable
- coeff divided by std. error
- the higher the t-stat, the more significant the variable
- can also be neagtive, in that case negative correlation
- P > |t|: p-value for coefficient
- H0 always that bn = 0 --> p-value assesses how likely it is to get a different bn=0 by chance
- essentially indicates if a variable has a sign. impact on model
95% CI
- true coefficient will be somewhere in the 95% CI
- if crosses 0, the variable can potentially
Course 1
Interpretation of LRM
- Give mathematical and biological definition of:
- intercept
- slope
- p-value
Interpretation of LRM
- intercept:
- mathematical: if X is 0, Y will be = intercept
- biologically:
- value cannot be negative, should be 0 or positive to make sense
- if negative, the biological interpretation becomes nonsensical
- for a body weight of 0, the "starting" heart weight is (should be) of 0 too
- improving/correcting nonsensical intercept by parametrisation
- substract constant value from predictors, e.g. mean body weight from all body weights
- use log-transformation on response variable
- value cannot be negative, should be 0 or positive to make sense
- in R:
- intercept and slope are "estimates" of the real data
- given by lm(formula = predictor var ~ response var, data = d.cats)
- slope:
- mathematical: for each unit on X, Y will increase by slope value
- biological:
- in R: intercept and slope are "estimates" given by lm(formula = predictor var ~ response var, data = d.cats)
- SE of estimate: describes range in which estimate is applied (?)
- residual SE: explain sundefined variation of the model
- multiple R-squared: explains how much of the data can be explained by the model --> 0-100% (0 - 1)
- adjusted R squared: takes into account complexibility of (different) models to make them more comparable
- p values
- dichotomous thinking is bad practice! values around 0.05, e.g. 0.049 are less strong than 0.001 --> significance is a grey-scale
- smarter to display CI
Course 1
- What steps important before running regression models?
- which possibilities exist
- which code is used in such cases?
Prior to any regression model: Visualize data
- three options to visualize data with more than one predictor
- different shapes for different predictors
- different colors for different predictors
- panneling --> superimpose both plots/lines
- scatterplots
- can be used with different colors or shapes for the additional predictor sex
- panneling: the facets argument in ggplot2 alows to create graph with two panels
- facet_grid(. ~ Sex)
- boxplots
- can be used to inspect distribution between factors via IQR
Basics: LRM basics
LRM basics
- what are the intercept and slope and how are they determined ?
Definitions
- X-axis
- independent variable
- predictor
- Y-Axis
- dependent variable
- observations: data points = actual values
- linear regression line
- can be + or -
- fits observations as much as possible --> least square method: error betwen estimate on line and observation is as small as possible for all observations
- has an intercept (b0) as well as a slope (b1): y = b0 + b1 * x
- slope b1
- crosses mean of all x and all y values
- slope determined by least square methods using \(\sum(x-\hat{x})(y-\hat{y})/\sum(x-\hat{x})^2\) which means the difference between each x and xmean value is summed, then multiplied with y - ymean. Subsequently, the value is divided by the mean squared.
- intercept b0
- can be determined after b1 by solving formula for b0
- slope b1
Course 1
Interactions:
- which possibilities exist?
Interactions, 2 possibilities:
- either sex as additional predictor: lm(Hwt ~ Bwt + Sex, data = d.cats)
- regression lines are paralell because model is forced to do so (why?)
- intercept differs
- or sex as sex as sex-weight interaction: lm(Hwt ~ Bwt * Sex, data = d.cats)
- lines and intercept differ for both models
- if plotted seperately and model fitted, differences in slope and intercept are well visible (qplot, data inspection prior to GLM)
Course 1
Treatment contrasts
- what does it mean?
- how can it be changed?
Treatment contrasts
- Setting correct treatment contrast important when comparing new to old, or when the reference is a gold-standard
- Historically, the term “treatment contrasts” comes from clinical studies where different treatments are compared
- R uses treatment contrasts in alpha-numerical ordering
- relevel(data, ref = "M") to set factor "M" as reference