Data Analytics Wissen
Klausur
Klausur
Kartei Details
Karten | 64 |
---|---|
Sprache | English |
Kategorie | Finanzen |
Stufe | Universität |
Erstellt / Aktualisiert | 08.02.2025 / 08.02.2025 |
Weblink |
https://card2brain.ch/box/20250208_data_analytics_wissen
|
Einbinden |
<iframe src="https://card2brain.ch/box/20250208_data_analytics_wissen/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
What is the 10-fold cross validation?
It involves comparing the mean performances of the tuning parameters on test data blocks.
Try different tuning parameters, evaluate their performance, and compare. Performance is commonly checked using 10-fold cross-validation.
Ridge: If most variables in the dataset are useful. Lasso: If most variables in the dataset are useless. If uncertain: Use Elastic Net Regression.
Because the constraint of the RIDGE forms a circle, so the ellipse never exactly touches zero.
A combination of RIDGE and LASSO.
We minimize the sum of squared residuals + a shrinkage penalty with an absolute value term (power of one) to find ßs.
We minimize the sum of squared residuals + a shrinkage penalty with a squared term (power of two) to find ßs.
We minimize the sum of squared residuals to find ßs.
Ridge has a larger bias than the OLS, but a lower variance. This reflects the Bias-Variance Trade-Off.
They can be used for prediction or classification when we have large data sets.
False Negative
False Positive
When the target variable y is categorical (e.g., color). We only deal with binary outcomes (yes (1), no (0)).
The RMSE of the training data is lower in the multiple regression compared to the simple one.
log y = ß log x + E. If x increases by 1 percent, then y changes by ß percent.
y = ß log x + E. If x increases by 100 percent, then y changes by ß units.
log y = ßx + E. If x increases by one unit, then y changes by ß * 100 percent.
y = ßx + E. If x increases by one unit, then y changes by ß units.
Compare the root mean squared error and mean absolute error of training and validation data.
Gives an idea of systematic over- or underprediction.
Gives an idea of the magnitude of errors.
Taking the 10% of observations most likely classified as 1s by the model yields almost eight times as many 1s as a random selection of 10% of cases.
1. Usually in deciles. 2. First: Observations are ordered along predicted probabilities. 3. Calculate the proportion of 1s in each decile. 4. Divide by the average proportion of 1s in the data set. 5. This ratio gives the lift value.
Higher area under the curve: - 0.5 indicates no better than random assignment - 1.0 indicates perfect separation of classes
Receiver Operating Characteristic. It illustrates sensitivity and specificity when the cutoff value decreases from 1 to 0. Better performance is shown by an ROC curve closer to the upper left corner.
True Negative / (True Negative + False Positive)
True Positive / (True Positive + False Negative)
1 - Error Rate
For example: Class 1 (acceptance of credit) vs. Class 0 (rejection of credit). Calculate the probability of belonging to Class 1. If it is lower than 0.5 (threshold), classify as Class 0; otherwise, classify as Class 1.
High separation: Predictor variables lead to a low error. Low separation: Predictor variables do not significantly improve the naive rule.
Classify all observations as belonging to the most frequent class (benchmark).
Proportion of misclassified observations out of all observations of the datasets in the validation data.
Classification of an observation as belonging to one class, although it belongs to another.
In the simplest version, a bar chart shows only the frequency in each category.
Visualization of the distribution of a continuous variable.
Grouped boxplots allow comparison between categories of a potential predictor.
The boxplot is very useful to get an overview of the overall distribution of a continuous variable.
The Scatter Plot Matrix offers a combination of bivariate scatter plots and distribution plots.
Displays the relationship between two numerical variables.
Study the relationship of the outcome variable to categorical predictors using bar charts with the outcome variable on the y-axis. Study the relationship of the outcome variable to pairs of numerical predictors via color-coded scatter plots. Study the relationship between the outcome variable and numerical predictors via side-by-side boxplots.