Data Analytics Wissen
Klausur
Klausur
Kartei Details
Karten | 64 |
---|---|
Sprache | English |
Kategorie | Finanzen |
Stufe | Universität |
Erstellt / Aktualisiert | 08.02.2025 / 08.02.2025 |
Weblink |
https://card2brain.ch/box/20250208_data_analytics_wissen
|
Einbinden |
<iframe src="https://card2brain.ch/box/20250208_data_analytics_wissen/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Plot the outcome variable on the y-axis of boxplots, bar charts, and scatter plots. Study the relationship of the outcome variable with categorical predictors via side-by-side plots and multiple panels. Study the relationship between the outcome variable and numerical predictors via scatter plots.
Test data is used to perform another validation step.
With this data, we evaluate the performance of the model. If we compare several models with the validation data, overfitting could occur again, so test data is used.
Model development and model training; trying out different models; typically the largest part of the data.
Causes: too many predictors, too many parameters in the model (too complex model with too few observations), too many different models tried. Solution: Partition the data into training data, validation data, and test data (sometimes).
If we simply use the existing data to find our model, there is a danger of overfitting. A complex model could fit the existing data excellently but perform worse on new data.
zi = (xi - x)/s. Subtract the arithmetic mean and divide by the standard deviation. Alternative: Rescale - subtract the minimum and divide by the range of max and min.
Standardisation puts all variables on the same scale.
housing_df["BEDROOMS"].sort_values()
Solution 1: Discard - only practical if the number of missing records is small. Solution 2: Imputation - replace the missing values with meaningful substitute values (mean, median) -> advantage: we can keep the observation’s non-missing information.
If no data value is stored for a variable for an observation; denoted with NaN.
Graphically, ordering the variables, using the minimum/maximum values.
We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.
Observation that is extreme (far away) compared to the rest of the data.
Number of categories - 1 -> redundant information leads to the failure of algorithms.
The goal is the prediction of a categorical outcome variable.
The goal is the prediction of a numerical outcome variable.
Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.
In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.
Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.
Name the 9 steps of the Data Mining Process
Define/ Understand the purpose of the analysis;
Obtaining data (possibly including sampling);
Data analysis, cleaning, preparation;
Reduce the data (dimension) (for supervised data mining, partition it);
Specify the analysis goal (classification, prediction, etc.);
Selection of techniques (e.g. regression, logit);
Iterative implementation and tuning;
Evaluation of the results;
Roll-out and widespread use of the best model
What are the core ideas of Data Mining?
Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems
What are categorical variables?
Ordinal -> values can be ordered logically (e.g. good - ok - bad) and
nominal -> values cannot be ordered logically (e.g. blue, yellow, red)
What are numeric variables?
Continuous -> infinite numbers (e.g. size, time, age) and
Integer -> integer (e.g. number of cars, number of cities)