Data Analytics Wissen
Klausur
Klausur
Fichier Détails
Cartes-fiches | 64 |
---|---|
Langue | English |
Catégorie | Finances |
Niveau | Université |
Crée / Actualisé | 08.02.2025 / 08.02.2025 |
Lien de web |
https://card2brain.ch/box/20250208_data_analytics_wissen
|
Intégrer |
<iframe src="https://card2brain.ch/box/20250208_data_analytics_wissen/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Créer ou copier des fichiers d'apprentissage
Avec un upgrade tu peux créer ou copier des fichiers d'apprentissage sans limite et utiliser de nombreuses fonctions supplémentaires.
Connecte-toi pour voir toutes les cartes.
Solution 1: Discard - only practical if the number of missing records is small. Solution 2: Imputation - replace the missing values with meaningful substitute values (mean, median) -> advantage: we can keep the observation’s non-missing information.
If no data value is stored for a variable for an observation; denoted with NaN.
Graphically, ordering the variables, using the minimum/maximum values.
We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.
Observation that is extreme (far away) compared to the rest of the data.
Number of categories - 1 -> redundant information leads to the failure of algorithms.
The goal is the prediction of a categorical outcome variable.
The goal is the prediction of a numerical outcome variable.
Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.
In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.
Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.
Name the 9 steps of the Data Mining Process
Define/ Understand the purpose of the analysis;
Obtaining data (possibly including sampling);
Data analysis, cleaning, preparation;
Reduce the data (dimension) (for supervised data mining, partition it);
Specify the analysis goal (classification, prediction, etc.);
Selection of techniques (e.g. regression, logit);
Iterative implementation and tuning;
Evaluation of the results;
Roll-out and widespread use of the best model
What are the core ideas of Data Mining?
Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems
What are categorical variables?
Ordinal -> values can be ordered logically (e.g. good - ok - bad) and
nominal -> values cannot be ordered logically (e.g. blue, yellow, red)
What are numeric variables?
Continuous -> infinite numbers (e.g. size, time, age) and
Integer -> integer (e.g. number of cars, number of cities)
What is the 10-fold cross validation?
It involves comparing the mean performances of the tuning parameters on test data blocks.
Try different tuning parameters, evaluate their performance, and compare. Performance is commonly checked using 10-fold cross-validation.
Ridge: If most variables in the dataset are useful. Lasso: If most variables in the dataset are useless. If uncertain: Use Elastic Net Regression.
Because the constraint of the RIDGE forms a circle, so the ellipse never exactly touches zero.
A combination of RIDGE and LASSO.
We minimize the sum of squared residuals + a shrinkage penalty with an absolute value term (power of one) to find ßs.
We minimize the sum of squared residuals + a shrinkage penalty with a squared term (power of two) to find ßs.
We minimize the sum of squared residuals to find ßs.
Ridge has a larger bias than the OLS, but a lower variance. This reflects the Bias-Variance Trade-Off.
They can be used for prediction or classification when we have large data sets.
False Negative
False Positive
When the target variable y is categorical (e.g., color). We only deal with binary outcomes (yes (1), no (0)).
The RMSE of the training data is lower in the multiple regression compared to the simple one.
log y = ß log x + E. If x increases by 1 percent, then y changes by ß percent.
-
- 1 / 64
-