Klausur


Set of flashcards Details

Flashcards 64
Language English
Category Finance
Level University
Created / Updated 08.02.2025 / 08.02.2025
Weblink
https://card2brain.ch/box/20250208_data_analytics_wissen
Embed
<iframe src="https://card2brain.ch/box/20250208_data_analytics_wissen/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
What are the major visualizations for prediction?

Plot the outcome variable on the y-axis of boxplots, bar charts, and scatter plots. Study the relationship of the outcome variable with categorical predictors via side-by-side plots and multiple panels. Study the relationship between the outcome variable and numerical predictors via scatter plots.

What is test data?

Test data is used to perform another validation step.

What is validation data?

With this data, we evaluate the performance of the model. If we compare several models with the validation data, overfitting could occur again, so test data is used.

What is training data?

Model development and model training; trying out different models; typically the largest part of the data.

What 3 causes are there for overfitting? What is a solution?

Causes: too many predictors, too many parameters in the model (too complex model with too few observations), too many different models tried. Solution: Partition the data into training data, validation data, and test data (sometimes).

What is overfitting and what is the problem with it?

If we simply use the existing data to find our model, there is a danger of overfitting. A complex model could fit the existing data excellently but perform worse on new data.

What is the formula of Standardisation?

zi = (xi - x)/s. Subtract the arithmetic mean and divide by the standard deviation. Alternative: Rescale - subtract the minimum and divide by the range of max and min.

What does standardisation of variables mean?

Standardisation puts all variables on the same scale.

How can you detect the missing values in Python?

housing_df["BEDROOMS"].sort_values()

What can you do to fix missing values?

Solution 1: Discard - only practical if the number of missing records is small. Solution 2: Imputation - replace the missing values with meaningful substitute values (mean, median) -> advantage: we can keep the observation’s non-missing information.

What are missing values? How are they denoted in Python?

If no data value is stored for a variable for an observation; denoted with NaN.

How can we recognise outliers?

Graphically, ordering the variables, using the minimum/maximum values.

What happens when we detect outliers?

We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.

What is an outlier?

Observation that is extreme (far away) compared to the rest of the data.

How many numbers of dummies?

Number of categories - 1 -> redundant information leads to the failure of algorithms.

What is the goal of the method classification?

The goal is the prediction of a categorical outcome variable.

What is the goal of the method prediction?

The goal is the prediction of a numerical outcome variable.

What methods are used in supervised vs. unsupervised learning?

Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.

Is there a training value in supervised vs. unsupervised learning?

In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.

What is the goal of supervised vs. unsupervised learning?

Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.

Name the 9 steps of the Data Mining Process

Define/ Understand the purpose of the analysis;

Obtaining data (possibly including sampling);

Data analysis, cleaning, preparation;

Reduce the data (dimension) (for supervised data mining, partition it);

Specify the analysis goal (classification, prediction, etc.);

Selection of techniques (e.g. regression, logit);

Iterative implementation and tuning;

Evaluation of the results;

Roll-out and widespread use of the best model

What are the core ideas of Data Mining?

Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems

What are categorical variables?

Ordinal -> values can be ordered logically (e.g. good - ok - bad) and

nominal -> values cannot be ordered logically (e.g. blue, yellow, red)

What are numeric variables?

Continuous -> infinite numbers (e.g. size, time, age) and

Integer -> integer (e.g. number of cars, number of cities)