Klausur


Set of flashcards Details

Flashcards 129
Language English
Category Finance
Level University
Created / Updated 24.11.2024 / 08.02.2025
Weblink
https://card2brain.ch/box/20241124_data_analytics
Embed
<iframe src="https://card2brain.ch/box/20241124_data_analytics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What are numeric variables?

Continuous -> infinite numbers (e.g. size, time, age) and

Integer -> integer (e.g. number of cars, number of cities)

What are categorical variables?

Ordinal -> values can be ordered logically (e.g. good - ok - bad) and

nominal -> values cannot be ordered logically (e.g. blue, yellow, red)

What are the core ideas of Data Mining?

Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems

Name the 9 steps of the Data Mining Process

Define/ Understand the purpose of the analysis;

Obtaining data (possibly including sampling);

Data analysis, cleaning, preparation;

Reduce the data (dimension) (for supervised data mining, partition it);

Specify the analysis goal (classification, prediction, etc.);

Selection of techniques (e.g. regression, logit);

Iterative implementation and tuning;

Evaluation of the results;

Roll-out and widespread use of the best model

What is the goal of supervised vs. unsupervised learning?

Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.

Is there a training value in supervised vs. unsupervised learning?

In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.

What methods are used in supervised vs. unsupervised learning?

Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.

What is the goal of the method prediction?

The goal is the prediction of a numerical outcome variable.

What is the goal of the method classification?

The goal is the prediction of a categorical outcome variable.

What is the code for descriptive analysis of the dataset housing?

housing_df.describe()

What is the code to show the dimension of the dataset?

housing_df.shape

What is the code to show the first 5 lines of the dataset?

housing_df.head()

How many numbers of dummies?

Number of categories - 1 -> redundant information leads to the failure of algorithms.

What is an outlier?

Observation that is extreme (far away) compared to the rest of the data.

What happens when we detect outliers?

We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.

How can we recognise outliers?

Graphically, ordering the variables, using the minimum/maximum values.

What are missing values? How are they denoted in Python?

If no data value is stored for a variable for an observation; denoted with NaN.

What can you do to fix missing values?

Solution 1: Discard - only practical if the number of missing records is small. Solution 2: Imputation - replace the missing values with meaningful substitute values (mean, median) -> advantage: we can keep the observation’s non-missing information.

How can you detect the missing values in Python?

housing_df["BEDROOMS"].sort_values()

What does standardisation of variables mean?

Standardisation puts all variables on the same scale.

What is the formula of Standardisation?

zi = (xi - x)/s. Subtract the arithmetic mean and divide by the standard deviation. Alternative: Rescale - subtract the minimum and divide by the range of max and min.

What is the code of standardisation in Python?

norm_df = (housing_df - housing_df.mean()) / housing_df.std() norm_df.describe()

What is the code of rescale in Python?

res_bed = (housing_df["BEDROOMS"] - housing_df["BEDROOMS"].min()) / (housing_df["BEDROOMS"].max() - housing_df["BEDROOMS"].min()) res_bed.describe()

What is overfitting and what is the problem with it?

If we simply use the existing data to find our model, there is a danger of overfitting. A complex model could fit the existing data excellently but perform worse on new data.

What 3 causes are there for overfitting? What is a solution?

Causes: too many predictors, too many parameters in the model (too complex model with too few observations), too many different models tried. Solution: Partition the data into training data, validation data, and test data (sometimes).

What is training data?

Model development and model training; trying out different models; typically the largest part of the data.

What is validation data?

With this data, we evaluate the performance of the model. If we compare several models with the validation data, overfitting could occur again, so test data is used.

What is test data?

Test data is used to perform another validation step.

What are the major visualizations for prediction?

Plot the outcome variable on the y-axis of boxplots, bar charts, and scatter plots. Study the relationship of the outcome variable with categorical predictors via side-by-side plots and multiple panels. Study the relationship between the outcome variable and numerical predictors via scatter plots.

What are the major visualizations for classification?

Study the relationship of the outcome variable to categorical predictors using bar charts with the outcome variable on the y-axis. Study the relationship of the outcome variable to pairs of numerical predictors via color-coded scatter plots. Study the relationship between the outcome variable and numerical predictors via side-by-side boxplots.

What shows the visualization the scatterplot?

Displays the relationship between two numerical variables.

What is the code for the scatterplot in Python using pandas?

housing_df.plot.scatter(x='LSTAT', y='MEDV')

What is the code for the scatterplot in Python using seaborn?

sns.scatterplot(x='LSTAT', y='MEDV', data=housing_df)

What is a Scatter Plot Matrix?

The Scatter Plot Matrix offers a combination of bivariate scatter plots and distribution plots.

What is the Python code for the scatter plot matrix in Python using seaborn?

sns.pairplot(housing_df[['CRIM', 'INDUS', 'LSTAT', 'MEDV']])

What is the Python code for the correlation matrix?

housing_df.corr().round(2)

What is the Python code for boxplots using seaborn?

sns.boxplot(y=housing_df["MEDV"], whis=[0,100])

What is the boxplot useful for?

The boxplot is very useful to get an overview of the overall distribution of a continuous variable.

What are grouped boxplots?

Grouped boxplots allow comparison between categories of a potential predictor.

What is the Python code for grouped boxplots using seaborn?

sns.boxplot(y=housing_df["MEDV"], x=housing_df["CHAS"], whis=[0,100])