Data Analytics
Klausur
Klausur
Set of flashcards Details
Flashcards | 129 |
---|---|
Language | English |
Category | Finance |
Level | University |
Created / Updated | 24.11.2024 / 08.02.2025 |
Weblink |
https://card2brain.ch/box/20241124_data_analytics
|
Embed |
<iframe src="https://card2brain.ch/box/20241124_data_analytics/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
What are numeric variables?
Continuous -> infinite numbers (e.g. size, time, age) and
Integer -> integer (e.g. number of cars, number of cities)
What are categorical variables?
Ordinal -> values can be ordered logically (e.g. good - ok - bad) and
nominal -> values cannot be ordered logically (e.g. blue, yellow, red)
What are the core ideas of Data Mining?
Data Analysis, Visualisation, Prediction, Classification, Data Reduction, Association Rules, Recommendation Systems
Name the 9 steps of the Data Mining Process
Define/ Understand the purpose of the analysis;
Obtaining data (possibly including sampling);
Data analysis, cleaning, preparation;
Reduce the data (dimension) (for supervised data mining, partition it);
Specify the analysis goal (classification, prediction, etc.);
Selection of techniques (e.g. regression, logit);
Iterative implementation and tuning;
Evaluation of the results;
Roll-out and widespread use of the best model
Supervised learning's goal is to predict a target or outcome variable; the goal of unsupervised learning is to identify patterns and divide data into meaningful groups.
In supervised learning a target value is known in the training data, it is the data on which the algorithm is trained; in unsupervised learning there is no target variable for prediction or classification.
Supervised learning: classification and prediction; unsupervised learning: association rules, data reduction, data exploration, visualisation.
The goal is the prediction of a numerical outcome variable.
The goal is the prediction of a categorical outcome variable.
What is the code for descriptive analysis of the dataset housing?
housing_df.describe()
What is the code to show the dimension of the dataset?
housing_df.shape
What is the code to show the first 5 lines of the dataset?
housing_df.head()
Number of categories - 1 -> redundant information leads to the failure of algorithms.
Observation that is extreme (far away) compared to the rest of the data.
We need expertise in the data to determine whether it is an error or a true extreme. Sometimes it is possible to correct the error. If the number of outliers is small and we recognize it as an error: treat it as a missing value.
Graphically, ordering the variables, using the minimum/maximum values.
If no data value is stored for a variable for an observation; denoted with NaN.
Solution 1: Discard - only practical if the number of missing records is small. Solution 2: Imputation - replace the missing values with meaningful substitute values (mean, median) -> advantage: we can keep the observation’s non-missing information.
housing_df["BEDROOMS"].sort_values()
Standardisation puts all variables on the same scale.
zi = (xi - x)/s. Subtract the arithmetic mean and divide by the standard deviation. Alternative: Rescale - subtract the minimum and divide by the range of max and min.
What is the code of standardisation in Python?
norm_df = (housing_df - housing_df.mean()) / housing_df.std() norm_df.describe()
What is the code of rescale in Python?
res_bed = (housing_df["BEDROOMS"] - housing_df["BEDROOMS"].min()) / (housing_df["BEDROOMS"].max() - housing_df["BEDROOMS"].min()) res_bed.describe()
If we simply use the existing data to find our model, there is a danger of overfitting. A complex model could fit the existing data excellently but perform worse on new data.
Causes: too many predictors, too many parameters in the model (too complex model with too few observations), too many different models tried. Solution: Partition the data into training data, validation data, and test data (sometimes).
Model development and model training; trying out different models; typically the largest part of the data.
With this data, we evaluate the performance of the model. If we compare several models with the validation data, overfitting could occur again, so test data is used.
Test data is used to perform another validation step.
Plot the outcome variable on the y-axis of boxplots, bar charts, and scatter plots. Study the relationship of the outcome variable with categorical predictors via side-by-side plots and multiple panels. Study the relationship between the outcome variable and numerical predictors via scatter plots.
Study the relationship of the outcome variable to categorical predictors using bar charts with the outcome variable on the y-axis. Study the relationship of the outcome variable to pairs of numerical predictors via color-coded scatter plots. Study the relationship between the outcome variable and numerical predictors via side-by-side boxplots.
Displays the relationship between two numerical variables.
What is the code for the scatterplot in Python using pandas?
housing_df.plot.scatter(x='LSTAT', y='MEDV')
What is the code for the scatterplot in Python using seaborn?
sns.scatterplot(x='LSTAT', y='MEDV', data=housing_df)
The Scatter Plot Matrix offers a combination of bivariate scatter plots and distribution plots.
What is the Python code for the scatter plot matrix in Python using seaborn?
sns.pairplot(housing_df[['CRIM', 'INDUS', 'LSTAT', 'MEDV']])
What is the Python code for the correlation matrix?
housing_df.corr().round(2)
What is the Python code for boxplots using seaborn?
sns.boxplot(y=housing_df["MEDV"], whis=[0,100])
The boxplot is very useful to get an overview of the overall distribution of a continuous variable.
Grouped boxplots allow comparison between categories of a potential predictor.
What is the Python code for grouped boxplots using seaborn?
sns.boxplot(y=housing_df["MEDV"], x=housing_df["CHAS"], whis=[0,100])