Data Science

lesson01-lesson15

lesson01-lesson15


Fichier Détails

Cartes-fiches 118
Langue English
Catégorie Informatique
Niveau Université
Crée / Actualisé 15.06.2020 / 28.12.2022
Lien de web
https://card2brain.ch/box/20200615_data_science
Intégrer
<iframe src="https://card2brain.ch/box/20200615_data_science/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Which trends are driving the data science "revolution"?

Mainly Big Data and Machine Learning, 

Give a definition of Data science:

1.
Data science is about the extraction of useful
information and knowledge from large
volumes of data, on order to improve
business decision-making.

2.
Data science is an interdisciplinary subject with 3 key areas:
- Statistics
- Computer Science
- Domain expertise

Why is Data Science important?

In the past, data analysis was typically slow: Needed teams of statisticians, analysts etc. to explore data manually.

Today colume, velocity and variety make manual analysis impossible but fast computers and good algorithms allow much deeper analyses than before.

--> data-driven decision making
--> base decisions on alysis of data, not intuition

Draw the Data Science performing process:

- Iterative process
- Non-sequential
- Early termination
- Established processes, e.g. CRISP-DM

Name the approximately year of invention of Machine Learning, Deep learning and Artificial Intelligence:

  • AI 1950's
    Creation of first "intelligent" algorithms and programs
  • ML 1980's
    Statistical models and algorithms that can learn from data
  • DL 2010's
    Statistical models and algorithms inspired by neurones that can learn from data

Name the 3 main branches of ML and some of its applications:

  • Supervised Learning
    • Classification
      • Diagnostics
      • Customer Retention (Kundenbindung)
      • Image Classification
    • Regression
      • Estimating life expextancy
      • Population Growht Prediction
      • Market Forecasting
  • Unsupervised Learning
    • Clustering
      • Recommender System
      • Customer Segmentation
      • Targetted marketing
    • Dimensionality Reduction
      • Big data Visualisation
      • Structure Discovery
  • Reinforcement Learning
    • Game AI
    • Robot Navigation
    • Real-time decisions

Explain supervised learning:

In supervised learning the training data consicts of input / output pairs and we train a function to map the inputs to the outputs. The predicted variable consists is therby either a continuous variable like Price / Cost / Weight (Regression Problems) or categorical variable like A, B or C / Dogs or Cats.

Explain unsupervised learning:

In unsupervised learning there are no labels available, insights are gained without prior knowledge.

For Anomaly / Outlier detection is the task, finding samples in a dataset tat raise suspicion.
The problem therby is, that you usally do not know, what you are looking for.
The solution is to use statistics and characteristics of the dataset to find outliers.

What did mainly drive the deep learning focus?

1. A lot of data

2. Computing power

What is new in deep learning compared to earlier "machine learning"?

What is new (among other thins) is a learning algorithm called backpropagation which allows to train deep neural nets.

State-of-the-art networks can have over 200 layers.

What are main differences between classical ML vs. Deep Learning?

Classical ML methos don't handle hogh dimensionality well.

--> dimensionality reduction / feature selection

Deep neural nets learn compact representations of data even in a high dimensionality / sparse setting - no feature engineering required!

Why do we not just apply Deep Learning to every problem?

Possibilities where classical ML performs better than DL are cases where

- necessary data is not available
- computational power is not available
- results are harder to interpret
- Deep networks can be fooled [Image of panda which you overly with a filter (rauschfilter) and the prediction gives something like a gibbon]

How do you define any object in python (3 ways)?

In general, for any object to be defined, it has to be accessible within the current scope, namely:

1. It belongs to Python's default environment. These are the in-bilt functions and containers like str, print, list, etc.
2. It has been defined in the current program, eg. when you create a custom functino with the def keyword.
3. It exist as a separate library and you imported the librar with a szitable import statement.

Make an example of creating a pandas DataFrame out of a csv and assigning it to a variable:

e.g.

housing_data = pd.read_csv(DATA/'housing.csv')

whereas DATA is a variable in which the path was stored.

Which are the first recommended commands after importing a pandas DataFrame?

Looking at the head and tail of the dataset because one often finds metadata or aggregations at the end of Excel files:
df.head() and df.tail()

Look also at some random samples with the df.sample(5, random_state=42) #random_state secures the reproducability.

Getting a short description of the df of the featurenames, amount aof values, and dtype with df.info().

Getting a summary of the numerical attributes via df.describe() (does descriptive satistics).

Which Python libraries are most often used for visualisations? And especially which kind of plots to get an quick overview?

Seaborn and matplotlib

  • Histograms: Shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis). Useful for understanding the shape of a single variable
  • Correlation matrix heatmap: Shows how much each column correlates with each other column with a color gradient. Useful for quickly seeing which variables correlate most strongly with the variable of interest.
  • Scatter plots: Shows a collection of points, each having the value of one column determining the position on the horizontal axis and the value of the other column determining the position on the vertical axis. Useful for visually looking for correlations.

The named plots are all created with the seaborn library!

Explain the meaning of correlation regarding the correlation matrix heatmap:

Most people are usually referring correlation to the standard correlation coefficient px, y (also called Pearson's r) between a pair of random variables X and Y. The coefficient ranges from -1 to 1; when it is close to 1 (-1) it means there is a strong positive (negative) correlation. When the coefficient is close to 0 it means there is no linear correlation.

What is the solution for scatterplots which are too dense to interpret?

Plotting a hexagon bin plot. They bin the spatial area of the chart and the intensity of a hexagon's color can be interpreted as points boing more concentrated in this area.

What are the goals of exploratory data analysis (EDA)?

  • Suggest hypotheses about the phenomena of interest
  • Check if necessary data is available to test these hypotheses
  • Make a selection of appropriate methos and models to achieve the goal
  • Suggest what data should be gathered for further investigation

The exploratory phase lays out the path for the rest of a data science project and is therefore a crucial part of the process.

With which function would you combine two different pandas.DataFrames together and how would it look like?

With the pd.merge() function, e.g.:

housing_merged = pd.merge(

housing_data, housing_addresses, how='left', on='latitude_longitude'

)

How does the groupby mechanic work?

What is data cleaning?

When you receive a new dataset at the beginning of a project, the first task usually involves some form of data cleaning.

To solve the task at hand, you might need ata from multiple sourves which you need to combine into one unified table. However, this is usually a tricky task; the different data sources might have different naming conventions, some of them might be human-generated, while others are automatic system reports. A list of ehings you usaually have to go through are the following:

  • Merge multiple sources into one table
  • Remove duplicate entries
  • Clean corrupted entries
  • Handle missing data

How much time spend data scientiests for preparing datasets for machine learning algorithms, according to a study by CrowdFlower?

About 60-80%

How would a function lokk like, which converst strings to categories?

e.g:

def convert_strings_to_categories(df):
    for col in df.columns:
        if is_object_dtype(df[col]):
            df[col] = df[col].astype('category')

What is the special dtype of pandas to treat object (string) features with?

The categorical dtype for holding data that uses the integer-based categorical representation or encoding. The categorical object has categories and codes.

Why do we have to deal with missing values in our datasets?

In general, machine learning algorithms will fail to work with missing data.

What are the three general options to handle missing values?

  1. Get rid of the corresponding rows
  2. Get rid of the whole feature or column
  3. Replace the missing values with some value like zero or the mean, median of the column

How would a utility function to fill missing values with median look like?

def fill_missing_values_with_median(df):
    for column in df.columns:
        if is_numeric_dtype(df[column]):
            if pd.isnull(df[column]).sum():
                column_median = df[column].median()
                df[column].fillna(column_median, inplace=True)

What is a potential problem by converting categories to numbers all in one column?

The machine learning algoritms will treat two values that are numerically close to each other as being similar. Thus an alternative approach is to apply a technique known as one-hot encoding, where we create a binary feature per category. In pandas we can do this by simply running e.g.

housing data_encoded = pd.get_dummies(housing_data)

What is the key idea of decision trees?

To learn a hirarchy of if/else which leads in the end to a decision which is the target.

What are Random Forests and what is the kea idea behind it?

A powerful algorithm built from an ensemble (a collection/group of predictive models) of decision trees.

Key idea: Aggregate predictions from a group of predictors or "estimators" (classifier/regressors). In the and the Mahority vote (sometimes calles "wisdom of crowd") among the estimators is the final prediction. --> Better predictions than single best estimator.

What are the advantages and disadvantages of Random Forests?

Advantages:

  • Training/prediction is fast because built on decision trees (From a computer science perspective. Because it is just if/else statements)
  • Multiple trees allow for probabilistic classification (estimate prob. thath instance belongs to class)
  • Flexible, less prone to under- of overfitting data

Disadvantages

  • Blackbox model --> hard to explain in simple terms why prediction is made in contract to decision trees which provides nice classification rules (whitebox model)

What makes a good machine learning model?

Suppose we are trying to solve a supervised learning task (i.e. regression or classification) and we have trained a set of models (M1 (linear regression), M2(random forest, ..., Mn (neural network)).

  • train/valid/test splits
  • performance metrics
  • visualisation tools

Using this 3 point you can evaluate which of your model works best for the given task.

What is the technique to assess how a model will generalise to new data (e.g. in production)?

The Key idea is to split the dataset into training and validation sets. As a rule of thumb 80:20 spllit. If data permitts add a test set! Evaluate the errors on the validation set. If you have a test set, use it just in the very end.

What is the key idea of evaluating regressors and name some ways to measure it?

Measure how far the data is from predictions --> "error" of the model

Several ways to measure the error:

  • RMSE (Root Mean Square Error)
    Measures the standard deviation of the errors the system makes in its predictions.
    Pro: same units as y, differentiable
    Cons: sensitive to outliers
  • MAE (Mean Absolute Error)
    Useful when there are many outliers in the data.
    Pro: Intitive, shows how far to expect pred. error on average, same units as y, robust to outliers
    Cons: not "smoothly differeentiable" --> not preferred in ML algorithms

What is a machine learning model?

Tom Mitchell, one of the pioneers of machine learning, proposed this definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

What should be selected before any machine learning model is trained?

The perfomance measure

How is one of the best known Python libraries for machine learning called?

scikit-learn

It provides efficient implementations of a large number of common algorithms. It has a unifrom Estimator API as well as excellent online documentation. The main benefit of its API is that once you understand the basic use and syntax of scikit-learn for one type of model, switching to a new model or algorithm is very easy.

What are the most common steps one takes when building a model in scikit-learn?

  1. Choose a class of model by importing the appropriate estimator class from scikit-learn.
  2. Choose model hyperparameters by instantiating this class with the desired values.
  3. Arrange data into a feature matrix and target vextor
  4. Fit the model to your data by caliing the fit() method.
  5. Evaluate the predictions of the model:
    • For supervised learning we typically predict labels for new data using the predict() method.
    • For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() methods.

What is the convention of scikit-learn of arranging data into a feature matric and target vector?

  • The feature matrix is often stored in a variable called X. This matrix is typically two-dimensional with shape [n_samples, n_features], where n_samples refers to the number of row (i.e. housing districts in our example) and n_features refers to all columns except the target.
  • The target or label array is usually denoted by y.