Data Science

lesson01-lesson15

lesson01-lesson15


Set of flashcards Details

Flashcards 118
Language English
Category Computer Science
Level University
Created / Updated 15.06.2020 / 28.12.2022
Weblink
https://card2brain.ch/box/20200615_data_science
Embed
<iframe src="https://card2brain.ch/box/20200615_data_science/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Why is evaluating our model's predictions on the same data it was trained on a recipe for disaster?

The problem is that the model may memorise the structure of the data it sees and fail to provide good predictions when shown new data.

Why is are the model errors often visualised and which are the two common ways to diagnose regression models?

It can often be gained deeper insights.

  1. Prediciton errors
    To get a sense of how often our model is predicting values that are close to the expected values, we'll plot the actual value labels from the test dataset against the predictions generated by our final model. --> we want all the plots aline on the plotted line.
  2. Residual plots
    A residual is the difference between the labeled value and the predicted value for each instance in our dataset. We can plot residuals to visualize the extent to which our model has capured the behavior of the data. By plotting the residuals for a dseries of instances, we can check whether they're consistent with random error; we should not be able to predict the error for any given instance. If the data points appear to be ebenly (randomly) dispersed around the plotted line, our model is performing well. In some sense, the resulting plot is a rotated version of our prediction error one. --> What we're looking for is a mostly symmetrical distribution with points that tend to cluster toward the middle of the plot, ideally around smaller numbers of the y-axis. If we observe some kind of structure that does not soincide with the plotted line, we have failed to capture the behavior of the data and should either consider some feature engineering, selecting a new model, or an exploration of the hyerparamters.

Regarding the performance metrics: Where do we want to get with the evaluation values of the RMSE resp. R2?

RMSE = Lower is better

R2 = closer to 1 is better

What is the fundamental concept of the RandomForestTrees?

Ensembling many different Decision Trees

Name some hyperparameters of RandomForestRegressors:

  • n_estimators (=10): number of decision trees in the forest 
  • max_depth (=3): the number of "levels" in the tree
  • bootstrap (=False): This setting ensures we use the whole dataset to build the tree (set to False)
  • min_samples_leaf (=3): controls whether or not the tree should continue splitting a given node bases on the number of samples in that node.
  • max_features (=0.5): controls what random number or fraction of columns we consider when making a single split at a tree node. The motivation is that we might have situations where a few columns in our data are highly predictive, so each tree will be biased towards picking the same splits and thus reduce the generalisation power of our ensemble. Values like 1.0, 0.5, log2 or sqrt.

What is bagging used for?

Bagging is a technique that can be used to improve the ability of models to generalise to new data.

The basic idea between bagging is to consider training several models, eac of which is only partially predictive, but crucially, uncorrelated. Since these models are effectively gaining different insights into the data, by averaging their predictions we can create an ensemble that is more predictive.

Bagging is a two-step process:

  1. Bootstrapping, i.e. sampling the training set
  2. Aggreagation, i.e. averaging predictions from multiple models

This gives us the acronym Bootstrap AGGregatING, or bagging for short.

The key for this to work is to ensure the errors of each mode are uncorrelated, so the way we do that with trees is to sample with replacement from the data: this produces a set of independent samples upon which we can train our models for the ensemble.

To use this method just set bootstrap=True

What would be the solution if you do not have enough data to create a validation set because you would not have enough data to build a good model?

RandomForests have a nice feature called Out-Of-Bag (OOB) error which is designed for just this case.

The key idea is to observe that the first tree of our encemble was trained on a bagged sample of the full dataset, so if we evaluate this model on the remaining samples we have effectively created a validation set per tree. To generate OOB predictions, we can then average all the trees and calculate RMSE, R2 or whatever metric we are interested in.

To toggle this behavior in scikit-learn, one makes use of the oob_score flag (defining along with the hyperparamters as oob_score=True), which adds an oob_score_attribute to our model that we can print out.

What is model interpretability?

A nice explanation for what it means to interpret a model's predictions is given in the Beware Default Random Forest Importances article:

Training a model that accurately predicts outcomes is great, but most of the time you don't just need predictions, you want to be able to interpret your model. For example, if you build a model of house prices, knowing which features are most predictive of price tells us which features people are willing to pay for.

How can we estimate our confidence in the predictions?

One way to do this is to calculate the standard deviation of the predictions of the trees. Conceptually, the idea is that if the standard deviation is high, each tree is generating very different predictions and may indicate the model has not learnt the most important features of the data.

Where do we in general expect our models to perform best?

On categories that are most frequent in the data. One way to validate this hypthesisi is by calculating the ration of the standard deviation of the predictions to the predictions themselves.

Which two main purposes do confidence intervals serve in general?

  • We can identify which categories the model is less condident about and investigate further
  • We can identify which rows in the data the model is not confident about. This is particularly important when deploying models to production, where e.g. we need to decide how to evaluate the model's predictions for a single housing district.

What is a drawback with the confidence interval analysis?

That we need to drill-down into each feature to see where the model is making mistakes.

How can we get in practice a global view of the feature importance?

By ranking each feature in terms of its importance to the model'spredictions. In scikit-learn, the Random Forest model has an attribute called feature_importances_ that we can use to rank each feature.

What looks the feature importance like in nearly every real-world dataset?

A handful of columns are very importand, while most are not. The powerful aspect of this approach is that it focuses our attention on which features we should investigate further and which ones we can safely ignore.

What is a dendogram used for?

A dendogram is produced from a technique called hierarchical clustering and tells us which pairs of features are similar.

From  the plot we see that quantities like latitude and postal_code are grouped together and similar. Note that we used Spearman's rank correlation coefficient to calculate notions of similarity - this is useful for finding non-linear correlations that may be missed by Pearson's correlation coefficient.

Recall which four main steps need to be performed to bring a DataFrame to a form suitable for training a RandomForest on?

  • Convert strings to categorical data type
  • Handle missing values
  • Numericalise the DataFrame and create a features matric C and target vector y
  • Create train and validation sets

What is a way of evaluating classifiers but is not the preferred performance measure and why it is not?

Accuracy

When you are dealing with skewed datasets (i.e. when some classes are much more frequent than others). This is becuase the formula of the accuracy is (Number of correct decisions made / Total number of decisions made). Means if you have a relation of 3:1 (churn:no churn), if the model just predicts churn it gets already a accuracy of 75%.

What is the prefered way to evaluate the performance of a classifier?

Look at the confusion matrix. Recall that a confusion matric for a problem involving n classes is an n*n matric with the rows labelled by the actual classes and the columns with the predicted classes. 

If you denote the true classes as p(ositive) and n(egative), and the classes predicted by the model as Y(es) and N(o) then the confusion matrix has the form like in the picture.

The main diagonal conatains the counts of correct decisions. The errors of the classifier are the false negatiives (positives classified as negative) and false positives (negatives classified as positive).

What is a good summary statistic (performance metric)  of the predictiveness of a binary classifier?

The Area Under the ROC Curve (AUC). It varies from 0 to 1. A value of 0.5 corresponds to randomness (the classifier cannot distinguish at all between "churn" and "no churn") and a value of 1.0 means that it is perfect.

The "ROC" refers to the Receiver Operating Characteristic (ROC) curve which plots the true positive rate against the false positive rate (FPR), where the FPR is the ratio of negative instances that are incorrectly classified as positive (Formula in the picture). In general there is a tradeoff between these two quantities: the higher the TPR, the more false positives (FPR) the classifier produces. A good classifier stays as close to the top-left corner of a ROC curve plot as possible.

What is the main difference of the splitting of a node from a DecisionTree in classifiers compared to regressors?

The main Difference is that the splitting criterion is no longer the mean squarred error, but instead is something known as the Gini index. For classification tasks, the goal is to minimise the Gini index across each split, which amount to finding which segments are moste "pure".

How can the manual hyperparameter tuning part be automated?

With scikit-learn's GridSearchCV to search for the best comination of hyperparamter values.

Why is overfitting bad?

When a model overfits to the training data it does not generalise well on enseen data which leads to poor performance.

Ho do we avoid overfitting?

  1. Model Complexity: Choose right model complexitiy with hyperparamters (see fitting graph)
  2. More Data: It'sharder to overfit on large datasets -> get more data
  3. Regularisation: Punish models for using complex paramters

What is overfitting?

When we fit a model to data we always have to be careful not to overfit. If we overfit the model this means that the model learned specific aspects of the training data and does not generalise to new, unseen data. Instead of learning useful relations between the unput feature and the target the model has memorised the training samples. If this happens the model we perform very poorly on new data and therefore we want to make sure this does not happen.

Tool against overfitting:

  • Splitting data into two sets. Measuring the performance difference between the training and validation set already helps identifying when we are overfitting.
  • An even more systematic way of splitting the data is using cross-validation.

What is a fitting graph and what can be observed in it?

It is to visualize the under-/overfitting of a model. To do this, we calculate the training and validatoin error for each degree and plot them in a single graph. 

It can be oberved the following points:

  1. The training error always decreases when adding more degrees.
  2. There is a region where the validatoin error is stable and low.

Ideally, we would choose the model parameters such that we have the best model performance. However, we want to make sure that we really have the best validation performance. Ehrn we do 'train_test_split' we randomly split the data into to parts. What could happen is that we got lucky and split the data such that it favours the validation error. This is especially dangerous if we are dealing with small datasets. One way to check if that's the case is to run the experiment several times for different, random splits. However here is an even more systematic way of doing this: cross-validaton.

What is cross validation and what is the idea behind it?

The idea behind cross-vaidatoin is to split the data into k equally sized parts, called folds. Each fold gets to be the validation set once while the other folds play the training set part. That means we run k experiments and aggregate the training and validatoin metrics by averaging them. This is a more robust approach to monitoring overfitting and thanks to scikit-learn we only have to adjust one line by adding the cross_validate function.

What is the challenge in fitting models in machine learning?

The challenge in fitting models in machine learning is to find the good fit. A model that is too simple will not be able to capture the complexity of the data and lead to underfitting. A model that is too complex has the capacity to 'memorize' aspects of the data and cause overfitting. If we are overfitting our model will not predict well unseen data- we say it does not generalise. The goal is to find a model that has just the right complexity to fit the data. The fitting graph is a tool to identify the sweetspot of model complexity.

What is meant by the complexity of a machine learning model?

The model complexity comes in different form and shapes. In our polynomial example the complexity is controlled by the degree parameter. For a Random Forest the complexity is given by several parameters such as tree_depth and n_estimators.

What would be the simplest way of avoiding overfitting?

Get more data!

What is the reason why you want to have a train, validation and a test set?

  • The train set is used to train a model
  • With the validatoin set the model is evaluated. With this information we tune the parameters.
  • We only evaluate the final, tuned model on the test set. We do not use it to tune the model parameters

The reson we make the distinction between validation and test set is that by tuning the parameters on the validatoin performance we might start to overfit the validatoin data. The test set gives a final sanity check that we actually have a performant model.

In cross-validation the consept of train and validation is melted and all training data is also validatoin data at some point.

Explain k-nearest neighbour classification respectively k-means clustering?

  • k-nearest neighbour classification falls in the category of supervised algorithms, which we already encountered with Decision Trees and Random Forests.
  • k-means clustering in the other hand is an snsupervised mehtos and therefore forms a new class or algrothms.

Explain what is meant with similarity and distance measures and what it is used for:

Similarity and distance measres are fundamental tools in machine learning. Often we want to know how far apart or similar two datapoints are. Some examples:

  • How similar are two customers?
  • How close es a search query to a webpage?
  • How similar are two pictures?
  • How far away is the closest restaurant?

How we measure the distance and similarity influences the results. If the drect, shortest path takes us over a cliff it might not actually be the shortest path time-wise. Also the dimensionality of the data has an impact on the useluness of the result. When working with high -dimensional data one has to keep the curse of dimensinality in mind that impacts the quality of some metrics.

Name a few distance/similarity measures:

These functions calcualte the distance/similarity between two vectors x and y:

  • euclidiean distance
  • cosinus similarity
  • manhatten distance

Explain how the k-nearest neighbours classifier works and give some tips & tricks:

The k-nearest neighbours classifier uses the neighbours of a sample to classify it. Given a new point, it searches the k samples in the training set that are closest to the new point. Then the majority class of the neighbours is used to classify the new point.

Tips & tricks:

  • Make sure all features are scaled properly (e.g. see sklearn.preprocessing.Standardscaler)
  • Use odd number for k to avoid ties
  • Voting can be weighted by distance

Explain the k-means clustering and give some idea about its interface:

It is an unsupervised approach. The goal is to automatically identify clusters in the data without having access to the labels. We will see that even without knowledge about the data we will be able to make statements about the shape of it.

The interface of k-mean provided in scikit-learn is very similar to that of a classifier.

  • Initialize: We define the number of clusters we want to look for.
    kmeans = KMeans(n_clusters=k)
  • Fit: We fit the model to the data.
    kmeans.fit(X)
  • Predict: We make predictions wo which cluster each datapoint belongs.
    y = kmeans.predict(X) 

Note: In contrast to the classifiers, the k-means algorithm does not need any labels when the model is fitted to the data.

Explain the additional featarues to the standard functions of k-means:

 

  • On the one hand can the calculated cluster centers (or sometimes called centroids) be accessed:
    kmeans.cluster_centers_
  • Furthermore, we can get the inertia, which is the sum of the squared distances of each datapoint to its closest cluster center:
    kmeans.inertia_

 

What is the elbow rule?

Looking at the results of the visualizations of the clusters with the given k-values  it seems that the hardest part of k-means is to select the right number for k, which can be any positive integer. How can we find a good value for k?

There are several approaches, one o which is the so called elbow rule: Plot the inertia for different values of k. This yields an asymptotic curve that moves at first fast towards zero and then slows down. Imagine the curve is an amr. The spot where the elbow of that arm would be is usally a good value for k.

Name some applications of machine learning ini natural language processing:

  • Text classification
  • Question/answerung systems
  • Dialogue systems
  • Named entity recognition
  • Summarization
  • Text generation

Why is natural text different to ther data sources such as numberical tables or images?

One way to look at text is to consider each word to be a feature. Since most languages have of the order of 100k words in their vocabulary plus many variations this leads to an anomrous feature space. At the same time most words in the vocubulary do not appear in a small text. This leads to extreme sparsity. These properties call for a different approach to NLP than the methods we encounterd and used for tabular data.

What means regex in the NLP language?

regular expressions