Set of flashcards Data Science

Flashcards	118
Language	English
Category	Computer Science
Level	University
Created / Updated	15.06.2020 / 28.12.2022
Weblink	https://card2brain.ch/box/20200615_data_science
Embed	<iframe src="https://card2brain.ch/box/20200615_data_science/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is doing the following code?

!tar -zcf {data_path/dataset_name}.tar.gz {data_path/dataset_name}

Downloading image dataset.

What is this code for?

path = Path('')

tarfile.open(path/'cats_vs_dogs.tar.gz','r:gz').extractall(path)

To load an image dataset according to fastai library.

We apply a few tricks when we load the data:

A trick often used in image classification is data augmentation. This means one creates more data by manipulating existing data. For images this means that the images can be rotated, flipped cropped etc. This generally improves the performance of the classifier and is setup with the get_transforms function.
We split the data 80/20 into train and validation data.
We crop the images to a size of 224 pixels.

tfms = get_transforms()
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, valid_pct=0.2, size=224)

Peek at the data:

data.show_batch(rows=3, figsize=(8, 8))

What involves training a learner for image classification?

Hint: Same steps as with ULMFiT

1. Load a pretrained model (e.g. resnet34, resnet50, etc.)

2. Find the optimal learning rate

3. Fit the head of the network

4. Unfreeze all layers and fine-tune

5. Evaluate the results

Why is often the ResNet34 model taken for ImageClassification tasks?

It is a convolutinal neural network with 3 layers. There are also larger networks with up to 150 layers but this model usually takes the least effort to train on and get good results.

Whats the following code for?

learn = cnn_learner(data, models.resnet34, metrics=accuracy)

Load pretrained image model

How do you evaluate an image classifier?

With the confusin matrix

Whats the code for?

ds, idxs = DatasetFormatter().from_toplosses(learn)

ImageCleaner(ds, idxs, path)

data = ImageDataBunch.from_csv(path, csv_labels='cleaned.csv', ds_tfms=tfms, valid_pct=0.2, size=224)

data.show_batch(rows=3, figsize=(8, 8))

fast.ai's ImageCleaner for data cleaning.

Give the two formulas of (concerning evaluation of classifiers by confusion matrix)

Accuracy of positive preds (precision)
True positive rate (recall)

precision = TP / (TP+FP)
recall = TP / (TP+FN)

Note: Very hard (impossible) to get a model with perfect precision AND recall.

What stands PCA for?

Principal Compnent Analysis.

PCA aims at finding new axis that contain most of the variance in the dataset.

Note: PCA is sensitive to the scaling of the features --> larger scaling <--> larger variance

What is meant with topology?

The study of shapes

One way to study shapes is to look at holes: The whole form of a figure does not matter as long as there is a hole in it, it is the same.

What are count-encodings exactly in NLP?

Count-encodings are the sum of the word encodings.

What is the advantages and disadvantages of n-grams in NLP?

Advantage: some information on the sewuence is conserved

Disadvantage: Vocabulary can get very large

How does deep learning work (simple explanation)?

In deep learning many neuron layers are stacked on top of each other -> y becomes X of next layer

What does the Backpropagation algorithm do?

Backpropagatin uses the chain rule to calculate the gradients over the whole network.

Name 3 different kinds of neural networks:

Feedforward networks: Information flows forward only
Recurrent neural network: Neuron has a feedback loop
Convolutional neural network: Network "scans" through input instead of consuming all input at once

Which trends are driving the data science "revolution"?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Mainly Big Data and Machine Learning,

Give a definition of Data science:

Keyboard commands:

= turn,

= for-/backward,

= scroll

1.
Data science is about the extraction of useful
information and knowledge from large
volumes of data, on order to improve
business decision-making.

2.
Data science is an interdisciplinary subject with 3 key areas:
- Statistics
- Computer Science
- Domain expertise

Why is Data Science important?

Keyboard commands:

= turn,

= for-/backward,

= scroll

In the past, data analysis was typically slow: Needed teams of statisticians, analysts etc. to explore data manually.

Today colume, velocity and variety make manual analysis impossible but fast computers and good algorithms allow much deeper analyses than before.

--> data-driven decision making
--> base decisions on alysis of data, not intuition

Draw the Data Science performing process:

Keyboard commands:

= turn,

= for-/backward,

= scroll

- Iterative process
- Non-sequential
- Early termination
- Established processes, e.g. CRISP-DM

Name the approximately year of invention of Machine Learning, Deep learning and Artificial Intelligence:

Keyboard commands:

= turn,

= for-/backward,

= scroll

AI 1950's
Creation of first "intelligent" algorithms and programs
ML 1980's
Statistical models and algorithms that can learn from data
DL 2010's
Statistical models and algorithms inspired by neurones that can learn from data

Name the 3 main branches of ML and some of its applications:

Keyboard commands:

= turn,

= for-/backward,

= scroll

Supervised Learning
- Classification
  - Diagnostics
  - Customer Retention (Kundenbindung)
  - Image Classification
- Regression
  - Estimating life expextancy
  - Population Growht Prediction
  - Market Forecasting
Unsupervised Learning
- Clustering
  - Recommender System
  - Customer Segmentation
  - Targetted marketing
- Dimensionality Reduction
  - Big data Visualisation
  - Structure Discovery
Reinforcement Learning
- Game AI
- Robot Navigation
- Real-time decisions

Explain supervised learning:

Keyboard commands:

= turn,

= for-/backward,

= scroll

In supervised learning the training data consicts of input / output pairs and we train a function to map the inputs to the outputs. The predicted variable consists is therby either a continuous variable like Price / Cost / Weight (Regression Problems) or categorical variable like A, B or C / Dogs or Cats.

Explain unsupervised learning:

Keyboard commands:

= turn,

= for-/backward,

= scroll

In unsupervised learning there are no labels available, insights are gained without prior knowledge.

For Anomaly / Outlier detection is the task, finding samples in a dataset tat raise suspicion.
The problem therby is, that you usally do not know, what you are looking for.
The solution is to use statistics and characteristics of the dataset to find outliers.

What did mainly drive the deep learning focus?

Keyboard commands:

= turn,

= for-/backward,

= scroll

1. A lot of data

2. Computing power

What is new in deep learning compared to earlier "machine learning"?

Keyboard commands:

= turn,

= for-/backward,

= scroll

What is new (among other thins) is a learning algorithm called backpropagation which allows to train deep neural nets.

State-of-the-art networks can have over 200 layers.

What are main differences between classical ML vs. Deep Learning?

Classical ML methos don't handle hogh dimensionality well.

--> dimensionality reduction / feature selection

Deep neural nets learn compact representations of data even in a high dimensionality / sparse setting - no feature engineering required!

Why do we not just apply Deep Learning to every problem?

Possibilities where classical ML performs better than DL are cases where

- necessary data is not available
- computational power is not available
- results are harder to interpret
- Deep networks can be fooled [Image of panda which you overly with a filter (rauschfilter) and the prediction gives something like a gibbon]

How do you define any object in python (3 ways)?

In general, for any object to be defined, it has to be accessible within the current scope, namely:

1. It belongs to Python's default environment. These are the in-bilt functions and containers like str, print, list, etc.
2. It has been defined in the current program, eg. when you create a custom functino with the def keyword.
3. It exist as a separate library and you imported the librar with a szitable import statement.

Make an example of creating a pandas DataFrame out of a csv and assigning it to a variable:

e.g.

housing_data = pd.read_csv(DATA/'housing.csv')

whereas DATA is a variable in which the path was stored.

Which are the first recommended commands after importing a pandas DataFrame?

Looking at the head and tail of the dataset because one often finds metadata or aggregations at the end of Excel files:
df.head() and df.tail()

Look also at some random samples with the df.sample(5, random_state=42) #random_state secures the reproducability.

Getting a short description of the df of the featurenames, amount aof values, and dtype with df.info().

Getting a summary of the numerical attributes via df.describe() (does descriptive satistics).

Data Science

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google