Data Science
lesson01-lesson15
lesson01-lesson15
Set of flashcards Details
Flashcards | 118 |
---|---|
Language | English |
Category | Computer Science |
Level | University |
Created / Updated | 15.06.2020 / 28.12.2022 |
Weblink |
https://card2brain.ch/box/20200615_data_science
|
Embed |
<iframe src="https://card2brain.ch/box/20200615_data_science/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Create or copy sets of flashcards
With an upgrade you can create or copy an unlimited number of sets and use many more additional features.
Log in to see all the cards.
What is doing the following code?
!tar -zcf {data_path/dataset_name}.tar.gz {data_path/dataset_name}
Downloading image dataset.
What is this code for?
path = Path('')
tarfile.open(path/'cats_vs_dogs.tar.gz','r:gz').extractall(path)
To load an image dataset according to fastai library.
We apply a few tricks when we load the data:
- A trick often used in image classification is data augmentation. This means one creates more data by manipulating existing data. For images this means that the images can be rotated, flipped cropped etc. This generally improves the performance of the classifier and is setup with the get_transforms function.
- We split the data 80/20 into train and validation data.
- We crop the images to a size of 224 pixels.
tfms = get_transforms()
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, valid_pct=0.2, size=224)
Peek at the data:
data.show_batch(rows=3, figsize=(8, 8))
What involves training a learner for image classification?
Hint: Same steps as with ULMFiT
1. Load a pretrained model (e.g. resnet34, resnet50, etc.)
2. Find the optimal learning rate
3. Fit the head of the network
4. Unfreeze all layers and fine-tune
5. Evaluate the results
Why is often the ResNet34 model taken for ImageClassification tasks?
It is a convolutinal neural network with 3 layers. There are also larger networks with up to 150 layers but this model usually takes the least effort to train on and get good results.
Whats the following code for?
learn = cnn_learner(data, models.resnet34, metrics=accuracy)
Load pretrained image model
How do you evaluate an image classifier?
With the confusin matrix
Whats the code for?
ds, idxs = DatasetFormatter().from_toplosses(learn)
ImageCleaner(ds, idxs, path)
data = ImageDataBunch.from_csv(path, csv_labels='cleaned.csv', ds_tfms=tfms, valid_pct=0.2, size=224)
data.show_batch(rows=3, figsize=(8, 8))
fast.ai's ImageCleaner for data cleaning.
Give the two formulas of (concerning evaluation of classifiers by confusion matrix)
- Accuracy of positive preds (precision)
- True positive rate (recall)
- precision = TP / (TP+FP)
- recall = TP / (TP+FN)
Note: Very hard (impossible) to get a model with perfect precision AND recall.
What stands PCA for?
Principal Compnent Analysis.
PCA aims at finding new axis that contain most of the variance in the dataset.
Note: PCA is sensitive to the scaling of the features --> larger scaling <--> larger variance
What is meant with topology?
The study of shapes
One way to study shapes is to look at holes: The whole form of a figure does not matter as long as there is a hole in it, it is the same.
What are count-encodings exactly in NLP?
Count-encodings are the sum of the word encodings.
What is the advantages and disadvantages of n-grams in NLP?
Advantage: some information on the sewuence is conserved
Disadvantage: Vocabulary can get very large
How does deep learning work (simple explanation)?
In deep learning many neuron layers are stacked on top of each other -> y becomes X of next layer
What does the Backpropagation algorithm do?
Backpropagatin uses the chain rule to calculate the gradients over the whole network.
Name 3 different kinds of neural networks:
- Feedforward networks: Information flows forward only
- Recurrent neural network: Neuron has a feedback loop
- Convolutional neural network: Network "scans" through input instead of consuming all input at once
Which trends are driving the data science "revolution"?
Mainly Big Data and Machine Learning,
Give a definition of Data science:
1.
Data science is about the extraction of useful
information and knowledge from large
volumes of data, on order to improve
business decision-making.
2.
Data science is an interdisciplinary subject with 3 key areas:
- Statistics
- Computer Science
- Domain expertise
Why is Data Science important?
In the past, data analysis was typically slow: Needed teams of statisticians, analysts etc. to explore data manually.
Today colume, velocity and variety make manual analysis impossible but fast computers and good algorithms allow much deeper analyses than before.
--> data-driven decision making
--> base decisions on alysis of data, not intuition
Name the approximately year of invention of Machine Learning, Deep learning and Artificial Intelligence:
- AI 1950's
Creation of first "intelligent" algorithms and programs - ML 1980's
Statistical models and algorithms that can learn from data - DL 2010's
Statistical models and algorithms inspired by neurones that can learn from data
Name the 3 main branches of ML and some of its applications:
- Supervised Learning
- Classification
- Diagnostics
- Customer Retention (Kundenbindung)
- Image Classification
- Regression
- Estimating life expextancy
- Population Growht Prediction
- Market Forecasting
- Classification
- Unsupervised Learning
- Clustering
- Recommender System
- Customer Segmentation
- Targetted marketing
- Dimensionality Reduction
- Big data Visualisation
- Structure Discovery
- Clustering
- Reinforcement Learning
- Game AI
- Robot Navigation
- Real-time decisions
Explain supervised learning:
In supervised learning the training data consicts of input / output pairs and we train a function to map the inputs to the outputs. The predicted variable consists is therby either a continuous variable like Price / Cost / Weight (Regression Problems) or categorical variable like A, B or C / Dogs or Cats.
Explain unsupervised learning:
In unsupervised learning there are no labels available, insights are gained without prior knowledge.
For Anomaly / Outlier detection is the task, finding samples in a dataset tat raise suspicion.
The problem therby is, that you usally do not know, what you are looking for.
The solution is to use statistics and characteristics of the dataset to find outliers.
What did mainly drive the deep learning focus?
1. A lot of data
2. Computing power
What is new in deep learning compared to earlier "machine learning"?
What is new (among other thins) is a learning algorithm called backpropagation which allows to train deep neural nets.
State-of-the-art networks can have over 200 layers.
What are main differences between classical ML vs. Deep Learning?
Classical ML methos don't handle hogh dimensionality well.
--> dimensionality reduction / feature selection
Deep neural nets learn compact representations of data even in a high dimensionality / sparse setting - no feature engineering required!
Why do we not just apply Deep Learning to every problem?
Possibilities where classical ML performs better than DL are cases where
- necessary data is not available
- computational power is not available
- results are harder to interpret
- Deep networks can be fooled [Image of panda which you overly with a filter (rauschfilter) and the prediction gives something like a gibbon]
How do you define any object in python (3 ways)?
In general, for any object to be defined, it has to be accessible within the current scope, namely:
1. It belongs to Python's default environment. These are the in-bilt functions and containers like str, print, list, etc.
2. It has been defined in the current program, eg. when you create a custom functino with the def keyword.
3. It exist as a separate library and you imported the librar with a szitable import statement.
Make an example of creating a pandas DataFrame out of a csv and assigning it to a variable:
e.g.
housing_data = pd.read_csv(DATA/'housing.csv')
whereas DATA is a variable in which the path was stored.
Which are the first recommended commands after importing a pandas DataFrame?
Looking at the head and tail of the dataset because one often finds metadata or aggregations at the end of Excel files:
df.head() and df.tail()
Look also at some random samples with the df.sample(5, random_state=42) #random_state secures the reproducability.
Getting a short description of the df of the featurenames, amount aof values, and dtype with df.info().
Getting a summary of the numerical attributes via df.describe() (does descriptive satistics).
-
- 1 / 118
-