Data Science

lesson01-lesson15

lesson01-lesson15


Kartei Details

Karten 118
Sprache English
Kategorie Informatik
Stufe Universität
Erstellt / Aktualisiert 15.06.2020 / 28.12.2022
Weblink
https://card2brain.ch/box/20200615_data_science
Einbinden
<iframe src="https://card2brain.ch/box/20200615_data_science/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Why are texts encoded as vectors or matrices for NLP?

Because most machine learning methods can only handle numerical data such as vectors and matrices.

How is encoded input text representations as vectors or matrices called in NLP?

Vector encodings

Explain how Text data is preprocessed for NLP:

Being able to filter/combine/manipulate strings is a crucial skill to do natural language processing.

Cleaning up text for NLP tasks usually involves the following steps:

  • Normalization -> process of transforming the text to lower-case
  • Tokenization -> split the text in words/tokens
  • Remove stop-words -> remove words that are too common and do not add the content of sentences
  • Remove non-alphabetical tokens -> remove all tokens that are not composed of letters (e.g. punctuation and numbers)
  • Stemming -> trim the words to the stem. This helps drastically decrease the vocabulary size and maps similar/same word onto the same stem. (e.g. plural/singular words or different forms of verbs)

Some of the steps might not be necessary or you need to add steps depending on the text, task and method.

What are the main parts of NLP?

  1. Preparing/having a dataset
  2. preprocessing text
  3. Vector encoding
  4. Model (e.g. Naive Bayes classifier)

What is in NLP after the preprocessing the next part?

Vector encoding

After the text is cleaned up and tokenized the text corpus wis now ready to encode the texts in vectors. The One-hot-encodings can be extended to count encodings and TF-IDF encodings.

In scikit-learn we use it the following way:

count_vectorizer = CountVectorizer(your_settings)

count_vectorizer.fit(your_dataset)

vec = count_vectorizer.transform('your_text')

This ceates a vectorizer that can transform texts to vectors. We can also limit the number of words taken into account when building the vector. This limits the vector size and cuts off words that occur rarely. If you set max_features = 1000 only the 1000 most occuring word are used to build the vector and all rare words are excluded. This means that encoding vector then has a dimension of 1000. For now we take all word. Since we used our own tokenizer and preprocessing step we iverwrite the standard steps in the vectorizer library with the vec_default_settings.

Explain what is n-grams used for in NLP:

Whe we use a count or TF-IDF vectorizer we throw all sequential information in the texts away. From the cector encodings above we could not reconstruct the original senteces. For this reason these encodings are caalled Bag-of-Words encodings (all word go in a bag and are shuffeled). However, sequential infromatino can be important for the meaning of a sentence. As an example imagine the sentence:

'The movie was good and not bad.'

It is important to know if the word 'not' is in front of 'good' or 'bad' for determining the sentiment of the sentence. We can preserve some of that information by using n-grams. Insted of just encoding single words we can also encode tuple, triplets etc. called n-grams. The n ancodes how many words we bundle together.

What is a stength of modern GPU's?

A lot of computations are easily parallelizable.

What is transfer learning about?

Training deep learning models requires a lot of data. It is not uncommon to train models on millions of images or gigabytes of text data to achieve good results. Most real-world problems don't have that amount of labeled data ready, and not all companies or individuals who want to train a model can afford to hire people to label data for them.

For many years this has been very challenging. Fortunately, it has been solved for image based models a couple of years ago and recently also for NLP. One approach that helps train models with limited labeled data is called transfer learning.

The idea is that once a model is trained on a large dataset for a specific task (e.g., classifying houses vs. planes), the model has learned certain features of the data that can be reused for another task. Such features could be how to detect edges or textures in images. If these features are useful for another task, then we can train the model on new data without requiring as many labels as if we were training it from scratch.

How did Jeremy Howard and Sebastian Ruder manage to do transfer learning in NLP?

For transfer learning in NLP Jeremy Howard and Sebastian Ruder came up with a similar approach called ULMFiT (Universal Language Model Fine-tuning for Text Classification) for texts. The central theme of the approach is language modeling.

What are the three steps of Language modeling in ULMFiT (Transfer learning method of Jeremy and Sebastian) and what is the goal of it??

In language modeling the goal is to predict the next word based on the previous word in a text.

The steps include:

  1. Language model (wiki): A language model is trained on a large dataset. Wikipedia is a common choice for this task as it includes many topics, and the text is of high quality. This step usually takes the most time on the order of days. In this step, the model learns the general structure of language.

  2. Language model (domain): The language model trained on Wikipedia might be missing some aspects of the domain we are interested in. If we want to do sentiment classification, Wikipedia does not offer much insight since Wikipedia articles are generally of neutral sentiment. Therefore, we continue training the language model on the text we are interested in. This step still takes several hours.

  3. Classifier (domain): Now that the language model works well on the text we are interested in, it is time to build a classifier. We do this by adapting the output of the network to yield classes instead of words. This step only takes a couple of minutes to an hour to complete.

What is the pwoer of the language modeling task in ULMFiT?

The power of this approach is that you only need little labeled data for the last step and only need to go through the expensive first step once. Even the second step can be reused on the dataset if you, for example, build a sentiment classifier and additionally a topic classifier. This can be done in minutes and allows us to achieve great results with little time and resources.

What is the correspdoning function of get_dataset() in the fastai library?

untar_data()

Preprocess txt data for language modeling:

data_lm = (TextList.from_folder(path)

           #Inputs: all the text files in path

            .filter_by_folder(include=['train''test''unsup']) 

           #We may have other temp folders that contain text files so we only keep what's in train and test

            .split_by_rand_pct(0.1)

           #We randomly split and keep 10% (10,000 reviews) for validation

            .label_for_lm()           

           #We want to do a language model so we label accordingly

            .databunch(bs=bs))

What stands bs for?

bd = Batch size

The batch size bs specifies ho many samples the modelis optimised for at each step.

What stand xxunk, xxbos, xxeos and xxmaj in the vocabulary of fastai tokens for (NLP)?

xxunk = A word that is not in the dictionary

xxbos and xxeos = Identify the beginning and the end of a string. So if the first entry in the encoding vector is 1 this means that the token is xxunk. If the third entry is 1 then the token is xxbos.

xxmaj = Signifies that the first letter of the following word is capitalised.

 

What stands LSTM for?

A Deep Learning text classifier model named long shoort-term memory network. This is a neural network with a feedback loop. That means that when fed a sequence of tokens, it feeds back its output for the next prediction. With this the model has a mchanism remembering the past inputz. This is especially useful when dealing with sequential data such as texts, where the sequence of words and characters carries important meaning.

What does the following code?

learn = language_model_learner(data_lm, AWD_LSTM, drop_mult=0.3, model_dir="../data/")

Loading the on Wikipedia pretrained language model likely to the fastai library. The dataset it will be trained on is passed to in the end.

What is a key paramter when training models in deep learning?

The learning rate

How do you find the optimal learning rate in deep learning and what is the API in fastai to do so?

With the lr_find() function, we can explore how the loss function behaves with regards to the value of the learning rate:

learn.lr_find()

In the graph we want to find the spot where the loss function decreases the steepest with the largest learning rate.

What is the objective of a language model?

To predict the next word based on a sequence of words.

What do you achieve with the following code cells?

TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2

print("\n".join(learn.predict(TEXT, N_WORDS, temperature=0.75for _ in range(N_SENTENCES)))

Generate text examples (in this case movie reviews.

How would you create a new ImageDownloader object?

from fastai.vision import *

img_dl = ImageDownloader(data_path, dataset name)

What are you doing with the following code?

class_name = 'dank_meme'
search_query = 'dank meme'

img_dl.add_images_to_class(class_name, search_query)

Creating a new class for the image downloader.

What is done with the followin code?

(data_path/dataset_name).ls()

data = ImageDataBunch.from_folder(data_path/dataset_name, valid_pct=0.2, size=224)

data.show_batch(rows=3, figsize=(8, 8)) 

Look at examples of the downladed images with its classes.

What is doing the following code?

!tar -zcf {data_path/dataset_name}.tar.gz {data_path/dataset_name}

Downloading image dataset.

What is this code for?

path = Path('')

tarfile.open(path/'cats_vs_dogs.tar.gz','r:gz').extractall(path)

To load an image dataset according to fastai library.

We apply a few tricks when we load the data:

  1. A trick often used in image classification is data augmentation. This means one creates more data by manipulating existing data. For images this means that the images can be rotated, flipped cropped etc. This generally improves the performance of the classifier and is setup with the get_transforms function.
  2. We split the data 80/20 into train and validation data.
  3. We crop the images to a size of 224 pixels.

tfms = get_transforms()
data = ImageDataBunch.from_folder(path, ds_tfms=tfms, valid_pct=0.2, size=224)

Peek at the data:

data.show_batch(rows=3, figsize=(8, 8)) 

What involves training a learner for image classification?

Hint: Same steps as with ULMFiT

1. Load a pretrained model (e.g. resnet34, resnet50, etc.)

2. Find the optimal learning rate

3. Fit the head of the network

4. Unfreeze all layers and fine-tune

5. Evaluate the results

Why is often the ResNet34 model taken for ImageClassification tasks?

It is a convolutinal neural network with 3 layers. There are also larger networks with up to 150 layers but this model usually takes the least effort to train  on and get good results.

Whats the following code for?

learn = cnn_learner(data, models.resnet34, metrics=accuracy)

Load pretrained image model

How do you evaluate an image classifier?

With the confusin matrix

Whats the code for?

ds, idxs = DatasetFormatter().from_toplosses(learn)

ImageCleaner(ds, idxs, path)

data = ImageDataBunch.from_csv(path, csv_labels='cleaned.csv', ds_tfms=tfms, valid_pct=0.2, size=224)

data.show_batch(rows=3, figsize=(8, 8))

 

fast.ai's ImageCleaner for data cleaning.

Give the two formulas of (concerning evaluation of classifiers by confusion matrix)

  • Accuracy of positive preds (precision)
  • True positive rate (recall)

  • precision = TP / (TP+FP)
  • recall = TP / (TP+FN)

Note: Very hard (impossible) to get a model with perfect precision AND recall.

What stands PCA for?

Principal Compnent Analysis.

PCA aims at finding new axis that contain most of the variance in the dataset.

Note: PCA is sensitive to the scaling of the features --> larger scaling <--> larger variance

What is meant with topology?

The study of shapes

One way to study shapes is to look at holes: The whole form of a figure does not matter as long as there is a hole in it, it is the same.

What are count-encodings exactly in NLP?

Count-encodings are the sum of the word encodings.

What is the advantages and disadvantages of n-grams in NLP?

Advantage: some information on the sewuence is conserved

Disadvantage: Vocabulary can get very large

How does deep learning work (simple explanation)?

In deep learning many neuron layers are stacked on top of each other -> y becomes X of next layer

What does the Backpropagation algorithm do?

Backpropagatin uses the chain rule to calculate the gradients over the whole network.

Name 3 different kinds of neural networks:

  • Feedforward networks: Information flows forward only
  • Recurrent neural network: Neuron has a feedback loop
  • Convolutional neural network: Network "scans" through input instead of consuming all input at once