Cartes mémoires AML

Cartes-fiches	34
Langue	English
Catégorie	Informatique
Niveau	Université
Crée / Actualisé	08.04.2025 / 28.05.2025
Lien de web	https://card2brain.ch/box/20250408_aml
Intégrer	<iframe src="https://card2brain.ch/box/20250408_aml/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the most significant difference between normal feedforward neural nets to recurrent ones?

There is a new set of weights, that connect the hidden layer from the previous time step to the current hidden layer. These weights determine how the network makes use of past context in calculating the output for the current input. The weights are also trained via backpropagation.

What is meant by unrolling the network in RNNs?

Because the computation at time t requires the value of the hidden layer from time t-1, the inference has to be done incrementally starting from the beginning of the sequence.

Describe the training process of an RNN...

There are two passes. In the first pass, we perform forward inference, computing \({h_t, y_t}\), accumulating the loss at each step in time, saving the value of the hidden layer at each step for use at the next time step. In the second pass, we process the sequence in reverse, computing the required gradients as we go, computing and saving the error term for use in the hidden layer for each step backward in time. This general approach is reffered to as backpropagation through time.

By unrolling the network into a feedforward computation graph, we get rid of the explicit recurrence and train the network directly without a special procedure.

What is the output vector y that results from forward inference in a (recurrent) language model?

It's a vector representing a probability distribution over the vocabulary to predict the next word given the previous sequence.

How does forward inferrence work?

At each step, the model uses the word embedding matrix E to retrieve the embedding for the current word, multiples it by the weight matrix W, and then adds it to the hidden layer from the previous step (weighted by weight matrix U) to compute a new hidden layer.

This hidden layer is then used to generate an output layer which is passed through a softmax layer to generate a probability distribution over the entire vocabulary.

Assumed the embedding dimension and the hidden dimension are the same, and therefore we have a model dimension d. What shape does the embedding matrix E have?

\({[d \times |V|]}\), with \(V\) being the size of the vocabulary

Assumed the embedding dimension and the hidden dimension are the same, and therefore we have a model dimension d. What shape does the one-hot encoded \(x_t\)have?

\([|V| \times 1]\)

Assumed the embedding dimension and the hidden dimension are the same, and therefore we have a model dimension d. What shape does the embedding matrix W and U have?

\([d \times d]\)

How does pretraining an RNN work (self-supervision)?

We train the model to minimize the error in predicting the true next word int he training sequence, using cross-entropy as the loss function.

Cross-entropy measures the difference between a predicted probability distribution and the correct distribution, which in out case comes from knowing the next word. In one-hot encoding, this boils down to the probabiltiy that the model assigns to the correct next word (index in the vector).

What is weight tying and what are it's benefits?

For RNNs the input embedding matrix E and the final layer matrix V can be tied because the shape of V is the transpose of E, and we just use \(E^T\) as V.

In addition to providing improved model perplexity, this approach significantly reduces the number of parameters required for the model.

What are three domains that RNNs can be used?

Sequence Labeling like POS
Sequence Classification
Text Generation like machine translation, text summarization, grammar correction, story generation, and conversational dialogue

What is an autoregressive model?

An autoregressive model is a model that predicts a value at time t based on a linear function of the previous values at times t-1, t-2, and so on.

Language models are not linear, but language generation is still often referred to as autoregressive generation, because we predict a word based on the embeddings of previous words.

What are the two basic processes of Diffusion models?

There is a forward diffusion process, in which we add noise to an image until we arrive at Gaussian noise after T timesteps. The model should then learn the reverse denoising process to generate the an image having high likelihood given the training data from Gaussian noise.

What is the basic concept of generative models?

They learn to model a true distribution (e.g. the real-world process that generates our data) \(p(x)\) by observing samples x from that distribution. We know that the true distribution exists, but we don't know how it is defined. There are many latent variables (unobservable variables) so we can only approximate the true distribution through parameters that are learned during model training on samples from the true distribution (in our case real images).

What is a solution to approximate the true distribution?

One solution is to approximate them by maximizing the (log-)likelihood p(x), where the the shape of the distribution p is controlled by latent variables z. Formally, we extent our probability representation to include these latent variables explicitly:

What does the distributional hypothesis state?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Words that occur in similar contexts tend to have similar meanings (e.g. synonyms).

What are word embeddings?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Representations of the meaning of words as vectors learned from their distribution in texts.

Which two kinds of embeddings exist?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Static embeddings
Contextualized embeddings

What aspects of word meaning flow into vector semantics?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Lemmas and Senses
Synonymy
Word Similarity
Word Relatedness
Words from the same semantic field (e.g. hospitals -> surgeon, scalpel, nurse, etc.)
Semantic Frames and Roles (buyer - seller)
Connotation
Sentiment

What is the idea of vector semantics?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

The idea of vector semantics is to represent a word as a point in a multidimentional semantic space that is derived from the distributions of word neighbors.

Vector semantic models can be learned automatically from text without supervision.

What are examples of sparse and dense vector semantic models that are typically used as base-lines?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Sparse: tf-idf, where the meaning of a word is defined by a simple function of the counts of nearby words.
Dense: Word2vec.

Explain the term-document matrix...

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

In the term-document matrix, each row represents a word in the vocabulary and each column represents a document from some collection of documents. Each cell therefore represents the number of times a word occurs in a given document. It can be used to represent a document as a vector or word-counts. For example for information retrieval and document similarity measures. Its dimensions are: D columns for the number of documents and V rows for the number of words in the vocabulary across all documents.

Explain the term-term matrix / word-word matrix...

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

In the term-term matrix, each cell records the number of times the row (target) and the column (context) word co-occur in some context in some training corpus. The context is typically a sliding window (e.g. 4 words to the left and 4 words to the right). Since most words never occur in the same context, this results in many sparse vectors.

How can we measure the similarity between two words?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

We can use the cosine, which is the angle between two vectors, as a measure of similarity. It's based on the dot product of two vectors. It will be large when two vectors have large values in the same position. To solve the problem that the raw dot product favours long vectors (frequent words tend to have longer vectors), we normalie it by dividing it by the product of the lenghts of the two vectors, which comes out as the cosine.

We can also pre-normalize each vector by dividing it by its length to create a unit vector of length 1.

What is the motivation for TF-IDF?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

Raw word counts are not very informative. Common words like 'the', 'a', or 'it' frequently co-occur with many other words and are therefore pretty useless for discriminating between documents or word meanings. At the same time, words that occur often in the context of another word (like pie and cherry) are important to determine the meaning of the word. We need to find a way to distinguish between important and unimportant words.

What do we do in the tf-idf weighting?

tf: term frequency, meaning the frequency/count of the word t in the document d. Often we use the log of the raw count.

idf: inverse of the document frequency (N/df, N=Number of documents), meaning we divide the term frequency by the document frequency, which is the number of documents it occurs in. If a word appears in all documents it is less informative than a word that only appears in specific documents. Typically also taken as the logarithm.

What is the Pointwise mutual information?

The PMI is an alternative weighting function to tf-idf and measures how often two events (words) co-occur, compared with what we would expect if they were independent.

The numerator tells us how often we observed the two words together and the denominator tells us how often we would expect the two words to co-occur assuming they each occured independently.

Why do dense embeddings work better than sparse representations like TF-IDF?

The intuition is, that representing words as 300-dimensional dense vectors requires our classifiers to learn far fewer weights than if we represented words as 50'000-dimensional vectors, and the smaller parameter space possibly helps with generalization and avoiding overfitting.

What is the intuition for word2vec (skip-gram with negative sampling)?

The intuition of word2vec is that instead of counting how often each word w occurs near another word, we'll instead train a classifier on a binary prediction task: "Is word w likely to show up near the other word?" and then take the learned classifier weights as the word embeddings.

The revolutionary intuition here is that we can just use running text as implicitly supervised training data for such a classifier, which is called self-supervision and avoids the need for any hand-labeled supervision signal. Intuitively we perform the following steps:

Treat the target word and a neighboring context word as positive examples.
Randomly sample other words in the lexicon to get negative samples.
Use logistic regression to train a classifier to distinguish those two cases.
Use the learned weights as the embeddings.

What is a problem of word2vec?

There is no good way to deal with unknown words. Fasttext (an alternative to word2vec) solves this by using a subword model, representing each word as itself plus a bag of constituent n-grams, with special boundary symbols < and > added to each word.

E.g. <where> + <wh, whe, her, ere, re> for n=3.

AML

Créer ou copier des fichiers d'apprentissage

Créer ou copier des fichiers d'apprentissage

Connecte-toi pour voir toutes les cartes.

SWITCHaai

Office 365

Edulog

Apple ID

Google