08_Summarizing
abc
abc
13
0.0 (0)
Kartei Details
Karten | 13 |
---|---|
Sprache | English |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 07.02.2018 / 28.05.2020 |
Lizenzierung | Keine Angabe |
Weblink |
https://card2brain.ch/box/20180207_8summarizing
|
Einbinden |
<iframe src="https://card2brain.ch/box/20180207_8summarizing/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Main components and parameter of summarization systems
- Components
- Content selection:
- selection of information from the document(s) to be summarized
- Ordering of the extracted units
- Sentence realization:
- improve the output to obtain fluent text
- Content selection:
- Main Parameter
- Compression rate:
- length of the summary or proportion of text to be kept
- Compression rate:
Main use of summarization
Reduce information overload by extracting relevant information
Types of Summaries
- Main Type Dimensions:
- Single document vs. Multi-document summarization
- Generic summarization vs. Query-focused summarization:
- Generic summarization does not take into account the particular user or her information need, as opposed to query-focused summarization
- Abstractive vs. Extractive summarization
- Abstractive summarization results in a text that differs from the original
- Extractive summarization consists of the original phrases and sentences
- Most state-of-the-art systems focus on this since it is easier
- Other types:
- Contrastive multiple-document summaries:
- the common topics of all the documents as well as unique topics pertaining to each document
- Update summaries:
- only new information that has not been covered before
- Contrastive multiple-document summaries:
Single-document summarization
Steps:
- Content selection:
- choose sentences to extract from the document, either with an unsupervised or supervised method
- Baseline: use first k sentences
- Information ordering:
- choose an order for the sentences
- Baseline: keep the order of the original text
- Sentence realization:
- clean up sentences, e.g. sentence simplification, sentence fusion, etc.
- Baseline: do not perform any combination or clean-up
Centroid-based content selection
- Simplest approach:
- select sentences that have more informative words
- e.g. measured with the
- maximum likelihood estimate
- TF-IDF
- e.g. measured with the
- Sentences are scored based on the score of the informative words they contain
- All sentences are ranked by their score and the top-ranked sentences are kept for the summary
- select sentences that have more informative words
Further Approaches
The SumBasic system
- Compute probability for each word
- n is the number of times the word appeared in the input and N is the total number of words in the input
- p(w) = n/N
- for each sentence average probability of the words in it
- then pick best scoring sentence
- update probability for words in chosen sentence p(w_new) = p(w_old)^2
Log-Likelihood ratio (LLR)
- quite common
- Document(s) to summarize is compared to Background corpus
LexRank
- graph-based content selection
- represents a cluster of documents as a network of related sentences
- sentences that are similar to a lot of others are more central
- uses TF-IDF applied to sentences
Lexical chains
- group sets of words, esp. nouns, which are semantically related (same word - same sense, synonyms, hypernyms/hyponyms, co-hyponyms, collocations)
- Lexical chains can be used to identify important concepts from a document
- Each noun instance usually belongs to exactly one lexical chain
- → it is necessary to perform word sense disambiguation
- Example: Picture
- For Summarization:
- All terms representing the same concept occur in the same chain
- → avoids repetition
- The chain combines the weight (frequency) of its members, so that low frequency terms may still help identifying important concepts
- Build lexical chains:
- Extract nouns and noun phrases
- Use a lexical resource or statistics over large corpora to determine word
- relatedness
- Identify strong chains based on their length
- Extract significant sentences
- one sentence for each chain (or more for larger chains)
- All terms representing the same concept occur in the same chain
Supervised Content Selection
- Define features to assess sentence saliency
- Training data: corpus where sentences are annotated as part of the extract summary (1) or not (0)
- Example Features:
- Fixed-phrase feature:
- “in conclusion” indicates summary
- Position feature:
- 1st/ last paragraph, initial/ final sentence more likely to be important
- Thematic word feature:
- Repetition as indicator of importance
- Important words:
- Sentence with high salience (e.g. several words with TF-IDF weight)
- Uppercase word feature:
- Often indicates named entities
- Fixed-phrase feature: