Premium Partner

08_Summarizing

abc

abc


Kartei Details

Karten 13
Sprache English
Kategorie Informatik
Stufe Universität
Erstellt / Aktualisiert 07.02.2018 / 28.05.2020
Lizenzierung Keine Angabe
Weblink
https://card2brain.ch/box/20180207_8summarizing
Einbinden
<iframe src="https://card2brain.ch/box/20180207_8summarizing/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Main components and parameter of summarization systems

  • Components
    • Content selection:
      • selection of information from the document(s) to be summarized
    • Ordering of the extracted units
    • Sentence realization:
      • improve the output to obtain fluent text
  • Main Parameter
    • Compression rate:
      • length of the summary or proportion of text to be kept

Main use of summarization

Reduce information overload by extracting relevant information

Types of Summaries

  • Main Type Dimensions:
    • Single document vs. Multi-document summarization
    • Generic summarization vs. Query-focused summarization:
      • Generic summarization does not take into account the particular user or her information need, as opposed to query-focused summarization
    • Abstractive vs. Extractive summarization
      • Abstractive summarization results in a text that differs from the original
      • Extractive summarization consists of the original phrases and sentences
        • Most state-of-the-art systems focus on this since it is easier

 

  • Other types:
    • Contrastive multiple-document summaries:
      • the common topics of all the documents as well as unique topics pertaining to each document
    • Update summaries:
      • only new information that has not been covered before

Single-document summarization

Steps:

 

  1. Content selection:
    • choose sentences to extract from the document, either with an unsupervised or supervised method
    • Baseline: use first k sentences
  2. Information ordering:
    • choose an order for the sentences
    • Baseline: keep the order of the original text
  3. Sentence realization:
    • clean up sentences, e.g. sentence simplification, sentence fusion, etc.
    • Baseline: do not perform any combination or clean-up

Centroid-based content selection

  • Simplest approach:
    • select sentences that have more informative words
      • e.g. measured with the
        • maximum likelihood estimate
        • TF-IDF
    • Sentences are scored based on the score of the informative words they contain
    • All sentences are ranked by their score and the top-ranked sentences are kept for the summary

Further Approaches

The SumBasic system

  • Compute probability for each word
  • n is the number of times the word appeared in the input and N is the total number of words in the input
  • p(w) = n/N
  • for each sentence average probability of the words in it
  • then pick best scoring sentence
  • update probability for words in chosen sentence p(w_new) = p(w_old)^2

Log-Likelihood ratio (LLR)

  • quite common
  • Document(s) to summarize is compared to Background corpus

LexRank

  • graph-based content selection
  • represents a cluster of documents as a network of related sentences
  • sentences that are similar to a lot of others are more central
  • uses TF-IDF applied to sentences

Lexical chains

  • group sets of words, esp. nouns, which are semantically related (same word - same sense, synonyms, hypernyms/hyponyms, co-hyponyms, collocations)
  • Lexical chains can be used to identify important concepts from a document
  • Each noun instance usually belongs to exactly one lexical chain
    • → it is necessary to perform word sense disambiguation
  • Example: Picture

 

  • For Summarization:
    • All terms representing the same concept occur in the same chain
      • → avoids repetition
    • The chain combines the weight (frequency) of its members, so that low frequency terms may still help identifying important concepts
    • Build lexical chains:
      • Extract nouns and noun phrases
      • Use a lexical resource or statistics over large corpora to determine word
    • relatedness
    • Identify strong chains based on their length
    • Extract significant sentences
      • one sentence for each chain (or more for larger chains)

Supervised Content Selection

  • Define features to assess sentence saliency
  • Training data: corpus where sentences are annotated as part of the extract summary (1) or not (0)
  • Example Features:
    • Fixed-phrase feature:
      • “in conclusion” indicates summary
    • Position feature:
      • 1st/ last paragraph, initial/ final sentence more likely to be important
    • Thematic word feature:
      • Repetition as indicator of importance
    • Important words:
      • Sentence with high salience (e.g. several words with TF-IDF weight)
    • Uppercase word feature:
      • Often indicates named entities