BDMA_final

BDMA@LMU

BDMA@LMU


Fichier Détails

Cartes-fiches 18
Langue Deutsch
Catégorie Informatique
Niveau Université
Crée / Actualisé 11.08.2019 / 10.10.2019
Lien de web
https://card2brain.ch/box/20190811_bdmafinal
Intégrer
<iframe src="https://card2brain.ch/box/20190811_bdmafinal/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the main idea of LossyCounting algorithm? (4)

  • frequent pattern mining on stream data
  • error below user-specific threshold (epsilon)
  • Stream divided into buckets w = 1/epsilon
    • runtime inversely proportional to epsilon
  • count items in bucket, when bucket full remove infrequent items

What is special about clustering of data streams? (2)

  • maintain continuously consistent good clustering of observed sequence
  • memory and time efficient

What are requirements of stream clustering algorithms? (3)

  • fast, incremental processing
  • tracking changes
  • fast identification of outliers

Explain LEADER algorithm.(3). What does it depend on? (2)

  • next object: 
    • find closest cluster
    • if d(cluser, object) < delta >> assign to object cluster
    • else create new cluster
  • depends on
    • threshold delta (good choice)
    • order of incoming objects

Explain stream k-means (3)

  • stream into chunks
  • apply k-means on each chunk 
  • optional: k-means on cluster centers to get overall k-means clustering

What microclusters consists of (3)? How is the centroid defined each? How is the radius of each micro-cluster defined?

  • microcluster (CF) consists of (3)
    • N: # points
    • LS: sum of points
    • SS: squared sum of points
  • centroid: LS/N
  • radius: sqrt(SS/N - (LS/N)^2)

What is the main idea of BIRCH? (2)

  • online component
    • average microclusters in tree structure
  • offline component
    • apply global clustering on all leaf entries

What is the main idea of CluStream? (2: 3, 2)

  • online step
    • if new point x is close (below threshold) to microcluster clu: assign x to clu
    • else: create new microcluster
    • if too many microclusters: merge closest two
  • offline macro-clustering (on demand)
    • find k macro-clusters in time horizon h
    • apply k-means on micro-clusters >> k macro clusters

Discuss CluStream (4+ 2)

  • + clustering of large evolving data streams
  • + considers stream = changing process over time
  • + clustering over different time-horizons
  • + high flexibility
  • - # micro-clusters fixed over time
  • - sensitive to noise

What is different for stream classification (vs. batch classification)? What does this imply?

What is requirements for classifications for streams then? (2)

  • classification not static anymore
  • >> batch model regeneration not sufficient
  • requirements
    • coperate with new data
    • deal with non-stationary data generation (drift, shift) >> remove obsolete data

What is a concept drift? How can it be characterised? (3)

  • data distribution changes over time (input or relation of input, outout may change)
  • characterisation
    • prior probability of class p(y)
    • class conditional probabilities p(X|y)
    • posterior p(y|X) might change

What is the difference between real and virual concept drift?

 

  • real: posterior p(y|X) changes (independent of p(X))
  • virtual: p(X) changes (without affecting p(x|Y))

What is the difference between drift and shift?

  • drift = gradual changes
  • shift: abrupt changes

How is the splitting done in batch decision tree?

  • based on information gain = expected reduction of entropy due to splitting (the higher the better)

What is the problem of going from batch to stream decision trees? What is the solution?

  • problem: all data required for finding splitting, not given with streams (2)
    • not storable  
    • infinite stream
  • solution: sufficient to use subset of training examples (Hoeffding tree)

Why hoeffding tree is called like this? (name pro and cons (1,1))

  • how many instances necessary to find splitting? >> hoeffding bound 
    • + independent of distribution
    • - more conservative / pessimitic than distribution-dependent bound

(How to handle infinitiness of streams in case of hoeffding tree? ) What are problems? (2) Name a solution (1)

What is the idea of concept-adapting hoeffding tree (2)

  • problems
    • tree becomes complex
    • historical data dominates decisions
  • solutions
    • introduce maximum tree size: if reached: reset tree (all information lost)
  • CAHT
    • create new subtree when performance decreases
    • if subtree performs better ?? replace original :: discard subtree

What is the idea of ensembling classifiers? (1) How is this done? (3)

  • use combinations of models to increase accuracy
  • approaches
    • bagging: training on boostrapped samples (one model each)
    • boosting: adapt weights of misclassified samples
    • stacking: apply multiple base-learners