BDMA_final
BDMA@LMU
BDMA@LMU
Kartei Details
Karten | 18 |
---|---|
Sprache | Deutsch |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 11.08.2019 / 10.10.2019 |
Weblink |
https://card2brain.ch/box/20190811_bdmafinal
|
Einbinden |
<iframe src="https://card2brain.ch/box/20190811_bdmafinal/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Lernkarteien erstellen oder kopieren
Mit einem Upgrade kannst du unlimitiert Lernkarteien erstellen oder kopieren und viele Zusatzfunktionen mehr nutzen.
Melde dich an, um alle Karten zu sehen.
What is the main idea of LossyCounting algorithm? (4)
- frequent pattern mining on stream data
- error below user-specific threshold (epsilon)
- Stream divided into buckets w = 1/epsilon
- runtime inversely proportional to epsilon
- count items in bucket, when bucket full remove infrequent items
What is special about clustering of data streams? (2)
- maintain continuously consistent good clustering of observed sequence
- memory and time efficient
What are requirements of stream clustering algorithms? (3)
- fast, incremental processing
- tracking changes
- fast identification of outliers
Explain LEADER algorithm.(3). What does it depend on? (2)
- next object:
- find closest cluster
- if d(cluser, object) < delta >> assign to object cluster
- else create new cluster
- depends on
- threshold delta (good choice)
- order of incoming objects
Explain stream k-means (3)
- stream into chunks
- apply k-means on each chunk
- optional: k-means on cluster centers to get overall k-means clustering
What microclusters consists of (3)? How is the centroid defined each? How is the radius of each micro-cluster defined?
- microcluster (CF) consists of (3)
- N: # points
- LS: sum of points
- SS: squared sum of points
- centroid: LS/N
- radius: sqrt(SS/N - (LS/N)^2)
What is the main idea of BIRCH? (2)
- online component
- average microclusters in tree structure
- offline component
- apply global clustering on all leaf entries
What is the main idea of CluStream? (2: 3, 2)
- online step
- if new point x is close (below threshold) to microcluster clu: assign x to clu
- else: create new microcluster
- if too many microclusters: merge closest two
- offline macro-clustering (on demand)
- find k macro-clusters in time horizon h
- apply k-means on micro-clusters >> k macro clusters
Discuss CluStream (4+ 2)
- + clustering of large evolving data streams
- + considers stream = changing process over time
- + clustering over different time-horizons
- + high flexibility
- - # micro-clusters fixed over time
- - sensitive to noise
What is different for stream classification (vs. batch classification)? What does this imply?
What is requirements for classifications for streams then? (2)
- classification not static anymore
- >> batch model regeneration not sufficient
- requirements
- coperate with new data
- deal with non-stationary data generation (drift, shift) >> remove obsolete data
What is a concept drift? How can it be characterised? (3)
- data distribution changes over time (input or relation of input, outout may change)
- characterisation
- prior probability of class p(y)
- class conditional probabilities p(X|y)
- posterior p(y|X) might change
What is the difference between real and virual concept drift?
- real: posterior p(y|X) changes (independent of p(X))
- virtual: p(X) changes (without affecting p(x|Y))
What is the difference between drift and shift?
- drift = gradual changes
- shift: abrupt changes
How is the splitting done in batch decision tree?
- based on information gain = expected reduction of entropy due to splitting (the higher the better)
What is the problem of going from batch to stream decision trees? What is the solution?
- problem: all data required for finding splitting, not given with streams (2)
- not storable
- infinite stream
- solution: sufficient to use subset of training examples (Hoeffding tree)
Why hoeffding tree is called like this? (name pro and cons (1,1))
- how many instances necessary to find splitting? >> hoeffding bound
- + independent of distribution
- - more conservative / pessimitic than distribution-dependent bound
(How to handle infinitiness of streams in case of hoeffding tree? ) What are problems? (2) Name a solution (1)
What is the idea of concept-adapting hoeffding tree (2)
- problems
- tree becomes complex
- historical data dominates decisions
- solutions
- introduce maximum tree size: if reached: reset tree (all information lost)
- CAHT
- create new subtree when performance decreases
- if subtree performs better ?? replace original :: discard subtree
What is the idea of ensembling classifiers? (1) How is this done? (3)
- use combinations of models to increase accuracy
- approaches
- bagging: training on boostrapped samples (one model each)
- boosting: adapt weights of misclassified samples
- stacking: apply multiple base-learners
-
- 1 / 18
-