BDMA_final
BDMA@LMU
BDMA@LMU
Fichier Détails
Cartes-fiches | 18 |
---|---|
Langue | Deutsch |
Catégorie | Informatique |
Niveau | Université |
Crée / Actualisé | 11.08.2019 / 10.10.2019 |
Lien de web |
https://card2brain.ch/box/20190811_bdmafinal
|
Intégrer |
<iframe src="https://card2brain.ch/box/20190811_bdmafinal/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Créer ou copier des fichiers d'apprentissage
Avec un upgrade tu peux créer ou copier des fichiers d'apprentissage sans limite et utiliser de nombreuses fonctions supplémentaires.
Connecte-toi pour voir toutes les cartes.
What is the main idea of LossyCounting algorithm? (4)
- frequent pattern mining on stream data
- error below user-specific threshold (epsilon)
- Stream divided into buckets w = 1/epsilon
- runtime inversely proportional to epsilon
- count items in bucket, when bucket full remove infrequent items
What is special about clustering of data streams? (2)
- maintain continuously consistent good clustering of observed sequence
- memory and time efficient
What are requirements of stream clustering algorithms? (3)
- fast, incremental processing
- tracking changes
- fast identification of outliers
Explain LEADER algorithm.(3). What does it depend on? (2)
- next object:
- find closest cluster
- if d(cluser, object) < delta >> assign to object cluster
- else create new cluster
- depends on
- threshold delta (good choice)
- order of incoming objects
Explain stream k-means (3)
- stream into chunks
- apply k-means on each chunk
- optional: k-means on cluster centers to get overall k-means clustering
What microclusters consists of (3)? How is the centroid defined each? How is the radius of each micro-cluster defined?
- microcluster (CF) consists of (3)
- N: # points
- LS: sum of points
- SS: squared sum of points
- centroid: LS/N
- radius: sqrt(SS/N - (LS/N)^2)
What is the main idea of BIRCH? (2)
- online component
- average microclusters in tree structure
- offline component
- apply global clustering on all leaf entries
What is the main idea of CluStream? (2: 3, 2)
- online step
- if new point x is close (below threshold) to microcluster clu: assign x to clu
- else: create new microcluster
- if too many microclusters: merge closest two
- offline macro-clustering (on demand)
- find k macro-clusters in time horizon h
- apply k-means on micro-clusters >> k macro clusters
Discuss CluStream (4+ 2)
- + clustering of large evolving data streams
- + considers stream = changing process over time
- + clustering over different time-horizons
- + high flexibility
- - # micro-clusters fixed over time
- - sensitive to noise
What is different for stream classification (vs. batch classification)? What does this imply?
What is requirements for classifications for streams then? (2)
- classification not static anymore
- >> batch model regeneration not sufficient
- requirements
- coperate with new data
- deal with non-stationary data generation (drift, shift) >> remove obsolete data
What is a concept drift? How can it be characterised? (3)
- data distribution changes over time (input or relation of input, outout may change)
- characterisation
- prior probability of class p(y)
- class conditional probabilities p(X|y)
- posterior p(y|X) might change
What is the difference between real and virual concept drift?
- real: posterior p(y|X) changes (independent of p(X))
- virtual: p(X) changes (without affecting p(x|Y))
What is the difference between drift and shift?
- drift = gradual changes
- shift: abrupt changes
How is the splitting done in batch decision tree?
- based on information gain = expected reduction of entropy due to splitting (the higher the better)
What is the problem of going from batch to stream decision trees? What is the solution?
- problem: all data required for finding splitting, not given with streams (2)
- not storable
- infinite stream
- solution: sufficient to use subset of training examples (Hoeffding tree)
Why hoeffding tree is called like this? (name pro and cons (1,1))
- how many instances necessary to find splitting? >> hoeffding bound
- + independent of distribution
- - more conservative / pessimitic than distribution-dependent bound
(How to handle infinitiness of streams in case of hoeffding tree? ) What are problems? (2) Name a solution (1)
What is the idea of concept-adapting hoeffding tree (2)
- problems
- tree becomes complex
- historical data dominates decisions
- solutions
- introduce maximum tree size: if reached: reset tree (all information lost)
- CAHT
- create new subtree when performance decreases
- if subtree performs better ?? replace original :: discard subtree
What is the idea of ensembling classifiers? (1) How is this done? (3)
- use combinations of models to increase accuracy
- approaches
- bagging: training on boostrapped samples (one model each)
- boosting: adapt weights of misclassified samples
- stacking: apply multiple base-learners
-
- 1 / 18
-