BDMA_final

BDMA@LMU

0.0 (0)

Fichier Détails

Cartes-fiches	18
Langue	Deutsch
Catégorie	Informatique
Niveau	Université
Crée / Actualisé	11.08.2019 / 10.10.2019
Lien de web	https://card2brain.ch/box/20190811_bdmafinal
Intégrer	<iframe src="https://card2brain.ch/box/20190811_bdmafinal/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Liste des carte

Étudier

What is the main idea of LossyCounting algorithm? (4)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

frequent pattern mining on stream data
error below user-specific threshold (epsilon)
Stream divided into buckets w = 1/epsilon
- runtime inversely proportional to epsilon
count items in bucket, when bucket full remove infrequent items

What is special about clustering of data streams? (2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

maintain continuously consistent good clustering of observed sequence
memory and time efficient

What are requirements of stream clustering algorithms? (3)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

fast, incremental processing
tracking changes
fast identification of outliers

Explain LEADER algorithm.(3). What does it depend on? (2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

next object:
- find closest cluster
- if d(cluser, object) < delta >> assign to object cluster
- else create new cluster
depends on
- threshold delta (good choice)
- order of incoming objects

Explain stream k-means (3)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

stream into chunks
apply k-means on each chunk
optional: k-means on cluster centers to get overall k-means clustering

What microclusters consists of (3)? How is the centroid defined each? How is the radius of each micro-cluster defined?

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

microcluster (CF) consists of (3)
- N: # points
- LS: sum of points
- SS: squared sum of points
centroid: LS/N
radius: sqrt(SS/N - (LS/N)^2)

What is the main idea of BIRCH? (2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

online component
- average microclusters in tree structure
offline component
- apply global clustering on all leaf entries

What is the main idea of CluStream? (2: 3, 2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

online step
- if new point x is close (below threshold) to microcluster clu: assign x to clu
- else: create new microcluster
- if too many microclusters: merge closest two
offline macro-clustering (on demand)
- find k macro-clusters in time horizon h
- apply k-means on micro-clusters >> k macro clusters

Discuss CluStream (4+ 2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

+ clustering of large evolving data streams
+ considers stream = changing process over time
+ clustering over different time-horizons
+ high flexibility
- # micro-clusters fixed over time
- sensitive to noise

What is different for stream classification (vs. batch classification)? What does this imply?

What is requirements for classifications for streams then? (2)

Commandes clavier:

= tourner,

= avant/arrière,

= faire défiler

classification not static anymore
>> batch model regeneration not sufficient
requirements
- coperate with new data
- deal with non-stationary data generation (drift, shift) >> remove obsolete data

What is a concept drift? How can it be characterised? (3)

data distribution changes over time (input or relation of input, outout may change)
characterisation
- prior probability of class p(y)
- class conditional probabilities p(X|y)
- posterior p(y|X) might change

What is the difference between real and virual concept drift?

real: posterior p(y|X) changes (independent of p(X))
virtual: p(X) changes (without affecting p(x|Y))

What is the difference between drift and shift?

drift = gradual changes
shift: abrupt changes

How is the splitting done in batch decision tree?

based on information gain = expected reduction of entropy due to splitting (the higher the better)

What is the problem of going from batch to stream decision trees? What is the solution?

problem: all data required for finding splitting, not given with streams (2)
- not storable
- infinite stream
solution: sufficient to use subset of training examples (Hoeffding tree)

Why hoeffding tree is called like this? (name pro and cons (1,1))

how many instances necessary to find splitting? >> hoeffding bound
- + independent of distribution
- - more conservative / pessimitic than distribution-dependent bound

(How to handle infinitiness of streams in case of hoeffding tree? ) What are problems? (2) Name a solution (1)

What is the idea of concept-adapting hoeffding tree (2)

problems
- tree becomes complex
- historical data dominates decisions
solutions
- introduce maximum tree size: if reached: reset tree (all information lost)
CAHT
- create new subtree when performance decreases
- if subtree performs better ?? replace original :: discard subtree

What is the idea of ensembling classifiers? (1) How is this done? (3)

use combinations of models to increase accuracy
approaches
- bagging: training on boostrapped samples (one model each)
- boosting: adapt weights of misclassified samples
- stacking: apply multiple base-learners

1 / 18

Liste des carte

Étudier

BDMA_final

Créer ou copier des fichiers d'apprentissage

Créer ou copier des fichiers d'apprentissage

Connecte-toi pour voir toutes les cartes.

SWITCHaai

Office 365

Edulog

Apple ID

Google