BDMA_final

BDMA@LMU

0.0 (0)

Set of flashcards Details

Flashcards	18
Language	Deutsch
Category	Computer Science
Level	University
Created / Updated	11.08.2019 / 10.10.2019
Weblink	https://card2brain.ch/cards/20190811_bdmafinal
Embed	<iframe src="https://card2brain.ch/box/20190811_bdmafinal/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Card list

What is the main idea of LossyCounting algorithm? (4)

frequent pattern mining on stream data
error below user-specific threshold (epsilon)
Stream divided into buckets w = 1/epsilon
- runtime inversely proportional to epsilon
count items in bucket, when bucket full remove infrequent items

What is special about clustering of data streams? (2)

maintain continuously consistent good clustering of observed sequence
memory and time efficient

What are requirements of stream clustering algorithms? (3)

fast, incremental processing
tracking changes
fast identification of outliers

Explain LEADER algorithm.(3). What does it depend on? (2)

next object:
- find closest cluster
- if d(cluser, object) < delta >> assign to object cluster
- else create new cluster
depends on
- threshold delta (good choice)
- order of incoming objects

Explain stream k-means (3)

stream into chunks
apply k-means on each chunk
optional: k-means on cluster centers to get overall k-means clustering

What microclusters consists of (3)? How is the centroid defined each? How is the radius of each micro-cluster defined?

microcluster (CF) consists of (3)
- N: # points
- LS: sum of points
- SS: squared sum of points
centroid: LS/N
radius: sqrt(SS/N - (LS/N)^2)

What is the main idea of BIRCH? (2)

online component
- average microclusters in tree structure
offline component
- apply global clustering on all leaf entries

What is the main idea of CluStream? (2: 3, 2)

online step
- if new point x is close (below threshold) to microcluster clu: assign x to clu
- else: create new microcluster
- if too many microclusters: merge closest two
offline macro-clustering (on demand)
- find k macro-clusters in time horizon h
- apply k-means on micro-clusters >> k macro clusters

Discuss CluStream (4+ 2)

+ clustering of large evolving data streams
+ considers stream = changing process over time
+ clustering over different time-horizons
+ high flexibility
- # micro-clusters fixed over time
- sensitive to noise

What is different for stream classification (vs. batch classification)? What does this imply?

What is requirements for classifications for streams then? (2)

classification not static anymore
>> batch model regeneration not sufficient
requirements
- coperate with new data
- deal with non-stationary data generation (drift, shift) >> remove obsolete data

What is a concept drift? How can it be characterised? (3)

data distribution changes over time (input or relation of input, outout may change)
characterisation
- prior probability of class p(y)
- class conditional probabilities p(X|y)
- posterior p(y|X) might change

What is the difference between real and virual concept drift?

real: posterior p(y|X) changes (independent of p(X))
virtual: p(X) changes (without affecting p(x|Y))

What is the difference between drift and shift?

drift = gradual changes
shift: abrupt changes

How is the splitting done in batch decision tree?

based on information gain = expected reduction of entropy due to splitting (the higher the better)

What is the problem of going from batch to stream decision trees? What is the solution?

problem: all data required for finding splitting, not given with streams (2)
- not storable
- infinite stream
solution: sufficient to use subset of training examples (Hoeffding tree)

Why hoeffding tree is called like this? (name pro and cons (1,1))

how many instances necessary to find splitting? >> hoeffding bound
- + independent of distribution
- - more conservative / pessimitic than distribution-dependent bound

(How to handle infinitiness of streams in case of hoeffding tree? ) What are problems? (2) Name a solution (1)

What is the idea of concept-adapting hoeffding tree (2)

problems
- tree becomes complex
- historical data dominates decisions
solutions
- introduce maximum tree size: if reached: reset tree (all information lost)
CAHT
- create new subtree when performance decreases
- if subtree performs better ?? replace original :: discard subtree

What is the idea of ensembling classifiers? (1) How is this done? (3)

use combinations of models to increase accuracy
approaches
- bagging: training on boostrapped samples (one model each)
- boosting: adapt weights of misclassified samples
- stacking: apply multiple base-learners

Card list

Study

BDMA_final

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google