Set of flashcards KDDM (Page 1 of 6)

Flashcards	203
Language	Deutsch
Category	Computer Science
Level	University
Created / Updated	01.08.2019 / 03.08.2019
Weblink	https://card2brain.ch/box/20190801_kddm
Embed	<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Explain the KDD-process (4)

data cleaning and integration (60%)
transformation and selection (features, reduction of dimension)
data mining (pattern)
visualisation / evaluation (presentation of results)
>> iterative

Name and briefly explain basic mining tasks (5)

frequent pattern mining
- find co-occuring items, association rules
clustering
- find (dis-) similiarity function between objects, given no labels
classification
- find function to distinguish classes to predict new objects, given labels
regression
- find function to describe output to predict new objects
outlier detection
- find objects which do not comply general behvaior of data

Name the general setting of frequent pattern mining concerning the data, the goal, how frequency is defined is defined. Name a naive approach.

database with transactions (set of items)
goal: finding patterns, associations, correlations
frequent if: supp(X) >= minSup
- \(supp(X) = | \{X \subseteq T\} | / |D|\)
- How often X occurs in a transactions within database
naive approach: count frequency of all possible subsets (2^m itemsets)

What is the apriori principle?

non-empty subset of frequent itemset is frequent
- \(A' \subset A: supp(A') \ge supp(A) \ge minSup\)
superset of non-frequent itemset is non-frequent
- \(A' \supset A: supp(A') \le supp(A) < minSup\)

Explain the apriori-algorithm: general idea and steps.

count 1- itemsets, 2-itemsets, 3-...
with counting k+1-itemsets only where those all subsets of length k were frequent

produce candidates: join L_k with itslef, discard non-frequent itemsets which are included
1. if items are ordered: p and q are joined if same k-1 items
prove candidates: increment count of candidates if contained in T

What are the requirements for the candidates of the apriori algorithm?

completeness (superset property): \(C_{k+1} \supseteq L_{k+1}\)
selectiveness: significantly smaller than set of all k+1 subsets

What is meant by pruning in case of apriori-algorithm?

applying apriori principle: remove those candidates which contain non-frequent k-subsets: \(s \notin L_k\)

Explain how the counting of candidate support can be supported by hash trees. Go in detail how to search, insert and count.

useful if the support of huge amount of candidates is counted
search: apply hash-function
insert: search for leaf-node
- insert itemset into leafnode
- in case of overflow
  - leaf gets internal node, distribute entries according to hash-function
counting:
- root: compute hash for all items
- internal node: determine hash value, continue search
- leaf: itemset contained in T or not (+1 or not)

Name the performance bottlenecks of apriori algorithm (2), what could be a solution?

huge candidate sets (many frequent itemsets exists)
multiple DB-scans (n = length of longest pattern)
solution: candidate generation? >> FP-trees

What is the general idea of FP-tree?

Explain the steps.

idea: creating conditional databases associated with one frequent item to avoid costy DB-scans

scan DB once, order frequent items desc
scan again
1. keep only frequent items, desc
2. does a common prefix exist?
  1. yes: increment counter, append suffix in tree
  2. no: create new branch with prefix

What are the benefits of FP-tree? (2)

completeness: long pattern unbroken (complete information)
compactness:
- reduce irrelevant information
- frequency in descending order
- compression (same prefix)

Explain how to mine in a FP-tree.

recursively grow frequent pattern path (3)
- create conditional pattern base (3)
  - start from header
  - visit all nodes via links (node-link property = ...)
  - accumulate all prefix paths with frequency
- construct conditional FP-tree pattern
- mine tree, grow frequent pattern (2)
  - base case: if conditional FP-tree only contains single path: enumerate all pattern/combinations
  - else: create conditional pattern base >> recursive call

What is the pattern growth property?

"abcdef" is frequent iff "abcde" is frequent AND "f" is frequent in transactions containing "abcde" (conditional frequency)

What are the advantages of FP-growth? (4 + 1)

no candidate generation
compact data structure
no repeated DB-scans
operations: counting and tree construction
>> faster than apriori

What is a closed / maximal frequent itemset?

closed: for all \(Y \supset X: supp(Y) < supp(X)\)
- complete information regarding frequent itemsets
maximal: for all \(Y \supset X: supp(Y) < minSup\)
- implies closed

What are association rules in general? (2+2)

"if... then" implication in transactions with probability
items are sorted lexicographically
frequency: \(supp(X \Rightarrow Y) = P(X \cup Y) = |X\cup Y \subseteq T| / |D| = supp(X \cup Y)\)
- = frequency of both occuring
confidence: \(conf(X \Rightarrow Y) = P(Y |X) = supp(X \cup Y) / supp(X)\)
- = frequeny of both occuring devided by frequency of first occuring
- strength of implication rule

What is a strong association rule?

all rules where supp >= minSup AND conf >= minConf

What are the main steps in finding association rules?

How many candidates exist for rules?

find frequent itemsets (i.e. apriori or FP-growth)
for each \(X: Y \subset X \) generate rules \(Y \Rightarrow (X \setminus Y)\), check conf >= minConf?

2^|X| - 2 for each X

What is the anti-monotonicity?

if \(Y \Rightarrow Z\) not strong: \(Y \Rightarrow Z \cup A\) not strong
if \(Y \Rightarrow Z\) not strong: \(Y' \subseteq Y: (Y \setminus Y') \Rightarrow (Z \cup Y')\)not strong

When is a rule interesting? (2)

unexpected
actionable

What is the criticism to support and confidence? What is the solution?

not all strong rules are interesting, results can be misleading
solution: correlation ("Lift")
- \(corr_{A, B} = \frac{conf(A \Rightarrow B)} {supp(B)}\)
- > 1: positive correlated, = 1 no correlation, < 1 negative correlated

Why are hierachical assosciation rules useful? (1+1)

problem:
- lare minSupp leads to few rules
- small minSupp leads to many rules
solution: exploit item taxonomies

What are generalised association rules?

Y is generalisation (union) of several X_i
\(supp(Y \Rightarrow Z) \ge supp(X_i \Rightarrow Z) \forall i\)
\(supp(Y \Rightarrow Z) \neq \sum_{i = 1}^{k}supp(X_i \Rightarrow Z)\)

How to mine hierarchical association rules? (3)

top-down / progressive-deepening approach: from strong/high level to low level
adapting minSup (2)
redundancy filtering

How the minSup can be adapted in hierarchical association rules? (2)

same accross levels (uniform support)
reduce minSup for lower levels (reduced / variable support) (3)
- level-by-level independent method: regardless of frequency of parent node
- level-cross filtering by single item: only if parent is frequent at its level
- level-cross-filtering by k-itemset: only if parent k-itemset is frequent

What is redundancy filtering?

redundancy through ancestor-relationship (3)
- redundant if support \(\approx\) expected valued based on ancestors' rule
- ancestor rules: \(X' \Rightarrow Y'; X' \Rightarrow Y; X \Rightarrow Y'\)of \(X \Rightarrow Y\)
- direct ancestor: \(X' \Rightarrow Y'\) ancestor of \(X \Rightarrow Y\)AND NOT (\(X'' \Rightarrow Y''\) is ancestor of\(X \Rightarrow Y\) AND \(X' \Rightarrow Y'\) is ancestor of \(X'' \Rightarrow Y''\))
>> R-Interestingness (3)
- if no direct ancestor of \(X \Rightarrow Y\) exists
- OR supp > R * expected support from ancestor's rule
- OR conf > R * expected confidence from ancestor's rule

What is sequential pattern mining. What is a sequence, consecutive sequence, subsequence, prefix, suffix, relative support, maximal frequent sequence and closed frequent sequence?

ordering of patterns matters (spatial or temportal)
even higher complexity than frequent item mining
- \(|\Sigma|^k\) k-sequences, often \(k > |\Sigma|\)
notions
- sequence: ordered list of length k
- consecutive subsequence: S contains R: \(R \subseteq S\)
- subsequence: gaps allowed, same order of occurrence required; \(R \subseteq S\)
- prefix: consecuiive subsequence S[1:i]
- suffix: consecutive S[i:n]
- relative support: \(supp(R) = | \{R \subseteq S\}|/|D|\)
- maximal frequent sequence: no supersequence is frequent
- closed frequent sequence: all supersequences have smaller support

Name mining algorithms for sequential pattern mining. (2+2)

breadth-frist search based (comparable to apriori-algorithm)
- GSP
- SPADE
depth-frist seach based
- PrefixSpan
- SPAM

Explain PrefixSpan algorithm.

sequence space is organised as prefix search tree
S' is extension of prefix S
uses projected DB
- find occurence of s in S
- remove infrequent items
- calculate support for items
- recursive projection: depth-first

How interval-based sequential pattern mining can be done? (3)

code starting points as \(t_s^+\), end points as \(t_s^-\)
describe relationships between sequences
- point-based
  - before
  - during
  - after
- interval-based (allen's relations) (7)
  - before
  - overlap
  - contains
  - started-by
  - finished-by
  - meets
  - equals
  - >> comparison: \(\mathcal{O}(k^2)\)
- TPrefixSpan converts intervals to points i.e. \(\{A^+\}; \{B^+\}, \{B^-\}, \{A^-\}\)
  - similar to PrefixSpan but additional validation required (intervals)

What is the general idea of clustering? What is clusterings used for? (2)

grouping of data objects based on similarity
- similar objects within cluster
- dissimilar to objects of other clusters
goals
- insights into data generation i.e. pattern recognition, image processing
- preprocessing of other algorithms

What are general approaches for clustering? (4)

partitioning algorithms: k clusters, minimize objective function
probabilistic model-based: EM
density-based: based on connectivity/density
hierarchical clustering

What is the general characteristics of partitioning algorithms? (3) What is the general problem and how can it be tackled?

k clusters (predefined), \(k \le n\)
each instance assigned to exactly one \(C_i\)
construct clusters by minimzing objective function (i.e. sum of squared distances between centoids and elements)
problem: consider all combinations of clusters is expensive (NP-hard)
- solution: using heuristic

Explain an example of heuristic in case of partitioning algorithms? Name two examples.

random representatives
assign each object to representative
compute new centroids
repeat 2, 3 until changes are below a threshold

k-means: representation by center (centroid)
k-medoid: representation by one object of cluster

Characterise k-means and how the clustering can be evaluated (3)

\(\mu_s = \frac{1}{|S|} \sum_{p\in S}p\) must be defined
compactness of cluster \(C_j\) (sum of squared distances)
- \(SSE(C_j) = \sum_{p\in C_j}||p - \mu_{C_j}||^2_2\)
compactness of clustering \(C\)
- \(SSE(C) = \sum_{C_j \in C}SSE(C_j) = \sum_{p \in D} ||p-\mu_{C_{(p)}}||^2_2\)
\(\underset{C}{\mathrm{argmin}} (SSE(C))\) leads to optimal clustering (NP-hard)

What is a voronoi diagramm? (3)

cluster centroids \(p \in P\) partition space into Voronoi-cells (one cell for each cluster)
cells over all points for which \(p\) is the nearest neighbour (from \(P\))
cells are separated by perpendicular hyperplanes

Name advantages and disadvantages of K-means. (2 + 4)

+ efficient: \(\mathcal{O}(tkn)\)
+ easy to implement
- means must be defined
- k must be chosen in advance
- sensitive to outliers / noise
- solution depends on initialisation (>> local optimum)

Describe alternatives for k-means in case that the mean is not defined.

median (50%-quantile): artificial representative object "in the middle"
mode: most often occuring value
medoid: representative object "in the middle" (if centoids don't make sense)
>> Sum of the total distances is minimized (robust against outliers)

Discuss the k-algorithms regarding data-type, complexity and sensitivity to outliers. (4*3)

And name overall pros and cons.(2+2)

k-means
- numerical data
- \(\mathcal{O}(tkn)\)
- high
k-median
- ordinal data
- \(\mathcal{O}(tkn)\)
- low
k-mode
- categorical
- \(\mathcal{O}(tkn)\)
- low
k-medioid
- metric
- \(\mathcal{O}(tk(n-k)^2) \)
- low
+ easy to implement
+ variations and optimizations exist
- k has to be defined in advance
- results depend on initial partition (local optimum)

Briefly descirbe to Partitioning around medoids algorithm (PAM)

delta_TD = -inf
while delta_TD < 0
- minimal delta_TD = TD_M,N- TD_current
- if (delta_TD <)
  - replace medoid M with non-medoid
  - TD_current = TD_M,N
  - store medoids and assignments as best so far

KDDM

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google