KDDM
KDDM@LMU
KDDM@LMU
Kartei Details
Karten | 203 |
---|---|
Sprache | Deutsch |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 01.08.2019 / 03.08.2019 |
Weblink |
https://card2brain.ch/box/20190801_kddm
|
Einbinden |
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Explain the KDD-process (4)
- data cleaning and integration (60%)
- transformation and selection (features, reduction of dimension)
- data mining (pattern)
- visualisation / evaluation (presentation of results)
- >> iterative
Name and briefly explain basic mining tasks (5)
- frequent pattern mining
- find co-occuring items, association rules
- clustering
- find (dis-) similiarity function between objects, given no labels
- classification
- find function to distinguish classes to predict new objects, given labels
- regression
- find function to describe output to predict new objects
- outlier detection
- find objects which do not comply general behvaior of data
Name the general setting of frequent pattern mining concerning the data, the goal, how frequency is defined is defined. Name a naive approach.
- database with transactions (set of items)
- goal: finding patterns, associations, correlations
- frequent if: supp(X) >= minSup
- \(supp(X) = | \{X \subseteq T\} | / |D|\)
- How often X occurs in a transactions within database
- naive approach: count frequency of all possible subsets (2m itemsets)
What is the apriori principle?
- non-empty subset of frequent itemset is frequent
- \(A' \subset A: supp(A') \ge supp(A) \ge minSup\)
- superset of non-frequent itemset is non-frequent
- \(A' \supset A: supp(A') \le supp(A) < minSup\)
Explain the apriori-algorithm: general idea and steps.
- count 1- itemsets, 2-itemsets, 3-...
- with counting k+1-itemsets only where those all subsets of length k were frequent
- produce candidates: join Lk with itslef, discard non-frequent itemsets which are included
- if items are ordered: p and q are joined if same k-1 items
- prove candidates: increment count of candidates if contained in T
What are the requirements for the candidates of the apriori algorithm?
- completeness (superset property): \(C_{k+1} \supseteq L_{k+1}\)
- selectiveness: significantly smaller than set of all k+1 subsets
What is meant by pruning in case of apriori-algorithm?
- applying apriori principle: remove those candidates which contain non-frequent k-subsets: \(s \notin L_k\)
Explain how the counting of candidate support can be supported by hash trees. Go in detail how to search, insert and count.
- useful if the support of huge amount of candidates is counted
- search: apply hash-function
- insert: search for leaf-node
- insert itemset into leafnode
- in case of overflow
- leaf gets internal node, distribute entries according to hash-function
- counting:
- root: compute hash for all items
- internal node: determine hash value, continue search
- leaf: itemset contained in T or not (+1 or not)
Name the performance bottlenecks of apriori algorithm (2), what could be a solution?
- huge candidate sets (many frequent itemsets exists)
- multiple DB-scans (n = length of longest pattern)
- solution: candidate generation? >> FP-trees
What is the general idea of FP-tree?
Explain the steps.
- idea: creating conditional databases associated with one frequent item to avoid costy DB-scans
- scan DB once, order frequent items desc
- scan again
- keep only frequent items, desc
- does a common prefix exist?
- yes: increment counter, append suffix in tree
- no: create new branch with prefix
What are the benefits of FP-tree? (2)
- completeness: long pattern unbroken (complete information)
- compactness:
- reduce irrelevant information
- frequency in descending order
- compression (same prefix)
Explain how to mine in a FP-tree.
- recursively grow frequent pattern path (3)
- create conditional pattern base (3)
- start from header
- visit all nodes via links (node-link property = ...)
- accumulate all prefix paths with frequency
- construct conditional FP-tree pattern
- mine tree, grow frequent pattern (2)
- base case: if conditional FP-tree only contains single path: enumerate all pattern/combinations
- else: create conditional pattern base >> recursive call
- create conditional pattern base (3)
What is the pattern growth property?
- "abcdef" is frequent iff "abcde" is frequent AND "f" is frequent in transactions containing "abcde" (conditional frequency)
What are the advantages of FP-growth? (4 + 1)
- no candidate generation
- compact data structure
- no repeated DB-scans
- operations: counting and tree construction
- >> faster than apriori
What is a closed / maximal frequent itemset?
- closed: for all \(Y \supset X: supp(Y) < supp(X)\)
- complete information regarding frequent itemsets
- maximal: for all \(Y \supset X: supp(Y) < minSup\)
- implies closed
What are association rules in general? (2+2)
- "if... then" implication in transactions with probability
- items are sorted lexicographically
- frequency: \(supp(X \Rightarrow Y) = P(X \cup Y) = |X\cup Y \subseteq T| / |D| = supp(X \cup Y)\)
- = frequency of both occuring
- confidence: \(conf(X \Rightarrow Y) = P(Y |X) = supp(X \cup Y) / supp(X)\)
- = frequeny of both occuring devided by frequency of first occuring
- strength of implication rule
What is a strong association rule?
- all rules where supp >= minSup AND conf >= minConf
What are the main steps in finding association rules?
How many candidates exist for rules?
- find frequent itemsets (i.e. apriori or FP-growth)
- for each \(X: Y \subset X \) generate rules \(Y \Rightarrow (X \setminus Y)\), check conf >= minConf?
- 2|X| - 2 for each X
What is the anti-monotonicity?
- if \(Y \Rightarrow Z\) not strong: \(Y \Rightarrow Z \cup A\) not strong
- if \(Y \Rightarrow Z\) not strong: \(Y' \subseteq Y: (Y \setminus Y') \Rightarrow (Z \cup Y')\)not strong
When is a rule interesting? (2)
- unexpected
- actionable
What is the criticism to support and confidence? What is the solution?
- not all strong rules are interesting, results can be misleading
- solution: correlation ("Lift")
- \(corr_{A, B} = \frac{conf(A \Rightarrow B)} {supp(B)}\)
- > 1: positive correlated, = 1 no correlation, < 1 negative correlated
Why are hierachical assosciation rules useful? (1+1)
- problem:
- lare minSupp leads to few rules
- small minSupp leads to many rules
- solution: exploit item taxonomies
What are generalised association rules?
- Y is generalisation (union) of several X_i
- \(supp(Y \Rightarrow Z) \ge supp(X_i \Rightarrow Z) \forall i\)
- \(supp(Y \Rightarrow Z) \neq \sum_{i = 1}^{k}supp(X_i \Rightarrow Z)\)
How to mine hierarchical association rules? (3)
- top-down / progressive-deepening approach: from strong/high level to low level
- adapting minSup (2)
- redundancy filtering
How the minSup can be adapted in hierarchical association rules? (2)
- same accross levels (uniform support)
- reduce minSup for lower levels (reduced / variable support) (3)
- level-by-level independent method: regardless of frequency of parent node
- level-cross filtering by single item: only if parent is frequent at its level
- level-cross-filtering by k-itemset: only if parent k-itemset is frequent
What is redundancy filtering?
- redundancy through ancestor-relationship (3)
- redundant if support \(\approx\) expected valued based on ancestors' rule
- ancestor rules: \(X' \Rightarrow Y'; X' \Rightarrow Y; X \Rightarrow Y'\)of \(X \Rightarrow Y\)
- direct ancestor: \(X' \Rightarrow Y'\) ancestor of \(X \Rightarrow Y\)AND NOT (\(X'' \Rightarrow Y''\) is ancestor of\(X \Rightarrow Y\) AND \(X' \Rightarrow Y'\) is ancestor of \(X'' \Rightarrow Y''\))
- >> R-Interestingness (3)
- if no direct ancestor of \(X \Rightarrow Y\) exists
- OR supp > R * expected support from ancestor's rule
- OR conf > R * expected confidence from ancestor's rule
What is sequential pattern mining. What is a sequence, consecutive sequence, subsequence, prefix, suffix, relative support, maximal frequent sequence and closed frequent sequence?
- ordering of patterns matters (spatial or temportal)
- even higher complexity than frequent item mining
- \(|\Sigma|^k\) k-sequences, often \(k > |\Sigma|\)
- notions
- sequence: ordered list of length k
- consecutive subsequence: S contains R: \(R \subseteq S\)
- subsequence: gaps allowed, same order of occurrence required; \(R \subseteq S\)
- prefix: consecuiive subsequence S[1:i]
- suffix: consecutive S[i:n]
- relative support: \(supp(R) = | \{R \subseteq S\}|/|D|\)
- maximal frequent sequence: no supersequence is frequent
- closed frequent sequence: all supersequences have smaller support
Name mining algorithms for sequential pattern mining. (2+2)
- breadth-frist search based (comparable to apriori-algorithm)
- GSP
- SPADE
- depth-frist seach based
- PrefixSpan
- SPAM
Explain PrefixSpan algorithm.
- sequence space is organised as prefix search tree
- S' is extension of prefix S
- uses projected DB
- find occurence of s in S
- remove infrequent items
- calculate support for items
- recursive projection: depth-first
How interval-based sequential pattern mining can be done? (3)
- code starting points as \(t_s^+\), end points as \(t_s^-\)
- describe relationships between sequences
- point-based
- before
- during
- after
- interval-based (allen's relations) (7)
- before
- overlap
- contains
- started-by
- finished-by
- meets
- equals
- >> comparison: \(\mathcal{O}(k^2)\)
- TPrefixSpan converts intervals to points i.e. \(\{A^+\}; \{B^+\}, \{B^-\}, \{A^-\}\)
- similar to PrefixSpan but additional validation required (intervals)
- point-based
What is the general idea of clustering? What is clusterings used for? (2)
- grouping of data objects based on similarity
- similar objects within cluster
- dissimilar to objects of other clusters
- goals
- insights into data generation i.e. pattern recognition, image processing
- preprocessing of other algorithms
What are general approaches for clustering? (4)
- partitioning algorithms: k clusters, minimize objective function
- probabilistic model-based: EM
- density-based: based on connectivity/density
- hierarchical clustering
What is the general characteristics of partitioning algorithms? (3) What is the general problem and how can it be tackled?
- k clusters (predefined), \(k \le n\)
- each instance assigned to exactly one \(C_i\)
- construct clusters by minimzing objective function (i.e. sum of squared distances between centoids and elements)
- problem: consider all combinations of clusters is expensive (NP-hard)
- solution: using heuristic
Explain an example of heuristic in case of partitioning algorithms? Name two examples.
- random representatives
- assign each object to representative
- compute new centroids
- repeat 2, 3 until changes are below a threshold
- k-means: representation by center (centroid)
- k-medoid: representation by one object of cluster
Characterise k-means and how the clustering can be evaluated (3)
- \(\mu_s = \frac{1}{|S|} \sum_{p\in S}p\) must be defined
- compactness of cluster \(C_j\) (sum of squared distances)
- \(SSE(C_j) = \sum_{p\in C_j}||p - \mu_{C_j}||^2_2\)
- compactness of clustering \(C\)
- \(SSE(C) = \sum_{C_j \in C}SSE(C_j) = \sum_{p \in D} ||p-\mu_{C_{(p)}}||^2_2\)
- \(\underset{C}{\mathrm{argmin}} (SSE(C))\) leads to optimal clustering (NP-hard)
What is a voronoi diagramm? (3)
- cluster centroids \(p \in P\) partition space into Voronoi-cells (one cell for each cluster)
- cells over all points for which \(p\) is the nearest neighbour (from \(P\))
- cells are separated by perpendicular hyperplanes
Name advantages and disadvantages of K-means. (2 + 4)
- + efficient: \(\mathcal{O}(tkn)\)
- + easy to implement
- - means must be defined
- - k must be chosen in advance
- - sensitive to outliers / noise
- - solution depends on initialisation (>> local optimum)
Describe alternatives for k-means in case that the mean is not defined.
- median (50%-quantile): artificial representative object "in the middle"
- mode: most often occuring value
- medoid: representative object "in the middle" (if centoids don't make sense)
- >> Sum of the total distances is minimized (robust against outliers)
Discuss the k-algorithms regarding data-type, complexity and sensitivity to outliers. (4*3)
And name overall pros and cons.(2+2)
- k-means
- numerical data
- \(\mathcal{O}(tkn)\)
- high
- k-median
- ordinal data
- \(\mathcal{O}(tkn)\)
- low
- k-mode
- categorical
- \(\mathcal{O}(tkn)\)
- low
- k-medioid
- metric
- \(\mathcal{O}(tk(n-k)^2) \)
- low
- + easy to implement
- + variations and optimizations exist
- - k has to be defined in advance
- - results depend on initial partition (local optimum)
Briefly descirbe to Partitioning around medoids algorithm (PAM)
- delta_TD = -inf
- while delta_TD < 0
- minimal delta_TD = TD_M,N - TD_current
- if (delta_TD <)
- replace medoid M with non-medoid
- TD_current = TD_M,N
- store medoids and assignments as best so far