KDDM

KDDM@LMU

KDDM@LMU


Set of flashcards Details

Flashcards 203
Language Deutsch
Category Computer Science
Level University
Created / Updated 01.08.2019 / 03.08.2019
Weblink
https://card2brain.ch/box/20190801_kddm
Embed
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Explain the KDD-process (4)

  1. data cleaning and integration (60%)
  2. transformation and selection (features, reduction of dimension)
  3. data mining (pattern)
  4. visualisation / evaluation (presentation of results)
  5. >> iterative

Name and briefly explain basic mining tasks (5)

  • frequent pattern mining
    • find co-occuring items, association rules
  • clustering
    • find (dis-) similiarity function between objects, given no labels
  • classification
    • find function to distinguish classes to predict new objects, given labels
  • regression
    • find function to describe output to predict new objects
  • outlier detection
    • find objects which do not comply general behvaior of data

Name the general setting of frequent pattern mining concerning the data, the goal, how frequency is defined is defined. Name a naive approach.

  • database with transactions (set of items)
  • goal: finding patterns, associations, correlations
  • frequent if: supp(X) >= minSup
    • \(supp(X) = | \{X \subseteq T\} | / |D|\)
    • How often X occurs in a transactions within database
  • naive approach: count frequency of all possible subsets (2m itemsets)

What is the apriori principle?

  • non-empty subset of frequent itemset is frequent
    • \(A' \subset A: supp(A') \ge supp(A) \ge minSup\)
  • superset of non-frequent itemset is non-frequent
    • \(A' \supset A: supp(A') \le supp(A) < minSup\)

Explain the apriori-algorithm: general idea and steps.

  • count 1- itemsets, 2-itemsets, 3-...
  • with counting k+1-itemsets only where those all subsets of length k were frequent

 

  1. produce candidates: join Lk with itslef, discard non-frequent itemsets which are included
    1. if items are ordered: p and q are joined if same k-1 items
  2. prove candidates: increment count of candidates if contained in T

What are the requirements for the candidates of the apriori algorithm?

  • completeness (superset property): \(C_{k+1} \supseteq L_{k+1}\)
  • selectiveness: significantly smaller than set of all k+1 subsets

What is meant by pruning in case of apriori-algorithm?

  • applying apriori principle: remove those candidates which contain non-frequent k-subsets: \(s \notin L_k\)

 

Explain how the counting of candidate support can be supported by hash trees. Go in detail how to search, insert and count.

  • useful if the support of huge amount of candidates is counted
  • search: apply hash-function
  • insert: search for leaf-node
    • insert itemset into leafnode
    • in case of overflow
      • leaf gets internal node, distribute entries according to hash-function
  • counting: 
    • root: compute hash for all items
    • internal node: determine hash value, continue search
    • leaf: itemset contained in T or not (+1 or not)

Name the performance bottlenecks of apriori algorithm (2), what could be a solution?

  • huge candidate sets (many frequent itemsets exists)
  • multiple DB-scans (n = length of longest pattern)
  • solution: candidate generation? >> FP-trees

What is the general idea of FP-tree?

Explain the steps.

  • idea: creating conditional databases associated with one frequent item to avoid costy DB-scans

 

  1. scan DB once, order frequent items desc
  2. scan again 
    1. keep only frequent items, desc
    2. does a common prefix exist?
      1. yes: increment counter, append suffix in tree
      2. no: create new branch with prefix

What are the benefits of FP-tree? (2)

  • completeness: long pattern unbroken (complete information)
  • compactness:
    • reduce irrelevant information
    • frequency in descending order
    • compression (same prefix)

Explain how to mine in a FP-tree.

  • recursively grow frequent pattern path (3)
    • create conditional pattern base (3)
      • start from header
      • visit all nodes via links (node-link property = ...)
      • accumulate all prefix paths with frequency
    • construct conditional FP-tree pattern
    • mine tree, grow frequent pattern (2)
      • base case: if conditional FP-tree only contains single path: enumerate all pattern/combinations
      • else: create conditional pattern base >> recursive call

What is the pattern growth property?

  • "abcdef" is frequent iff "abcde" is frequent AND "f" is frequent in transactions containing "abcde" (conditional frequency)

What are the advantages of FP-growth? (4 + 1)

  • no candidate generation
  • compact data structure
  • no repeated DB-scans
  • operations: counting and tree construction
  • >> faster than apriori

What is a closed / maximal frequent itemset?

  • closed: for all \(Y \supset X: supp(Y) < supp(X)\)
    • complete information regarding frequent itemsets
  • maximal: for all \(Y \supset X: supp(Y) < minSup\)
    • implies closed

What are association rules in general? (2+2)

  • "if... then" implication in transactions with probability
  • items are sorted lexicographically
  • frequency: \(supp(X \Rightarrow Y) = P(X \cup Y) = |X\cup Y \subseteq T| / |D| = supp(X \cup Y)\)
    • = frequency of both occuring
  • confidence: \(conf(X \Rightarrow Y) = P(Y |X) = supp(X \cup Y) / supp(X)\)
    • = frequeny of both occuring devided by frequency of first occuring
    • strength of implication rule

What is a strong association rule?

  • all rules where supp >= minSup AND conf >= minConf

What are the main steps in finding association rules?

How many candidates exist for rules?

  1. find frequent itemsets (i.e. apriori or FP-growth)
  2. for each \(X: Y \subset X \) generate rules \(Y \Rightarrow (X \setminus Y)\), check conf >= minConf?

 

  • 2|X| - 2 for each X

What is the anti-monotonicity?

  • if \(Y \Rightarrow Z\) not strong: \(Y \Rightarrow Z \cup A\) not strong
  • if \(Y \Rightarrow Z\) not strong: \(Y' \subseteq Y: (Y \setminus Y') \Rightarrow (Z \cup Y')\)not strong

When is a rule interesting? (2)

  • unexpected
  • actionable

What is the criticism to support and confidence? What is the solution?

  • not all strong rules are interesting, results can be misleading
  • solution: correlation ("Lift")
    • \(corr_{A, B} = \frac{conf(A \Rightarrow B)} {supp(B)}\)
    • > 1: positive correlated, = 1 no correlation, < 1 negative correlated

Why are hierachical assosciation rules useful? (1+1)

  • problem:
    • lare minSupp leads to few rules
    • small minSupp leads to many rules
  • solution: exploit item taxonomies

What are generalised association rules?

  • Y is generalisation (union) of several X_i
  • \(supp(Y \Rightarrow Z) \ge supp(X_i \Rightarrow Z) \forall i\)
  • \(supp(Y \Rightarrow Z) \neq \sum_{i = 1}^{k}supp(X_i \Rightarrow Z)\)

How to mine hierarchical association rules? (3)

  • top-down / progressive-deepening approach: from strong/high level to low level
  • adapting minSup (2)
  • redundancy filtering

How the minSup can be adapted in hierarchical association rules? (2)

  • same accross levels (uniform support)
  • reduce minSup for lower levels (reduced / variable support) (3)
    • level-by-level independent method: regardless of frequency of parent node
    • level-cross filtering by single item: only if parent is frequent at its level
    • level-cross-filtering by k-itemset: only if parent k-itemset is frequent

What is redundancy filtering?

  • redundancy through ancestor-relationship (3)
    • redundant if support \(\approx\) expected valued based on ancestors' rule
    • ancestor rules: \(X' \Rightarrow Y'; X' \Rightarrow Y; X \Rightarrow Y'\)of \(X \Rightarrow Y\)
    • direct ancestor: \(X' \Rightarrow Y'\) ancestor of \(X \Rightarrow Y\)AND NOT (\(X'' \Rightarrow Y''\) is ancestor of\(X \Rightarrow Y\) AND \(X' \Rightarrow Y'\) is ancestor of \(X'' \Rightarrow Y''\))
  • >> R-Interestingness (3)
    • if no direct ancestor of \(X \Rightarrow Y\) exists
    • OR supp > R * expected support from ancestor's rule
    • OR conf > R * expected confidence from ancestor's rule

What is sequential pattern mining. What is a sequence, consecutive sequence, subsequence, prefix, suffix, relative support, maximal frequent sequence and closed frequent sequence?

  • ordering of patterns matters (spatial or temportal)
  • even higher complexity than frequent item mining
    • \(|\Sigma|^k\) k-sequences, often \(k > |\Sigma|\)
  • notions
    • sequence: ordered list of length k
    • consecutive subsequence: S contains R: \(R \subseteq S\)
    • subsequence: gaps allowed, same order of occurrence required; \(R \subseteq S\)
    • prefix: consecuiive subsequence S[1:i]
    • suffix: consecutive S[i:n]
    • relative support: \(supp(R) = | \{R \subseteq S\}|/|D|\)
    • maximal frequent sequence: no supersequence is frequent
    • closed frequent sequence: all supersequences have smaller support

Name mining algorithms for sequential pattern mining. (2+2)

  • breadth-frist search based (comparable to apriori-algorithm)
    • GSP
    • SPADE
  • depth-frist seach based
    • PrefixSpan
    • SPAM

Explain PrefixSpan algorithm.

  • sequence space is organised as prefix search tree
  • S' is extension of prefix S
  • uses projected DB
    • find occurence of s in S
    • remove infrequent items
    • calculate support for items
    • recursive projection: depth-first

How interval-based sequential pattern mining can be done? (3)

  • code starting points as \(t_s^+\), end points as \(t_s^-\)
  • describe relationships between sequences
    • point-based
      • before
      • during
      • after
    • interval-based (allen's relations) (7)
      • before
      • overlap
      • contains
      • started-by
      • finished-by
      • meets
      • equals
      • >> comparison: \(\mathcal{O}(k^2)\)
    • TPrefixSpan converts intervals to points i.e. \(\{A^+\}; \{B^+\}, \{B^-\}, \{A^-\}\)
      • similar to PrefixSpan but additional validation required (intervals)

 

What is the general idea of clustering? What is clusterings used for? (2)

  • grouping of data objects based on similarity
    • similar objects within cluster
    • dissimilar to objects of other clusters
  • goals
    • insights into data generation i.e. pattern recognition, image processing
    • preprocessing of other algorithms

What are general approaches for clustering? (4)

  • partitioning algorithms: k clusters, minimize objective function
  • probabilistic model-based: EM
  • density-based: based on connectivity/density
  • hierarchical clustering

What is the general characteristics of partitioning algorithms? (3) What is the general problem and how can it be tackled?

  • k clusters (predefined), \(k \le n\)
  • each instance assigned to exactly one \(C_i\)
  • construct clusters by minimzing objective function (i.e. sum of squared distances between centoids and elements)
  • problem: consider all combinations of clusters is expensive (NP-hard)
    • solution: using heuristic

Explain an example of heuristic in case of partitioning algorithms? Name two examples.

  1. random representatives
  2. assign each object to representative
  3. compute new centroids
  4. repeat 2, 3 until changes are below a threshold
  • k-means: representation by center (centroid)
  • k-medoid: representation by one object of cluster

Characterise k-means and how the clustering can be evaluated (3) 

  • \(\mu_s = \frac{1}{|S|} \sum_{p\in S}p\) must be defined
  • compactness of cluster \(C_j\) (sum of squared distances)
    • \(SSE(C_j) = \sum_{p\in C_j}||p - \mu_{C_j}||^2_2\)
  • compactness of clustering \(C\)
    • \(SSE(C) = \sum_{C_j \in C}SSE(C_j) = \sum_{p \in D} ||p-\mu_{C_{(p)}}||^2_2\)
  • \(\underset{C}{\mathrm{argmin}} (SSE(C))\) leads to optimal clustering (NP-hard)

What is a voronoi diagramm? (3)

  • cluster centroids \(p \in P\) partition space into Voronoi-cells (one cell for each cluster)
  • cells over all points for which \(p\) is the nearest neighbour (from \(P\))
  • cells are separated by perpendicular hyperplanes

Name advantages and disadvantages of K-means. (2 + 4)

  • + efficient: \(\mathcal{O}(tkn)\)
  • + easy to implement
  • - means must be defined
  • - k must be chosen in advance
  • - sensitive to outliers / noise
  • - solution depends on initialisation (>> local optimum)

Describe alternatives for k-means in case that the mean is not defined.

  • median (50%-quantile): artificial representative object "in the middle"
  • mode: most often occuring value
  • medoid: representative object "in the middle" (if centoids don't make sense)
  • >> Sum of the total distances is minimized (robust against outliers)

Discuss the k-algorithms regarding data-type, complexity and sensitivity to outliers. (4*3)

And name overall pros and cons.(2+2)

  • k-means
    • numerical data
    • \(\mathcal{O}(tkn)\)
    • high
  • k-median
    • ordinal data
    • \(\mathcal{O}(tkn)\)
    • low
  • k-mode
    • categorical
    • \(\mathcal{O}(tkn)\)
    • low
  • k-medioid
    • metric
    • \(\mathcal{O}(tk(n-k)^2) \)
    • low
  • + easy to implement
  • + variations and optimizations exist
  • - k has to be defined in advance
  • - results depend on initial partition (local optimum)

Briefly descirbe to Partitioning around medoids algorithm (PAM)

  • delta_TD = -inf
  • while delta_TD < 0
    • minimal delta_TD = TD_M,N - TD_current
    • if (delta_TD <)
      • replace medoid M with non-medoid
      • TD_current = TD_M,N
      • store medoids and assignments as best so far