KDDM

KDDM@LMU

KDDM@LMU


Kartei Details

Karten 203
Sprache Deutsch
Kategorie Informatik
Stufe Universität
Erstellt / Aktualisiert 01.08.2019 / 03.08.2019
Weblink
https://card2brain.ch/box/20190801_kddm
Einbinden
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

How can the initialisation of partitions be done? (2 + 1)

  • naive: sampling
  • k-means++
    • first points randomly
    • next point according to squared distance to sampled point(s)
    • repeat until k points sampled
  • >> in general: try different centers, choose C with lowest "errors"

How to choose k in case of partition algorithms? What is a problem and how can it be solved?

  • idea: generate clusters for various k, choose "best"
  • problem: SSE, TD are monotonously decreasing in k
  • solution: Silhouette-Coefficient (not monotonic over k)
    • average distance of objects to their representatives: \(a(o) = \frac{1}{|C(o)|} \sum_{p \in C(o)} d(o, p)\)
    • average distance of objects to 2nd closest representative \(b(o) = \underset{C_i \neq C(o)}{\mathrm{min}} \{ \frac{1}{|C_i|}\sum_{p \in C_i}d(o, p) \}\)
    • silouette coefficient for on observation: \(s(o) = \begin{cases}0, a(o) < 0\\\frac{b(o)-a(o)}{max(a(o), b(o))} else\end{cases}\) 
      • \(s(o) \in [-1, 1]\)
      • s(o) = 1 if b(o) >> a(o) = good assignment of o to cluster
      • s(o) = 0 if b(o) = a(o) = o between cluster A and B
      • s(o) = -1 if b(o) << a(o) = bad assignment, o closer to representative of cluster B
    • silhouette coefficient for one cluster: \(s(C_i) = \frac{1}{|C_i|}\sum_{o \in C_i}s(o)\)
    • silhouette coefficient for clustering: \(s(C) = \frac{1}{|D|}\sum_{o \in D}s(o)\)

How to interpret the silhouette coefficient for a cluster \(s(C)\) (4)

  • 0.7 - 1: strong structure
  • 0.5 - 0.7: medium structure
  • 0.25 - 0.5: weak structure
  • 0 - 0.25: no structure

What is the general idea of EM? (2) Explain the basic concept. (4)

  • statistical approach: ML-estimates of parameters (proabilistic model)
  • assumption: observation are drawn from one of several components of a mixture distribution
  • concept
    • clusters = probability distributions
    • for each objects calculate probability to belong to cluster >> assign to most probable
    • iteratively improve parameters and assignments
    • any distribution can be used, i.e. gaussian

How is a guassian mixture distribution defined?

  • \(p(x) = \sum_{l = 1}^{k}\pi_l \times N(x | \mu_l, \Sigma_l) = p(x | \theta)\), with \(\pi_l = p(z_l = 1)\) = mixing coefficients
    • \(\theta = (\pi_l, \mu_l, \Sigma_l), l = \{1, ..., k\}\)
  • likelikood: \(p(x|\theta) = \prod_{i = 1}^n p(x_i|\theta)\)
    •  \(\underset{\theta}{\mathrm{argmax}} = \theta_{ML}\)

What is the responsibility? In words and mathematically.

  • probability that component j generated x_i
  • \(\gamma_j(x_i) = \frac{\pi_j\times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^{k}\pi_l\times N(x_i | \mu_l, \Sigma_l)} = p(z_j = 1 |x_i, \theta)=\frac{p(x_i, \theta|z_j = 1)\times p(z_j = 1)}{\sum_{l = 1}^{k}p(x_i, \theta|z_l = 1) \times p(z_l =1)}\)

Name the ML-estimates in case of EM. What problem does this imply and how to solve this?

  • \(\mu_j = \frac{\sum_{i = 1}^{n} \gamma_j(x_i) \times x_i}{\sum_{i = 1}^n \gamma_j(x_i)}\)
  • \(\gamma_j(x_i) = \frac{\pi_j \times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^k \pi_l \times N(x_i|\mu_l, \Sigma_l)}\)
  • with \(\pi_j = \frac{N_j}{N} \)
  • >> non-linear mutual dependencies
    • solution: iterative independent optimization procedure (approximation)

Explain the iterative procedure of EM. (4)

  1. init \(\theta_j = (\mu_j, \Sigma_j, \pi_j)\)and calculate log-likelihood
  2. E-step: evaluate responsibilitied with current parameters
    1. \(\gamma_j^{new}(x_i) = \)
  3. M-step: reestimate parameters, using responsibilites
    1. \(\mu_j^{new} = f(\gamma_j^{new}(x_i))\)
    2. \(\Sigma_j^{new} = \)
    3. \(\pi_j^{new} = \)
  4. Evaluate log-likelihood: \(|log(p(x|\theta^{new}) - log(p(x|\theta) \le\epsilon\)? break : repeat

What is meant with soft clustering? (2)

  • object belongs to cluster with probability (expresses confidence / uncertainty)
  • partition: assign object to cluster which it belongs to the most probable
    • \(C(x_i) = \underset{l \in \{1, ..., k\}}{\mathrm{argmax}} \{\gamma_l(x_i)\} \)

Disuss EM (1 + 4)

  • clusters vary in size (better representation)
  • - converges to local optimum
  • \(\mathcal{O}(tkn)\), usually high t
  • - results and t depend on initial assignment (2)
    • try mutiple starts, choose highest likelihood
    • initialisation i.e. with k-means for faster convergence
  • - singularity of EM: cluster with one point: \(\sigma^2 \rightarrow0 \Rightarrow f(x) \rightarrow \infty\)hence \(l(\theta)\rightarrow \infty\)
  •  

How to choose k in case of EM? (2) Which problems occur? (2) How to solve it? (2)

  • low k = no flexibility, high k = overfitting >> trade-off: goodness of fit vs. simplicity
  • idea: introduce quality measure: \(\theta_{k^*} = \underset{k \in \{k_{min}, k_{max}\}}{\mathrm{argmax}} \{qual(\theta_k)\} \)
  • problems (2)
    • SIihouette coefficient works only for paritioning
    • likelihood is increasing in k
  • solutions
    • deterministic: \(qual(\theta_k) = log(p(x|\theta_k)) + P(k)\) with \(P(k) \) as penalty for large k
    • stochastic: MCMC-simulation

What is the general idea of density-based clustering? (2)

  • clusters = dense regions
  • separated by regions of lower density

Explain basic concepts like

  • epsilon-radius
  • MinPts
  • Core Point
  • directly density reachable
  • density reachable
  • density connected
  • maximality
  • connectivity
  • noise
  • border-point

  • epsilon-radius: neighbourhood of q: \(N_\epsilon(q) = \{p\in D | dist(p, q) \le\epsilon\}, q \in N_\epsilon(q)\)
  • MinPts: required min points in \(N_\epsilon(q)\)
  • Core Point: if \(|N_\epsilon(q)| \ge minPts\)
  • directly density reachable if \(p \in N_\epsilon(q)\) and q is a core point
  • density reachable: transitive closure of directly density-reachable
  • density connected: it exists a point o s.t. p and q are density reachable
  • maximality: if \(q \in C\) and p is density-reachable from q >> \(p \in C\)
  • connectivity: all objects of one cluster are density-connected to all other objects in C
  • noise: instances which are not assigned to any cluster (not density reachable from any core-point)\(D \setminus (C_1~\cup ... \cup~C_k)\)
  • border-point: \(|N_\epsilon(p)| < minPts \land p \in N_\epsilon(q)\)
    • no core-point but directly density reachable from core-point

Briefly explain the steps of DBSCAN

  • = Density based spatial clustering of application with noise
  • for all (unclassified) points
    • if point is core-object: assign all density-reachable objects (epsilon-neighbourhood-query) to recent (= new) cluster
    • else: assign as noise

How to choose epsilon and minPts in case of DBSCAN? Explain the heuristic.(5)

  • idea: use least dense cluster to defined epsilon, minPts
  • heuristic
    • fix minPts (i.e. 2d-1)
    • calculate k-distances (with k = minPts) for all p in D
    • plot all k-distances in decreasing order
    • choose "border-point" o in "elbow-plot"
    • set epsilon = k-distance(o)

Discuss DBSCAN (4+2)

  • + arbitrary size and shape
  • + k automatically determined
  • + separates clusters from noise
  • + low complexity: neighbourhood-query: O(n), DBSCAN: O(n^2)
    • supports spatial index structures s.t.  neighbourhood-query: O(log(n)), overall O(n*log(n))
  • - parameters need to be defined (epsilon, minPts)
  • - sensitive to parameters

How DBSCAN can be supported database-wise? (3)

  • epsilon-similarity join: all epsilon-similar objects (two datasets Q and P)
    • \(Q \bowtie_\epsilon P = \{(q, p)\in Q\times P ~|~ dist(q, p)\le\epsilon\)
  • epsilon-similarity self-join
    • \(D \bowtie_\epsilon D = \{(q, p) \in D \times D ~ | ~ dist(q,p) \le \epsilon\}\)
  • "directly epsilon, minPts density-reachable"-query
    • select * from D q D p where dist(q, p) <= epsilon group by q-id having count(q.id) >= minPts

Explain idea (1) and steps of Mean Shift (2+4). What is a basin of attraction?

  • idea: find mode-points in density
  1. for each point m
    1. select window-size epsilon
    2. calculate mean inside window: w(m)
    3. shift window to w(m)
    4. repeat 2. and 3. until convergence
  2. grouping of those instances which converge to same mode ("basin of attraction")

Discuss Mean Shift.(5+1)

  • + arbitrary size and shape
  • + # clusters automatically determined
  • + robust to outliers
  • + easy implementation / parallelisation ("for each point")
  • + one parameter: window-size epsilon
  • - high complexity: neighbourhood-query: O(n), overall O(tn^2)
    • supports spatial index neighbourhood-query: O(log(n)), overall O(tn*log(n))

What is the main idea of spectral clustering? What is the main problem, what is the solution?

  • data modeled as similarity graph (V = objects, E = similarity of v_i, v_j)
  • "cutting" similarity graph by global minimum cut >> NP-hard problem
    • solution: use approximations

How an undirected graph can be characterised in case of spectral clustering? (2)

  • W: weighted adjacency matrix
  • D: degree matrix: \(D_{ii} = \sum_{j = 1}^n W_{ij},~ 0 ~else\)
    • diag(i) = sum of weighted edges to/from v_i

Explain an approximation of spectral clustering:

What is the aim? (2 + 1)

What is the main idea? (3)

 

  • aim:
    • partitioning G into k subsets
    • minimizing weights within partitions
    • >> comparable to partitioning idea
  • idea: 
    • indicator vector for cluster C: \(f_{C_i} = \begin{cases} 1, ~ if~v_i \in C\\ 0, ~ else \end{cases}\)
    • Laplacian matrix: L = D - W
    • analyse eigenvalues and -vectors of L
      • \(fLf^T\) represents weight of inter-cluster edges >> to be minimized (besides solution f = (1, ..., 1))
        • small for good partitioning
      • \(fLf^T = fDf^T - fWf^T = ~...~ = \frac{1}{2}\sum_{ij}(w_{ij}(f_i - f_j)^2 \)

What happens if there are no edges between clusters in case of spectral clustering (perfect separation)? (2)

  • blocks in L
  • \(fLf^T = 0\) for each C

What is the relation between connected components and eigenvalues of \(L\) in case of spectral clustering? (3).

How to find clusters based on this?

  • # (eigenvalue of 0) = # connected components in the graph
  • eigenvalues are usually != 0 (one large connected component)
  • solution: analyse eigenvectors belonging to k smallest eigenvalues

 

  • vertecies represented by eigenvectors \((v_1, v_2,..., v_k)\)>> clustering based on this representation (i.e. k-means)

When is hierarchical clustering required? (2) What is the result of hierarchical clustering? (5)

  • required if: 
    • hierarchical structure of clusters
    • different densities, sizes
  • result: dendrogram (5)
    • tree structure of clusters
    • root: whole dataset
    • leaf: single object
    • internal node union of all objects in sub-tree
    • height of internal noe = distance of two childs

What are the main two approaches of hierarchical clustering. (2)

  • agglomerative: bottom-up
  • divisive: top down

Explain steps of agglomerative clustering. (4) What is the resulting cluster based on?

  1. each object in own cluster
  2. compute pairwise distance between clusters, distance function for clusters required
    1. single-link: \(dist(X, Y) = min_{x \in X, y \in Y}~dist(x, y)\)
    2. complete link: \(dist(X, Y) = max_{x \in X, y \in Y}~dist(x, y)\)
    3. average-link: \(dist(X, Y) = (|X|\times|Y|)^{-1}\times \sum_{x\in X, y\in Y}~dist(x, y)\)
  3. merge closest clusters into new one (\((A, B) \rightarrow A' = A \cup B\))
  4. repeat 2, 3 until C incsludes all objects

>> based on local pattern

Explain basic idea of divisive clustering. (2)

How can the splitting be done? (2 + 2 + 2)

What characterises the clustering-results?

 

  1. all objects in one cluster
  2. split until all clusters are singletons: 2^(n-1)-1 spliiting possibilities in first iteration
    1. i.e. based on largest diameter
    2. most disparete object in C: highest average dissimiliarity
      1. splinter group: S = {o}
      2. assign \(o' \notin S\) with \(D(o') > 0\) to S until \(D(o') \le 0, ~\forall o'\notin S\)
        1. with \(D(o') = \sum_{|o_j \in C \setminus S|}\frac{d(o', o_j)}{|C\setminus S|} - \sum_{o_i\in S} \frac{d(o', o_i)}{|S|}\)
        2. = avg distance from o' to elements not in splinter group (o_j)

>> more complex splitting procedure = more accurate as it uses more complete information?

What is the basic idea of density based hierarchical clustering? (1) Name the algorithm and what is stands for.

  • using the processing order of objects to track density >> OPTICS = Ordering points to identify the clustering structure

What is the core-distance and the reach-distance in case of OPTICS?

  • core-distance: smallest distance s.t. o is a core point (core-dist(o))
    • if core-dist > epsilon => undefined
  • reach-distance: smallest distance s.t. p is directly density-reachable from o (reach-dist(p,o)
    • \(reach\_dist(p,o) = \begin{cases} dist(p, o), ~ if ~dist(p,o) \ge core-dist(o)\\core-dist(p, o), ~ if ~dist(p, o) < core-dist(o)\\\infty, ~ if ~ dist(p, o) > \epsilon\end{cases}\)

Explain the single steps of OPTICS.

What is the results? What can it be used for? (1) What are advantages of these? (2)

  • seedList: all seen objects with reachability (asc)
  • clusterOrder: sequential cluster-order (including reachabilities)
    • result-object for reachability plot
  • go to next entry o in seedList (closest one)
    • for each p in range(o, epsilon)
      • compute reach-dist(p, o)
      • update seedList (add p in order)

 

  • result: reachability plot
    • represents density-based clustering structure
    • easy to analyse
    • independent of dimensionality

Discuss hierarchical clustering (5+2)

  • + # cluster must not be known in advance (= result)
  • + no or robust parameter (in case of OPTICS)
  • + generates complete nesting of clusters
  • + good visualisations included
  • + flat partition through cut of dendrogram / reachability plot possible
  • - complexity:
    • standard methods: O(n log(n))
    • OPTICS: O(n^2) or O(n log(n)) with tree-based index support
  • - final clustering needs to be chosen manually

How clustering results can be evaluated, name pros and cons each? (3: 3, 2, 2)

  • expert opinion
    • + new insights in data
    • - expensive
    • - not comparable / reproducable
  • external measures
    • + objective
    • - ground-truth required
  • internal measures
    • + no additional information required
    • - results in optimization of evalution criteria

Define the following external measures:

  • TP
  • FP
  • TN
  • FN
  • recall
  • precision
  • F1-measure
  • rand-index
  • adjusted rand-index
  • Jaccard coefficient
  • confusion matrix
  • shannon entropy
  • mutual entropy
  • mutual information
  • normalized mutual information
  • adjusted mutual information

  • TP: \(|S_C \cap S_G|\)
  • FP: \(|S_C \cap \overline{S_G}|\)
  • TN: \(|\overline{S_C} \cap \overline{S_G}|\)
  • FN: \(|\overline{S_C} \cap {S_G}|\)
  • recall: \(\frac{TP}{TP + FN}\)
  • precision: \(\frac{TP}{TP + FP}\)
  • F1-measure: \(\frac{2\cdot rec\cdot prec}{rec + prec}\)
  • rand-index: \(\frac{TP + TN}{TP + FN + FP + TN}\)
  • adjusted rand-index: RI vs. expected RI from random cluster assignment
  • Jaccard coefficient: \(\frac{TP}{TP + FN + FP}\)
  • confusion matrix:
    • \(N_{ij} = | C_i \cap G_j|\) matrix with i = cluster index, j = ground-truth index
    • \(N_i = \sum_{j = 1}^l N_{ij} = |C_i|\) summing rowise all elements from cluster i 
    • \(N = \sum_{j = 1}^l N_{i} = |D|\) summing all cluster elements from overall objects
  • shannon entropy: \(H(C) = - \sum_{C_i \in C}p(C_i)~log(p(C_i)) = - \sum_{i = 1}^k \frac{N_i}{N}~log\frac{N_i}{N}\)
  • mutual entropy: \(H(C|G) = - \sum_{C_i \in C}\{p(C_i) \sum_{G_j \in G} p(G_j | C_i )~log(p(G_i | C_i)\} = - \sum_{i = 1}^k\{\frac{N_{i}}{N} \sum_{j = 1}^l \frac{N_{ij}}{N}~log\frac{N_{ij}}{N}\}\)
  • mutual information: \(I (C, G) = H(C) - H(C|G) = H(G) - H(G|C)\)
  • normalized mutual information: \(NMI(C, G) = \frac{I(C, G)}{\sqrt{H(C)H(G)}}\)
  • adjusted mutual information: MI vs expected MI of random cluster assignment

Name three ways of internal measure for cluster evaluation.

  • cohesion:
    • \(coh(C_i) = {|C_i| \choose 2}^{-1}\sum_{o, p \in C_i,~ o \neq p}d(o, p)\) = average distance of objects of same cluster
    • \(coh(C) = \sum_{i = 1}^k \frac{|C_i|}{n} coh(C_i)\) = weighted mean of all C_i 
  • separation:
    • separation between cluster i and j: \(sep(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{o \in C_i,~p \in C_j}d(o, p)\) = average distance between all pairs of objects of C_i, and C_j
    • minium separation: \(sep(C_i) = min~\{sep(C_i, C_j)\}, i \neq q\) = minimum separation to another cluster
    • separation of a clustering: \(sep(C) = \sum_{i = 1}^k \frac{|C_i|}{n} ~ sep(C_i)\) = weighted mean of cluster's minimum separations
  • evalutaion of distance matrix: data orderecd by cluster labels >> squares for clusters recognizable

When does cohesion and separation fail?

  • when clusters are stretched: large cohesion and small separation possible

What is the ambiguity of clustering? (4)

  • correct clustering doesn't exist
  • check for reasonability of clustering required
  • no clusters found != no clusters exist
  • clusters found != good representation of reality

What is the idea of hopkins statistics? (1) How is it applied? (3) What do values of 0, 0.5, 1 imply?

  • compares dataset with uniform distributed objects
    • sample m objects, m << n from dataset, calculate distance to next neighbour w_i
    • sample m uniform distributed sampled, calculate distance to next neighbour u_i
    • \(H = \frac{sum(u_i)}{sum(u_i) + sum(w_i)}\)
      • H = 0: regular data as sum(w_i) -> inf
      • H = 0.5: uniform distributed data as sum(u_i) = sum(w_i)
      • H = 1: clustered data as sum(w_i) = 0 (objects are really close)

What is the main idea of ensemble clustering? (1)

What are the benefits of it? (5)

 

  • consolidate multiple cluster solutions: given \(\mathcal{C} = \{C_1, ..., C_L\}\), find consensus clustering C*
  • benefits
    • knowledge reuse
    • improved quality
    • improved robustness
    • model selection
    • unifying distributed clustering

What are the main approaches for ensemble clustering? (2: 2, 2)

  • based on pairwise similarity: \(\phi(\mathcal{C}, C^*) = \sum_{C \in \mathcal{C}}\phi(C, C^*)\) with \(\phi\) = external evaluation measure
    • pair-counting based (based on RI, ...)
    • information theoretic (based on MI, NMI, ...)
  • probabilistic approach
    • assume L labels of objects x_i follow a distribution
    • optimize parameters based on ML / EM