KDDM
KDDM@LMU
KDDM@LMU
Kartei Details
Karten | 203 |
---|---|
Sprache | Deutsch |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 01.08.2019 / 03.08.2019 |
Weblink |
https://card2brain.ch/box/20190801_kddm
|
Einbinden |
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
How can the initialisation of partitions be done? (2 + 1)
- naive: sampling
- k-means++
- first points randomly
- next point according to squared distance to sampled point(s)
- repeat until k points sampled
- >> in general: try different centers, choose C with lowest "errors"
How to choose k in case of partition algorithms? What is a problem and how can it be solved?
- idea: generate clusters for various k, choose "best"
- problem: SSE, TD are monotonously decreasing in k
- solution: Silhouette-Coefficient (not monotonic over k)
- average distance of objects to their representatives: \(a(o) = \frac{1}{|C(o)|} \sum_{p \in C(o)} d(o, p)\)
- average distance of objects to 2nd closest representative \(b(o) = \underset{C_i \neq C(o)}{\mathrm{min}} \{ \frac{1}{|C_i|}\sum_{p \in C_i}d(o, p) \}\)
- silouette coefficient for on observation: \(s(o) = \begin{cases}0, a(o) < 0\\\frac{b(o)-a(o)}{max(a(o), b(o))} else\end{cases}\)
- \(s(o) \in [-1, 1]\)
- s(o) = 1 if b(o) >> a(o) = good assignment of o to cluster
- s(o) = 0 if b(o) = a(o) = o between cluster A and B
- s(o) = -1 if b(o) << a(o) = bad assignment, o closer to representative of cluster B
- silhouette coefficient for one cluster: \(s(C_i) = \frac{1}{|C_i|}\sum_{o \in C_i}s(o)\)
- silhouette coefficient for clustering: \(s(C) = \frac{1}{|D|}\sum_{o \in D}s(o)\)
How to interpret the silhouette coefficient for a cluster \(s(C)\) (4)
- 0.7 - 1: strong structure
- 0.5 - 0.7: medium structure
- 0.25 - 0.5: weak structure
- 0 - 0.25: no structure
What is the general idea of EM? (2) Explain the basic concept. (4)
- statistical approach: ML-estimates of parameters (proabilistic model)
- assumption: observation are drawn from one of several components of a mixture distribution
- concept
- clusters = probability distributions
- for each objects calculate probability to belong to cluster >> assign to most probable
- iteratively improve parameters and assignments
- any distribution can be used, i.e. gaussian
How is a guassian mixture distribution defined?
- \(p(x) = \sum_{l = 1}^{k}\pi_l \times N(x | \mu_l, \Sigma_l) = p(x | \theta)\), with \(\pi_l = p(z_l = 1)\) = mixing coefficients
- \(\theta = (\pi_l, \mu_l, \Sigma_l), l = \{1, ..., k\}\)
- likelikood: \(p(x|\theta) = \prod_{i = 1}^n p(x_i|\theta)\)
- \(\underset{\theta}{\mathrm{argmax}} = \theta_{ML}\)
What is the responsibility? In words and mathematically.
- probability that component j generated x_i
- \(\gamma_j(x_i) = \frac{\pi_j\times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^{k}\pi_l\times N(x_i | \mu_l, \Sigma_l)} = p(z_j = 1 |x_i, \theta)=\frac{p(x_i, \theta|z_j = 1)\times p(z_j = 1)}{\sum_{l = 1}^{k}p(x_i, \theta|z_l = 1) \times p(z_l =1)}\)
Name the ML-estimates in case of EM. What problem does this imply and how to solve this?
- \(\mu_j = \frac{\sum_{i = 1}^{n} \gamma_j(x_i) \times x_i}{\sum_{i = 1}^n \gamma_j(x_i)}\)
- \(\gamma_j(x_i) = \frac{\pi_j \times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^k \pi_l \times N(x_i|\mu_l, \Sigma_l)}\)
- with \(\pi_j = \frac{N_j}{N} \)
- >> non-linear mutual dependencies
- solution: iterative independent optimization procedure (approximation)
Explain the iterative procedure of EM. (4)
- init \(\theta_j = (\mu_j, \Sigma_j, \pi_j)\)and calculate log-likelihood
- E-step: evaluate responsibilitied with current parameters
- \(\gamma_j^{new}(x_i) = \)
- M-step: reestimate parameters, using responsibilites
- \(\mu_j^{new} = f(\gamma_j^{new}(x_i))\)
- \(\Sigma_j^{new} = \)
- \(\pi_j^{new} = \)
- Evaluate log-likelihood: \(|log(p(x|\theta^{new}) - log(p(x|\theta) \le\epsilon\)? break : repeat
What is meant with soft clustering? (2)
- object belongs to cluster with probability (expresses confidence / uncertainty)
- partition: assign object to cluster which it belongs to the most probable
- \(C(x_i) = \underset{l \in \{1, ..., k\}}{\mathrm{argmax}} \{\gamma_l(x_i)\} \)
Disuss EM (1 + 4)
- clusters vary in size (better representation)
- - converges to local optimum
- - \(\mathcal{O}(tkn)\), usually high t
- - results and t depend on initial assignment (2)
- try mutiple starts, choose highest likelihood
- initialisation i.e. with k-means for faster convergence
- - singularity of EM: cluster with one point: \(\sigma^2 \rightarrow0 \Rightarrow f(x) \rightarrow \infty\)hence \(l(\theta)\rightarrow \infty\)
How to choose k in case of EM? (2) Which problems occur? (2) How to solve it? (2)
- low k = no flexibility, high k = overfitting >> trade-off: goodness of fit vs. simplicity
- idea: introduce quality measure: \(\theta_{k^*} = \underset{k \in \{k_{min}, k_{max}\}}{\mathrm{argmax}} \{qual(\theta_k)\} \)
- problems (2)
- SIihouette coefficient works only for paritioning
- likelihood is increasing in k
- solutions
- deterministic: \(qual(\theta_k) = log(p(x|\theta_k)) + P(k)\) with \(P(k) \) as penalty for large k
- stochastic: MCMC-simulation
What is the general idea of density-based clustering? (2)
- clusters = dense regions
- separated by regions of lower density
Explain basic concepts like
- epsilon-radius
- MinPts
- Core Point
- directly density reachable
- density reachable
- density connected
- maximality
- connectivity
- noise
- border-point
- epsilon-radius: neighbourhood of q: \(N_\epsilon(q) = \{p\in D | dist(p, q) \le\epsilon\}, q \in N_\epsilon(q)\)
- MinPts: required min points in \(N_\epsilon(q)\)
- Core Point: if \(|N_\epsilon(q)| \ge minPts\)
- directly density reachable if \(p \in N_\epsilon(q)\) and q is a core point
- density reachable: transitive closure of directly density-reachable
- density connected: it exists a point o s.t. p and q are density reachable
- maximality: if \(q \in C\) and p is density-reachable from q >> \(p \in C\)
- connectivity: all objects of one cluster are density-connected to all other objects in C
- noise: instances which are not assigned to any cluster (not density reachable from any core-point)\(D \setminus (C_1~\cup ... \cup~C_k)\)
- border-point: \(|N_\epsilon(p)| < minPts \land p \in N_\epsilon(q)\)
- no core-point but directly density reachable from core-point
Briefly explain the steps of DBSCAN
- = Density based spatial clustering of application with noise
- for all (unclassified) points
- if point is core-object: assign all density-reachable objects (epsilon-neighbourhood-query) to recent (= new) cluster
- else: assign as noise
How to choose epsilon and minPts in case of DBSCAN? Explain the heuristic.(5)
- idea: use least dense cluster to defined epsilon, minPts
- heuristic
- fix minPts (i.e. 2d-1)
- calculate k-distances (with k = minPts) for all p in D
- plot all k-distances in decreasing order
- choose "border-point" o in "elbow-plot"
- set epsilon = k-distance(o)
Discuss DBSCAN (4+2)
- + arbitrary size and shape
- + k automatically determined
- + separates clusters from noise
- + low complexity: neighbourhood-query: O(n), DBSCAN: O(n^2)
- supports spatial index structures s.t. neighbourhood-query: O(log(n)), overall O(n*log(n))
- - parameters need to be defined (epsilon, minPts)
- - sensitive to parameters
How DBSCAN can be supported database-wise? (3)
- epsilon-similarity join: all epsilon-similar objects (two datasets Q and P)
- \(Q \bowtie_\epsilon P = \{(q, p)\in Q\times P ~|~ dist(q, p)\le\epsilon\)
- epsilon-similarity self-join
- \(D \bowtie_\epsilon D = \{(q, p) \in D \times D ~ | ~ dist(q,p) \le \epsilon\}\)
- "directly epsilon, minPts density-reachable"-query
- select * from D q D p where dist(q, p) <= epsilon group by q-id having count(q.id) >= minPts
Explain idea (1) and steps of Mean Shift (2+4). What is a basin of attraction?
- idea: find mode-points in density
- for each point m
- select window-size epsilon
- calculate mean inside window: w(m)
- shift window to w(m)
- repeat 2. and 3. until convergence
- grouping of those instances which converge to same mode ("basin of attraction")
Discuss Mean Shift.(5+1)
- + arbitrary size and shape
- + # clusters automatically determined
- + robust to outliers
- + easy implementation / parallelisation ("for each point")
- + one parameter: window-size epsilon
- - high complexity: neighbourhood-query: O(n), overall O(tn^2)
- supports spatial index neighbourhood-query: O(log(n)), overall O(tn*log(n))
What is the main idea of spectral clustering? What is the main problem, what is the solution?
- data modeled as similarity graph (V = objects, E = similarity of v_i, v_j)
- "cutting" similarity graph by global minimum cut >> NP-hard problem
- solution: use approximations
How an undirected graph can be characterised in case of spectral clustering? (2)
- W: weighted adjacency matrix
- D: degree matrix: \(D_{ii} = \sum_{j = 1}^n W_{ij},~ 0 ~else\)
- diag(i) = sum of weighted edges to/from v_i
Explain an approximation of spectral clustering:
What is the aim? (2 + 1)
What is the main idea? (3)
- aim:
- partitioning G into k subsets
- minimizing weights within partitions
- >> comparable to partitioning idea
- idea:
- indicator vector for cluster C: \(f_{C_i} = \begin{cases} 1, ~ if~v_i \in C\\ 0, ~ else \end{cases}\)
- Laplacian matrix: L = D - W
- analyse eigenvalues and -vectors of L
- \(fLf^T\) represents weight of inter-cluster edges >> to be minimized (besides solution f = (1, ..., 1))
- small for good partitioning
- \(fLf^T = fDf^T - fWf^T = ~...~ = \frac{1}{2}\sum_{ij}(w_{ij}(f_i - f_j)^2 \)
- \(fLf^T\) represents weight of inter-cluster edges >> to be minimized (besides solution f = (1, ..., 1))
What happens if there are no edges between clusters in case of spectral clustering (perfect separation)? (2)
- blocks in L
- \(fLf^T = 0\) for each C
What is the relation between connected components and eigenvalues of \(L\) in case of spectral clustering? (3).
How to find clusters based on this?
- # (eigenvalue of 0) = # connected components in the graph
- eigenvalues are usually != 0 (one large connected component)
- solution: analyse eigenvectors belonging to k smallest eigenvalues
- vertecies represented by eigenvectors \((v_1, v_2,..., v_k)\)>> clustering based on this representation (i.e. k-means)
When is hierarchical clustering required? (2) What is the result of hierarchical clustering? (5)
- required if:
- hierarchical structure of clusters
- different densities, sizes
- result: dendrogram (5)
- tree structure of clusters
- root: whole dataset
- leaf: single object
- internal node union of all objects in sub-tree
- height of internal noe = distance of two childs
What are the main two approaches of hierarchical clustering. (2)
- agglomerative: bottom-up
- divisive: top down
Explain steps of agglomerative clustering. (4) What is the resulting cluster based on?
- each object in own cluster
- compute pairwise distance between clusters, distance function for clusters required
- single-link: \(dist(X, Y) = min_{x \in X, y \in Y}~dist(x, y)\)
- complete link: \(dist(X, Y) = max_{x \in X, y \in Y}~dist(x, y)\)
- average-link: \(dist(X, Y) = (|X|\times|Y|)^{-1}\times \sum_{x\in X, y\in Y}~dist(x, y)\)
- merge closest clusters into new one (\((A, B) \rightarrow A' = A \cup B\))
- repeat 2, 3 until C incsludes all objects
>> based on local pattern
Explain basic idea of divisive clustering. (2)
How can the splitting be done? (2 + 2 + 2)
What characterises the clustering-results?
- all objects in one cluster
- split until all clusters are singletons: 2^(n-1)-1 spliiting possibilities in first iteration
- i.e. based on largest diameter
- most disparete object in C: highest average dissimiliarity
- splinter group: S = {o}
- assign \(o' \notin S\) with \(D(o') > 0\) to S until \(D(o') \le 0, ~\forall o'\notin S\)
- with \(D(o') = \sum_{|o_j \in C \setminus S|}\frac{d(o', o_j)}{|C\setminus S|} - \sum_{o_i\in S} \frac{d(o', o_i)}{|S|}\)
- = avg distance from o' to elements not in splinter group (o_j)
>> more complex splitting procedure = more accurate as it uses more complete information?
What is the basic idea of density based hierarchical clustering? (1) Name the algorithm and what is stands for.
- using the processing order of objects to track density >> OPTICS = Ordering points to identify the clustering structure
What is the core-distance and the reach-distance in case of OPTICS?
- core-distance: smallest distance s.t. o is a core point (core-dist(o))
- if core-dist > epsilon => undefined
- reach-distance: smallest distance s.t. p is directly density-reachable from o (reach-dist(p,o)
- \(reach\_dist(p,o) = \begin{cases} dist(p, o), ~ if ~dist(p,o) \ge core-dist(o)\\core-dist(p, o), ~ if ~dist(p, o) < core-dist(o)\\\infty, ~ if ~ dist(p, o) > \epsilon\end{cases}\)
Explain the single steps of OPTICS.
What is the results? What can it be used for? (1) What are advantages of these? (2)
- seedList: all seen objects with reachability (asc)
- clusterOrder: sequential cluster-order (including reachabilities)
- result-object for reachability plot
- go to next entry o in seedList (closest one)
- for each p in range(o, epsilon)
- compute reach-dist(p, o)
- update seedList (add p in order)
- for each p in range(o, epsilon)
- result: reachability plot
- represents density-based clustering structure
- easy to analyse
- independent of dimensionality
Discuss hierarchical clustering (5+2)
- + # cluster must not be known in advance (= result)
- + no or robust parameter (in case of OPTICS)
- + generates complete nesting of clusters
- + good visualisations included
- + flat partition through cut of dendrogram / reachability plot possible
- - complexity:
- standard methods: O(n log(n))
- OPTICS: O(n^2) or O(n log(n)) with tree-based index support
- - final clustering needs to be chosen manually
How clustering results can be evaluated, name pros and cons each? (3: 3, 2, 2)
- expert opinion
- + new insights in data
- - expensive
- - not comparable / reproducable
- external measures
- + objective
- - ground-truth required
- internal measures
- + no additional information required
- - results in optimization of evalution criteria
Define the following external measures:
- TP
- FP
- TN
- FN
- recall
- precision
- F1-measure
- rand-index
- adjusted rand-index
- Jaccard coefficient
- confusion matrix
- shannon entropy
- mutual entropy
- mutual information
- normalized mutual information
- adjusted mutual information
- TP: \(|S_C \cap S_G|\)
- FP: \(|S_C \cap \overline{S_G}|\)
- TN: \(|\overline{S_C} \cap \overline{S_G}|\)
- FN: \(|\overline{S_C} \cap {S_G}|\)
- recall: \(\frac{TP}{TP + FN}\)
- precision: \(\frac{TP}{TP + FP}\)
- F1-measure: \(\frac{2\cdot rec\cdot prec}{rec + prec}\)
- rand-index: \(\frac{TP + TN}{TP + FN + FP + TN}\)
- adjusted rand-index: RI vs. expected RI from random cluster assignment
- Jaccard coefficient: \(\frac{TP}{TP + FN + FP}\)
- confusion matrix:
- \(N_{ij} = | C_i \cap G_j|\) matrix with i = cluster index, j = ground-truth index
- \(N_i = \sum_{j = 1}^l N_{ij} = |C_i|\) summing rowise all elements from cluster i
- \(N = \sum_{j = 1}^l N_{i} = |D|\) summing all cluster elements from overall objects
- shannon entropy: \(H(C) = - \sum_{C_i \in C}p(C_i)~log(p(C_i)) = - \sum_{i = 1}^k \frac{N_i}{N}~log\frac{N_i}{N}\)
- mutual entropy: \(H(C|G) = - \sum_{C_i \in C}\{p(C_i) \sum_{G_j \in G} p(G_j | C_i )~log(p(G_i | C_i)\} = - \sum_{i = 1}^k\{\frac{N_{i}}{N} \sum_{j = 1}^l \frac{N_{ij}}{N}~log\frac{N_{ij}}{N}\}\)
- mutual information: \(I (C, G) = H(C) - H(C|G) = H(G) - H(G|C)\)
- normalized mutual information: \(NMI(C, G) = \frac{I(C, G)}{\sqrt{H(C)H(G)}}\)
- adjusted mutual information: MI vs expected MI of random cluster assignment
Name three ways of internal measure for cluster evaluation.
- cohesion:
- \(coh(C_i) = {|C_i| \choose 2}^{-1}\sum_{o, p \in C_i,~ o \neq p}d(o, p)\) = average distance of objects of same cluster
- \(coh(C) = \sum_{i = 1}^k \frac{|C_i|}{n} coh(C_i)\) = weighted mean of all C_i
- separation:
- separation between cluster i and j: \(sep(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{o \in C_i,~p \in C_j}d(o, p)\) = average distance between all pairs of objects of C_i, and C_j
- minium separation: \(sep(C_i) = min~\{sep(C_i, C_j)\}, i \neq q\) = minimum separation to another cluster
- separation of a clustering: \(sep(C) = \sum_{i = 1}^k \frac{|C_i|}{n} ~ sep(C_i)\) = weighted mean of cluster's minimum separations
- evalutaion of distance matrix: data orderecd by cluster labels >> squares for clusters recognizable
When does cohesion and separation fail?
- when clusters are stretched: large cohesion and small separation possible
What is the ambiguity of clustering? (4)
- correct clustering doesn't exist
- check for reasonability of clustering required
- no clusters found != no clusters exist
- clusters found != good representation of reality
What is the idea of hopkins statistics? (1) How is it applied? (3) What do values of 0, 0.5, 1 imply?
- compares dataset with uniform distributed objects
- sample m objects, m << n from dataset, calculate distance to next neighbour w_i
- sample m uniform distributed sampled, calculate distance to next neighbour u_i
- \(H = \frac{sum(u_i)}{sum(u_i) + sum(w_i)}\)
- H = 0: regular data as sum(w_i) -> inf
- H = 0.5: uniform distributed data as sum(u_i) = sum(w_i)
- H = 1: clustered data as sum(w_i) = 0 (objects are really close)
What is the main idea of ensemble clustering? (1)
What are the benefits of it? (5)
- consolidate multiple cluster solutions: given \(\mathcal{C} = \{C_1, ..., C_L\}\), find consensus clustering C*
- benefits
- knowledge reuse
- improved quality
- improved robustness
- model selection
- unifying distributed clustering
What are the main approaches for ensemble clustering? (2: 2, 2)
- based on pairwise similarity: \(\phi(\mathcal{C}, C^*) = \sum_{C \in \mathcal{C}}\phi(C, C^*)\) with \(\phi\) = external evaluation measure
- pair-counting based (based on RI, ...)
- information theoretic (based on MI, NMI, ...)
- probabilistic approach
- assume L labels of objects x_i follow a distribution
- optimize parameters based on ML / EM