Lernkartei KDDM (Seite 2 von 6)

Karten	203
Sprache	Deutsch
Kategorie	Informatik
Stufe	Universität
Erstellt / Aktualisiert	01.08.2019 / 03.08.2019
Weblink	https://card2brain.ch/box/20190801_kddm
Einbinden	<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

How can the initialisation of partitions be done? (2 + 1)

naive: sampling
k-means++
- first points randomly
- next point according to squared distance to sampled point(s)
- repeat until k points sampled
>> in general: try different centers, choose C with lowest "errors"

How to choose k in case of partition algorithms? What is a problem and how can it be solved?

idea: generate clusters for various k, choose "best"
problem: SSE, TD are monotonously decreasing in k
solution: Silhouette-Coefficient (not monotonic over k)
- average distance of objects to their representatives: \(a(o) = \frac{1}{|C(o)|} \sum_{p \in C(o)} d(o, p)\)
- average distance of objects to 2nd closest representative \(b(o) = \underset{C_i \neq C(o)}{\mathrm{min}} \{ \frac{1}{|C_i|}\sum_{p \in C_i}d(o, p) \}\)
- silouette coefficient for on observation: \(s(o) = \begin{cases}0, a(o) < 0\\\frac{b(o)-a(o)}{max(a(o), b(o))} else\end{cases}\)
  - \(s(o) \in [-1, 1]\)
  - s(o) = 1 if b(o) >> a(o) = good assignment of o to cluster
  - s(o) = 0 if b(o) = a(o) = o between cluster A and B
  - s(o) = -1 if b(o) << a(o) = bad assignment, o closer to representative of cluster B
- silhouette coefficient for one cluster: \(s(C_i) = \frac{1}{|C_i|}\sum_{o \in C_i}s(o)\)
- silhouette coefficient for clustering: \(s(C) = \frac{1}{|D|}\sum_{o \in D}s(o)\)

How to interpret the silhouette coefficient for a cluster \(s(C)\) (4)

0.7 - 1: strong structure
0.5 - 0.7: medium structure
0.25 - 0.5: weak structure
0 - 0.25: no structure

What is the general idea of EM? (2) Explain the basic concept. (4)

statistical approach: ML-estimates of parameters (proabilistic model)
assumption: observation are drawn from one of several components of a mixture distribution
concept
- clusters = probability distributions
- for each objects calculate probability to belong to cluster >> assign to most probable
- iteratively improve parameters and assignments
- any distribution can be used, i.e. gaussian

How is a guassian mixture distribution defined?

\(p(x) = \sum_{l = 1}^{k}\pi_l \times N(x | \mu_l, \Sigma_l) = p(x | \theta)\), with \(\pi_l = p(z_l = 1)\) = mixing coefficients
- \(\theta = (\pi_l, \mu_l, \Sigma_l), l = \{1, ..., k\}\)
likelikood: \(p(x|\theta) = \prod_{i = 1}^n p(x_i|\theta)\)
- \(\underset{\theta}{\mathrm{argmax}} = \theta_{ML}\)

What is the responsibility? In words and mathematically.

probability that component j generated x_i
\(\gamma_j(x_i) = \frac{\pi_j\times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^{k}\pi_l\times N(x_i | \mu_l, \Sigma_l)} = p(z_j = 1 |x_i, \theta)=\frac{p(x_i, \theta|z_j = 1)\times p(z_j = 1)}{\sum_{l = 1}^{k}p(x_i, \theta|z_l = 1) \times p(z_l =1)}\)

Name the ML-estimates in case of EM. What problem does this imply and how to solve this?

\(\mu_j = \frac{\sum_{i = 1}^{n} \gamma_j(x_i) \times x_i}{\sum_{i = 1}^n \gamma_j(x_i)}\)
\(\gamma_j(x_i) = \frac{\pi_j \times N(x_i|\mu_j, \Sigma_j)}{\sum_{l = 1}^k \pi_l \times N(x_i|\mu_l, \Sigma_l)}\)
with \(\pi_j = \frac{N_j}{N} \)
>> non-linear mutual dependencies
- solution: iterative independent optimization procedure (approximation)

Explain the iterative procedure of EM. (4)

init \(\theta_j = (\mu_j, \Sigma_j, \pi_j)\)and calculate log-likelihood
E-step: evaluate responsibilitied with current parameters
1. \(\gamma_j^{new}(x_i) = \)
M-step: reestimate parameters, using responsibilites
1. \(\mu_j^{new} = f(\gamma_j^{new}(x_i))\)
2. \(\Sigma_j^{new} = \)
3. \(\pi_j^{new} = \)
Evaluate log-likelihood: \(|log(p(x|\theta^{new}) - log(p(x|\theta) \le\epsilon\)? break : repeat

What is meant with soft clustering? (2)

object belongs to cluster with probability (expresses confidence / uncertainty)
partition: assign object to cluster which it belongs to the most probable
- \(C(x_i) = \underset{l \in \{1, ..., k\}}{\mathrm{argmax}} \{\gamma_l(x_i)\} \)

Disuss EM (1 + 4)

clusters vary in size (better representation)
- converges to local optimum
- \(\mathcal{O}(tkn)\), usually high t
- results and t depend on initial assignment (2)
- try mutiple starts, choose highest likelihood
- initialisation i.e. with k-means for faster convergence
- singularity of EM: cluster with one point: \(\sigma^2 \rightarrow0 \Rightarrow f(x) \rightarrow \infty\)hence \(l(\theta)\rightarrow \infty\)

How to choose k in case of EM? (2) Which problems occur? (2) How to solve it? (2)

low k = no flexibility, high k = overfitting >> trade-off: goodness of fit vs. simplicity
idea: introduce quality measure: \(\theta_{k^*} = \underset{k \in \{k_{min}, k_{max}\}}{\mathrm{argmax}} \{qual(\theta_k)\} \)
problems (2)
- SIihouette coefficient works only for paritioning
- likelihood is increasing in k
solutions
- deterministic: \(qual(\theta_k) = log(p(x|\theta_k)) + P(k)\) with \(P(k) \) as penalty for large k
- stochastic: MCMC-simulation

What is the general idea of density-based clustering? (2)

clusters = dense regions
separated by regions of lower density

Explain basic concepts like

epsilon-radius
MinPts
Core Point
directly density reachable
density reachable
density connected
maximality
connectivity
noise
border-point

epsilon-radius: neighbourhood of q: \(N_\epsilon(q) = \{p\in D | dist(p, q) \le\epsilon\}, q \in N_\epsilon(q)\)
MinPts: required min points in \(N_\epsilon(q)\)
Core Point: if \(|N_\epsilon(q)| \ge minPts\)
directly density reachable if \(p \in N_\epsilon(q)\) and q is a core point
density reachable: transitive closure of directly density-reachable
density connected: it exists a point o s.t. p and q are density reachable
maximality: if \(q \in C\) and p is density-reachable from q >> \(p \in C\)
connectivity: all objects of one cluster are density-connected to all other objects in C
noise: instances which are not assigned to any cluster (not density reachable from any core-point)\(D \setminus (C_1~\cup ... \cup~C_k)\)
border-point: \(|N_\epsilon(p)| < minPts \land p \in N_\epsilon(q)\)
- no core-point but directly density reachable from core-point

Briefly explain the steps of DBSCAN

= Density based spatial clustering of application with noise
for all (unclassified) points
- if point is core-object: assign all density-reachable objects (epsilon-neighbourhood-query) to recent (= new) cluster
- else: assign as noise

How to choose epsilon and minPts in case of DBSCAN? Explain the heuristic.(5)

idea: use least dense cluster to defined epsilon, minPts
heuristic
- fix minPts (i.e. 2d-1)
- calculate k-distances (with k = minPts) for all p in D
- plot all k-distances in decreasing order
- choose "border-point" o in "elbow-plot"
- set epsilon = k-distance(o)

Discuss DBSCAN (4+2)

+ arbitrary size and shape
+ k automatically determined
+ separates clusters from noise
+ low complexity: neighbourhood-query: O(n), DBSCAN: O(n^2)
- supports spatial index structures s.t. neighbourhood-query: O(log(n)), overall O(n*log(n))
- parameters need to be defined (epsilon, minPts)
- sensitive to parameters

How DBSCAN can be supported database-wise? (3)

epsilon-similarity join: all epsilon-similar objects (two datasets Q and P)
- \(Q \bowtie_\epsilon P = \{(q, p)\in Q\times P ~|~ dist(q, p)\le\epsilon\)
epsilon-similarity self-join
- \(D \bowtie_\epsilon D = \{(q, p) \in D \times D ~ | ~ dist(q,p) \le \epsilon\}\)
"directly epsilon, minPts density-reachable"-query
- select * from D q D p where dist(q, p) <= epsilon group by q-id having count(q.id) >= minPts

Explain idea (1) and steps of Mean Shift (2+4). What is a basin of attraction?

idea: find mode-points in density

for each point m
1. select window-size epsilon
2. calculate mean inside window: w(m)
3. shift window to w(m)
4. repeat 2. and 3. until convergence
grouping of those instances which converge to same mode ("basin of attraction")

Discuss Mean Shift.(5+1)

+ arbitrary size and shape
+ # clusters automatically determined
+ robust to outliers
+ easy implementation / parallelisation ("for each point")
+ one parameter: window-size epsilon
- high complexity: neighbourhood-query: O(n), overall O(tn^2)
- supports spatial index neighbourhood-query: O(log(n)), overall O(tn*log(n))

What is the main idea of spectral clustering? What is the main problem, what is the solution?

data modeled as similarity graph (V = objects, E = similarity of v_i, v_j)
"cutting" similarity graph by global minimum cut >> NP-hard problem
- solution: use approximations

How an undirected graph can be characterised in case of spectral clustering? (2)

W: weighted adjacency matrix
D: degree matrix: \(D_{ii} = \sum_{j = 1}^n W_{ij},~ 0 ~else\)
- diag(i) = sum of weighted edges to/from v_i

Explain an approximation of spectral clustering:

What is the aim? (2 + 1)

What is the main idea? (3)

aim:
- partitioning G into k subsets
- minimizing weights within partitions
- >> comparable to partitioning idea
idea:
- indicator vector for cluster C: \(f_{C_i} = \begin{cases} 1, ~ if~v_i \in C\\ 0, ~ else \end{cases}\)
- Laplacian matrix: L = D - W
- analyse eigenvalues and -vectors of L
  - \(fLf^T\) represents weight of inter-cluster edges >> to be minimized (besides solution f = (1, ..., 1))
    - small for good partitioning
  - \(fLf^T = fDf^T - fWf^T = ~...~ = \frac{1}{2}\sum_{ij}(w_{ij}(f_i - f_j)^2 \)

What happens if there are no edges between clusters in case of spectral clustering (perfect separation)? (2)

blocks in L
\(fLf^T = 0\) for each C

What is the relation between connected components and eigenvalues of \(L\) in case of spectral clustering? (3).

How to find clusters based on this?

# (eigenvalue of 0) = # connected components in the graph
eigenvalues are usually != 0 (one large connected component)
solution: analyse eigenvectors belonging to k smallest eigenvalues

vertecies represented by eigenvectors \((v_1, v_2,..., v_k)\)>> clustering based on this representation (i.e. k-means)

When is hierarchical clustering required? (2) What is the result of hierarchical clustering? (5)

required if:
- hierarchical structure of clusters
- different densities, sizes
result: dendrogram (5)
- tree structure of clusters
- root: whole dataset
- leaf: single object
- internal node union of all objects in sub-tree
- height of internal noe = distance of two childs

What are the main two approaches of hierarchical clustering. (2)

agglomerative: bottom-up
divisive: top down

Explain steps of agglomerative clustering. (4) What is the resulting cluster based on?

each object in own cluster
compute pairwise distance between clusters, distance function for clusters required
1. single-link: \(dist(X, Y) = min_{x \in X, y \in Y}~dist(x, y)\)
2. complete link: \(dist(X, Y) = max_{x \in X, y \in Y}~dist(x, y)\)
3. average-link: \(dist(X, Y) = (|X|\times|Y|)^{-1}\times \sum_{x\in X, y\in Y}~dist(x, y)\)
merge closest clusters into new one (\((A, B) \rightarrow A' = A \cup B\))
repeat 2, 3 until C incsludes all objects

>> based on local pattern

Explain basic idea of divisive clustering. (2)

How can the splitting be done? (2 + 2 + 2)

What characterises the clustering-results?

all objects in one cluster
split until all clusters are singletons: 2^(n-1)-1 spliiting possibilities in first iteration
1. i.e. based on largest diameter
2. most disparete object in C: highest average dissimiliarity
  1. splinter group: S = {o}
  2. assign \(o' \notin S\) with \(D(o') > 0\) to S until \(D(o') \le 0, ~\forall o'\notin S\)
    1. with \(D(o') = \sum_{|o_j \in C \setminus S|}\frac{d(o', o_j)}{|C\setminus S|} - \sum_{o_i\in S} \frac{d(o', o_i)}{|S|}\)
    2. = avg distance from o' to elements not in splinter group (o_j)

>> more complex splitting procedure = more accurate as it uses more complete information?

What is the basic idea of density based hierarchical clustering? (1) Name the algorithm and what is stands for.

using the processing order of objects to track density >> OPTICS = Ordering points to identify the clustering structure

What is the core-distance and the reach-distance in case of OPTICS?

core-distance: smallest distance s.t. o is a core point (core-dist(o))
- if core-dist > epsilon => undefined
reach-distance: smallest distance s.t. p is directly density-reachable from o (reach-dist(p,o)
- \(reach\_dist(p,o) = \begin{cases} dist(p, o), ~ if ~dist(p,o) \ge core-dist(o)\\core-dist(p, o), ~ if ~dist(p, o) < core-dist(o)\\\infty, ~ if ~ dist(p, o) > \epsilon\end{cases}\)

Explain the single steps of OPTICS.

What is the results? What can it be used for? (1) What are advantages of these? (2)

seedList: all seen objects with reachability (asc)
clusterOrder: sequential cluster-order (including reachabilities)
- result-object for reachability plot
go to next entry o in seedList (closest one)
- for each p in range(o, epsilon)
  - compute reach-dist(p, o)
  - update seedList (add p in order)

result: reachability plot
- represents density-based clustering structure
- easy to analyse
- independent of dimensionality

Discuss hierarchical clustering (5+2)

+ # cluster must not be known in advance (= result)
+ no or robust parameter (in case of OPTICS)
+ generates complete nesting of clusters
+ good visualisations included
+ flat partition through cut of dendrogram / reachability plot possible
- complexity:
- standard methods: O(n log(n))
- OPTICS: O(n^2) or O(n log(n)) with tree-based index support
- final clustering needs to be chosen manually

How clustering results can be evaluated, name pros and cons each? (3: 3, 2, 2)

expert opinion
- + new insights in data
- - expensive
- - not comparable / reproducable
external measures
- + objective
- - ground-truth required
internal measures
- + no additional information required
- - results in optimization of evalution criteria

Define the following external measures:

TP
FP
TN
FN
recall
precision
F1-measure
rand-index
adjusted rand-index
Jaccard coefficient
confusion matrix
shannon entropy
mutual entropy
mutual information
normalized mutual information
adjusted mutual information

TP: \(|S_C \cap S_G|\)
FP: \(|S_C \cap \overline{S_G}|\)
TN: \(|\overline{S_C} \cap \overline{S_G}|\)
FN: \(|\overline{S_C} \cap {S_G}|\)
recall: \(\frac{TP}{TP + FN}\)
precision: \(\frac{TP}{TP + FP}\)
F1-measure: \(\frac{2\cdot rec\cdot prec}{rec + prec}\)
rand-index: \(\frac{TP + TN}{TP + FN + FP + TN}\)
adjusted rand-index: RI vs. expected RI from random cluster assignment
Jaccard coefficient: \(\frac{TP}{TP + FN + FP}\)
confusion matrix:
- \(N_{ij} = | C_i \cap G_j|\) matrix with i = cluster index, j = ground-truth index
- \(N_i = \sum_{j = 1}^l N_{ij} = |C_i|\) summing rowise all elements from cluster i
- \(N = \sum_{j = 1}^l N_{i} = |D|\) summing all cluster elements from overall objects
shannon entropy: \(H(C) = - \sum_{C_i \in C}p(C_i)~log(p(C_i)) = - \sum_{i = 1}^k \frac{N_i}{N}~log\frac{N_i}{N}\)
mutual entropy: \(H(C|G) = - \sum_{C_i \in C}\{p(C_i) \sum_{G_j \in G} p(G_j | C_i )~log(p(G_i | C_i)\} = - \sum_{i = 1}^k\{\frac{N_{i}}{N} \sum_{j = 1}^l \frac{N_{ij}}{N}~log\frac{N_{ij}}{N}\}\)
mutual information: \(I (C, G) = H(C) - H(C|G) = H(G) - H(G|C)\)
normalized mutual information: \(NMI(C, G) = \frac{I(C, G)}{\sqrt{H(C)H(G)}}\)
adjusted mutual information: MI vs expected MI of random cluster assignment

Name three ways of internal measure for cluster evaluation.

cohesion:
- \(coh(C_i) = {|C_i| \choose 2}^{-1}\sum_{o, p \in C_i,~ o \neq p}d(o, p)\) = average distance of objects of same cluster
- \(coh(C) = \sum_{i = 1}^k \frac{|C_i|}{n} coh(C_i)\) = weighted mean of all C_i
separation:
- separation between cluster i and j: \(sep(C_i, C_j) = \frac{1}{|C_i||C_j|}\sum_{o \in C_i,~p \in C_j}d(o, p)\) = average distance between all pairs of objects of C_i, and C_j
- minium separation: \(sep(C_i) = min~\{sep(C_i, C_j)\}, i \neq q\) = minimum separation to another cluster
- separation of a clustering: \(sep(C) = \sum_{i = 1}^k \frac{|C_i|}{n} ~ sep(C_i)\) = weighted mean of cluster's minimum separations
evalutaion of distance matrix: data orderecd by cluster labels >> squares for clusters recognizable

When does cohesion and separation fail?

when clusters are stretched: large cohesion and small separation possible

What is the ambiguity of clustering? (4)

correct clustering doesn't exist
check for reasonability of clustering required
no clusters found != no clusters exist
clusters found != good representation of reality

What is the idea of hopkins statistics? (1) How is it applied? (3) What do values of 0, 0.5, 1 imply?

compares dataset with uniform distributed objects
- sample m objects, m << n from dataset, calculate distance to next neighbour w_i
- sample m uniform distributed sampled, calculate distance to next neighbour u_i
- \(H = \frac{sum(u_i)}{sum(u_i) + sum(w_i)}\)
  - H = 0: regular data as sum(w_i) -> inf
  - H = 0.5: uniform distributed data as sum(u_i) = sum(w_i)
  - H = 1: clustered data as sum(w_i) = 0 (objects are really close)

What is the main idea of ensemble clustering? (1)

What are the benefits of it? (5)

consolidate multiple cluster solutions: given \(\mathcal{C} = \{C_1, ..., C_L\}\), find consensus clustering C*
benefits
- knowledge reuse
- improved quality
- improved robustness
- model selection
- unifying distributed clustering

What are the main approaches for ensemble clustering? (2: 2, 2)

based on pairwise similarity: \(\phi(\mathcal{C}, C^*) = \sum_{C \in \mathcal{C}}\phi(C, C^*)\) with \(\phi\) = external evaluation measure
- pair-counting based (based on RI, ...)
- information theoretic (based on MI, NMI, ...)
probabilistic approach
- assume L labels of objects x_i follow a distribution
- optimize parameters based on ML / EM

KDDM

Lernkarteien erstellen oder kopieren

Lernkarteien erstellen oder kopieren

Melde dich an, um alle Karten zu sehen.

SWITCHaai

Office 365

Edulog

Apple ID

Google