KDDM

KDDM@LMU

KDDM@LMU


Set of flashcards Details

Flashcards 203
Language Deutsch
Category Computer Science
Level University
Created / Updated 01.08.2019 / 03.08.2019
Weblink
https://card2brain.ch/box/20190801_kddm
Embed
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Explain pair-counting-based ensemble clustering. What is the problem, how to tackle this? (2: 3, 0)

  • choose C* s.t. phi(C, C*) is maximal
  • problem: intractable for phi
  • solutions
    • use co-association-matrix (similarity-matrix) S (= adjacancy-matrix)
      • \(S_{ij}^\mathcal{C} = \sum_{C \in \mathcal{C}}1_{\{x_i \in C~\land~x_j \in C \}}\)
      • apply clustering by graph partitioning (spectral clustering)
      • >> solution comparable like using phi = RI
    • graph-partitioning / cumulative voting

Explain the information theoretic approach of ensemble clustering.

  • phi = (N)MI, maximization is NP-hard problem >> approximation
  • solution: 
    • \(C_{opt} = argmax_{C^*}(\sum_{C \in \mathcal{C}}I^s(C, C^*))\)with \(I^s = H^s(X) = (2^{1-s}-1)^{-1}\cdot\sum_{x_i \in X}(p_i^s-1)\)= generalized entropy of degree s

What is the general defintion of an outlier?

What are general differences in approaches of outlier-detection? (3)

  • outlier = strong deviation from other observations (i.e. generated by an other distribution)

 

  • global vs. local outliers
  • labeling vs. scoring of outliers
  • assumptions about outlierness

Name five approaches for outlier detection.

  • clustering based
  • statistical tests
  • distance based methods
  • density based
  • angle based

Name applications of outlier detection. (5)

  • medicine: unusual symptoms
  • sports: talent scouting (abnormal parameters/skills)
  • fraud detection: i.e. stolen credit card (changing behavior)
  • public health: abnormal occurence of diseases
  • error detection: i.e. errors in sensor data (to be removed)

Explain clustering based outlier detection. (3)

  • outlier candidates = clusters with few instances
  • calculate distance between outlier-candidates and non-candidates: if large >> outliers
  • or: if removing candidate leads to improved objective function >> outlier

Explain outlier detection based on statistical tests. Name parameters (3), give examples (2). What are problems? (4)

  • estimate distribution and parameter given all samples
  • outlier = low probability to be generated by distribution (i.e. mu +- 3*sigma)
  • parameters
    • distribution type
    • parametric / non parametric
    • # distributions
  • examples
    • m.v. normal
    • mahalanobis distance: \(MDist(x, \mu) = \sqrt{(x-\mu^T\Sigma^{-1}(x-\mu)} \sim \chi^2_d\)
  • problems
    • curse of dimensionality: similar MDs for large d
    • robustness: parameter are sensitive to outliers (mean estimated with outliers to detect outlier)
    • low flexibility as distribution fixed
    • global method 

 

 

Explain distance-based outlier detection and its approaches (1+2)

  • normal = dense neighbourhood, outlier = sparse neighbourhood >> distance as indicator
  • approaches
    • scoring based on kNN-distances
      • outlier-score = kNN-distance, outlier if kNN-dist > tao
      • problem: sensitive towards k (i.e. #p in C < k), global method as global tao
    • D(epsilon, pi)-outliers
      • outlierset: \(= \{ p~|\frac{|\{q\in D ~| ~dist(q, p) < \epsilon|\}}{|D|} \le\pi\}\)

Explain the basic idea of density-based outlier-detection. (1) What is the problem here? (1)

  • comparing of densities of point and it's neighbors >> relative density = outlier-score
    • similar density >> normal point
    • different density >> outlier
  • problem: global definition of density (epsilon), density may differ between regions

Explain outlier detection based on local density. How is outlier defined in this case? State different values of LOF.

What is the meaning of k in this case?

  • local density: \(ld_k(p) = (1/k ~\cdot~\sum_{o \in kNN(p)}dist(o, p) )^{-1}\) = inverse of average distance of kNNs of p
  • local outlier factor: \(LOF_k(p) = 1/k ~\cdot~\sum_{o\in kNN(p)} \frac{ld_k(o)}{ld_k(p)}\)= average ration of ld(o) of kNNs of p over ld(p)
    • if LOF_k(p) > tao = outlier
    • LOF = 1 = normal
    • LOF >> 1 = outlier
  • k defines reference set (kNNs)

Explain outlier detection based on reachability-distance.

  • reachability-distance: \(rd_k(p, o) = max(kdist(o), dist(p,o))\)
  • local reachability-distance: \(lrd_k(p) = (1/k ~\cdot \sum_{o \in kNN(p)}rd_k(p, o))^{-1}\)
  • local outlier factor: \(LOF_k(p) = 1/k ~\cdot \sum_{o \in kNN(p)} \frac{lrd_k(o)}{lrd_k(p)}\)

Explain angle-based outlier detection. What is it's main advantage?

  • idea: measure variance of angle-spectrum (= ABOD)
    • outliers = similar angles to others (as far away), normal = high variance in angles to others (central in data)
    • >> small ABOD = outlier
  • adv: stable in high-dimensional space

What are drivers of modern Data Science? (3+)

  • broadband connection
  • cheaper and faster storage / main memories
  • mobile devices
  • cloud computing
  • IoT and sensors

Which problems occur by increasing amout on data? (4)

  • integration of data (various sources)
  • interpretation of data (no clear structure)
  • updating of data (stream vs. batch)
  • ensure data quality (i.e. missingness)

How to handle small, medium and big data volumes? Give an exmaple each.

  • small: manipulation in main memory (from file to main memory: csv, Excel)
  • medium: manipulation on disk (db-system: SQL-queries...)
  • big: manipulation of data using cloud framework (NoSQL, distributed file systems, manipulation by i.e. mapReduce)

What is the difference between OLAP and OLTP?

  • OLAP: read whole database (aggregated)
  • OLTP: read + write single transactions

How business-models are created by Big Data?

  • use service, pay service >> provide data >> people pay with data

What are the V's of Big Data? (4+...) Briefly explain.

  • volume: # instances, feature
  • velocity: quick changes, new data
  • variety: various structures
  • veracity: data = samples != facts (>> interpretation of and uncertainty in data)
  • value
  • validity
  • vagueness

What is the main idea of Deep Learning architecture?

Name examples of application of deep learning. (4)

  • integrate data transformation and model training >> joint-optimization
    • each output is input of the next step
    • minimizing nested loss-function with gradient descent
  • examples
    • image recognition
    • autonomous driving
    • search-engines
    • data management

What are gabor filters?

  • linear filters to detect edes, similar orientation of objects.

Name current developments in deep learning. (4)

  • recurrent neural networks (i.e. Long short-term memory)
  • Autoencoders (compact compressed representation)
  • generative adversarial networks (data generation based on observations)
  • deep dreams (visualisation of intermediate results >> interpretability)

Explain the data pyramid.

  • raw data >> preprocessed data >> usable data >> labeled data (decreasing amount)

What is text-processing? Name Challenges (4)

  • patterns/similarities in large sets of documents (i.e. plagiarism)
  • challenges
    • stop words (meaningless)
    • d > 10.000
    • sparse feature-space with low frequencies
    • identifying word-stems

How to measure the relevancy of a term?

  • term frequency
    • \(TD(t, d) = \frac{n(t, d)}{max_{w \in d} ~(n(w, d))}\)
    • = probability of term t in document d
  • Inverse document frequency 
    • \(IDF(t) = \frac{|DB|}{|\{d~|~d\in DB ~\land~t\in d\}|}\)
    • = inverse probability of t regarding all documents

How to measure distance in case of high sparsity? (1 + 2)

  • problem: sparsity: frequency often 0 >> undiverse euclidean distances
  • Jaccard Coefficient: \(d_j(D_1, D_2) = \frac{|D_1 ~\cap ~D_2|}{|D_1~\cup~D_2|}\)
  • Cosine Coefficient: \(d_c(D_1, D_2) = \frac{\langle D_1, ~D_2\rangle }{||D_1||\cdot||D_2||}\)
    • useful for large d

What is the idea of shingling? (1) What is a k-shingle? What is the idea of hashing shingles?

  • construct set of short string occuring often in document
  • K-shingle: substring of length k occuring at least n times in d
  • mapping string of length k to hash >> representation of shingle

How to summarise sets of shingles in similarity-preserving manner? (characteristic matrix, minhasing, minhash signature)

 

  • idea: replace set of shingles by smaller representations
  • using characteristic matrix
    • columns = documents
    • rows = shingles
  • calculate minihash signature
    • permute rows
    • compute h_i(S) for all S (Index of first 1)
    • for each row r, for each column c
      • if [r, c] = 0 do nothing
      • if [r, c] = 1 
        • for each i = 1...n (hashfunctions) calculate SIG(i, c) = min(SIG(i, c), h_i(r))

Why reduction of dimensionality is useful? (5)

  • discover correlations
  • remove redundant features
  • visualisation/interpretation
  • sparse matrix to low dimensional dense matrix
  • storage reduction
  • >> low-dimensional representation of data

Explain variance analysis. What is the general idea? (2) Explain Steps of PCA. (4)

  • idea
    • search for large |x_i - mu| = large variance
    • choose such features k with high variance
  1. calculate sigma
  2. calculate eigenvalues, eigenvectors of sigma: sigma = V * lambda * t(V)
    1. V columnwise orthogonal, normalized s.t. V * t(V) = I
  3. select k largest eigenvalues (lambda') and corresponding eigenvectors (V')
  4. transform data: D' = D * t(V') (rotation)

How to choose k in case of PCA?

  • according to percentage of variance explained by k first principal components
    • sum(lamda_i, i to k) / sum(lambda) i.e. >= 85%

What is the complexity of calculating PCA? What is a solution? Describe.

  • O(n^3) for calculating characteristic polynomial >> expensive
  • solution: power iteration method
    • O(n^2), + stoppable after finding first k eigenvalues
    • idea. mutiplying matrix with itself to increas strongest directions relative to others
    • sigma^k = t(V) * lambda^k * V
      • large lambda become larger, small even smaller
      • normalizing lambda_i by sum(lambda) >> only one component remains (converges to 1) = strongest eigenvalue
    • k largest eigenvalues: project data to space being orthogonal to previous eigenvectors

Explain the idea of singular value decomposition? How to find U, V and Sigma? What is the interpretation of SVD: U, V, Sigma?

  • X = U * Sigma * t(V)
    • X: input, n x d
    • U: left singular vectors (columnwise orthogonal), n x k
    • Sigma: singular values, k x k
    • t(V): right singular values (columnwise orthogonal), k x d 
  • \(X^TX = V \Sigma^2V^T \Leftrightarrow(X^TX)~V = V\Sigma^2\)
    • compute eigenvalue and eigenvectors of \(X^TX\) to get V (eigenvectors)
    • square root of eigenvalue give singular values Sigma of X
  • \(XX^T = U ~\Sigma^2~U^T\) to find U
  • meaning:
    • U: connects entities to conecpts
    • Sigma: strength of each concept (k concepts)
    • V: relates objects to concepts

How SVD can be used to find a low-dimensional approximation? (2) How is it called?

  • \(X_k = U_k ~\Sigma_k~V_k^T \)
  • only use first k concepts
  • >> truncated SVD

What are problems of PCA, SVD? (2) What are alternatives?

  • maybe better low-rank approximations exist regarding
    • interpretability
    • structure for application
  • >> CX, CUR matrix decompositions

What is CX?

  • low rank approximation with only small number of columns of X

What is a CUR decomposition in general? What is the general idea?

  • X = C * U * R
    • with C = columns, R = rows of X (only few, not all)
  • from sparse X results sparse decomposition (PCA, SVD is not sparse): sparse C and R

Explain the single steps of CUR. (2+4)

  1. step
    1. choose number of concepts to be found
    2. biased sampling of each r rows and cols of X (n x m)
      1. C = n x r matrix
      2. R = r x m matrix
      3. sampling probability according to Frobenius norm: \(= \sum_{i, j}x_{i, j}^2\)
        1. p = sum of squares of rows or cols / frobenius norm
  2. construct U from C and R
    1. W = intersection of chosen C and R
    2. apply SVD: W = X Sigma Y
    3. compute Sigma+, pseudo-inverse of Sigma (numerical inverse of diagonal entries)
    4. compute U = Y (Sigma+)^2 X

Name explames for graphs. (5) What is analysed?

  • internet
  • social media
  • networks
  • publications
  • maps
  • >> groups, connections etc.

What are the challenges in Web Search? (2) What was the first approach? What is the recent approach? (4) What are flow-equations?

  • challenges
    • trustworthyness
    • correctness of results
  • first: rank defined by word-occurences (pushed by SEO)
  • then: ranking based on link-structure (importance matters)
    • link's vote is proportional to importance of source
    • page j with importance r_j and n outlink: each link gets r_j/n votes
    • importance \(r_j = \sum_{i \rightarrow j}\frac{r_i}{d_i}\)
    • >> recursive definition: flow equations
      • no unique solution (x equations, x variables, no constants)
      • add constraint for uniqueness: sum(r_i) = 1
      • solvable for small examples

Explain the matrix formulation of page-rank (3). What results from this formulation? (2)

  • M: stochastic adjacency matrix
    • if i - > j: M_ji = 1/d_i, else 0
    • col-sum = 1
  • rank-vector r
    • entry per page: r_i = importance-score
    • sum(r_i) = 1
  • >> r = M * r
    • r is first eigenvector of M (eigenvalue = 1 as col-stochastic)
    • solve for r by power-iteration