Set of flashcards KDDM (Page 3 of 6)

Flashcards	203
Language	Deutsch
Category	Computer Science
Level	University
Created / Updated	01.08.2019 / 03.08.2019
Weblink	https://card2brain.ch/box/20190801_kddm
Embed	<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Explain pair-counting-based ensemble clustering. What is the problem, how to tackle this? (2: 3, 0)

choose C* s.t. phi(C, C*) is maximal
problem: intractable for phi
solutions
- use co-association-matrix (similarity-matrix) S (= adjacancy-matrix)
  - \(S_{ij}^\mathcal{C} = \sum_{C \in \mathcal{C}}1_{\{x_i \in C~\land~x_j \in C \}}\)
  - apply clustering by graph partitioning (spectral clustering)
  - >> solution comparable like using phi = RI
- graph-partitioning / cumulative voting

Explain the information theoretic approach of ensemble clustering.

phi = (N)MI, maximization is NP-hard problem >> approximation
solution:
- \(C_{opt} = argmax_{C^*}(\sum_{C \in \mathcal{C}}I^s(C, C^*))\)with \(I^s = H^s(X) = (2^{1-s}-1)^{-1}\cdot\sum_{x_i \in X}(p_i^s-1)\)= generalized entropy of degree s

What is the general defintion of an outlier?

What are general differences in approaches of outlier-detection? (3)

outlier = strong deviation from other observations (i.e. generated by an other distribution)

global vs. local outliers
labeling vs. scoring of outliers
assumptions about outlierness

Name five approaches for outlier detection.

clustering based
statistical tests
distance based methods
density based
angle based

Name applications of outlier detection. (5)

medicine: unusual symptoms
sports: talent scouting (abnormal parameters/skills)
fraud detection: i.e. stolen credit card (changing behavior)
public health: abnormal occurence of diseases
error detection: i.e. errors in sensor data (to be removed)

Explain clustering based outlier detection. (3)

outlier candidates = clusters with few instances
calculate distance between outlier-candidates and non-candidates: if large >> outliers
or: if removing candidate leads to improved objective function >> outlier

Explain outlier detection based on statistical tests. Name parameters (3), give examples (2). What are problems? (4)

estimate distribution and parameter given all samples
outlier = low probability to be generated by distribution (i.e. mu +- 3*sigma)
parameters
- distribution type
- parametric / non parametric
- # distributions
examples
- m.v. normal
- mahalanobis distance: \(MDist(x, \mu) = \sqrt{(x-\mu^T\Sigma^{-1}(x-\mu)} \sim \chi^2_d\)
problems
- curse of dimensionality: similar MDs for large d
- robustness: parameter are sensitive to outliers (mean estimated with outliers to detect outlier)
- low flexibility as distribution fixed
- global method

Explain distance-based outlier detection and its approaches (1+2)

normal = dense neighbourhood, outlier = sparse neighbourhood >> distance as indicator
approaches
- scoring based on kNN-distances
  - outlier-score = kNN-distance, outlier if kNN-dist > tao
  - problem: sensitive towards k (i.e. #p in C < k), global method as global tao
- D(epsilon, pi)-outliers
  - outlierset: \(= \{ p~|\frac{|\{q\in D ~| ~dist(q, p) < \epsilon|\}}{|D|} \le\pi\}\)

Explain the basic idea of density-based outlier-detection. (1) What is the problem here? (1)

comparing of densities of point and it's neighbors >> relative density = outlier-score
- similar density >> normal point
- different density >> outlier
problem: global definition of density (epsilon), density may differ between regions

Explain outlier detection based on local density. How is outlier defined in this case? State different values of LOF.

What is the meaning of k in this case?

local density: \(ld_k(p) = (1/k ~\cdot~\sum_{o \in kNN(p)}dist(o, p) )^{-1}\) = inverse of average distance of kNNs of p
local outlier factor: \(LOF_k(p) = 1/k ~\cdot~\sum_{o\in kNN(p)} \frac{ld_k(o)}{ld_k(p)}\)= average ration of ld(o) of kNNs of p over ld(p)
- if LOF_k(p) > tao = outlier
- LOF = 1 = normal
- LOF >> 1 = outlier
k defines reference set (kNNs)

Explain outlier detection based on reachability-distance.

reachability-distance: \(rd_k(p, o) = max(kdist(o), dist(p,o))\)
local reachability-distance: \(lrd_k(p) = (1/k ~\cdot \sum_{o \in kNN(p)}rd_k(p, o))^{-1}\)
local outlier factor: \(LOF_k(p) = 1/k ~\cdot \sum_{o \in kNN(p)} \frac{lrd_k(o)}{lrd_k(p)}\)

Explain angle-based outlier detection. What is it's main advantage?

idea: measure variance of angle-spectrum (= ABOD)
- outliers = similar angles to others (as far away), normal = high variance in angles to others (central in data)
- >> small ABOD = outlier
adv: stable in high-dimensional space

What are drivers of modern Data Science? (3+)

broadband connection
cheaper and faster storage / main memories
mobile devices
cloud computing
IoT and sensors

Which problems occur by increasing amout on data? (4)

integration of data (various sources)
interpretation of data (no clear structure)
updating of data (stream vs. batch)
ensure data quality (i.e. missingness)

How to handle small, medium and big data volumes? Give an exmaple each.

small: manipulation in main memory (from file to main memory: csv, Excel)
medium: manipulation on disk (db-system: SQL-queries...)
big: manipulation of data using cloud framework (NoSQL, distributed file systems, manipulation by i.e. mapReduce)

What is the difference between OLAP and OLTP?

OLAP: read whole database (aggregated)
OLTP: read + write single transactions

How business-models are created by Big Data?

use service, pay service >> provide data >> people pay with data

What are the V's of Big Data? (4+...) Briefly explain.

volume: # instances, feature
velocity: quick changes, new data
variety: various structures
veracity: data = samples != facts (>> interpretation of and uncertainty in data)
value
validity
vagueness

What is the main idea of Deep Learning architecture?

Name examples of application of deep learning. (4)

integrate data transformation and model training >> joint-optimization
- each output is input of the next step
- minimizing nested loss-function with gradient descent
examples
- image recognition
- autonomous driving
- search-engines
- data management

What are gabor filters?

linear filters to detect edes, similar orientation of objects.

Name current developments in deep learning. (4)

recurrent neural networks (i.e. Long short-term memory)
Autoencoders (compact compressed representation)
generative adversarial networks (data generation based on observations)
deep dreams (visualisation of intermediate results >> interpretability)

Explain the data pyramid.

raw data >> preprocessed data >> usable data >> labeled data (decreasing amount)

What is text-processing? Name Challenges (4)

patterns/similarities in large sets of documents (i.e. plagiarism)
challenges
- stop words (meaningless)
- d > 10.000
- sparse feature-space with low frequencies
- identifying word-stems

How to measure the relevancy of a term?

term frequency
- \(TD(t, d) = \frac{n(t, d)}{max_{w \in d} ~(n(w, d))}\)
- = probability of term t in document d
Inverse document frequency
- \(IDF(t) = \frac{|DB|}{|\{d~|~d\in DB ~\land~t\in d\}|}\)
- = inverse probability of t regarding all documents

How to measure distance in case of high sparsity? (1 + 2)

problem: sparsity: frequency often 0 >> undiverse euclidean distances
Jaccard Coefficient: \(d_j(D_1, D_2) = \frac{|D_1 ~\cap ~D_2|}{|D_1~\cup~D_2|}\)
Cosine Coefficient: \(d_c(D_1, D_2) = \frac{\langle D_1, ~D_2\rangle }{||D_1||\cdot||D_2||}\)
- useful for large d

What is the idea of shingling? (1) What is a k-shingle? What is the idea of hashing shingles?

construct set of short string occuring often in document
K-shingle: substring of length k occuring at least n times in d
mapping string of length k to hash >> representation of shingle

How to summarise sets of shingles in similarity-preserving manner? (characteristic matrix, minhasing, minhash signature)

idea: replace set of shingles by smaller representations
using characteristic matrix
- columns = documents
- rows = shingles
calculate minihash signature
- permute rows
- compute h_i(S) for all S (Index of first 1)
- for each row r, for each column c
  - if [r, c] = 0 do nothing
  - if [r, c] = 1
    - for each i = 1...n (hashfunctions) calculate SIG(i, c) = min(SIG(i, c), h_i(r))

Why reduction of dimensionality is useful? (5)

discover correlations
remove redundant features
visualisation/interpretation
sparse matrix to low dimensional dense matrix
storage reduction
>> low-dimensional representation of data

Explain variance analysis. What is the general idea? (2) Explain Steps of PCA. (4)

idea
- search for large |x_i - mu| = large variance
- choose such features k with high variance

calculate sigma
calculate eigenvalues, eigenvectors of sigma: sigma = V * lambda * t(V)
1. V columnwise orthogonal, normalized s.t. V * t(V) = I
select k largest eigenvalues (lambda') and corresponding eigenvectors (V')
transform data: D' = D * t(V') (rotation)

How to choose k in case of PCA?

according to percentage of variance explained by k first principal components
- sum(lamda_i, i to k) / sum(lambda) i.e. >= 85%

What is the complexity of calculating PCA? What is a solution? Describe.

O(n^3) for calculating characteristic polynomial >> expensive
solution: power iteration method
- O(n^2), + stoppable after finding first k eigenvalues
- idea. mutiplying matrix with itself to increas strongest directions relative to others
- sigma^k = t(V) * lambda^k * V
  - large lambda become larger, small even smaller
  - normalizing lambda_i by sum(lambda) >> only one component remains (converges to 1) = strongest eigenvalue
- k largest eigenvalues: project data to space being orthogonal to previous eigenvectors

Explain the idea of singular value decomposition? How to find U, V and Sigma? What is the interpretation of SVD: U, V, Sigma?

X = U * Sigma * t(V)
- X: input, n x d
- U: left singular vectors (columnwise orthogonal), n x k
- Sigma: singular values, k x k
- t(V): right singular values (columnwise orthogonal), k x d
\(X^TX = V \Sigma^2V^T \Leftrightarrow(X^TX)~V = V\Sigma^2\)
- compute eigenvalue and eigenvectors of \(X^TX\) to get V (eigenvectors)
- square root of eigenvalue give singular values Sigma of X
\(XX^T = U ~\Sigma^2~U^T\) to find U
meaning:
- U: connects entities to conecpts
- Sigma: strength of each concept (k concepts)
- V: relates objects to concepts

How SVD can be used to find a low-dimensional approximation? (2) How is it called?

\(X_k = U_k ~\Sigma_k~V_k^T \)
only use first k concepts
>> truncated SVD

What are problems of PCA, SVD? (2) What are alternatives?

maybe better low-rank approximations exist regarding
- interpretability
- structure for application
>> CX, CUR matrix decompositions

What is CX?

low rank approximation with only small number of columns of X

What is a CUR decomposition in general? What is the general idea?

X = C * U * R
- with C = columns, R = rows of X (only few, not all)
from sparse X results sparse decomposition (PCA, SVD is not sparse): sparse C and R

Explain the single steps of CUR. (2+4)

step
1. choose number of concepts to be found
2. biased sampling of each r rows and cols of X (n x m)
  1. C = n x r matrix
  2. R = r x m matrix
  3. sampling probability according to Frobenius norm: \(= \sum_{i, j}x_{i, j}^2\)
    1. p = sum of squares of rows or cols / frobenius norm
construct U from C and R
1. W = intersection of chosen C and R
2. apply SVD: W = X Sigma Y
3. compute Sigma+, pseudo-inverse of Sigma (numerical inverse of diagonal entries)
4. compute U = Y (Sigma+)^2 X

Name explames for graphs. (5) What is analysed?

internet
social media
networks
publications
maps
>> groups, connections etc.

What are the challenges in Web Search? (2) What was the first approach? What is the recent approach? (4) What are flow-equations?

challenges
- trustworthyness
- correctness of results
first: rank defined by word-occurences (pushed by SEO)
then: ranking based on link-structure (importance matters)
- link's vote is proportional to importance of source
- page j with importance r_j and n outlink: each link gets r_j/n votes
- importance \(r_j = \sum_{i \rightarrow j}\frac{r_i}{d_i}\)
- >> recursive definition: flow equations
  - no unique solution (x equations, x variables, no constants)
  - add constraint for uniqueness: sum(r_i) = 1
  - solvable for small examples

Explain the matrix formulation of page-rank (3). What results from this formulation? (2)

M: stochastic adjacency matrix
- if i - > j: M_ji = 1/d_i, else 0
- col-sum = 1
rank-vector r
- entry per page: r_i = importance-score
- sum(r_i) = 1
>> r = M * r
- r is first eigenvector of M (eigenvalue = 1 as col-stochastic)
- solve for r by power-iteration

KDDM

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google