KDDM

KDDM@LMU

KDDM@LMU


Kartei Details

Karten 203
Sprache Deutsch
Kategorie Informatik
Stufe Universität
Erstellt / Aktualisiert 01.08.2019 / 03.08.2019
Weblink
https://card2brain.ch/box/20190801_kddm
Einbinden
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Characterise large-scale file systems. (2) Name an exampleof distributed file systems

  • huge amount of data (terabytes)
  • files rarely upated (update = appending)
  • HDFS: hadoop distributed file system

How to find files in DFS? (1)

  • master node has information about all replications

What is the heartbeat-mechanism? What is it used for?

  • data node sends heartbeat-message to NameNode (master node) periodically
  • absence of heartbeat = datanode lost connection

How HDFS ensure robustness? (4)

  • data integrity: checksum checking of blocks 
  • cluster rebalancing: move data from datanode in case if free space
  • data disk failure: heartbeats
  • snapshots: storing copy

Why MapReduce is useful? (3)

  • manage large-scale computations
  • tolerant to hardware faults
  • parallelisation / distributed computation

What are the functions of MapReduce? Explain them. What is special about MapReduce?

 

  • map(in_key, in_value) -> list(map_key, intermediate value)
  • reduce(map_key, list(intermediate values)) -> list(out_values)
    • merged results
  • >> only thing to be programmed.

What is the general processing of MapReduce? (3)

  • map data in each chunk >> sequences of key-value pairs
  • pairs collected by master, sorted by key
  • reduce task on each key separately >> combine all values to output

What does MapReduce handle on its own? (5)

  • partitioning of input data
  • scheduling
  • suffle and sort (group-by-key step)
  • handling failures
  • inter-machine communication
  • >> no programming required

Explain how to apply k-means with MapReduce? (3)

  1. read data + initialize centers
  2. mappers (for each data point separately)
    1. calculate proximity from datapoints to centroid
    2. assignment
  3. reducing: recalculation of cendroids (aggregation: mean value for each group)
  4. repeat 2. and 3 until convergence (no changes of cendroids)

Briefly characterise Apache Hadoop. (2)

  • created by google (2003), independent Apache project (2008) (open source)
  • platform for storage, analysis of big data
    • reliable
    • scalable
    • distributed computing

Name limitations of MapReduce. (2)

  • limited to batch-programming (not real-time / ad-hoc analysis)
  • limited analysis modes (not for graph, stream, text analysis)

What is the main advantag of Apache Spark compared to MapReduce? (2)

  • better performance than Hadoop
    • less I/O by usage of main memory instead of write and read of intermediate results on disk
  • more operations than only MapReduce
    • pipeline of operations

What is RDD? (3)

  • resilient distributed dataset
  • read-only collection of objects
  • partitioned accross machines >> parallelisation

What are RDD transformations and actions? Give examples (3 + 1)

  • transformations: lazy evaluated
    • map
    • groupyByKey
    • mapValues
  • actions: trigger transformations
    • collect

What is the conecpt of shared variables?

  • variables = copies = not updateable
  • common data is broadcasted within each stage (update by broadcast)

Define a  data stream. (1) What is characteristic? (2) What's the challenge then?

  • continuous, potentially infinite stochastic process
  • characteristic
    • huge amount of data (not storable)
    • no control on arriving order
    • >> difficult to analyse history

What are the components of data stream management system? (5)

  • data streams
  • stream processor
  • output streams
  • ad-hoc queries
  • archive storage

Name different data stream models (3)

  • insert-only model
    • seen x_i unchangable
  • insert-delete model
    • x_i can't be deleted or updated
  • additive model
    • x_i is increment of previous version = store sums only

Why streaming methods are required? (2)

How to they tackle these problems? What is the tradeoff each?

  • not all data storagable
    • solution: storing summaries of pervious data
    • trade-off: storage space vs. precise queries
  • not all data processable
    • solution: data reduction / approximation
    • trade-off: processing speed vs. precise queries

Name and explain streaming methods in case of limited storage. (2)

  • sliding window
    • keep most recent stream in main memory
    • timestamp vs. sequence based
  • ageing
    • decay importance of objects

Name streaming methods in case of restricted processing speed. (3)

  • sampling
    • obtain representative sample i.e. reservoir sampling, min-wise sampling
  • summaries 
    • i.e. DWT (discrete wavelet transformation)
  • histograms

What is special about Spark streaming?

  • ingestion of data from several sources:
    • input stream >> spark streaming >> batches of input (RDD) >> spark engine >> batches of processed data

What is Apache Storm? What is special about it?

  • alternative to spark streaming
  • supports real-time processing

What is Apache Flink? (1)

What is special about it? (4)

  • stream processing framework (open source)
  • special
    • low latency with high throughput (buffer sent when full or time out)
    • stateful operators: internal state (i.e. state = model parameter)
    • distributed execution
    • interative data flow integrated (looping)
      • loops by feedbackstream (Hadoop, Spark: loop in client program)

What do we need for learning? What is the general problem?

  • domain set X
  • label set y
  • training data S
  • prediction rule / hypothesis h (X -> y) from hypothesis space H
  • distribution of X: x ~ D (unknown)
  • labeling function (f: X -> y) (unknown)
  • true risk (L_D,f(h)) (unknown)
  • >> problem: D unknown

What is the aim of statistical learning theory? How is this archived?

  • find funtion in H which describes S the best
  • >> empirical rist minimization: choose function with lowest empirical risk

Name conditions for ERM. (1 >> 2)

  • P(bad hypothesis chosen)) = |H| * e^(-epsilon * m)
    • if |H| is finite >> learnable ERM
    • for larger m the chance for a by accident bad hypothesis is chosen by accident decreases 

What is the aim of PAC-learnability? (2)

  • PAC = probably approximately correct
  • given H, z, h, epsilon, 1 - delta, define m
    • m = theoretical sample complexity of learning H
    • i.e. finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)

What is an epsilon representative sample?

  • for all rules h: |L_s(h) - L_D(h)| <= epsilon

What is the main aspect of uniform convergence?

What is the idea of concentration measures?

  • empirical loss converges to true loss (law of large numbers)
  • quantify how fast empirical risk converges
    • markov inequality: P(x > epsilon) <= E(x) / epsilon
    • Chebyshev' inequality: P(|z - E(z)| > epsilon) <= Var(z) / epsilon^2
      • i.e. probability for bad hypothesis: P(|L_s(h) - L_D(h) | > epsilon) <= 1/(epsilon^2 * m)

What is PAC-learnable?

  • finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)
  • infinite hypothesis classes i.e. threshold functions IF hypothesis classes are restricted (VC(H) < inf)

What is shattering.

  • H shatters a finite set C (subset of X) if |H_c| = 2^|c|

Define VC-dimension (1). How to show this (2)?

  • maixmal size of C that can be shattered by H
  • to show: 
    • it exists a set C of size d that is shattered by H >> VC(H) >= d
    • every set C of size d+1 can't be shattered by H >> VC(H) = d

What is the VC-dim of linear classifier (with b=0 and b != 0)

  • d
  • d+1

Why is it useful to use a surrogate loss function? (2)

  • easier to solve (global optimum unfeasable for large m, d >> NP-hard) 
    • i.e. analytically solvable
    • else: iterative optimization (i.e. gradient descent)
  • approximates original loss

What is the main idea of SVMs? (3)

What does the solution depend on?

  • transform features in highdimensional space
  • assumes separable data
  • maximizing margin from hyperplane
    • constraint optimization problem: maximize minimal distance to hyperplane
    • = minimize length of w
  • solution only depends on support vectors (y_i * f(x_i) = 1)

Why is the dual form useful? (3) How to recieve it?

  • primal form not analytically solvable
  • get rid of constraint: y_i * f(x_i) - 1 >= 0
  • inner product <x_j, x_i> included in solution = similarity (distance) of two samples >> kernel introducable
  • done with lagrangian multiplier (same direction of gradient)

What is the difference between hard and soft SVMs?

  • hard: no classification error allowed (linear separable)
  • soft: slack-variables introduced >> hinge loss to handle margin-violations

What is the hilbert space? (1) Characterise it. (2)

  • generalisation of euclidean space
  • character
    • complete: all cauchy sequences converge (minization)
    • inner product exists

What is the Gran-matrix? How does it scale? What does it represent?

  • K(x_i, x_j)
  • = similarity of x_i, x_j
  • scales with # of samples (m)