KDDM
KDDM@LMU
KDDM@LMU
Kartei Details
Karten | 203 |
---|---|
Sprache | Deutsch |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 01.08.2019 / 03.08.2019 |
Weblink |
https://card2brain.ch/box/20190801_kddm
|
Einbinden |
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Characterise large-scale file systems. (2) Name an exampleof distributed file systems
- huge amount of data (terabytes)
- files rarely upated (update = appending)
- HDFS: hadoop distributed file system
How to find files in DFS? (1)
- master node has information about all replications
What is the heartbeat-mechanism? What is it used for?
- data node sends heartbeat-message to NameNode (master node) periodically
- absence of heartbeat = datanode lost connection
How HDFS ensure robustness? (4)
- data integrity: checksum checking of blocks
- cluster rebalancing: move data from datanode in case if free space
- data disk failure: heartbeats
- snapshots: storing copy
Why MapReduce is useful? (3)
- manage large-scale computations
- tolerant to hardware faults
- parallelisation / distributed computation
What are the functions of MapReduce? Explain them. What is special about MapReduce?
- map(in_key, in_value) -> list(map_key, intermediate value)
- reduce(map_key, list(intermediate values)) -> list(out_values)
- merged results
- >> only thing to be programmed.
What is the general processing of MapReduce? (3)
- map data in each chunk >> sequences of key-value pairs
- pairs collected by master, sorted by key
- reduce task on each key separately >> combine all values to output
What does MapReduce handle on its own? (5)
- partitioning of input data
- scheduling
- suffle and sort (group-by-key step)
- handling failures
- inter-machine communication
- >> no programming required
Explain how to apply k-means with MapReduce? (3)
- read data + initialize centers
- mappers (for each data point separately)
- calculate proximity from datapoints to centroid
- assignment
- reducing: recalculation of cendroids (aggregation: mean value for each group)
- repeat 2. and 3 until convergence (no changes of cendroids)
Briefly characterise Apache Hadoop. (2)
- created by google (2003), independent Apache project (2008) (open source)
- platform for storage, analysis of big data
- reliable
- scalable
- distributed computing
Name limitations of MapReduce. (2)
- limited to batch-programming (not real-time / ad-hoc analysis)
- limited analysis modes (not for graph, stream, text analysis)
What is the main advantag of Apache Spark compared to MapReduce? (2)
- better performance than Hadoop
- less I/O by usage of main memory instead of write and read of intermediate results on disk
- more operations than only MapReduce
- pipeline of operations
What is RDD? (3)
- resilient distributed dataset
- read-only collection of objects
- partitioned accross machines >> parallelisation
What are RDD transformations and actions? Give examples (3 + 1)
- transformations: lazy evaluated
- map
- groupyByKey
- mapValues
- actions: trigger transformations
- collect
What is the conecpt of shared variables?
- variables = copies = not updateable
- common data is broadcasted within each stage (update by broadcast)
Define a data stream. (1) What is characteristic? (2) What's the challenge then?
- continuous, potentially infinite stochastic process
- characteristic
- huge amount of data (not storable)
- no control on arriving order
- >> difficult to analyse history
What are the components of data stream management system? (5)
- data streams
- stream processor
- output streams
- ad-hoc queries
- archive storage
Name different data stream models (3)
- insert-only model
- seen x_i unchangable
- insert-delete model
- x_i can't be deleted or updated
- additive model
- x_i is increment of previous version = store sums only
Why streaming methods are required? (2)
How to they tackle these problems? What is the tradeoff each?
- not all data storagable
- solution: storing summaries of pervious data
- trade-off: storage space vs. precise queries
- not all data processable
- solution: data reduction / approximation
- trade-off: processing speed vs. precise queries
Name and explain streaming methods in case of limited storage. (2)
- sliding window
- keep most recent stream in main memory
- timestamp vs. sequence based
- ageing
- decay importance of objects
Name streaming methods in case of restricted processing speed. (3)
- sampling
- obtain representative sample i.e. reservoir sampling, min-wise sampling
- summaries
- i.e. DWT (discrete wavelet transformation)
- histograms
What is special about Spark streaming?
- ingestion of data from several sources:
- input stream >> spark streaming >> batches of input (RDD) >> spark engine >> batches of processed data
What is Apache Storm? What is special about it?
- alternative to spark streaming
- supports real-time processing
What is Apache Flink? (1)
What is special about it? (4)
- stream processing framework (open source)
- special
- low latency with high throughput (buffer sent when full or time out)
- stateful operators: internal state (i.e. state = model parameter)
- distributed execution
- interative data flow integrated (looping)
- loops by feedbackstream (Hadoop, Spark: loop in client program)
What do we need for learning? What is the general problem?
- domain set X
- label set y
- training data S
- prediction rule / hypothesis h (X -> y) from hypothesis space H
- distribution of X: x ~ D (unknown)
- labeling function (f: X -> y) (unknown)
- true risk (L_D,f(h)) (unknown)
- >> problem: D unknown
What is the aim of statistical learning theory? How is this archived?
- find funtion in H which describes S the best
- >> empirical rist minimization: choose function with lowest empirical risk
Name conditions for ERM. (1 >> 2)
- P(bad hypothesis chosen)) = |H| * e^(-epsilon * m)
- if |H| is finite >> learnable ERM
- for larger m the chance for a by accident bad hypothesis is chosen by accident decreases
What is the aim of PAC-learnability? (2)
- PAC = probably approximately correct
- given H, z, h, epsilon, 1 - delta, define m
- m = theoretical sample complexity of learning H
- i.e. finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)
What is an epsilon representative sample?
- for all rules h: |L_s(h) - L_D(h)| <= epsilon
What is the main aspect of uniform convergence?
What is the idea of concentration measures?
- empirical loss converges to true loss (law of large numbers)
- quantify how fast empirical risk converges
- markov inequality: P(x > epsilon) <= E(x) / epsilon
- Chebyshev' inequality: P(|z - E(z)| > epsilon) <= Var(z) / epsilon^2
- i.e. probability for bad hypothesis: P(|L_s(h) - L_D(h) | > epsilon) <= 1/(epsilon^2 * m)
What is PAC-learnable?
- finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)
- infinite hypothesis classes i.e. threshold functions IF hypothesis classes are restricted (VC(H) < inf)
What is shattering.
- H shatters a finite set C (subset of X) if |H_c| = 2^|c|
Define VC-dimension (1). How to show this (2)?
- maixmal size of C that can be shattered by H
- to show:
- it exists a set C of size d that is shattered by H >> VC(H) >= d
- every set C of size d+1 can't be shattered by H >> VC(H) = d
What is the VC-dim of linear classifier (with b=0 and b != 0)
- d
- d+1
Why is it useful to use a surrogate loss function? (2)
- easier to solve (global optimum unfeasable for large m, d >> NP-hard)
- i.e. analytically solvable
- else: iterative optimization (i.e. gradient descent)
- approximates original loss
What is the main idea of SVMs? (3)
What does the solution depend on?
- transform features in highdimensional space
- assumes separable data
- maximizing margin from hyperplane
- constraint optimization problem: maximize minimal distance to hyperplane
- = minimize length of w
- solution only depends on support vectors (y_i * f(x_i) = 1)
Why is the dual form useful? (3) How to recieve it?
- primal form not analytically solvable
- get rid of constraint: y_i * f(x_i) - 1 >= 0
- inner product <x_j, x_i> included in solution = similarity (distance) of two samples >> kernel introducable
- done with lagrangian multiplier (same direction of gradient)
What is the difference between hard and soft SVMs?
- hard: no classification error allowed (linear separable)
- soft: slack-variables introduced >> hinge loss to handle margin-violations
What is the hilbert space? (1) Characterise it. (2)
- generalisation of euclidean space
- character
- complete: all cauchy sequences converge (minization)
- inner product exists
What is the Gran-matrix? How does it scale? What does it represent?
- K(x_i, x_j)
- = similarity of x_i, x_j
- scales with # of samples (m)