Lernkartei KDDM (Seite 5 von 6)

Karten	203
Sprache	Deutsch
Kategorie	Informatik
Stufe	Universität
Erstellt / Aktualisiert	01.08.2019 / 03.08.2019
Weblink	https://card2brain.ch/box/20190801_kddm
Einbinden	<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

Characterise large-scale file systems. (2) Name an exampleof distributed file systems

huge amount of data (terabytes)
files rarely upated (update = appending)
HDFS: hadoop distributed file system

How to find files in DFS? (1)

master node has information about all replications

What is the heartbeat-mechanism? What is it used for?

data node sends heartbeat-message to NameNode (master node) periodically
absence of heartbeat = datanode lost connection

How HDFS ensure robustness? (4)

data integrity: checksum checking of blocks
cluster rebalancing: move data from datanode in case if free space
data disk failure: heartbeats
snapshots: storing copy

Why MapReduce is useful? (3)

manage large-scale computations
tolerant to hardware faults
parallelisation / distributed computation

What are the functions of MapReduce? Explain them. What is special about MapReduce?

map(in_key, in_value) -> list(map_key, intermediate value)
reduce(map_key, list(intermediate values)) -> list(out_values)
- merged results
>> only thing to be programmed.

What is the general processing of MapReduce? (3)

map data in each chunk >> sequences of key-value pairs
pairs collected by master, sorted by key
reduce task on each key separately >> combine all values to output

What does MapReduce handle on its own? (5)

partitioning of input data
scheduling
suffle and sort (group-by-key step)
handling failures
inter-machine communication
>> no programming required

Explain how to apply k-means with MapReduce? (3)

read data + initialize centers
mappers (for each data point separately)
1. calculate proximity from datapoints to centroid
2. assignment
reducing: recalculation of cendroids (aggregation: mean value for each group)
repeat 2. and 3 until convergence (no changes of cendroids)

Briefly characterise Apache Hadoop. (2)

created by google (2003), independent Apache project (2008) (open source)
platform for storage, analysis of big data
- reliable
- scalable
- distributed computing

Name limitations of MapReduce. (2)

limited to batch-programming (not real-time / ad-hoc analysis)
limited analysis modes (not for graph, stream, text analysis)

What is the main advantag of Apache Spark compared to MapReduce? (2)

better performance than Hadoop
- less I/O by usage of main memory instead of write and read of intermediate results on disk
more operations than only MapReduce
- pipeline of operations

What is RDD? (3)

resilient distributed dataset
read-only collection of objects
partitioned accross machines >> parallelisation

What are RDD transformations and actions? Give examples (3 + 1)

transformations: lazy evaluated
- map
- groupyByKey
- mapValues
actions: trigger transformations
- collect

What is the conecpt of shared variables?

variables = copies = not updateable
common data is broadcasted within each stage (update by broadcast)

Define a data stream. (1) What is characteristic? (2) What's the challenge then?

continuous, potentially infinite stochastic process
characteristic
- huge amount of data (not storable)
- no control on arriving order
- >> difficult to analyse history

What are the components of data stream management system? (5)

data streams
stream processor
output streams
ad-hoc queries
archive storage

Name different data stream models (3)

insert-only model
- seen x_i unchangable
insert-delete model
- x_i can't be deleted or updated
additive model
- x_i is increment of previous version = store sums only

Why streaming methods are required? (2)

How to they tackle these problems? What is the tradeoff each?

not all data storagable
- solution: storing summaries of pervious data
- trade-off: storage space vs. precise queries
not all data processable
- solution: data reduction / approximation
- trade-off: processing speed vs. precise queries

Name and explain streaming methods in case of limited storage. (2)

sliding window
- keep most recent stream in main memory
- timestamp vs. sequence based
ageing
- decay importance of objects

Name streaming methods in case of restricted processing speed. (3)

sampling
- obtain representative sample i.e. reservoir sampling, min-wise sampling
summaries
- i.e. DWT (discrete wavelet transformation)
histograms

What is special about Spark streaming?

ingestion of data from several sources:
- input stream >> spark streaming >> batches of input (RDD) >> spark engine >> batches of processed data

What is Apache Storm? What is special about it?

alternative to spark streaming
supports real-time processing

What is Apache Flink? (1)

What is special about it? (4)

stream processing framework (open source)
special
- low latency with high throughput (buffer sent when full or time out)
- stateful operators: internal state (i.e. state = model parameter)
- distributed execution
- interative data flow integrated (looping)
  - loops by feedbackstream (Hadoop, Spark: loop in client program)

What do we need for learning? What is the general problem?

domain set X
label set y
training data S
prediction rule / hypothesis h (X -> y) from hypothesis space H
distribution of X: x ~ D (unknown)
labeling function (f: X -> y) (unknown)
true risk (L_D,f(h)) (unknown)
>> problem: D unknown

What is the aim of statistical learning theory? How is this archived?

find funtion in H which describes S the best
>> empirical rist minimization: choose function with lowest empirical risk

Name conditions for ERM. (1 >> 2)

P(bad hypothesis chosen)) = |H| * e^(-epsilon * m)
- if |H| is finite >> learnable ERM
- for larger m the chance for a by accident bad hypothesis is chosen by accident decreases

What is the aim of PAC-learnability? (2)

PAC = probably approximately correct
given H, z, h, epsilon, 1 - delta, define m
- m = theoretical sample complexity of learning H
- i.e. finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)

What is an epsilon representative sample?

for all rules h: |L_s(h) - L_D(h)| <= epsilon

What is the main aspect of uniform convergence?

What is the idea of concentration measures?

empirical loss converges to true loss (law of large numbers)
quantify how fast empirical risk converges
- markov inequality: P(x > epsilon) <= E(x) / epsilon
- Chebyshev' inequality: P(|z - E(z)| > epsilon) <= Var(z) / epsilon^2
  - i.e. probability for bad hypothesis: P(|L_s(h) - L_D(h) | > epsilon) <= 1/(epsilon^2 * m)

What is PAC-learnable?

finite hypthesis classes: m <= |H| / (epsilon^2 * delta) (chebyshev's inequality)
infinite hypothesis classes i.e. threshold functions IF hypothesis classes are restricted (VC(H) < inf)

What is shattering.

H shatters a finite set C (subset of X) if |H_c| = 2^|c|

Define VC-dimension (1). How to show this (2)?

maixmal size of C that can be shattered by H
to show:
- it exists a set C of size d that is shattered by H >> VC(H) >= d
- every set C of size d+1 can't be shattered by H >> VC(H) = d

What is the VC-dim of linear classifier (with b=0 and b != 0)

d
d+1

Why is it useful to use a surrogate loss function? (2)

easier to solve (global optimum unfeasable for large m, d >> NP-hard)
- i.e. analytically solvable
- else: iterative optimization (i.e. gradient descent)
approximates original loss

What is the main idea of SVMs? (3)

What does the solution depend on?

transform features in highdimensional space
assumes separable data
maximizing margin from hyperplane
- constraint optimization problem: maximize minimal distance to hyperplane
- = minimize length of w
solution only depends on support vectors (y_i * f(x_i) = 1)

Why is the dual form useful? (3) How to recieve it?

primal form not analytically solvable
get rid of constraint: y_i * f(x_i) - 1 >= 0
inner product <x_j, x_i> included in solution = similarity (distance) of two samples >> kernel introducable
done with lagrangian multiplier (same direction of gradient)

What is the difference between hard and soft SVMs?

hard: no classification error allowed (linear separable)
soft: slack-variables introduced >> hinge loss to handle margin-violations

What is the hilbert space? (1) Characterise it. (2)

generalisation of euclidean space
character
- complete: all cauchy sequences converge (minization)
- inner product exists

What is the Gran-matrix? How does it scale? What does it represent?

K(x_i, x_j)
= similarity of x_i, x_j
scales with # of samples (m)

KDDM

Lernkarteien erstellen oder kopieren

Lernkarteien erstellen oder kopieren

Melde dich an, um alle Karten zu sehen.

SWITCHaai

Office 365

Edulog

Apple ID

Google