Cartes mémoires KDDM (Seite 4 von 6)

Cartes-fiches	203
Langue	Deutsch
Catégorie	Informatique
Niveau	Université
Crée / Actualisé	01.08.2019 / 03.08.2019
Lien de web	https://card2brain.ch/box/20190801_kddm
Intégrer	<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the random walk in cae of page rank?

at time t from i juse any random out-link j by probability p(t)
- p(t+1) = M * p(t)

What is the problem of page-rank and random surfer? (2) What is the solution?

problems
- dead ends: no out -inks
- spider traps: all out-links in one group (stuck at trap)
solutions
- always teleport (dead ends): adjust M accordingly: if dead end >> uniform edge probability to all nodes (1/n)
- teleport (traps)
  - with probability beta: follow random link
  - with p = 1-beta: jump to random page
  - >> tradeoff: link-structure vs. stucked

State the final formular of page-rank

\(r_j = \sum_{i \rightarrow j}(\beta \frac{r_i}{d_i} + (1 - \beta)\frac{1}{n}) = \beta M+(1-\beta)\frac{1}{n}\)
assumes no dead ends (all columns col-stochastic, always teleport)

What are general problems of page-rank? (3) How to tackle them? (0, 1, 4)

only one measure of performance (rank by importance)
general probability of page (topics disregarded)
- >> topic-specific pagerank: modification of random-walk (teleports): biased random walk (pick page from (weighted) S) >> influences page-rank
link spam (artificial links for rank-boosting by link-farms)
- combating term spam: using statistical methods
- combating link-spam: detection and blacklisting
- trustrank: teleport set to trusted pages
- trust propagation: inital trusted pages, trust passed to outlinks, if trust below threshold = spam

What are communities? What is the assumption?

collection of entities
assumptino: non-randomness / locality: entities tend to cluster

Define betweenness. (1)

betweenness of (a,b) = # of pairs of (x, y) s.t. (a, b) lie on shortest path

Explain the girvan-Newman-Algorithm to compute betweenness (1:4, 1). What is the result of the algo?

for every node
- construct DAG
- label each node with number of shortest paths from root
- calculate credit for each edge
- >> betweenness
remove edges with largest betweenness until separated communities remain
result: hierachical decomposition of network

How to find the # of clusters in case of cummunity detection? (1:2)

use modularity as measure of partitioning into communities (>> max)
- sum over groups(# edges within group - expected # egdes within group)
  - expected # edges = 1/(2m) * k_i * k_j (m = # edges, k_i, j = degree of node i, j)
  - random connections, same degree distribution
- >> Q(G, S) = 1/(2m) * sum_s * sum_i * sum_j(a_ij, k_i * k_j / (2m))
  - 1/(2m) = normalisation s.t. G [-1, 1]

How to interpret the modularity Q (2)

> 0: observed # of edges exceeds expectation
0.3 - 0.7: significant structure

What is a good partitioning in general?

How is cut(A, B) defined?

What is the minimum cut?

What is a problem here?

max # within group-connections, min # between-group connections
# of edges linking groups
argmin_A, B cut(A, B): minimize weight of connections between groups
problem: only considers external cluster connections (not internal)

What is the idea of normalised cut? What is the problem here?

relate connectivity of a group to density of a group >> more balanced partitions
problem: optimal cuts: NP-hard

What are the 3 main steps of spectral clustering?

preprocessing:
- matrix-representation of graph
decomposition:
- eigenvalues, eigenvectors of matrix
- low-dimensional representation by k eigenvectors
grouping:
- assign points to custers (based on representation i.e. split at 0, k-means, minimum cut)

What is the general idea of databases?

What is the time to read?

persistence of data (main memory not enough)
TTR
- seek
- latency
- transfer

What are problems of file systems? (4-6)

unknown metadata (type, size...)
redundancy of information
distributed information (several files)
no access management
parallel read/write not possible
how to handle data loss?

Explain the architecture of a DB

App <> DBMS <> DB

Name the functionality of a DB-system (5-7)

description of data (metadata)
store, search, manipulate data
views
access control
transactions (validity)
safety: recovery, logging
synchronisation with concurrent users

What are the main differences between OLTP and OLAP? (4)

OLAP
- OLTP
  - consolidated, historical
    - single transaction, current data
  - read only (complex queries)
    - read/write/update
  - data warehouse
    - traditional DBMS
  - decision support
    - operations support

What does ACID stand for, explain each. (4) What does it focus on? (1)

atomicity
- transactions: all or nothing: not devidable
consistency
- transactions: from a valid state to valid state
isolation
- transactions: concurrent execution as if executed serially
durability
- transaction: commited = persistent
>> strict transactions (CONSISTENCY)

What are the main contents of a DB? (2)

schema: data type, structure (description of possible content)
- changes rarely
instance: content (objects and values)
- changes often

What are the main components of ER-model? (3)

ER = Entity Relation
entity
relation
- 1:1 (same relation)
- 1:n (by FK)
- n:m additional table
attributes

What is the idea of normalisation in context of relational DBs? What is the aim?

splitting objects into tuples in different tables
avoid redundancy while preserving dependencies

Explain the main operations of querying relational DBs.

Basic opations (5), dervied operations (4)

basic
- cross-product: all combinations of S, R
- selection: all tuples of subset
- projection: only specific attributes
- union: S + R same schema required
- difference: S - R, same schame required
derived
- intersection (union and differenc)
- join (selection of cross-product)
- natural join (shared attributes)
- division

What is SQL?

desciptive query language for relational DB-systems

How to scale storage systems? (2) Name drawbacks. (2 + 1)

vertical scaling: enlarge single machine
- limited in space
- expensive
horizontal scaling: use clusters / grids of computers
- cluster maintenance

Why NoSQL came up?

What is characteristic for a NoSQL-system? (4)

relational DBs don't scale well horizontally
characteristics
- horizonal scalable
- not relational
- distributed
- schema-less

Explain the BASE-concept. What is it's main focus? Relate it to ACID.

basically available (system generally available)
soft state: data might be expired (many changes)
eventual consistency: mostly consistent
>> main focus: availability
>> opposite to ACID regarding consisteny, availability

Name the levels of consistency (6)

eventual consistency: write-operations not spread immediately
monotonic read consistency: once read, never read older version
read your own writes
immediately consistency: not atomic
strong consistency: atomic
transactions: ACID

Name the two levels of consistency in NoSQL.

data sharding / logical consistency: data consistent within itself
data replication: data consistent across multiple replicas

Explain the CAP-theorem. (3+1)

CAP
- consistency: same data for all clients
- availability: each client can read and write
- partition tolerance: system works well distributed
- >> any networked shared-data system can fulfil at most 2/3

Explain single possible systems in context of CAP-theorem.

CP: only consistent nodes available
AP: answers always, consistence not given
AC: fully available, consistent but not distributed system

Name the four main NoSQL data models.

key-value store
document store
wide column store
graph databases

Explain the key-value store. (3) Relate it to CAP. Name a representative.

value by key
value: any format
consistency: eventually to serizability
>> CP, AP systems
>> redis

Characterise document store. (2) Relate it to CAP. Name one representative.

storing documents (XML, JSON) i.e. whole arrays
semi-structured (non homogeneous structure)
>> CP or AP system
>> MongoDB

Explain wide column store. (3) Explain it's structure. Relate it to CAP. Name on representative.

rows identified by keys
# columns can differ for rows
rows ordered by keys
structure:
- many rows >> family
- many families >> key space
>> CP or AP
>> Cassandra

Explain graph databases (1). Relate it to CAP. Name one representative.

storing of relationships: nodes, edges, properties
>> AC-system
>> neo4j

Name advantages of NoSQL-systems (4)

flexible scaling
less administration
better economics
flexible data models

What are drawbacks of NoSQL-systems?

only semi-structured
limited access control
limited indexing
CP or AP systems only
- AC not possible

Characterise distributed file systems in context of data analytics (3 + 1)

data analytics requires all data
well defined data (structure)
parallelisation of methods
>> distributed file system usefu l

What are challenges in context of distributed file systems (2)

redundant files (in case of failure)
divided computation of tasks

What is the parallel computing architecture? (2 + 1)

nodes are stored on racks
nodes are connected by network
>> cluster computing

KDDM

Créer ou copier des fichiers d'apprentissage

Créer ou copier des fichiers d'apprentissage

Connecte-toi pour voir toutes les cartes.

SWITCHaai

Office 365

Edulog

Apple ID

Google