KDDM

KDDM@LMU

KDDM@LMU


Fichier Détails

Cartes-fiches 203
Langue Deutsch
Catégorie Informatique
Niveau Université
Crée / Actualisé 01.08.2019 / 03.08.2019
Lien de web
https://card2brain.ch/box/20190801_kddm
Intégrer
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is the random walk in cae of page rank?

  • at time t from i juse any random out-link j by probability p(t)
    • p(t+1) = M * p(t)

What is the problem of page-rank and random surfer? (2) What is the solution?

  • problems
    • dead ends: no out -inks
    • spider traps: all out-links in one group (stuck at trap)
  • solutions
    • always teleport (dead ends): adjust M accordingly: if dead end >> uniform edge probability to all nodes (1/n)
    • teleport (traps)
      • with probability beta: follow random link
      • with p = 1-beta: jump to random page
      • >> tradeoff: link-structure vs. stucked

State the final formular of page-rank

  • \(r_j = \sum_{i \rightarrow j}(\beta \frac{r_i}{d_i} + (1 - \beta)\frac{1}{n}) = \beta M+(1-\beta)\frac{1}{n}\)
  • assumes no dead ends (all columns col-stochastic, always teleport)

What are general problems of page-rank? (3) How to tackle them? (0, 1, 4)

  • only one measure of performance (rank by importance)
  • general probability of page (topics disregarded)
    • >> topic-specific pagerank: modification of random-walk (teleports): biased random walk (pick page from (weighted) S) >> influences page-rank
  • link spam (artificial links for rank-boosting by link-farms)
    • combating term spam: using statistical methods
    • combating link-spam: detection and blacklisting
    • trustrank: teleport set to trusted pages
    • trust propagation: inital trusted pages, trust passed to outlinks, if trust below threshold = spam

What are communities? What is the assumption?

  • collection of entities
  • assumptino: non-randomness / locality: entities tend to cluster

Define betweenness. (1)

  • betweenness of (a,b) = # of pairs of (x, y) s.t. (a, b) lie on shortest path

Explain the girvan-Newman-Algorithm to compute betweenness (1:4, 1). What is the result of the algo?

  • for every node
    • construct DAG
    • label each node with number of shortest paths from root
    • calculate credit for each edge
    • >> betweenness
  • remove edges with largest betweenness until separated communities remain
  • result: hierachical decomposition of network

How to find the # of clusters in case of cummunity detection? (1:2)

  • use modularity as measure of partitioning into communities (>> max)
    • sum over groups(# edges within group - expected # egdes within group)
      • expected # edges = 1/(2m) * k_i * k_j (m = # edges, k_i, j = degree of node i, j)
      • random connections, same degree distribution
    • >> Q(G, S) = 1/(2m) * sum_s * sum_i * sum_j(a_ij, k_i * k_j / (2m))
      • 1/(2m) = normalisation s.t. G [-1, 1]

How to interpret the modularity Q (2)

  • > 0: observed # of edges exceeds expectation
  • 0.3 - 0.7: significant structure

What is a good partitioning in general?

How is cut(A, B) defined?

What is the minimum cut?

What is a problem here?

  • max # within group-connections, min # between-group connections
  • # of edges linking groups
  • argmin_A, B cut(A, B): minimize weight of connections between groups
  • problem: only considers external cluster connections (not internal)

What is the idea of normalised cut? What is the problem here?

  • relate connectivity of a group to density of a group >> more balanced partitions
  • problem: optimal cuts: NP-hard

What are the 3 main steps of spectral clustering?

  • preprocessing:
    • matrix-representation of graph
  • decomposition:
    • eigenvalues, eigenvectors of matrix
    • low-dimensional representation by k eigenvectors
  • grouping:
    • assign points to custers (based on representation i.e. split at 0, k-means, minimum cut)

What is the general idea of databases?

What is the time to read?

  • persistence of data (main memory not enough)
  • TTR
    • seek
    • latency
    • transfer

What are problems of file systems? (4-6)

  • unknown metadata (type, size...)
  • redundancy of information
  • distributed information (several files)
  • no access management
  • parallel read/write not possible 
  • how to handle data loss?

Explain the architecture of a DB

  • App <> DBMS <> DB

Name the functionality of a DB-system (5-7)

  • description of data (metadata)
  • store, search, manipulate data
  • views
  • access control
  • transactions (validity)
  • safety: recovery, logging
  • synchronisation with concurrent users

What are the main differences between OLTP and OLAP? (4)

  • OLAP
    • OLTP
      • consolidated, historical
        • single transaction, current data
      • read only (complex queries)
        • read/write/update
      • data warehouse
        • traditional DBMS
      • decision support
        • operations support

What does ACID stand for, explain each. (4) What does it focus on? (1)

  • atomicity
    • transactions: all or nothing: not devidable
  • consistency
    • transactions: from a valid state to valid state
  • isolation
    • transactions: concurrent execution as if executed serially
  • durability
    • transaction: commited = persistent
  • >> strict transactions (CONSISTENCY)

What are the main contents of a DB? (2)

  • schema: data type, structure (description of possible content)
    • changes rarely
  • instance: content (objects and values)
    • changes often

What are the main components of ER-model? (3)

  • ER = Entity Relation
  • entity
  • relation
    • 1:1 (same relation)
    • 1:n (by FK)
    • n:m additional table
  • attributes

What is the idea of normalisation in context of relational DBs? What is the aim?

  • splitting objects into tuples in different tables
  • avoid redundancy while preserving dependencies

Explain the main operations of querying relational DBs.

Basic opations (5), dervied operations (4)

 

  • basic
    • cross-product: all combinations of S, R
    • selection: all tuples of subset
    • projection: only specific attributes
    • union: S + R same schema required
    • difference: S - R, same schame required
  • derived
    • intersection (union and differenc)
    • join (selection of cross-product)
    • natural join (shared attributes)
    • division

What is SQL?

  • desciptive query language for relational DB-systems

How to scale storage systems? (2) Name drawbacks. (2 + 1)

  • vertical scaling: enlarge single machine
    • limited in space
    • expensive
  • horizontal scaling: use clusters / grids of computers
    • cluster maintenance

Why NoSQL came up?

What is characteristic for a NoSQL-system? (4)

  • relational DBs don't scale well horizontally
  • characteristics
    • horizonal scalable
    • not relational
    • distributed
    • schema-less

Explain the BASE-concept. What is it's main focus? Relate it to ACID.

  • basically available (system generally available)
  • soft state: data might be expired (many changes)
  • eventual consistency: mostly consistent
  • >> main focus: availability
  • >> opposite to ACID regarding consisteny, availability

Name the levels of consistency (6)

  • eventual consistency: write-operations not spread immediately
  • monotonic read consistency: once read, never read older version
  • read your own writes
  • immediately consistency: not atomic
  • strong consistency: atomic
  • transactions: ACID

Name the two levels of consistency in NoSQL.

  • data sharding / logical consistency: data consistent within itself
  • data replication: data consistent across multiple replicas 

Explain the CAP-theorem. (3+1)

  • CAP
    • consistency: same data for all clients
    • availability: each client can read and write
    • partition tolerance: system works well distributed
    • >> any networked shared-data system can fulfil at most 2/3

Explain single possible systems in context of CAP-theorem.

  • CP: only consistent nodes available
  • AP: answers always, consistence not given
  • AC: fully available, consistent but not distributed system

Name the four main NoSQL data models.

  • key-value store
  • document store
  • wide column store
  • graph databases

Explain the key-value store. (3) Relate it to CAP. Name a representative.

  • value by key
  • value: any format
  • consistency: eventually to serizability
  • >> CP, AP systems
  • >> redis

Characterise document store. (2) Relate it to CAP. Name one representative.

  • storing documents (XML, JSON) i.e. whole arrays
  • semi-structured (non homogeneous structure)
  • >> CP or AP system
  • >> MongoDB

Explain wide column store. (3) Explain it's structure. Relate it to CAP. Name on representative.

  • rows identified by keys
  • # columns can differ for rows
  • rows ordered by keys
  • structure:
    • many rows >> family
    • many families >> key space
  • >> CP or AP
  • >> Cassandra

Explain graph databases (1). Relate it to CAP. Name one representative.

  • storing of relationships: nodes, edges, properties
  • >> AC-system
  • >> neo4j

Name advantages of NoSQL-systems (4)

  • flexible scaling
  • less administration
  • better economics
  • flexible data models

What are drawbacks of NoSQL-systems?

  • only semi-structured
  • limited access control
  • limited indexing
  • CP or AP systems only
    • AC not possible

Characterise distributed file systems in context of data analytics (3 + 1)

  • data analytics requires all data
  • well defined data (structure)
  • parallelisation of methods
  • >> distributed file system usefu l

What are challenges in context of distributed file systems (2)

  • redundant files (in case of failure)
  • divided computation of tasks

What is the parallel computing architecture? (2 + 1)

  • nodes are stored on racks
  • nodes are connected by network
  • >> cluster computing