KDDM
KDDM@LMU
KDDM@LMU
Fichier Détails
Cartes-fiches | 203 |
---|---|
Langue | Deutsch |
Catégorie | Informatique |
Niveau | Université |
Crée / Actualisé | 01.08.2019 / 03.08.2019 |
Lien de web |
https://card2brain.ch/box/20190801_kddm
|
Intégrer |
<iframe src="https://card2brain.ch/box/20190801_kddm/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
What is the random walk in cae of page rank?
- at time t from i juse any random out-link j by probability p(t)
- p(t+1) = M * p(t)
What is the problem of page-rank and random surfer? (2) What is the solution?
- problems
- dead ends: no out -inks
- spider traps: all out-links in one group (stuck at trap)
- solutions
- always teleport (dead ends): adjust M accordingly: if dead end >> uniform edge probability to all nodes (1/n)
- teleport (traps)
- with probability beta: follow random link
- with p = 1-beta: jump to random page
- >> tradeoff: link-structure vs. stucked
State the final formular of page-rank
- \(r_j = \sum_{i \rightarrow j}(\beta \frac{r_i}{d_i} + (1 - \beta)\frac{1}{n}) = \beta M+(1-\beta)\frac{1}{n}\)
- assumes no dead ends (all columns col-stochastic, always teleport)
What are general problems of page-rank? (3) How to tackle them? (0, 1, 4)
- only one measure of performance (rank by importance)
- general probability of page (topics disregarded)
- >> topic-specific pagerank: modification of random-walk (teleports): biased random walk (pick page from (weighted) S) >> influences page-rank
- link spam (artificial links for rank-boosting by link-farms)
- combating term spam: using statistical methods
- combating link-spam: detection and blacklisting
- trustrank: teleport set to trusted pages
- trust propagation: inital trusted pages, trust passed to outlinks, if trust below threshold = spam
What are communities? What is the assumption?
- collection of entities
- assumptino: non-randomness / locality: entities tend to cluster
Define betweenness. (1)
- betweenness of (a,b) = # of pairs of (x, y) s.t. (a, b) lie on shortest path
Explain the girvan-Newman-Algorithm to compute betweenness (1:4, 1). What is the result of the algo?
- for every node
- construct DAG
- label each node with number of shortest paths from root
- calculate credit for each edge
- >> betweenness
- remove edges with largest betweenness until separated communities remain
- result: hierachical decomposition of network
How to find the # of clusters in case of cummunity detection? (1:2)
- use modularity as measure of partitioning into communities (>> max)
- sum over groups(# edges within group - expected # egdes within group)
- expected # edges = 1/(2m) * k_i * k_j (m = # edges, k_i, j = degree of node i, j)
- random connections, same degree distribution
- >> Q(G, S) = 1/(2m) * sum_s * sum_i * sum_j(a_ij, k_i * k_j / (2m))
- 1/(2m) = normalisation s.t. G [-1, 1]
- sum over groups(# edges within group - expected # egdes within group)
How to interpret the modularity Q (2)
- > 0: observed # of edges exceeds expectation
- 0.3 - 0.7: significant structure
What is a good partitioning in general?
How is cut(A, B) defined?
What is the minimum cut?
What is a problem here?
- max # within group-connections, min # between-group connections
- # of edges linking groups
- argmin_A, B cut(A, B): minimize weight of connections between groups
- problem: only considers external cluster connections (not internal)
What is the idea of normalised cut? What is the problem here?
- relate connectivity of a group to density of a group >> more balanced partitions
- problem: optimal cuts: NP-hard
What are the 3 main steps of spectral clustering?
- preprocessing:
- matrix-representation of graph
- decomposition:
- eigenvalues, eigenvectors of matrix
- low-dimensional representation by k eigenvectors
- grouping:
- assign points to custers (based on representation i.e. split at 0, k-means, minimum cut)
What is the general idea of databases?
What is the time to read?
- persistence of data (main memory not enough)
- TTR
- seek
- latency
- transfer
What are problems of file systems? (4-6)
- unknown metadata (type, size...)
- redundancy of information
- distributed information (several files)
- no access management
- parallel read/write not possible
- how to handle data loss?
Explain the architecture of a DB
- App <> DBMS <> DB
Name the functionality of a DB-system (5-7)
- description of data (metadata)
- store, search, manipulate data
- views
- access control
- transactions (validity)
- safety: recovery, logging
- synchronisation with concurrent users
What are the main differences between OLTP and OLAP? (4)
- OLAP
- OLTP
- consolidated, historical
- single transaction, current data
- read only (complex queries)
- read/write/update
- data warehouse
- traditional DBMS
- decision support
- operations support
- consolidated, historical
- OLTP
What does ACID stand for, explain each. (4) What does it focus on? (1)
- atomicity
- transactions: all or nothing: not devidable
- consistency
- transactions: from a valid state to valid state
- isolation
- transactions: concurrent execution as if executed serially
- durability
- transaction: commited = persistent
- >> strict transactions (CONSISTENCY)
What are the main contents of a DB? (2)
- schema: data type, structure (description of possible content)
- changes rarely
- instance: content (objects and values)
- changes often
What are the main components of ER-model? (3)
- ER = Entity Relation
- entity
- relation
- 1:1 (same relation)
- 1:n (by FK)
- n:m additional table
- attributes
What is the idea of normalisation in context of relational DBs? What is the aim?
- splitting objects into tuples in different tables
- avoid redundancy while preserving dependencies
Explain the main operations of querying relational DBs.
Basic opations (5), dervied operations (4)
- basic
- cross-product: all combinations of S, R
- selection: all tuples of subset
- projection: only specific attributes
- union: S + R same schema required
- difference: S - R, same schame required
- derived
- intersection (union and differenc)
- join (selection of cross-product)
- natural join (shared attributes)
- division
What is SQL?
- desciptive query language for relational DB-systems
How to scale storage systems? (2) Name drawbacks. (2 + 1)
- vertical scaling: enlarge single machine
- limited in space
- expensive
- horizontal scaling: use clusters / grids of computers
- cluster maintenance
Why NoSQL came up?
What is characteristic for a NoSQL-system? (4)
- relational DBs don't scale well horizontally
- characteristics
- horizonal scalable
- not relational
- distributed
- schema-less
Explain the BASE-concept. What is it's main focus? Relate it to ACID.
- basically available (system generally available)
- soft state: data might be expired (many changes)
- eventual consistency: mostly consistent
- >> main focus: availability
- >> opposite to ACID regarding consisteny, availability
Name the levels of consistency (6)
- eventual consistency: write-operations not spread immediately
- monotonic read consistency: once read, never read older version
- read your own writes
- immediately consistency: not atomic
- strong consistency: atomic
- transactions: ACID
Name the two levels of consistency in NoSQL.
- data sharding / logical consistency: data consistent within itself
- data replication: data consistent across multiple replicas
Explain the CAP-theorem. (3+1)
- CAP
- consistency: same data for all clients
- availability: each client can read and write
- partition tolerance: system works well distributed
- >> any networked shared-data system can fulfil at most 2/3
Explain single possible systems in context of CAP-theorem.
- CP: only consistent nodes available
- AP: answers always, consistence not given
- AC: fully available, consistent but not distributed system
Name the four main NoSQL data models.
- key-value store
- document store
- wide column store
- graph databases
Explain the key-value store. (3) Relate it to CAP. Name a representative.
- value by key
- value: any format
- consistency: eventually to serizability
- >> CP, AP systems
- >> redis
Characterise document store. (2) Relate it to CAP. Name one representative.
- storing documents (XML, JSON) i.e. whole arrays
- semi-structured (non homogeneous structure)
- >> CP or AP system
- >> MongoDB
Explain wide column store. (3) Explain it's structure. Relate it to CAP. Name on representative.
- rows identified by keys
- # columns can differ for rows
- rows ordered by keys
- structure:
- many rows >> family
- many families >> key space
- >> CP or AP
- >> Cassandra
Explain graph databases (1). Relate it to CAP. Name one representative.
- storing of relationships: nodes, edges, properties
- >> AC-system
- >> neo4j
Name advantages of NoSQL-systems (4)
- flexible scaling
- less administration
- better economics
- flexible data models
What are drawbacks of NoSQL-systems?
- only semi-structured
- limited access control
- limited indexing
- CP or AP systems only
- AC not possible
Characterise distributed file systems in context of data analytics (3 + 1)
- data analytics requires all data
- well defined data (structure)
- parallelisation of methods
- >> distributed file system usefu l
What are challenges in context of distributed file systems (2)
- redundant files (in case of failure)
- divided computation of tasks
What is the parallel computing architecture? (2 + 1)
- nodes are stored on racks
- nodes are connected by network
- >> cluster computing