Databases and Data Warehousing
This is an introduction to databases with emphasis on data warehousing.
This is an introduction to databases with emphasis on data warehousing.
Set of flashcards Details
| Flashcards | 26 | 
|---|---|
| Language | English | 
| Category | Computer Science | 
| Level | Other | 
| Created / Updated | 04.03.2021 / 04.03.2021 | 
| Weblink | 
                                
                                
                                https://card2brain.ch/cards/20210304_databases_and_data_warehousing
                             | 
| Embed | 
                                
                                
                                <iframe src="https://card2brain.ch/box/20210304_databases_and_data_warehousing/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
                             | 
What are the two main architectures of databases?
- Shared Everything (SMP, Symmetric Multiprocessing Architecture)
- Shared Disk (SMP multi-node)- Two problems with this system: Distributed shared manager:- Applies locks and shares data blocks.
- Everything comes from one shared storage. Pipeline can be a bottleneck.
 
- examples: Oracle RAC, DB2 Pure Scale
 
- Two problems with this system: Distributed shared manager:
- Shared Nothing (MPP, Massively Parallel Processing)- Exasol, IBM Netica, Teradata, SAP HANA, Greenplum
 
Data distribution of a MPP system.
The goal is that all nodes work equally.
For a MPP system one would rather not distribute the data by date. The reason is that data is often requested in form of timeintervals, meaning that some nodes are idle.
Aggregations on a MPP system can be often done without any data exchange between the nodes.
Joins are more complex. Local joins are possible if the tables to be joined have the same distribution key. Otherwise one get a global join, which needs redistribution of data between the nodes.
The resdistribution overhead is critical to make the system valueable.
What means ACID?
Stands for Atomicity, Consistency, Isolation, Durability;
ACID gives guiding principles that can ensure database transactions can be processed reliably.
- Atomicity: Essentially, an atomic transaction ensures that any commit you make finishes the entire operation successfully or, in cases of the connection being lost in the middle of an operation, the database is rolled back to its state prior to the commit being initiated.
- Consistency: refers to maintaining data integrity constraints. If a databse enters into an illegal state the process will be aborted and changes roll back to their previous, legal state.
- Isolation: Isolated transactions are considered to be "serializeable". Multiple transaction at the same time can only occur as long as the have no possibility to influence each other.
- Durability: ensures that changes made to the database (transactions) that are successfully commited will survive permanently.
MPP System basically avoid Overhead necessary for Online Transaction Processing (OLTP). What does this mean?
- minimal locking: no row locks but table locks only- MPP is not approbriate for e-business systems, where constantly new entries need to be implemented on a table.
- we have only bulk loads and do then queries
 
- column oriented design and compression
- small table redundancy replication instead of broadcasting
- no Redo-/Archive Logs: no point-in -time restore as we are used from a Oracle database. We actually don't need that because we can roll back.
- update = delete+insert
What is a database?
A databse system is a data management system that follows the ACID standard.
It consists of two important components. The database management system (DBSM) and the actuual database.
What is a relational database?
enable a structured storage of data.
relational databases use SQL
it is designed with the ER Diagram (Entity Relationship Diagram), which helps to explain the logical structure of databases.
What is a NoSQL database?
Relatively young technology that stores data in a unstructured form.
What are the advantages of database systems?
- Avoidance of redundancy.
- DBSM systems allow for complex data analysis methods.
- DBSM systems are request-optimized. data can be read very fast.
What is the database design?
Databse design is the organization of data according to a database model.
What is the database model?
Is the theoretical foundation of the database. It consists of three important components:
- generic data structure
- generic operators
- integrity constraints
The relational database is a database model.
What is the database architecture?
Explains details to the project. For instance, if the database is established on a single computer or on a server.
A Stand-Alone Database system does not need to worry about concurrency situations, and thus can be much simpler than a Multi-User Database system.
Optimistic vs. Pessimistic
Optimistic Locking: All users can carry out all actions and the DBSM decides which data is available to which user.
Pessimistic Locking: All users can carry out all actions. However, the database locks the tables for other users over this time.
The optimistic lock is much more productive, but also much more complex.
What is the three-tier architecture for DBMS?
Most widely used architecture to design a DBSM.
Database (Data) Tier: The databse resides along with its query processing languages. We also have the relations that define the data and their constraints at this level.
Application (Middle) Tier: At this tier resides the application server and the programs that access the database. End-users are unaware of any existence of the database beyond the application tier. On the other hand, the database tier is not aware of any other user beyond the application tier.
User (Presentation) Tier: End-users operate on this tier and they know nothing about any existence of the database beyond this layer.
Name the most important terms in data modelling.
- entity: An entity is an object in the actual world, and that can be identified from other objects. In a database, a entity is defined by a set of attributes.
- entity set: An entity set is a group of entities that posses the same set of attributes. Entities are always distinguishable since entities will always differ by some attributes.
- relation (table): encloses the entity and the corresponding entity set.
- tuple (data row): represents all attributes of an entity within an entity set.
How does the ERM work?
The Entity-Relationship-Model basically consists of three components, which of the logical structure of a database can be designed.
- the entity (sympolized by a rectangle)
- the relation (symbolized by a diamond)
- the attribute (symbolized by an attribute)
A relation can mean a one-to-many or a many-to-many relation.
What is the difference between a clustered and a non-clustered index?
Clustered index:
- Cluster index is a type of index which sorts the data rows in the table on their key values. In the database, there is only one clustered index per table.
Non-clustered index:
- a non-clustered index stores the data at one location and indices at another location. The index contains pointers to the location of that data. A single table can have many non-clustered indexes as an index in the non-clustered index is stored in different places.
What is the primary key?
The primary key is used for unambigious identification of a set of entities. In a normalized database all tables have a primary key.
We distinguish between unambigious, composed and artificial primary keys. Ideally, there exists unambigious attributes (one or several), which can be used as a primary key. Otherwise, an artificial primary key is necessary.
What is a foreign key?
While a primary key may exist on its own, a foreign key must always reference to a primary key somewhere. The original table containing the primary key is the parent table (also known as the referenced table). This key can be referenced by multiple foreign keys from other tables, known as "child" tables. For any column acting as a foreign key, a corresponding value should exist in the linked table.
Which relations between tables exist?
1:1 relation: Should be avoided because tables can be unified. Makes only sense if a very complex table is supposed to be splitted or parts of a table are isolated because of accesibility restrictions.
1:n relation: ...
m:n relation: The many-to-many relation needs a intermediate/connective table, that consists of both foreign keys and a primary key. The primary key is formed by the foreign keys. Actually, a many-to-many relation is build of two one-to-many relations.
What means normalization of a database?
Normalization of a relational databases means the splitting of attributes into multiple tables to avoid redundancies.
For normalization clear rules are defined. Their exists five forms of normalization, whereas the third form is often sufficient to get the ideal result.
Explain the normal forms for normalization of databases.
0-NF: database is unnormalized.
1-NF: information is fully atomic
2-NF: to conform to 2-NF and remove duplicates, every non candidate-key attribute must depend on the whole candidate key.
3-NF: removing further all transitive functional dependencies in order to further reduce duplicates.
Their exists higher normalization forms, which are usually automatically satisfied for most database designs.
Which dependencies exists for normalization?
We distinguish between the functional, the fully functional and the transitive dependency.
functional dependency: for every attribute X exists exactly one Y.
fully functional dependency: every non-key attribute is not only partially functional dependent to a composed key attribute, but from all parts of the key attribute.
Explain the different normal forms of normalization.
1-NF: The values in each column of a table must be atomic.
2-NF: To conform to 2-NF and remove duplicates, every non candidate-key attribute must depend on the whole candidate key, not just part of it.
3-NF: Removing all transitive functional dependencies to further decrease redundancy.
What is referential data integrity?
Data rows that consists of a foreign key can be added only, if the value for the foreign key in the reference table is unique.
List components of the data warehouse and explain them.
Operative System or Online Transaction Processing (OLTP): Different parties of a company generate data on the Operative System (OLTP), which are extracted in regular intervals. Before they are incorporated in the data warehouse, they are stored in intermediate relational databases for further processing.
Data Staging Area: Means the complete ETL (computing, extract, transform, load) process, which lead to the loading of data into the data warehouse.
Operational Data Store (ODS): Exists because of the limitationsn of a data warehouse. It is responsible only for analytical queries.
Dashboards: presents meaningful statistics of the data and is close to the user.
Business Intelligence Board: Similar to Dashboards: content of a datawarehouse is graphically displayed.
Was ist die Granularität von Daten?
Die Granularität (lateinisch „granum“, übersetzt etwa mit „Körnigkeit“) von Daten gibt Auskunft über deren Verdichtungsgrad.[1] So werden Daten oft nach räumlichen (wie z. B. geografischen), und zeitlichen wie auch organisatorischen Aspekten verdichtet, d. h. zusammengefasst oder aggregiert. Die Verdichtung numerischer Einzeldaten kann mittels Addition, Mittelwertbildung oder anderen Rechenoperationen wie Min- oder Maximumbildung geschehen oder indem mehrere Objekte zu einem neuen zusammengeführt werden.
Liegen z. B. Umsatzzahlen eines international agierenden Handelsunternehmens auf Ebene der einzelnen Filialen vor, so können die Umsatzzahlen auf Stadt, Landes- und Kontinent-Ebene verdichtet werden. Die feingranularen Daten auf Filial-Ebene werden zu grobgranularen Daten auf Kontinent-Ebene. Die Ebene Filiale liefert die feinste Granularität der Umsatzdaten, hat aber den höchsten Detaillierungsgrad, während die Ebene Kontinent die gröbste Granularität und den niedrigsten Detaillierungsgrad aufweist.
Granularität von Daten spielt besonders beim Data-Warehousing eine Rolle. Die im Data-Warehouse hinterlegten Daten können z. B. mittels OLAP nach unterschiedlichen Aspekten ausgewählt und auf verschiedenen Verdichtungsebenen mit entsprechendem Verdichtungsgrad zusammengefasst und präsentiert werden.
 
                 
                 
                 
                 
                 
                 
                 
                 
                