DataMgmt FS23
DtaaMgmt FS23
DtaaMgmt FS23
Kartei Details
Karten | 99 |
---|---|
Sprache | English |
Kategorie | Informatik |
Stufe | Universität |
Erstellt / Aktualisiert | 11.09.2023 / 15.10.2023 |
Weblink |
https://card2brain.ch/box/20230911_datamgmt_fs23
|
Einbinden |
<iframe src="https://card2brain.ch/box/20230911_datamgmt_fs23/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Polyglot Persistence is the practice of using multiple data storage technologies to cater to different data storage needs across an enterprise or within a single application.
RDBMS organizes data in relations as tuples with links between data tuples established through primary and foreign keys. However, RDBMS have limitations such as lacking support for complex data modeling, versioning, and scalability. They also struggle with data and schema evolution.
ACID stands for Atomicity, Consistency, Isolation, and Durability. Atomicity ensures that a transaction is all or nothing. Consistency requires data to be in a consistent state before and after a transaction. Isolation prevents interference from other processes during a transaction, and Durability ensures changes made by a transaction persist.
Modern SQL is a standardized query language that supports ACID compliance, nested and aggregated structures, hierarchic and recursive queries, and distributed processing. It aims to maintain the relational model while supporting user-defined types as objects.
NoSQL databases address the needs of Web 2.0, including large datasets, write access, permanent availability, and polyglot persistence. They have characteristics like horizontal scalability, weaker concurrency/transaction models, schema-free design, and use of distributed indexes. They were developed to overcome the limitations of traditional RDBMS in handling these new requirements.
Common types of NoSQL databases include Key/Value databases, Document stores, Column-Oriented databases, and Graph databases. Key/Value databases store data in a dictionary format, Document stores use flexible schemas, Column-Oriented databases are efficient for analytics, and Graph databases manage complex relationships.
Factors to consider include the use case (read/write, transactional/analytical), single-user or multi-user requirements, data quantities (small, medium, or big data), data structure (NoSQL or multi-model), and existing database management systems and expertise.
In practice, Polyglot Persistence involves selecting the appropriate database technology for each component of an application or enterprise system based on its specific requirements. For example, you might use a document store for unstructured data, a relational database for structured data, and a graph database for managing complex relationships.
A suggested approach is to use a multi-model database like PostgreSQL as a workhorse for transactional and multi-user needs. Additionally, employ specialized databases for analytical queries (non-transactional, single user) to optimize performance. This two-database approach allows you to leverage the strengths of different database technologies for different aspects of your project while avoiding the pitfalls of a one-size-fits-all solution.
Modern SQL includes features such as Recursive CTE (for tree and graph queries), Window Functions (for analytics and sequential processing), enhanced Aggregation functions (GROUP BY), and various data types/structures like Time and Interval, Enumerated Values, Arrays, Key/Values, Documents (JSON, XML), Trees (JSON, XML), and Graphs. These features enhance the capabilities of SQL for handling complex data scenarios.
Relational Algebra serves as a theoretical foundation for relational databases by defining operators that transform one or more input relations into an output relation. It provides a set of operations to perform selections, projections, unions, set differences, set intersections, and renaming of data in tables, enabling the manipulation and retrieval of data from relational databases.
The Selection operator (σ) is used to select specific tuples from relations that meet certain criteria. For example, σ(c > 3)R would select tuples from relation R where the value in column C is greater than 3. It filters the rows based on a specified condition and returns only those that satisfy the condition.
The Projection operator (π) is used to extract specific columns from a relation. It returns a relation with only the columns specified, removing duplicate data in the process. For example, if we want columns B and C from relation R, π(B, C)R would return a new relation with only columns B and C from R. It helps in selecting specific attributes of a relation for further processing or analysis.
The Union operator (U) combines two relations to produce a new relation that contains all unique tuples from both input relations. The constraint is that both input relations must have the same set of attributes. For example, if we have two relations FRENCH and GERMAN, the query π(Student_Name)FRENCH U π(Student_Name)GERMAN would return all unique student names from both relations. The Union operator is used to combine data from multiple relations while ensuring data consistency.
The Set Difference operator (-) in Relational Algebra returns a new relation that contains tuples from the first input relation that are not present in the second input relation. For example, if we have two relations FRENCH and GERMAN, the query π(Student_Name)FRENCH - π(Student_Name)GERMAN would return student names that are in the FRENCH relation but not in the GERMAN relation. The Set Difference operator is used to find the elements unique to one relation when compared to another.
The Rename operator (ρ) in Relational Algebra is used to change the name of an attribute or column in a relation. It renames the specified attribute with a new name, allowing for aliasing and making queries more readable. For example, ρ(a/b)R would rename the attribute 'b' to 'a' in the relation R. It is useful for creating more descriptive or user-friendly column names in query results.
The Cross Product (X) operator combines two relations by generating all possible combinations of tuples between the two input relations. If relation A has 'n' tuples and relation B has 'm' tuples, the result of A X B will have 'n * m' tuples. Each tuple in the result combines attributes from both input relations. The Cross Product is used for creating Cartesian products, which can lead to large result sets when used with large relations.
Common Table Expressions (CTEs) in SQL are temporary named result sets that can be used as alternatives to derived tables (subqueries), views, and inline user-defined functions. They are often used for hierarchical or tree-structured data and enable declarative queries of tree and graph structures. CTEs are commonly used when dealing with recursive data, such as organizational hierarchies or nested categories, where SQL alone may not be sufficient for expressing complex relationships.
SQL databases support various data types and structures including Arrays, Key/Values, Documents (JSON, XML), Trees (JSON, XML), and Graphs. These data types allow SQL databases to handle a wide range of data scenarios beyond traditional relational models. For example, Arrays are used for storing lists of values, JSON documents enable flexible schema designs, and Graphs are used to represent complex relationships in data.
NoSQL databases are characterized by principles like scalability (both scaling up and out), the CAP theorem, and the BASE model. They prioritize flexible schema designs, high availability, and the ability to handle large volumes of data. The CAP theorem states that a distributed database system can achieve at most two of three properties: Consistency, Availability, and Partition Tolerance. BASE (Basically Available, Soft state, Eventually consistent) represents a set of properties that contrast with the strict ACID guarantees of traditional RDBMS.
NewSQL is a category of RDBMS that aims to combine the scalability of NoSQL systems with the ACID guarantees of traditional databases, primarily for OLTP workloads. It focuses on features like easy partitioning, high availability, and web-friendly capabilities. NewSQL databases provide horizontal scalability while maintaining the strong transactional consistency associated with ACID properties, making them suitable for modern, distributed applications where data consistency is crucial.
GraphQL is a data query and manipulation language that allows clients to declaratively define the structure of the data they need from a server. It prevents over-fetching or under-fetching of data by specifying exactly what data is required. Projects like PostgREST, PostGraphQL, and Hasura support GraphQL in conjunction with PostgreSQL, enabling efficient and precise data retrieval from databases. GraphQL is particularly useful in modern API design, as it empowers clients to request only the data they need, improving performance and reducing network load.
Parallel and distributed databases tackle performance challenges with large datasets by increasing I/O bandwidth, partitioning data, parallelizing data access, and leveraging multiple nodes in parallel. Parallel databases optimize operations on a single multiprocessor system, while distributed databases store data across multiple sites, each managed by a DBMS. Techniques include data partitioning/fragmentation, replication for availability, and load balancing to distribute query workloads efficiently.
Data partitioning/fragmentation involves dividing a relation into smaller fragments stored across different sites, potentially impacting data availability. It is a technique used in distributed databases to enhance scalability and manageability. However, it can challenge availability if not done correctly, as data fragments may become unavailable if a site fails. Proper partitioning strategies and redundancy mechanisms are necessary to mitigate these challenges and ensure data remains accessible.
Replication involves maintaining multiple copies of a relation or its fragments across different nodes, increasing data availability and speeding up queries. Replication can be synchronous or asynchronous and can limit write access to a single node (Master-Slave) or distribute it across multiple nodes (Master-Master). Replication strategies vary based on the desired balance between data consistency, availability, and performance, with Master-Slave and Master-Master replication being common approaches.
Load balancing in distributed databases distributes query workloads evenly across multiple nodes, preventing overloading of specific nodes and optimizing resource utilization. It ensures efficient use of available computing resources and improves response times for queries. Load balancing strategies consider factors such as node capacity, query complexity, and network conditions to intelligently distribute queries and maintain system performance and availability. Load balancing is essential for achieving high scalability and responsiveness in distributed systems.
Software optimizations can boost database performance through techniques like column-oriented database architectures, compression-aware databases, and the use of specialized database systems like VoltDB. Column stores optimize data storage and retrieval for analytics, while compression-aware databases reduce storage requirements. VoltDB, for example, offers high-performance database access through pre-compiled Java stored procedures and in-memory processing, making it suitable for real-time applications demanding both speed and scalability.
Cloud database systems offer benefits like scalability, accessibility, cost-effectiveness, and quick availability. They provide services such as Software as a Service (SaaS) and secure data access. However, when adopting cloud databases, considerations include data privacy and compliance, as cloud providers may have varying standards. Concerns about corporate giants like GAFAM dominating open-source technologies without adequate contributions are also important. On-premises solutions may be considered for data storage alternatives in some cases.
Database optimization plays a crucial role in improving performance by fine-tuning various aspects, such as hardware configuration, SQL queries, statistics, indexes, schema design, and architecture. Factors to consider include adapting the database configuration to hardware, ensuring efficient SQL queries, maintaining up-to-date statistics, implementing appropriate indexes, and evaluating the overall database schema and architecture. Optimization aims to align the database with the specific requirements and constraints of the application and infrastructure.
Transactional Processing primarily deals with numerous short everyday operations, where machines make decisions. On the other hand, Analytical Processing involves specialized queries on analytical database systems, geared towards supporting future business decisions, where humans make decisions. The key difference lies in the nature of operations and their respective goals, with transactional processing handling routine tasks and analytical processing focused on aiding strategic decision-making.
A Decision Support System (DSS) aims to enhance enterprise management by gathering both current and historical data from various operational systems. It reduces this data by generating summaries and extracting meaningful insights. DSS assists management in understanding and influencing the enterprise's behavior, providing comprehensive views of aspects like accounting, controlling, and generating reports. It empowers decision-makers with the information needed to make informed choices and mitigate risks associated with incomplete or uncertain data.
Information-oriented DSS includes Corporate Performance Management, which offers decision-oriented indicators like turnover, cash flow, ratios (e.g., ROI), and individual performance metrics. It also encompasses Scorecard Systems that aggregate indicators hierarchically, resulting in Key Performance Indicators (KPIs) and balanced scorecards. Balanced Scorecards integrate various views (financial, customer, learn & growth, processes) to provide a holistic perspective of an enterprise's value chain.
Analysis-oriented DSS, particularly in Marketing & Sales Controlling, facilitates revenue calculation, efficiency assessment, and resource allocation to gain or maintain a competitive advantage. It helps determine the optimal allocation of resources and input factors that can maximize profit. By analyzing data related to marketing and sales efforts, this system empowers organizations to make data-driven decisions to enhance their market presence and profitability.
DSS extends beyond actual figures and incorporates planned figures for comparison. This facilitates effective decision-making by highlighting areas where management measures can be applied most effectively. Planning scenarios are developed within DSS to provide management with insights into different possible outcomes based on varying assumptions. These scenarios enable proactive planning and decision-making, helping organizations adapt to changing circumstances and make informed choices.
Challenges with analytical systems include data integration complexities, difficulties in querying data from across an organization, and the need to accommodate new data sources due to market or organizational changes. These challenges arise from the growing demand for comprehensive analysis that involves querying data from diverse sources and adapting to evolving business needs. Successfully addressing these challenges is crucial for effective decision support and business intelligence.
Prominent BI tools include Tableau, Microsoft PowerBI, and open-source options like Apache Superset. BI tools are vital for decision-making success due to their data visualization and publication capabilities. Visualization enables data analysis and sharing with others in a user-friendly format. BI tools empower decision-makers with actionable insights by presenting data in a comprehensible manner, making it easier to derive valuable information from complex datasets.
Analytical queries involve calculations and aggregations in data analysis. Window functions, including GROUP BY, GROUPING SETS, CUBE, ROLLUP, and WINDOW Functions, play a crucial role in analytical queries. They enable various calculations like SUM() and AVG() as defined in SQL standards such as SQL-92 and SQL:1999. These functions provide advanced capabilities for data analysis, making it possible to perform complex calculations efficiently within a single or a few queries, without requiring domain-specific programming skills.
Analytic functions are valuable for performance indicator calculations in data warehouses, data lakehouses, and decision support systems. They enable complex business or statistical calculations to be executed with just a few queries, without the need for specialized programming skills. The calculations are performed by the DBMS, often in parallel, leading to faster query execution. Additionally, analytic functions reduce the need for excessive access to secondary memory, enhancing overall query performance.
Separating OLAP (Online Analytical Processing) from OLTP (Online Transaction Processing) is crucial to avoid disrupting business-critical operations and ensure data integrity. OLAP systems have their dedicated database systems fed with data from operational databases. Data Warehouses are centralized repositories optimized for querying and analyzing structured data. In contrast, Data Lakehouses are hybrid storage solutions that combine data lakes and data warehouses, designed to store and analyze both structured and unstructured data, typically built on cloud-based platforms.
Online Analytical Processing (OLAP) is an approach to data processing that is performed online, in real-time, and focuses on analytical tasks. Unlike transactional processing, which handles everyday operations, OLAP deals with analytical operations aimed at supporting decision-making processes. OLAP systems enable users to interactively explore and analyze multidimensional data to derive insights and make informed decisions.