DataMgmt
Data Mgmt Chärtli
Data Mgmt Chärtli
Set of flashcards Details
Flashcards | 81 |
---|---|
Language | English |
Category | Computer Science |
Level | University |
Created / Updated | 31.05.2023 / 31.05.2023 |
Weblink |
https://card2brain.ch/box/20230531_datamgmt
|
Embed |
<iframe src="https://card2brain.ch/box/20230531_datamgmt/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Create or copy sets of flashcards
With an upgrade you can create or copy an unlimited number of sets and use many more additional features.
Log in to see all the cards.
The CAP theorem states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: consistency, availability, and partition tolerance.
Different types of databases make different trade-offs between consistency, availability, and partition tolerance. For example, traditional RDBMS prioritize consistency over availability and partition tolerance, while NoSQL databases often prioritize availability and partition tolerance over strong consistency.
Sharding can improve partition tolerance by allowing data to be distributed across multiple nodes in the system. However, it can also make it more difficult to maintain strong consistency across all nodes.
Eventual consistency is a property of distributed systems in which updates to data are propagated asynchronously and may take some time to be fully replicated across all nodes in the system. As a result, different nodes may have slightly different views of the data at any given time.
NoSQL databases often use techniques such as vector clocks or conflict resolution algorithms to reconcile conflicting updates and ensure that all nodes eventually converge on a consistent view of the data.
Eventual consistency can make it more difficult to reason about the state of the system at any given time, since different nodes may have different views of the data. It can also make it more difficult to enforce constraints or perform transactions that span multiple nodes.
Polyglot persistence refers to the practice of using multiple types of databases within a single application or system, each optimized for a specific type of data or workload.
Polyglot persistence allows developers to choose the best tool for each job, rather than trying to fit all data into a single database model. This can lead to better performance, scalability, and flexibility.
Polyglot persistence can add complexity to an application or system, since developers must manage multiple types of databases and ensure that they work together seamlessly. It can also make it more difficult to maintain consistency across different types of data.
A data warehouse is a large, centralized repository of data that is used for reporting and analysis. It typically contains historical data from multiple sources, organized in a way that makes it easy to query and analyze.
A data warehouse is optimized for read-heavy workloads and complex queries, while a transactional database is optimized for write-heavy workloads and simple queries. Data warehouses also typically contain denormalized or aggregated data, rather than raw transactional data.
Data warehouses are often used for business intelligence, reporting, and analytics. They can be used to answer questions such as "What were our sales by region last quarter?" or "Which products are most frequently purchased together?"
Data integration refers to the process of combining data from multiple sources into a single, unified view. This can involve tasks such as cleaning and transforming the data, resolving conflicts between different sources, and ensuring that the resulting dataset is consistent and accurate.
Data integration can be challenging due to differences in format, structure, and semantics between different sources of data. It can also be difficult to ensure that the resulting dataset is complete and accurate.
Common tools or techniques used for data integration include ETL (extract-transform-load) processes, master data management systems, and semantic mapping tools.
Traditional databases are designed to handle structured data, while Big Data databases are designed to handle unstructured or semi-structured data. Big Data databases also typically use distributed computing to process large amounts of data.
Some challenges associated with Big Data include storing and processing large amounts of data, ensuring data quality, and dealing with unstructured or semi-structured data.
Hadoop is an open-source software framework that allows for distributed storage and processing of large datasets across clusters of computers. It uses Hadoop Distributed File System (HDFS) for storage and MapReduce for processing.
HDFS is a distributed file system that provides high-throughput access to application data. It is designed to store very large files across multiple machines in a cluster.
HDFS stores data by breaking it into blocks and replicating those blocks across multiple machines in a cluster. This allows for fault tolerance and high availability of the data.
Using commodity hardware in a Hadoop cluster can be more cost-effective than using specialized hardware, as it allows for scaling out by adding more commodity machines as needed.
MapReduce is a programming model used for processing large datasets in parallel across clusters of computers. It consists of two phases: map phase and reduce phase.
In the map phase, data is divided into smaller chunks and processed in parallel across multiple machines. In the reduce phase, the results from the map phase are combined to produce a final output.
Some examples of applications that use MapReduce include search engines, social media analytics, and log processing.
Apache Spark is an open-source distributed computing system used for processing large datasets. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Spark is faster than Hadoop MapReduce because it keeps data in memory, while Hadoop MapReduce writes data to disk after each operation. Spark also provides a wider range of APIs and supports more programming languages than Hadoop MapReduce.
Some advantages of using Spark over MapReduce include faster processing times, better support for iterative algorithms, and a wider range of APIs and programming languages.
NoSQL stands for "not only SQL" and refers to a class of databases that do not use the traditional SQL relational database model. Instead, they use other data models such as key-value, document-oriented, or graph-based models.
NoSQL was developed to handle the large amounts of unstructured or semi-structured data that traditional SQL databases were not designed to handle. It also provides more flexibility in terms of data modeling and scalability.
Some examples of NoSQL databases include MongoDB (document-oriented), Cassandra (column-family), and Neo4j (graph-based).
-
- 1 / 81
-