Set of flashcards DataMgmt

Flashcards	81
Language	English
Category	Computer Science
Level	University
Created / Updated	31.05.2023 / 31.05.2023
Weblink	https://card2brain.ch/box/20230531_datamgmt
Embed	<iframe src="https://card2brain.ch/box/20230531_datamgmt/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>

What is meant by the term "CAP theorem"?

The CAP theorem states that it is impossible for a distributed system to simultaneously provide all three of the following guarantees: consistency, availability, and partition tolerance.

How do different types of databases prioritize the guarantees provided by the CAP theorem?

Different types of databases make different trade-offs between consistency, availability, and partition tolerance. For example, traditional RDBMS prioritize consistency over availability and partition tolerance, while NoSQL databases often prioritize availability and partition tolerance over strong consistency.

How does sharding affect the guarantees provided by the CAP theorem?

Sharding can improve partition tolerance by allowing data to be distributed across multiple nodes in the system. However, it can also make it more difficult to maintain strong consistency across all nodes.

What is meant by the term "eventual consistency"?

Eventual consistency is a property of distributed systems in which updates to data are propagated asynchronously and may take some time to be fully replicated across all nodes in the system. As a result, different nodes may have slightly different views of the data at any given time.

How do NoSQL databases typically achieve eventual consistency?

NoSQL databases often use techniques such as vector clocks or conflict resolution algorithms to reconcile conflicting updates and ensure that all nodes eventually converge on a consistent view of the data.

What are some potential drawbacks of eventual consistency?

Eventual consistency can make it more difficult to reason about the state of the system at any given time, since different nodes may have different views of the data. It can also make it more difficult to enforce constraints or perform transactions that span multiple nodes.

What is meant by the term "polyglot persistence"?

Polyglot persistence refers to the practice of using multiple types of databases within a single application or system, each optimized for a specific type of data or workload.

What are some benefits of using polyglot persistence?

Polyglot persistence allows developers to choose the best tool for each job, rather than trying to fit all data into a single database model. This can lead to better performance, scalability, and flexibility.

What are some challenges associated with polyglot persistence?

Polyglot persistence can add complexity to an application or system, since developers must manage multiple types of databases and ensure that they work together seamlessly. It can also make it more difficult to maintain consistency across different types of data.

What is meant by the term "data warehouse"?

A data warehouse is a large, centralized repository of data that is used for reporting and analysis. It typically contains historical data from multiple sources, organized in a way that makes it easy to query and analyze.

How does a data warehouse differ from a traditional transactional database?

A data warehouse is optimized for read-heavy workloads and complex queries, while a transactional database is optimized for write-heavy workloads and simple queries. Data warehouses also typically contain denormalized or aggregated data, rather than raw transactional data.

What are some common use cases for data warehouses?

Data warehouses are often used for business intelligence, reporting, and analytics. They can be used to answer questions such as "What were our sales by region last quarter?" or "Which products are most frequently purchased together?"

What is meant by the term "data integration"?

Data integration refers to the process of combining data from multiple sources into a single, unified view. This can involve tasks such as cleaning and transforming the data, resolving conflicts between different sources, and ensuring that the resulting dataset is consistent and accurate.

What are some common challenges associated with data integration?

Data integration can be challenging due to differences in format, structure, and semantics between different sources of data. It can also be difficult to ensure that the resulting dataset is complete and accurate.

What are some common tools or techniques used for data integration?

Common tools or techniques used for data integration include ETL (extract-transform-load) processes, master data management systems, and semantic mapping tools.

What is the difference between a traditional database and a Big Data database?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Traditional databases are designed to handle structured data, while Big Data databases are designed to handle unstructured or semi-structured data. Big Data databases also typically use distributed computing to process large amounts of data.

What are some of the challenges associated with Big Data?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Some challenges associated with Big Data include storing and processing large amounts of data, ensuring data quality, and dealing with unstructured or semi-structured data.

How does Hadoop help with processing Big Data?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Hadoop is an open-source software framework that allows for distributed storage and processing of large datasets across clusters of computers. It uses Hadoop Distributed File System (HDFS) for storage and MapReduce for processing.

What is Hadoop Distributed File System (HDFS)?

Keyboard commands:

= turn,

= for-/backward,

= scroll

HDFS is a distributed file system that provides high-throughput access to application data. It is designed to store very large files across multiple machines in a cluster.

How does HDFS store data?

Keyboard commands:

= turn,

= for-/backward,

= scroll

HDFS stores data by breaking it into blocks and replicating those blocks across multiple machines in a cluster. This allows for fault tolerance and high availability of the data.

What is the advantage of using commodity hardware in a Hadoop cluster?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Using commodity hardware in a Hadoop cluster can be more cost-effective than using specialized hardware, as it allows for scaling out by adding more commodity machines as needed.

What is MapReduce?

Keyboard commands:

= turn,

= for-/backward,

= scroll

MapReduce is a programming model used for processing large datasets in parallel across clusters of computers. It consists of two phases: map phase and reduce phase.

How does MapReduce work?

Keyboard commands:

= turn,

= for-/backward,

= scroll

In the map phase, data is divided into smaller chunks and processed in parallel across multiple machines. In the reduce phase, the results from the map phase are combined to produce a final output.

What are some examples of applications that use MapReduce?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Some examples of applications that use MapReduce include search engines, social media analytics, and log processing.

What is Apache Spark?

Keyboard commands:

= turn,

= for-/backward,

= scroll

Apache Spark is an open-source distributed computing system used for processing large datasets. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

How does Spark differ from Hadoop MapReduce?

Spark is faster than Hadoop MapReduce because it keeps data in memory, while Hadoop MapReduce writes data to disk after each operation. Spark also provides a wider range of APIs and supports more programming languages than Hadoop MapReduce.

What are some advantages of using Spark over MapReduce?

Some advantages of using Spark over MapReduce include faster processing times, better support for iterative algorithms, and a wider range of APIs and programming languages.

What is NoSQL?

NoSQL stands for "not only SQL" and refers to a class of databases that do not use the traditional SQL relational database model. Instead, they use other data models such as key-value, document-oriented, or graph-based models.

Why was NoSQL developed?

NoSQL was developed to handle the large amounts of unstructured or semi-structured data that traditional SQL databases were not designed to handle. It also provides more flexibility in terms of data modeling and scalability.

What are some examples of NoSQL databases?

Some examples of NoSQL databases include MongoDB (document-oriented), Cassandra (column-family), and Neo4j (graph-based).

DataMgmt

Create or copy sets of flashcards

Create or copy sets of flashcards

Log in to see all the cards.

SWITCHaai

Office 365

Edulog

Apple ID

Google