DataMgmt

Data Mgmt Chärtli

Data Mgmt Chärtli


Fichier Détails

Cartes-fiches 81
Langue English
Catégorie Informatique
Niveau Université
Crée / Actualisé 31.05.2023 / 31.05.2023
Lien de web
https://card2brain.ch/box/20230531_datamgmt
Intégrer
<iframe src="https://card2brain.ch/box/20230531_datamgmt/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
What is the difference between a traditional database and a Big Data database?

Traditional databases are designed to handle structured data, while Big Data databases are designed to handle unstructured or semi-structured data. Big Data databases also typically use distributed computing to process large amounts of data.

What are some of the challenges associated with Big Data?

Some challenges associated with Big Data include storing and processing large amounts of data, ensuring data quality, and dealing with unstructured or semi-structured data.

How does Hadoop help with processing Big Data?

Hadoop is an open-source software framework that allows for distributed storage and processing of large datasets across clusters of computers. It uses Hadoop Distributed File System (HDFS) for storage and MapReduce for processing.

What is Hadoop Distributed File System (HDFS)?

HDFS is a distributed file system that provides high-throughput access to application data. It is designed to store very large files across multiple machines in a cluster.

How does HDFS store data?

HDFS stores data by breaking it into blocks and replicating those blocks across multiple machines in a cluster. This allows for fault tolerance and high availability of the data.

What is the advantage of using commodity hardware in a Hadoop cluster?

Using commodity hardware in a Hadoop cluster can be more cost-effective than using specialized hardware, as it allows for scaling out by adding more commodity machines as needed.

What is MapReduce?

MapReduce is a programming model used for processing large datasets in parallel across clusters of computers. It consists of two phases: map phase and reduce phase.

How does MapReduce work?

In the map phase, data is divided into smaller chunks and processed in parallel across multiple machines. In the reduce phase, the results from the map phase are combined to produce a final output.

What are some examples of applications that use MapReduce?

Some examples of applications that use MapReduce include search engines, social media analytics, and log processing.

What is Apache Spark?

Apache Spark is an open-source distributed computing system used for processing large datasets. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

How does Spark differ from Hadoop MapReduce?

Spark is faster than Hadoop MapReduce because it keeps data in memory, while Hadoop MapReduce writes data to disk after each operation. Spark also provides a wider range of APIs and supports more programming languages than Hadoop MapReduce.

What are some advantages of using Spark over MapReduce?

Some advantages of using Spark over MapReduce include faster processing times, better support for iterative algorithms, and a wider range of APIs and programming languages.

What is NoSQL?

NoSQL stands for "not only SQL" and refers to a class of databases that do not use the traditional SQL relational database model. Instead, they use other data models such as key-value, document-oriented, or graph-based models.

Why was NoSQL developed?

NoSQL was developed to handle the large amounts of unstructured or semi-structured data that traditional SQL databases were not designed to handle. It also provides more flexibility in terms of data modeling and scalability.

What are some examples of NoSQL databases?

Some examples of NoSQL databases include MongoDB (document-oriented), Cassandra (column-family), and Neo4j (graph-based).

What is NewSQL?

NewSQL is a class of databases that combine the scalability and performance benefits of NoSQL with the ACID (atomicity, consistency, isolation, durability) properties of traditional SQL databases.

How does NewSQL differ from traditional SQL databases and NoSQL databases?

NewSQL differs from traditional SQL databases in its ability to scale horizontally across multiple machines while maintaining ACID properties. It differs from NoSQL databases in its support for complex queries and transactions.

What are some examples of NewSQL databases?

Some examples of NewSQL databases include CockroachDB, TiDB, and NuoDB.

What is OLAP?

OLAP stands for "online analytical processing" and refers to a class of databases used for business intelligence and data analytics. OLAP databases are designed to handle complex queries and provide fast query response times.

How does OLAP differ from OLTP (Online Transaction Processing)?

OLTP is used for transactional processing, such as recording sales transactions or updating inventory levels, while OLAP is used for analytical processing, such as performing complex queries and data analysis. OLTP databases are designed for fast read/write operations, while OLAP databases are designed for fast query response times and complex data analysis. OLAP databases also typically store historical data and provide tools for trend analysis and forecasting.

What is ETL/ELT?

ETL (extract, transform, load) and ELT (extract, load, transform) are processes used to integrate data from multiple sources into a single database or data warehouse. ETL/ELT involves extracting data from source systems, transforming it to fit the target database schema, and loading it into the target database or data warehouse.

What is information retrieval?

Information retrieval refers to the process of retrieving relevant information from a large collection of unstructured or semi-structured data. This can involve techniques such as natural language processing, text mining, and machine learning to identify patterns and extract meaningful information from text-based data sources such as documents or web pages.

What is database optimization?

Database optimization refers to the process of improving the performance of a database by optimizing its structure, indexes, queries, and other factors that affect its performance. This can involve techniques such as query

What is a database schema?

A database schema is a blueprint or plan for how data is organized in a database. It defines the structure of tables, columns, relationships, and other elements that make up the database.

Why is it important to have a well-designed database schema?

A well-designed database schema can improve data quality, reduce redundancy and inconsistency, and make it easier to query and analyze data. It can also help ensure that the database can scale effectively as more data is added over time.

What are some common elements of a database schema?

Common elements of a database schema include tables, columns, primary keys, foreign keys, indexes, constraints, and relationships between tables.

How do you create a new table in PostgreSQL using SQL?

To create a new table in PostgreSQL using SQL, you can use the CREATE TABLE statement followed by the table name and column definitions. For example: CREATE TABLE my_table ( id SERIAL PRIMARY KEY, name VARCHAR(50) NOT NULL, age INTEGER ); This creates a new table called "my_table" with three columns: "id" (a serial primary key), "name" (a non-null string), and "age" (an integer).

What is a primary key in PostgreSQL?

In PostgreSQL, a primary key is a column or set of columns that uniquely identifies each row in a table. It is used to enforce data integrity and ensure that there are no duplicate rows in the table.

How do you add or modify columns in an existing table in PostgreSQL?

To add or modify columns in an existing table in PostgreSQL using SQL, you can use the ALTER TABLE statement followed by the table name and column definitions. For example: ALTER TABLE my_table ADD COLUMN email VARCHAR(100); This adds a new column called "email" to the "my_table" table with a maximum length of 100 characters.

What is an inner join in SQL?

An inner join in SQL is a type of join that returns only the rows from two tables where there is a match on the join condition. It combines rows from both tables based on a common column or set of columns.

What is an outer join in SQL?

An outer join in SQL is a type of join that returns all the rows from one table and the matching rows from another table, or null values if there is no match. There are three types of outer joins: left outer join, right outer join, and full outer join.

How do you write a SQL query to perform an inner join between two tables?

To perform an inner join between two tables in SQL, you can use the JOIN keyword followed by the name of the second table and the ON keyword followed by the join condition. For example: SELECT * FROM orders JOIN customers ON orders.customer_id = customers.customer_id; This query returns all columns from both the "orders" and "customers" tables where there is a match on the "customer_id" column.

What is aggregation in SQL?

Aggregation in SQL refers to the process of summarizing or grouping data based on one or more columns. Common aggregation functions include COUNT, SUM, AVG, MIN, and MAX.

How do you use the GROUP BY clause in SQL to aggregate data?

To use the GROUP BY clause in SQL to aggregate data, you can specify one or more columns to group by in your SELECT statement. For example: SELECT category, COUNT(*) as num_products FROM products GROUP BY category; This query groups all products by their category column and counts how many products are in each category.

What is a subquery in SQL?

A subquery in SQL is a query that is nested inside another query. It can be used to retrieve data that will be used as input for another query or as a filter condition for a larger query.

What is a common table expression (CTE) in SQL?

A common table expression (CTE) in SQL is a temporary named result set that you can reference within a SELECT, INSERT, UPDATE, or DELETE statement. It allows you to break down complex queries into smaller, more manageable parts.

How do you create a CTE in SQL?

To create a CTE in SQL, you can use the WITH keyword followed by the name of the CTE and its column definitions. For example: WITH my_cte AS ( SELECT * FROM my_table WHERE age > 30 ) SELECT * FROM my_cte; This query creates a CTE called "my_cte" that selects all rows from "my_table" where the age column is greater than 30. The CTE is then used as input for the outer SELECT statement.

What are some benefits of using CTEs in SQL?

Some benefits of using CTEs in SQL include improved readability and maintainability of complex queries, reduced duplication of code, and better performance for certain types of queries.

What is a window function in SQL?

A window function in SQL is a type of function that performs calculations across rows that are related to the current row. It allows you to perform calculations such as running totals or moving averages without grouping or aggregating data.

How do you use a window function in SQL?

To use a window function in SQL, you can specify it as part of your SELECT statement followed by an OVER clause that defines the window or group of rows to operate on. For example: SELECT date, revenue, SUM(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as running_total FROM sales; This query calculates a running total of revenue for each date in the "sales" table using the SUM window function and the OVER clause. The window is defined as all rows from the start of the partition (unbounded preceding) up to and including the current row.