DataMgmt FS23
DtaaMgmt FS23
DtaaMgmt FS23
Fichier Détails
Cartes-fiches | 99 |
---|---|
Langue | English |
Catégorie | Informatique |
Niveau | Université |
Crée / Actualisé | 11.09.2023 / 15.10.2023 |
Lien de web |
https://card2brain.ch/box/20230911_datamgmt_fs23
|
Intégrer |
<iframe src="https://card2brain.ch/box/20230911_datamgmt_fs23/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
The multidimensional model stores data at a granular level and qualifies it with values in multiple dimensions. This structure allows for n-dimensional queries, where users can query for specific dimensions or combinations of dimensions. Key operations supported by the multidimensional model include: 1. Slice: Extracting a subset of data by selecting specific values for one or more dimensions. 2. Dice: Filtering data by selecting multiple values from different dimensions simultaneously. 3. Roll-up: Aggregating data across one or more dimensions to a higher level of granularity. 4. Drill-down: Breaking down aggregated data into more detailed levels of granularity.
ROLAP is designed for scalability, particularly for handling larger data volumes, and is often used with mature Relational Database Management Systems (DBMSs). In ROLAP, dimensions and measures are typically stored separately. The measures' table is referred to as a "fact table." To identify measures, they are associated with keys from dimension tables. The primary key of the fact table is created by concatenating the primary keys of all related dimension tables. ROLAP systems provide an effective means of handling multidimensional data in a relational context.
Concept hierarchies in OLAP refer to the structured organization of dimension values in a hierarchical manner. These hierarchies allow for drill-down and aggregation operations, accommodating values at different levels of granularity within the dimensions. Concept hierarchies are crucial in the design of Dimensions because they enable users to navigate through data at varying levels of detail, making it easier to explore and analyze multidimensional data effectively. Designing dimensions with hierarchies in mind enhances the usability of OLAP systems.
Temporality is crucial in data warehousing because historical data often plays a significant role in decision support and analysis. Changes in dimension values over time, known as Slowly Changing Dimensions (SCD), can impact the validity of historical data. Some examples of SCD types include: - SCD Type 1: No historization (overwrite record of seller). - SCD Type 2: Historization (Adding new record). - SCD Type 3: Single-level historization (adding attribute "previous_category"). Temporal data management is essential to ensure that historical data remains accurate and relevant for analysis.
Mono-temporal data management handles data with a single temporal dimension, typically for maintaining data integrity over time. In contrast, bitemporal data management involves two temporal dimensions, providing information about when updates were made to the data. Bitemporality is needed to track changes in data and ensure that reports are based on the correct system knowledge. It allows organizations to manage and query data based on both its validity and the time when updates occurred, preventing reliance on outdated information.
Cross tables are tables used to display data in a tabular format with rows and columns, often used for summarizing or aggregating data in a cross-tabulated format. Pivoting is a data transformation technique that converts data from rows to columns, facilitating data summarization in a different format. Cross tables and pivoting are essential in data analysis as they enable users to view and manipulate data in ways that make it more readable and usable. These techniques help in summarizing, aggregating, and presenting data effectively for analysis and reporting.
MOLAP (MultiDimensional OLAP), ROLAP (Relational OLAP), and HOLAP (Hybrid OLAP) are OLAP technologies that differ in their data storage and processing approaches: - MOLAP stores data in n-dimensional arrays, with measures in arrays and dimensions as indexes. It offers fast data access but can be inefficient with increasing data volume. - ROLAP stores dimensions and measures separately in relational databases, offering scalability but requiring transformation between multidimensional and relational models. - HOLAP combines aspects of MOLAP and ROLAP, exporting dense data to multidimensional arrays and sparse data to relational databases. It optimizes performance while reducing data storage.
Join indexes are used to combine record IDs for attributes in multiple tables simultaneously, creating indexes for frequently used attribute combinations. They are particularly useful for dimension attributes. However, using join indexes can lead to a "combinational explosion" due to the number of attribute combinations. Bitmap indexes use a bitmap data structure to map a domain to bits, enabling efficient relational operations like join, union, and constraint checking without accessing data. However, updates to bitmap indexes can be inefficient. The choice between join and bitmap indexes depends on specific query and update requirements.
The two main approaches to operating data warehousing are: 1. Virtual Integrated Database Systems: These systems provide a virtual view of data from various sources without physically integrating or replicating it. Users can access and query data as needed without the need for extensive data movement or consolidation. 2. Materialized Systems: Materialized systems store and manage data in a centralized repository, typically involving data extraction, transformation, and loading (ETL) processes. Data is physically integrated and made available for analysis. These approaches have trade-offs in terms of data consistency, query performance, and data movement complexity.
The primary goal of data integration is to merge and organize data from various heterogeneous sources into a structured and unified view that can be effectively interpreted by applications and users. This process is crucial in data engineering and data science for constructing clean databases, data warehouses, and integrated schemas for tasks such as machine learning. Data integration enables organizations to work with a comprehensive and consistent dataset, facilitating informed decision-making and meaningful analysis.
Data integration may not be necessary in situations where there is only a single data source, and the database product or deployment serves both Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) functions within the same system. In such cases, there may not be a need to merge and harmonize data from different sources since the entire dataset resides in a single system, simplifying data access and analysis. However, as complexity and the number of data sources increase, data integration becomes essential to ensure data consistency and accessibility.
Data integration faces several challenges, particularly related to data heterogeneity: Technical and structural heterogeneities encompass variations in access methods, data formats (binary vs. human-readable like CSV, JSON, XML), and encoding. Model and schema heterogeneities involve differences in data structure (e.g., attribute names) and data types (e.g., text vs. data). Semantic heterogeneities arise from different interpretations of data, such as variations in the meaning of terms like "date" or "building." Addressing these challenges is crucial to achieve a unified and coherent dataset during the integration process.
ETL (Extraction, Transformation, Loading) is a critical process in data integration. Its main phases are: - Extraction: This phase involves extracting data from diverse operational systems, each with its own formats and encodings. The key objective is to capture data while minimizing disruption to business transactions and selectively extract data based on future analytical requirements. - Transformation: Data transformation is essential to convert data into formats required by the data warehouse (DW). Key objectives include handling data inconsistencies, cleansing data (correcting errors, filling missing values, etc.), and achieving a general level of data quality. - Loading: Loading preformats data, checks integrity, and builds indexes. The main objective is to refresh the DW, either through total data exchange (Refresh-Copy) or by updating only changed data (Incremental Copy).
ETL (Extraction, Transformation, Loading) and ELT (Extract, Load, Transform) are two approaches to data integration. The key difference lies in when data transformation occurs: - ETL: In the ETL approach, data transformation takes place after data extraction and before loading into the data warehouse. It involves processing and cleaning the data before it's ready for analysis. - ELT: In the ELT approach, data is first extracted and loaded into the data warehouse, and transformation occurs later, often on-the-fly during query execution. ELT uses SQL for transformation and typically has a simpler architecture compared to ETL. ELT is represented by tools and services like Panoply and Stitch.
There are various data integration tools, services, and products available in the industry. Some examples include: - Desktop Tools: OpenRefine, Kettle/Spoon (Pentaho), Data Build Tool (dbt) - SQL for Data Transformation: Numerical transformations in SQL using scalar functions, aggregate functions, and window functions - Data Cleaning: Techniques include conversion, normalization, domain-independent and domain-specific cleaning, and rule-based cleaning. - Deduplication and Entity Resolution: Techniques for identifying and resolving records that refer to the same real-world entity. - Data Enrichment: Enhancing and completing data using third-party data from external sources. - (Synthetic) Data Generation: Methods for generating synthetic data while preserving data characteristics.
Dealing with missing data is crucial in data cleaning and preprocessing. Common techniques include: - Data Imputation: Filling in missing values using approaches like global constants, central tendencies (mean, median), decision trees, regression, interpolation, maximum likelihood, and machine learning. - Types of Missing Data: Understanding missing data types, such as Missing Completely at Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). - Outlier Detection: Identifying and handling outliers, which are abnormal values in data that can interfere with analysis. - Noise Reduction: Applying filters, regression, and binning to eliminate noise in the data.
Entity resolution, also known as deduplication or record linkage, is the process of identifying and resolving records that refer to the same real-world entity. It's essential in data integration because it ensures that duplicate or redundant records from different sources are correctly identified and merged into a single entity. This helps maintain data consistency and accuracy, which is crucial for effective data integration and analysis. Entity resolution often involves string matching and data matching techniques to compare and align records with similar attributes.
Data enrichment involves enhancing and completing existing data by incorporating additional information from external sources, such as open data portals. It contributes to data quality and completeness by adding valuable context and details to the dataset. For example, geocoding can add coordinates to an address list, enhancing location-based analyses. Fuzzy matching techniques can improve entity resolution by matching and cleaning data more effectively. Data enrichment reduces data silos, enhances the utility of data, and supports better decision-making and analysis by providing a more comprehensive view of the data.
Synthetic data generation aims to create data with similar characteristics to the original dataset while preserving privacy. Different types of synthetic data generation include: - Anonymized Data: Applying noise obfuscation, masking, or encryption to real data to protect privacy. - Artificial Data: Generated by probabilistic models through data sampling. - Pure Synthetic Data: Generated by models based on configuration, rules, or inferences. The key distinction is that anonymized data retains some relationship with the original data, while artificial and pure synthetic data are entirely new and independent datasets. Synthetic data generation is valuable for preserving privacy while enabling data analysis and research.
A data warehouse is a specialized type of materialized integrated database system designed primarily for analytical processing, in contrast to transactional processing systems that handle day-to-day operations. Data warehouses focus on storing and managing historical data for improved data quality and analytical tasks, such as reporting and data analysis. Unlike transactional systems, data warehouses are typically very large, with high growth rates, and offer read-only access to end-users, as write access is usually reserved for data loading using specific application tools.
The primary objective of a data warehouse is to provide a reliable and efficient platform for performing analytical processing tasks. It aims to support decision-making processes by allowing users to explore and analyze historical data, derive insights, and make informed decisions. By consolidating data from various sources, data warehouses enable organizations to obtain a comprehensive view of their data, leading to better decision support, reporting, and data analysis.
Historical data plays a crucial role in a data warehouse as it allows organizations to analyze past trends, patterns, and behaviors. By storing historical data, data warehouses facilitate the assessment of historical performance, identification of long-term trends, and the ability to make data-driven decisions based on historical insights. Historical data also supports predictive analytics and forecasting by providing a basis for modeling future scenarios.
Data warehouses are typically characterized by their large size, often measured in terabytes, and may experience high growth rates, sometimes reaching up to 50% per year. In terms of access control, end-user access to data warehouses is typically limited to read-only, as concurrency control mechanisms are not a primary concern. Write access is granted for loading data into the warehouse using specialized application tools, and transactions in data warehouses are often large and long-lasting, aimed at data loading and manipulation tasks.
A Data Mart is a subset of a data warehouse that is designed to serve the specific data needs of a particular group or department within an organization. It is essentially a smaller, focused data repository that contains a subset of the data available in the larger data warehouse. Data Marts are used to support the analytical requirements of specific business units, making it easier for them to access and analyze data relevant to their operations. Data Marts are often derived from the central data warehouse and are designed to be more tailored to the needs of individual teams.
A Data Lakehouse (DL) is a hybrid data architecture that combines the strengths of both data lakes and data warehouses. It serves as a central repository for storing and managing both structured and unstructured data. DL architectures typically separate storage and compute layers, enabling scalable and cost-effective processing of large datasets. Unlike traditional data warehouses, DLs provide a unified data platform that supports real-time analytics, machine learning, and AI on large datasets. They help organizations reduce data silos and make faster data-driven decisions.
Implementing a Data Lakehouse (DL) can present challenges related to data quality, security, and compliance. Ensuring data quality within a DL can be complex due to the variety of data types and sources. Maintaining data security and compliance with regulations is critical, especially when handling sensitive data. Managing the growth and complexity of data in a DL can also be a challenge. Organizations need effective strategies and tools to address these challenges successfully while harnessing the benefits of a DL architecture.
Dimensional modeling is a design technique used in data warehousing to structure data for efficient querying and reporting. It involves organizing data into dimensions (attributes that provide context) and measures (quantitative data to be analyzed). The purpose of dimensional modeling is to make it easier for users to explore data, create meaningful reports, and perform ad-hoc queries. It simplifies data navigation, enhances query performance, and aligns data with the way users think about their business, improving overall usability and effectiveness of data warehousing systems.
Dimensional Modeling differs from other data modeling approaches, such as Entity-Relationship Modeling (ER Modeling), in its focus on optimizing data structures for analytical queries and reporting. While ER Modeling emphasizes capturing relationships between entities, Dimensional Modeling prioritizes data accessibility and user-friendliness. It structures data into dimensions and measures, creating a star or snowflake schema, which simplifies data navigation and supports analytical tasks. Dimensional Modeling is tailored for data warehousing and analytical processing, making it suitable for decision support systems.
Common components of Dimensional Modeling include dimensions, facts, hierarchies, and attributes. Dimensions provide context to data, facts represent quantitative data to be analyzed, hierarchies organize dimension values into levels of detail, and attributes describe dimension members. Together, these components enhance data usability by simplifying data navigation, improving query performance, and allowing users to explore data with ease. Dimensional Modeling aligns data with business requirements, making it user-friendly and effective for analytical tasks in data warehousing.
Information Retrieval (IR) is the process of searching for unstructured and semi-structured data in a large number of documents to fulfill a user's information need. It encompasses web search (searching the web), personal information retrieval (integrated into operating systems and email), and enterprise or domain-specific search (searching internal documents using dedicated machines). These categories differ in scale and purpose, with enterprise search focusing on collections of internal documents and centralized file systems.
Text search in Information Retrieval systems presents several challenges. These include linguistic support for handling logical operators (AND/OR), word proximity, derived words, and stop words. Search results typically lack ordering, and substring searches using operators like LIKE and Regular Expression (Regex) can be slow due to a lack of index support. Additionally, matching text with different wording but similar meanings (word lemmas) is not considered, making precise text matching challenging.
Full-Text Indexing (FTS) is used in Information Retrieval to enhance search capabilities by creating indexes for textual content. It improves search performance and supports more complex search operations by efficiently mapping words and phrases in documents to their corresponding locations. FTS allows for faster text-based searches, making it a valuable tool for Information Retrieval systems.
Information Retrieval systems address the challenge of matching and ranking documents by determining the relevance of retrieved documents to a user's query. This involves ranking documents based on factors like common sets of index terms, weighted terms, and likelihood of relevance. Different Information Retrieval models, such as vector space models and probabilistic models, are used to assess document relevance and determine their order in search results. Ranking is a critical aspect of Information Retrieval, as it helps users find the most relevant information quickly.
The fundamental difference between data retrieval and Information Retrieval (IR) lies in their objectives and error tolerance. Data retrieval focuses on finding documents containing specific keywords and follows well-defined semantics. It is not error-tolerant, and even a single incorrect object implies failure. In contrast, Information Retrieval aims to find information about a subject or topic, where semantics are frequently loose, and small errors are tolerated. IR differs conceptually from database queries and emphasizes obtaining information rather than specific data objects.
An inverted index is a crucial data structure in Information Retrieval systems. It is used to efficiently map terms or keywords to the documents where they appear. This enables quick retrieval of documents containing specific terms, significantly improving search performance. Inverted indexes consist of term-document pairs and allow for efficient querying, ranking, and retrieval of relevant documents based on user queries. They are essential for Information Retrieval systems to provide fast and accurate search results.
Constructing an inverted index involves several steps: 1. Tokenization: Breaking documents into individual terms or words. 2. Sorting: Sorting the terms alphabetically. 3. Term Frequency: Counting the frequency of each term in each document. 4. Pointers: Creating pointers to the documents where each term appears. The resulting inverted index efficiently maps terms to their document locations, enabling quick retrieval of relevant documents during searches.
Query processing in Information Retrieval involves evaluating user queries against the inverted index to retrieve relevant documents. For operations like "AND," which require documents to contain multiple terms, query processing identifies documents that satisfy all query terms. In the case of "AND," only documents with all specified terms are considered relevant. Query processing plays a critical role in filtering and ranking documents based on user queries, ensuring that retrieved results align with the user's information needs.
The transition from full text to index terms involves representing documents by a set of index terms, usually nouns that capture the main themes. This transition simplifies document representation and retrieval. Full text assumes that all words are index terms, while index terms are carefully selected keywords or phrases. This process involves text operations like tokenization, handling stop words, normalization, stemming, and lemmatization. It is essential because it optimizes data structures, reduces noise in search results, and improves the accuracy and efficiency of Information Retrieval.
Common data structures in Information Retrieval systems include dictionaries, which store the term vocabulary, document frequency, and pointers to postings lists. These structures can be implemented using binary trees or B-trees. Dictionaries are crucial because they enable efficient term lookup and document retrieval. B-trees, in particular, offer advantages like addressing rebalancing issues and providing better overall performance, making them a preferred choice for managing term vocabularies and facilitating fast retrieval in Information Retrieval systems.
Index Construction in Information Retrieval involves creating data structures that enable efficient and fast retrieval of documents matching a user's query. It optimizes the process of searching through a large collection of documents by mapping terms to their locations in documents, allowing for quick identification of relevant documents. The goal is to improve the efficiency and speed of information retrieval tasks.