DataMgmt FS23
DtaaMgmt FS23
DtaaMgmt FS23
Set of flashcards Details
Flashcards | 99 |
---|---|
Language | English |
Category | Computer Science |
Level | University |
Created / Updated | 11.09.2023 / 15.10.2023 |
Weblink |
https://card2brain.ch/box/20230911_datamgmt_fs23
|
Embed |
<iframe src="https://card2brain.ch/box/20230911_datamgmt_fs23/embed" width="780" height="150" scrolling="no" frameborder="0"></iframe>
|
Create or copy sets of flashcards
With an upgrade you can create or copy an unlimited number of sets and use many more additional features.
Log in to see all the cards.
Heaps' Law is used in Information Retrieval to estimate the size of the vocabulary based on the number of tokens in a collection. It follows a power-law relationship, where the size of the vocabulary (M) is proportional to the number of tokens (T) raised to a certain exponent (b). The constants (k and b) typically have values around 30 ≤ k ≤ 100 and b ≈ 0.5. This relationship helps estimate the growth of vocabulary in a collection, assisting in the design of efficient indexing structures and memory allocation for Information Retrieval systems.
Zipf's Law describes the frequency distribution of terms in natural language. It states that the frequency of the ith most frequent term is inversely proportional to its rank (1/i). Zipf's Law leads to a linear relationship between the logarithm of term frequency and the logarithm of the term's rank. This law is observed in various contexts, such as term frequency in text or population distribution in cities. In Information Retrieval, it helps inform ranking strategies and term weighting schemes to prioritize terms by their importance.
The Boolean Model is a simple information retrieval model based on set theory. It assigns weights to index terms to quantify their importance in describing document contents. Documents are represented as weighted vectors, and queries are specified using Boolean expressions. Drawbacks of the Boolean Model in Information Retrieval include its reliance on binary decision criteria without partial matching, the absence of document ranking, the need for users to formulate queries in Boolean expressions, overly simplistic queries, and returning either too few or too many documents in response to queries.
The Vector Space Model (VSM) in Information Retrieval addresses the limitations of the Boolean Model by using non-binary weights for terms, allowing for partial matches between terms. Instead of binary weights, the VSM calculates weights based on term frequency (TF) within documents and inverse document frequency (IDF). This enables a more nuanced ranking of documents based on the degree of similarity between query terms and document terms. The VSM provides better matching of relevant results, improving the effectiveness of information retrieval.
In the Vector Space Model (VSM) of Information Retrieval, documents and queries are represented as weighted vectors in a t-dimensional space, where each term in the document has an associated weight determined by the term frequency (TF) within the document and the inverse document frequency (IDF). Similarity between a query and a document is measured using techniques like the cosine of the angle between their vectors or the distance between them. This allows for partial matching, meaning a document can be retrieved even if it only partially matches the query terms.
Search engines can be categorized into several types: 1. General-purpose search engines, such as Google, Bing, DuckDuckGo, and Qwant, which provide search results across a wide range of topics. 2. Specialized (Vertical) search engines, which focus on specific topics or regions, like search.ch for regional searches or search engines for images, events, and companies. 3. "Deep Web" search engines, which include library information retrieval systems. 4. Meta-search engines, which aggregate results from multiple search engines. Additionally, chatbots with question-answering capabilities have emerged as a recent type of search engine.
Semantic search is a search technique that goes beyond traditional keyword matching to consider the meaning and context of words, phrases, and documents. It aims to provide more accurate and relevant search results by analyzing the position and surrounding tokens of a query. Unlike traditional search engines that rely on lexical similarity and token frequency, semantic search employs techniques like embeddings and cosine similarity to measure similarity between queries and documents. Various models, including set-theoretic, algebraic, and probabilistic models, are used to represent documents and queries in semantic search.
Large Language Models (LLMs) have become essential in information retrieval due to their ability to represent complex language patterns. Models like BERT, GPT, and LLaMA have achieved improved performance and capabilities, thanks to the growth of training data, computational resources, and model architecture refinements. Bing/ChatGPT and Google BERT are examples of competing LLMs used in information retrieval tasks. These models excel at answering questions and implementing semantic search but are limited to topics they have been trained on. They can be used to answer questions about your own documents using approaches like the "modified horseshoe" or "piggyback" approach.
Users tend to interact with search results primarily on the first Search Engine Result Page (SERP), with most not going beyond the first three pages. On average, users use about 2.5 search words per query. Around 60% of queries are location-related, and over 50% of queries originate from mobile devices. Users have limited patience when evaluating search results and often trust the top-ranked results. However, it's important to note that search engines have limitations, including incomplete coverage of online content, outdated indexes, and biases introduced by ranking algorithms and advertisements.
Privacy concerns related to search engines include data collection by news apps and antivirus software without explicit consent. Cookie popups can be misleading and manipulate users into providing their data. Online marketing has led to questions about willingly providing personal information to search engines while being cautious in other contexts. Google, with a dominant market share in search, has faced scrutiny for its handling of user privacy, including issues like disabling location tracking on Android devices and adopting Apple's privacy-labels. Privacy concerns highlight the need for alternative search engines to reduce the dominance of major tech companies.
Web crawling is the process by which search engines discover and index web pages. It involves fetching and parsing web pages, extracting URLs, and organizing them for further processing. Crawlers are essential for search engines because they enable the discovery of web content and the creation of searchable indexes. Without web crawling, search engines wouldn't have access to the vast and continuously expanding web document collection, which includes various types of content, from truth to misinformation. Crawlers play a vital role in supporting universal search engines, vertical search engines, business intelligence, website monitoring, and addressing malicious activities like email harvesting for spam and phishing.
Basic web crawlers operate by starting with a set of known seed URLs, fetching and parsing them, extracting the URLs they point to, and placing them in a queue for further fetching. This process is repeated to crawl more web pages. However, web crawling presents various challenges, including the need for distributed systems, dealing with malicious pages, addressing latency and bandwidth variations, respecting webmasters' stipulations like robots.txt, handling duplicates, and implementing politeness rules. Challenges in implementation include avoiding duplicate fetches, managing the fast-growing frontier, determining file types, handling fetching errors, and removing duplicate URLs.
Random walks are a method used in web crawling to determine where to begin exploring the web without bias. They view the web as a directed graph and build a random walk on this graph, including rules to jump back to visited sites and avoid getting stuck. The advantages of random walks include statistical cleanliness and the potential to work for infinite webs under certain conditions. However, they also have disadvantages, such as the challenge of selecting an appropriate list of seed URLs, potential invalidity of practical approximations, and susceptibility to non-uniform distribution and link spamming.
Anchor text refers to the text used within hyperlinks to other web pages. Link analysis involves examining the relationships between web pages based on these hyperlinks. In web search and ranking algorithms, anchor text and link analysis are used for several purposes. They can help search engines understand the content and context of linked pages, contribute to the determination of page relevance, and assist in ranking pages based on their authority and popularity. Techniques like PageRank and Hyperlink-Induced Topic Search (HITS) rely on anchor text and link structures to assess the importance of web pages.
PageRank is an algorithm used to measure the relevance of a web page independently of a specific query. It models the web as a graph of interconnected pages, where pages have incoming and outgoing links. The algorithm simulates a surfer who follows links and occasionally jumps to other pages. PageRank addresses challenges like dangling nodes (pages with no outgoing links) by introducing random jumps to other web pages. When a query is made, pages that match the query are retrieved and ranked based on a combination of query relevance and PageRank. It is just one of many factors considered for ranking.
Polyglot Persistence is the practice of using multiple data storage technologies to cater to different data storage needs across an enterprise or within a single application.
RDBMS organizes data in relations as tuples with links between data tuples established through primary and foreign keys. However, RDBMS have limitations such as lacking support for complex data modeling, versioning, and scalability. They also struggle with data and schema evolution.
ACID stands for Atomicity, Consistency, Isolation, and Durability. Atomicity ensures that a transaction is all or nothing. Consistency requires data to be in a consistent state before and after a transaction. Isolation prevents interference from other processes during a transaction, and Durability ensures changes made by a transaction persist.
Modern SQL is a standardized query language that supports ACID compliance, nested and aggregated structures, hierarchic and recursive queries, and distributed processing. It aims to maintain the relational model while supporting user-defined types as objects.
NoSQL databases address the needs of Web 2.0, including large datasets, write access, permanent availability, and polyglot persistence. They have characteristics like horizontal scalability, weaker concurrency/transaction models, schema-free design, and use of distributed indexes. They were developed to overcome the limitations of traditional RDBMS in handling these new requirements.
Common types of NoSQL databases include Key/Value databases, Document stores, Column-Oriented databases, and Graph databases. Key/Value databases store data in a dictionary format, Document stores use flexible schemas, Column-Oriented databases are efficient for analytics, and Graph databases manage complex relationships.
Factors to consider include the use case (read/write, transactional/analytical), single-user or multi-user requirements, data quantities (small, medium, or big data), data structure (NoSQL or multi-model), and existing database management systems and expertise.
In practice, Polyglot Persistence involves selecting the appropriate database technology for each component of an application or enterprise system based on its specific requirements. For example, you might use a document store for unstructured data, a relational database for structured data, and a graph database for managing complex relationships.
A suggested approach is to use a multi-model database like PostgreSQL as a workhorse for transactional and multi-user needs. Additionally, employ specialized databases for analytical queries (non-transactional, single user) to optimize performance. This two-database approach allows you to leverage the strengths of different database technologies for different aspects of your project while avoiding the pitfalls of a one-size-fits-all solution.
Modern SQL includes features such as Recursive CTE (for tree and graph queries), Window Functions (for analytics and sequential processing), enhanced Aggregation functions (GROUP BY), and various data types/structures like Time and Interval, Enumerated Values, Arrays, Key/Values, Documents (JSON, XML), Trees (JSON, XML), and Graphs. These features enhance the capabilities of SQL for handling complex data scenarios.
Relational Algebra serves as a theoretical foundation for relational databases by defining operators that transform one or more input relations into an output relation. It provides a set of operations to perform selections, projections, unions, set differences, set intersections, and renaming of data in tables, enabling the manipulation and retrieval of data from relational databases.
The Selection operator (σ) is used to select specific tuples from relations that meet certain criteria. For example, σ(c > 3)R would select tuples from relation R where the value in column C is greater than 3. It filters the rows based on a specified condition and returns only those that satisfy the condition.
The Projection operator (π) is used to extract specific columns from a relation. It returns a relation with only the columns specified, removing duplicate data in the process. For example, if we want columns B and C from relation R, π(B, C)R would return a new relation with only columns B and C from R. It helps in selecting specific attributes of a relation for further processing or analysis.
The Union operator (U) combines two relations to produce a new relation that contains all unique tuples from both input relations. The constraint is that both input relations must have the same set of attributes. For example, if we have two relations FRENCH and GERMAN, the query π(Student_Name)FRENCH U π(Student_Name)GERMAN would return all unique student names from both relations. The Union operator is used to combine data from multiple relations while ensuring data consistency.
The Set Difference operator (-) in Relational Algebra returns a new relation that contains tuples from the first input relation that are not present in the second input relation. For example, if we have two relations FRENCH and GERMAN, the query π(Student_Name)FRENCH - π(Student_Name)GERMAN would return student names that are in the FRENCH relation but not in the GERMAN relation. The Set Difference operator is used to find the elements unique to one relation when compared to another.
-
- 1 / 99
-