Information and Data Management course overview
Lecture 1: Introduction
Lecture 2: Data modelling
Lecture 3: Data modelling
Lecture 4: Data modelling: relational schema normalization
Lecture 5: Towards Query Processing
Lecture 6: Storage Media
Lecture 7: Transaction Processing & Concurrency Control
Lecture 8: Logging & Recovery Distributed Transactions
Lecture 9: Access Paths
Lecture 10: Indexing: hash tables & trees
Lecture 11: Query Execution and Cardinality Estimation
Lecture 12/13: Query optimization and cardinality estimation
Lecture 14: Query optimization – Join orders.
Lecture 15: Research topics (bonus)
Lecture 16: Database security.
,Lecture 1: Introduction
Three parts of a database system:
1. SQL: the query language
2. Relations: strictly defined mathematical model on which the database is build
3. RDBMS: manages how the data is retrieved and stored on the disk (if you don’t have data
management systems you need to use files on your systems).
Databases are used to manage huge amounts of data. A database is a collection of related data:
• Data represents some aspects of the real world (universe of discourse).
• Data is logically coherent.
• Provided for an intended group of users and applications.
Databases are maintained by software called a database management system (DBMS). A DBMS
allows/enables:
• Definition of data and structure.
• Physical construction.
• Data manipulation.
• Sharing/protecting.
• Persistence/recovery.
Databases can have different underlying data models:
• Most popular: relational data model
• Earlier alternatives: network/hierarchical/object-oriented etc.
Relational Databases established a set of valuable features:
• Strict data modelling.
• Controlled redundancy.
• Data normalization.
• Data consistency and integrity constraints.
• SQL: simple and powerful query language.
• Effective and secure data sharing.
• Backup and recovery.
Databases are well-structured (ER-model → conceptual model)
• Catalog (data dictionary) contains all meta-data
• Defines the structure of the data in the database
Databases aim at efficient manipulation of data.
• Physical tuning allows for good data allocation.
• Indexes speed up search and access.
• Query plan are optimized for improved performance.
Data Independence: databases employ data abstraction by providing data models, applications work only
on the conceptual representation of data:
• Data is strictly typed (Integer, varchar, Timestamp, etc.)
• Details on where data is actually stored and how it is accessed is hidden by the DBMS
• Application can access and manipulate data by invoking declarative operations.
DBMS-controlled parts of the file system are strongly protected against outside manipulation.
Views provide a different perspective of the DB. For the application, a view does not differ from a table.
Views may contain subsets of a DB and/or contain virtual data. Virtual data is derived from the DB (mostly
by simple SQL statements, e.g. joins over several tables). Views can either be computed at query time or
materialized upfront.
Multiple users and applications may access the DB at the same time, concurrency control is necessary
for maintaining consistency.
,Transaction: a unit of work, possibly containing multiple data accesses and updates, that must commit or
abort as a single unit, and follow the ACID principles.
• Atomicity: a transaction is either executed completely (commit) or not at all (abort).
• Consistency: a transaction transforms a consistent database state into a (possibly different)
consistent database state.
• Isolation: a transaction is executed in isolation (i.e., does not see any effect of other concurrently
running transactions).
• Durability: a successfully completed transaction has a permanent effect on the database.
Storage Manager: provides the interface between the data stored in the database and the application
programs and queries submitted to the system. The storage manager is responsible for interaction with the
file manager and for efficient storing, retrieving and updating of data. Tasks: storage access, file
organization, indexing and hashing.
Query Processor: parses queries, optimizes query plans and evaluates the query. Needs to estimate the
cost of operations.
Transaction Manager: ensures that the database remains in a correct state despite system failures,
controls the interaction among concurrent transactions to ensure the database consistency.
A data model describes data objects, operations and their effects (example: relational model).
Data Definition Language (DDL): Create Table, Create View, Constraint/Check, etc.
Data Manipulation Language (DML): Select, Insert, Delete, Update, etc.
DML and DDL are usually clearly seperated, since they handle data and meta-data, respectively.
Schema: describe a part of the structure of the stored data as tables, attributes, views, constraints,
relationships, etc. (meta-data) (intensional database).
• Entity types (a real-world concept) as tables and their attributes (property of an entity)
• Types of attributes and integrity constraints.
• Relationships between entity types as tables.
• Schemas are intended to be stable and not change often
• Basic operations: operations for selections, insertions and updates.
• Optionally user defined operations and types.
System Catalogs: a collection of schemas, contain special schemas describing the schema collection.
The actually stored data is called an instance of a schema (extensional database).
Databases are organized using 3 levels of schemas
1. Internal schema (physical layer): describes the physical storage and access paths. Uses physical
models.
2. Conceptual schema (logical layer): describes structure of the whole database, hiding physical
details. Uses logical data models.
3. External schema (presentation layer): describes parts of the database for a certain user group as
views. Hides conceptual details.
Data Independence: ability to change schema of one level without changing the others.
Logical Data Independence: change of conceptual schema without change of external schemas.
Physical Data Independence: changes of the internal schema do not affect the conceptual schema
(physical tuning is one of the most important maintenance tasks of DB administrators).
, Lecture 2: Modelling 1
The Entity-Relationship Diagrams: the most used conceptual data diagramming style.
Entity types
Represent classes of objects that have:
• Common properties
• Autonomous existence
An occurrence of an entity type is an object of the class
that the entity type represents.
Consequence: an occurrence of an entity type has an
existence independent of the properties associated
with it. Difference from the relational model in which it
is not possible to represent an object without knowing
its properties.
Relationship types
Represent logical links between 2+ entity types.
• Defines a set of associations among
occurrences from these entity types.
• An entity type is said to participate in a
relationship.
Each relationship has a unique name (preferable a
noun instead of verb). There can be different
relationship types between the same entity types.
An occurrence of a relationship type is an n-tuple made
up of occurrences of entity types, one for each of the
entity types involved.
The set of occurrences of a relationship type is a
mathematical relation between the occurrences of the
involved entity types and thus a subset of the cartesian product of the entity occurrences. This means that
no n-tuple can be repeated among the occurrences of a relationship type of the ER-model, therefore, no
identical occurrences.
Modeling multiple occurrences (example: student and course related by an exam which can be retaken)
can be done by using higher degree relationship.
Degree of a relationship type: number of participating entity types (binary, ternary, recursive).
Cardinality: describes the maximum and minimum number of relationship occurrences in which an entity
occurrence can participate → specified for each entry participating in a relationship.
• Maximum can be:
o 1: each occurrence of the entity is associated at most with a single occurrence of the
relationship.
o N: each occurrence of the entity is associated with an arbitrary number of occurrences of the
relationship.
• Minimum can be 0 or 1.
o 0: the participation in the relationship is optional or partial.
o 1: the participation is mandatory or total (existence dependency).
In n-ary relationships, the entities involved almost always have a maximum cardinality that is equal to N. If
an entity E participates with maximum cardinality 1, it is possible to remove the n-ary relationship and relate
E with one of the other entities with a binary relationship.
Recursive relationship types: relationships between an entity type and itself (the same entity participates
in the relation with different roles).