Paper 1: Data warehouse design
1. Introduction to Data Warehousing
Too much data makes the extraction of the most important information difficult: ‘data = information’
is not always correct. Data warehousing grew from the huge amount of electronic data and the need
to use that data to accomplish goals. Quantifying and evaluating how each branch contributes to
global business performance by tailor-made queries is infeasible: it takes too much time and
resources, doesn’t always achieve the desired results, and slows down the system. Today’s advanced
data warehousing processes separate online analytical processing (OLAP) from online transactional
processing (OLTP) by creating a new information repository, that (1) integrates basic data from
various sources, (2) properly arranges data formats, and (3) makes data available for analysis and
evaluation.
Data warehousing can be applied to a wide range of fields and is not only restricted to enterprises. A
common property to all fields is the need for (1) storage and (2) query tools to retrieve information
summaries easily and quickly. This allows us to study business phenomena, learn about meaningful
correlations, and gain knowledge to support decision-making processes.
1.1 Decision Support Systems
In the past, enterprise databases stored only operational
data: data created by business operations involved in
daily management processes. To get quick,
comprehensive access to the information required by
decision-making processes, this strategic information is
extracted from operational data stored in enterprise
databases by means of progressive selection and
aggregation process.
Computers are the only tools suitable for providing data for decision-making due to an exponential
increase in operational data. This has affected the role of enterprise databases and introduced
decision support systems (DSS): “a set of expandable, interactive IT techniques and tools designed for
processing and analysing data and for supporting managers in decision making. To do this, the system
matches individual resources of managers with computer resources to improve the quality of the
decisions made.” It helps managers make decisions or choose among different alternatives (using
value estimates). It typically includes a (1) model-based management system connected to a (2)
knowledge engine, and an (3) interactive graphical user interface. Decision-makers can formulate
queries and conduct complex analyses on relevant information without slowing down operational
systems. A possible classification of DSSs on the basis of their functions:
,1.2 Data Warehousing
Data warehousing: “a collection of methods, techniques, and tools used to support knowledge
workers—senior managers, directors, managers, and analysts—to conduct data analyses that help
with performing decision-making processes and improving information resources.” It’s been brought
into business because classic information systems were unable to give the right access to the required
information in a proper format, substantially different results between different business area,
unable to manipulate data, unable to examine data at the maximum level of detail, and offer incorrect
or incomplete data. This gives the following key words distinguishing marks and essential
requirements for a data warehouse process:
• Accessibility – to users not very familiar with IT and data structures;
• Integration – of data on the basis of a standard enterprise model;
• Query flexibility – to maximize the advantages obtained from the existing information;
• Information conciseness – allowing for target-oriented and effective analyses;
• Multidimensional representation – giving users an intuitive and manageable view of
information;
• Correctness and completeness – of integrated data.
Data warehouses are placed in the middle of this process and act as repositories for data: they make
sure that the requirements set can be fulfilled. A data warehouse: “a collection of data that supports
decision-making processes.” It provides the following features:
• It’s subject-oriented: they hinge on enterprise-specific concepts, while operational databases
hinge on many different enterprise-specific applications;
• It’s integrated and consistent: multiple data sources are combined (integrated) and provides
a unified view of all the data (consistent). A data warehouse system doesn’t require new
information to be added, but existing information needs rearranging. So an IS should be
previously available;
• It shows its evolution over time and it is not volatile: enables analyses that cover a few years.
, Data is never deleted from data warehouses and updates are carried out when data warehouses are
offline: it can be seen as a read-only database. This improves short analysis query response time,
affects data warehouse-specific DBMS technologies (no need for advanced transactional
management techniques), and data warehouse-specific logical design solutions are completely
different (table normalization given up to improve performance).
Operational queries execute
transactions reading/writing a small
number of tuples from/to tables (data of
a customer or an order), called an OLTP
query. Data warehouses use OLAP
queries: dynamic, multi-dimensional
analyses scanning a huge amount of
records summing up the performance.
OLTP systems have an essential workload
core ‘frozen’ in application programs, so
ad hoc data queries are occasionally run
for data maintenance, while data
warehouse interactivity constantly changes the actual workload as time goes by. The distinctive
features of OLAP queries suggest adoption of a multidimensional representation for data warehouse
data.
1.3 Data Warehouse Architectures
Architecture properties essential for a data warehouse system:
• Separation – analytical and transactional processing should be kept apart as much as possible;
• Scalability – hardware and software architectures should be easy to upgrade as the data
volume, which has to be managed and processed, and the number of users’ requirements,
which have to be met, progressively increase;
• Extensibility – the architecture should be able to host new applications and technologies
without redesigning the whole system;
• Security – monitoring accesses is essential because of the strategic data stored in data
warehouses;
• Administrability – data warehouse management should not be overly difficult.
There are 2 different classifications commonly adopted for data warehouse architectures: (1) a
structure-oriented one (1.3.1, 1.3.2, 1.3.3), depending on the number of layers used by the
architecture, and (2) depending on how the different layers are employed to create enterprise-
oriented or department-oriented views of data warehouses (1.3.4).
1.3.1 Single-Layer Architecture (not frequently used in practice)