Data-Intensive Systems and Applications
2ID70: Quartile 3 2021-2022
Relational Databases
Indices
Order of tuples has a huge impact on which queries we can execute efficiently.
In queries you can search by id or by attributes. This is done in the WHERE clause.
Naïve selection in queries:
1. Load all records from the database (typically in disk blocks)
2. Keep the ones that satisfy the constraints
Indices can help with this (index):
• Prepare your data by building indices on the chosen attributes (library, phonebooks, etc)
BASIC CONCEPTS OF INDICES
Indices can speed up access to the desired data. <Search-Key, Pointer>
Search-Key: (set of) attributes used for looking up records in a file.
There are three types of indices that you can use based on your goals:
• Ordered indices: search keys are stored in sorted order
• Hashed indices: search keys are distributed uniformly across buckets using a hash function
• Bitmap indices: distinct values are mapped to positions of tuples
Indices are evaluated by: access types supported efficiently, access time, insertion time, deletion time,
space overhead. Indices also store data. If the data is big, the indices are big as well.
ORDERED INDICES: index entries are stored sorted on the search key value.
The classification of ordered indices:
• Density: Dense vs Sparse indices
• Primary vs Secondary indices
• Single-level vs Multi-level indices
Dense index: index record appears for every search-key value in the file.
Sparse index: contains index records only for some search-key value. This can be used when the
records are sequentially ordered on the search-key. (e.g. top of a dictionary)
To locate a record with search-key value 𝐾 we:
• Find index record with largest search-key value < 𝐾
• Search file sequentially starting at the record to which the index record points
Sparse indices need less space and less maintenance overhead for insertions & deletions. And they are
generally slower than dense index for locating records. Good trade-off: sparse index with an index
entry for every block of the table, corresponding to the least search-key value in the block.
,Primary index: the index whose search key specifies the sequential order of the file, clustering index.
Secondary index: an index whose search key specifies an order different from the sequential order of
the file. Also called the non-clustering index.
Index-sequential file: file with primary index on the search key.
A sequential scan using primary index is efficient, but a sequential scan using a secondary index is
expensive.
Single-level index: It is necessary to store the indices as well. It is like storing another table (disk
blocks). If the index is very large, it needs to be stored in the hard disk. Accessing the index will cost
non-negligible I/O. After the position is found, an additional I/O is needed for fetching the index.
Multi-level index: Keep the index on a disk, and treat is as a sequential sorted file and construct a
sparse index on it. It can be repeated recursively until the outer index fits into the RAM. The indices at
all levels must be updated on insertion or deletion from the file.
- Outer index: a sparse index of the index file
- Inner index: the index file
Composite keys: an index based on multiple search-keys. The order of in the key is important.
Indices with unique constraints: there can be only one record with a particular attribute value. This is
always the case for a primary key. Enforcing these unique constraints is costly. Without an index, this
takes O(n) time.
SQL COMMANDS
Non-clustered index (default):
CREATE INDEX idx_name ON TABLE tbl_name (list_of_attr)
Clustered index:
CREATE CLUSTERED INDEX idx_name ON TABLE tbl_name (list_of_attr)
Unique index:
CREATE UNIQUE INDEX idx_name ON TABLE tbl_name (list_of_attr)
Delete index:
DROP INDEX idx_name
If the index ID is not included in the table, it will take the first one as the index. So the ID is not needed
to be specified inside the query.
In the WHERE clause, the order of the attributes is irrelevant.
We can also use multiple indices for a query. You can either use only one of the indices. Or you can use
both. Or any other combination. In the case of all the indices, we need to get pointers to records from
each index. Then, compute the intersection and get the actual records.
Choices depends on the data characteristics and the query:
• Use only one: good when this attribute is very selective
• Use both: good when at least one of them is very selective. Bad is both are not selective, but
the combination is very selective. This can be slower than the sequential read.
UPDATING INDICES: this imposes overhead on database modification.
• Deletion:
o Single-level index entry deletion
▪ Dense Indices: if a deleted record was the only record in the file with its
particular search-key value, the search-key is deleted from the index as well.
▪ Sparse indices: if an entry for the search-key exists in the index, it is deleted
by replacing the entry in the index ith the next search-key value in the file. If
, the next search-key value already has an index entry, the entry is deleted
instead of being replaced.
o Multi-level insertion: algorithms are simple extensions of the single level algorithms
• Insertion:
o Single-level index insertion: perform a lookup using the search-key value appearing in
the record to be inserted.
▪ Dense indices: if the search-key value does not appear in the index, insert it.
▪ Spare indices: if the index stores an entry for each block of the file, no change
needs to be made to the index unless a new block is created. If this new block
is created, the first search-key value appearing in the new block is inserted
into the index.
o Multi-level insertion: algorithms are simple extensions of the single level algorithms
A limitation of ordered indices is that you still need to access an index structure to find the location in
the file, typically with a binary search (𝑂(log(𝑛)) I/O). And also, it can be expensive to update.
HASHED INDICES
Hashed indices: solution to the limitations of ordered indices. This optimizes for equality queries.
Static Hashing: a bucket is a unit of storage containing one or more records.
Hash file organization: obtain the bucket of a record directly from its search-key value using a
hash function.
Hash function ℎ: is an function from the set of all search-key values K to the set of all bucket
addresses B.
We can use this hash function to locate the buckets containing the records with a particular search
key. Records with different search-key values may be mapped to the same bucket. The entire bucket
has to be search sequentially to locate a record.
Hashing without integers is also possible. The hash function returns the sum of the binary
representations of the characters modulo the number of buckets. Binary representation of the 𝑖 𝑡ℎ
character is assumed to be the integer 𝑖.
The worst hash function maps all the input keys to the same bucket.
The ideal hash function is a uniform distribution of the keys to buckets, irrespective of the actual
distribution of the keys.
Typically, hash functions operate on the internal binary representation of the key.
Buckets have a fixed size. This means that they can overflow. There are too many records and
insufficient buckets. There is a skew in the distribution of the records.
The chosen hash function produces non-uniform distribution of the key values.
Multiple records have the same search-key value.
This can be fixed by overflow buckets, a place to store extra records.
Overflow chaining: The overflow buckets of a given bucket are chained together in a linked list. (1)
A hash index organizes the search keys, with their associated record
pointers, into a hash file structure. You do not need to specify a hash
function. The computation is very fast, but it can only be used for equality
queries (no <,>, regEx). Hash indices do not support range queries.