Class notes

Class notes of all lectures of Data Intensive Systems and Applications

3 purchases

Course
Data Intensive Systems and Applications (2ID70)

Institution
Technische Universiteit Eindhoven (TUE)

This file contains the notes of all the lectures from O. Papapetrou on this subject. It does not contains the extra lecture with an explanation of lambda expressions. It also does not contain the additional slides. It does contain extra comments the teacher made during the lectures.

[Show more]

Preview 3 out of 20 pages

View example

Uploaded on April 6, 2022
Number of pages 20
Written in 2021/2022
Type Class notes
Professor(s) O. papapetrou
Contains All classes

data intensive systems
hadoop
mapreduce
spark
postgresql
database optimizations
big data
relational databases
computer science
data science

Institution
Technische Universiteit Eindhoven (TUE)
Education
Computer Science And Engineering
Course
Data Intensive Systems and Applications (2ID70)

datasciencestudent

Member since 4 year 37 documents sold

$4.79

Add to cart

Add to wishlist

100% satisfaction guarantee
Immediately available after payment
Both online and in PDF
No strings attached

Data-Intensive Systems and Applications
2ID70: Quartile 3 2021-2022

Relational Databases

Indices
Order of tuples has a huge impact on which queries we can execute efficiently.

In queries you can search by id or by attributes. This is done in the WHERE clause.
Naïve selection in queries:
1. Load all records from the database (typically in disk blocks)
2. Keep the ones that satisfy the constraints

Indices can help with this (index):
• Prepare your data by building indices on the chosen attributes (library, phonebooks, etc)

BASIC CONCEPTS OF INDICES
Indices can speed up access to the desired data. <Search-Key, Pointer>
Search-Key: (set of) attributes used for looking up records in a file.

There are three types of indices that you can use based on your goals:
• Ordered indices: search keys are stored in sorted order
• Hashed indices: search keys are distributed uniformly across buckets using a hash function
• Bitmap indices: distinct values are mapped to positions of tuples

Indices are evaluated by: access types supported efficiently, access time, insertion time, deletion time,
space overhead. Indices also store data. If the data is big, the indices are big as well.

ORDERED INDICES: index entries are stored sorted on the search key value.
The classification of ordered indices:
• Density: Dense vs Sparse indices
• Primary vs Secondary indices
• Single-level vs Multi-level indices

Dense index: index record appears for every search-key value in the file.
Sparse index: contains index records only for some search-key value. This can be used when the
records are sequentially ordered on the search-key. (e.g. top of a dictionary)
To locate a record with search-key value 𝐾 we:
• Find index record with largest search-key value < 𝐾
• Search file sequentially starting at the record to which the index record points

Sparse indices need less space and less maintenance overhead for insertions & deletions. And they are
generally slower than dense index for locating records. Good trade-off: sparse index with an index
entry for every block of the table, corresponding to the least search-key value in the block.

,Primary index: the index whose search key specifies the sequential order of the file, clustering index.
Secondary index: an index whose search key specifies an order different from the sequential order of
the file. Also called the non-clustering index.
Index-sequential file: file with primary index on the search key.

A sequential scan using primary index is efficient, but a sequential scan using a secondary index is
expensive.

Single-level index: It is necessary to store the indices as well. It is like storing another table (disk
blocks). If the index is very large, it needs to be stored in the hard disk. Accessing the index will cost
non-negligible I/O. After the position is found, an additional I/O is needed for fetching the index.
Multi-level index: Keep the index on a disk, and treat is as a sequential sorted file and construct a
sparse index on it. It can be repeated recursively until the outer index fits into the RAM. The indices at
all levels must be updated on insertion or deletion from the file.
- Outer index: a sparse index of the index file
- Inner index: the index file

Composite keys: an index based on multiple search-keys. The order of in the key is important.

Indices with unique constraints: there can be only one record with a particular attribute value. This is
always the case for a primary key. Enforcing these unique constraints is costly. Without an index, this
takes O(n) time.

SQL COMMANDS
Non-clustered index (default):
CREATE INDEX idx_name ON TABLE tbl_name (list_of_attr)
Clustered index:
CREATE CLUSTERED INDEX idx_name ON TABLE tbl_name (list_of_attr)
Unique index:
CREATE UNIQUE INDEX idx_name ON TABLE tbl_name (list_of_attr)
Delete index:
DROP INDEX idx_name

If the index ID is not included in the table, it will take the first one as the index. So the ID is not needed
to be specified inside the query.
In the WHERE clause, the order of the attributes is irrelevant.

We can also use multiple indices for a query. You can either use only one of the indices. Or you can use
both. Or any other combination. In the case of all the indices, we need to get pointers to records from
each index. Then, compute the intersection and get the actual records.
Choices depends on the data characteristics and the query:
• Use only one: good when this attribute is very selective
• Use both: good when at least one of them is very selective. Bad is both are not selective, but
the combination is very selective. This can be slower than the sequential read.

UPDATING INDICES: this imposes overhead on database modification.
• Deletion:
o Single-level index entry deletion
▪ Dense Indices: if a deleted record was the only record in the file with its
particular search-key value, the search-key is deleted from the index as well.
▪ Sparse indices: if an entry for the search-key exists in the index, it is deleted
by replacing the entry in the index ith the next search-key value in the file. If

, the next search-key value already has an index entry, the entry is deleted
instead of being replaced.
o Multi-level insertion: algorithms are simple extensions of the single level algorithms
• Insertion:
o Single-level index insertion: perform a lookup using the search-key value appearing in
the record to be inserted.
▪ Dense indices: if the search-key value does not appear in the index, insert it.
▪ Spare indices: if the index stores an entry for each block of the file, no change
needs to be made to the index unless a new block is created. If this new block
is created, the first search-key value appearing in the new block is inserted
into the index.
o Multi-level insertion: algorithms are simple extensions of the single level algorithms

A limitation of ordered indices is that you still need to access an index structure to find the location in
the file, typically with a binary search (𝑂(log(𝑛)) I/O). And also, it can be expensive to update.

HASHED INDICES
Hashed indices: solution to the limitations of ordered indices. This optimizes for equality queries.
Static Hashing: a bucket is a unit of storage containing one or more records.
Hash file organization: obtain the bucket of a record directly from its search-key value using a
hash function.
Hash function ℎ: is an function from the set of all search-key values K to the set of all bucket
addresses B.
We can use this hash function to locate the buckets containing the records with a particular search
key. Records with different search-key values may be mapped to the same bucket. The entire bucket
has to be search sequentially to locate a record.

Hashing without integers is also possible. The hash function returns the sum of the binary
representations of the characters modulo the number of buckets. Binary representation of the 𝑖 𝑡ℎ
character is assumed to be the integer 𝑖.

The worst hash function maps all the input keys to the same bucket.
The ideal hash function is a uniform distribution of the keys to buckets, irrespective of the actual
distribution of the keys.
Typically, hash functions operate on the internal binary representation of the key.

Buckets have a fixed size. This means that they can overflow. There are too many records and
insufficient buckets. There is a skew in the distribution of the records.
The chosen hash function produces non-uniform distribution of the key values.
Multiple records have the same search-key value.
This can be fixed by overflow buckets, a place to store extra records.
Overflow chaining: The overflow buckets of a given bucket are chained together in a linked list. (1)

A hash index organizes the search keys, with their associated record
pointers, into a hash file structure. You do not need to specify a hash
function. The computation is very fast, but it can only be used for equality
queries (no <,>, regEx). Hash indices do not support range queries.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller datasciencestudent. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.79. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

75282 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 15 years now

Start selling

Popular Universities in the United States

Popular books

Find notes and summaries for these qualifications

Seller