100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Data Analytics (2IAB0) Summary Lectures 2020 $4.27
Add to cart

Summary

Data Analytics (2IAB0) Summary Lectures 2020

1 review
 276 views  3 purchases
  • Course
  • Institution

EN: Data Analytics for engineers (2IAB0) is a basis course of the Bachelor College at Eindhoven University of Technology. This means that all Bachelor TUe students should have completed this course. It is given in the third quartile of the first year. Data Analytics for Engineers provides more in...

[Show more]
Last document update: 3 year ago

Preview 2 out of 14  pages

  • May 28, 2020
  • April 2, 2021
  • 14
  • 2019/2020
  • Summary

1  review

review-writer-avatar

By: matthewmihu • 3 year ago

It was what I was looking for

avatar-seller
Data Analytics (2IAB0) Summary Lectures
Week 1: EDA
Descriptive data analytics is data collected now that may be used later for other purposes and is used to
give an insight into the past. Predictive data analytics is for looking into the future: we only do predictions
but don’t give an indication of what we should we do. Prescriptive data analytics consists of data-driven
advices how to take action to influence of change the future.
Data are raw, unorganized numbers, facts, etc. Information is structured, meaningful and useful numbers
and facts.
There are two data forms/types:
- categorical/nominal: a) dichotomous (yes/no, male/female) b) nominal: no ordering (genre)
c) ordinal: has ordering (ratings, bad – good)
- numerical: a) interval: no fixed “zero point”, only difference has meaning (temperature in F, ranking)
b) ratio: has fixed “zero point”, so ratios also do make sense (budget, running time)
A reference table stores “all” data in a table so that it can be looked up easily. A demonstration table is a
table to illustrate a point (so present just enough data). In a table, pay attention to the kind of data type,
units of measurement, whether the values make sense when comparing columns or rows and which
column/row has the largest/smallest values.
Asking what to expect is also an important way to spot errors. You can ask two questions: “What are
reasonable values?” (human age) and “Given one value, what could be others?” (time so what distance?).
Typical questions for statistical plots are about whether values are as expected, what typical sizes are,
the variation of the values, the distribution of the values and whether there are any exceptional values.
A scatter plot is good for showing actual values and structure of numerical variables but it is not suitable
for large data sets because of the many overlapping dots. The jitter option (which changes horizontal
placement) may help to avoid this.
The choice of plot depends on the data type: bar charts –> categorical data, histograms –> numerical data.
A histogram has the range of data values split in bins (intervals of values). You can choose the number of
bins or the bin size. The histogram will show the number of observations in the dataset for every bin. The
rule of thumb for choosing number of bins is √𝑛 where n is the number of observations. If the bin width is
too small, the histogram will be too wiggly. If it is too large, there are too few details.
In a cumulative histogram, the vertical axis reflects the share (or %) of the observations in a dataset with
values smaller than a value specified on the horizontal axis.
Kernel density plots make use of bandwidths. Assuming each observation indicates that this value is
possible, but values nearby could also occur (but less likely), choose a bandwidth to be taken around each
observation, generate a kernel with the chosen bandwidth for every observation in the dataset and the sum
of the kernels results in the kernel density plot. Choosing the bandwidth is important!
Summary statistics are numbers to describe level (location statistics: what are “typical” values?) and
spread (scale statistics: how much do values vary?). Typical distribution shapes are as follows:
- unimodal distribution (1 peak) – bimodal distribution (2 peaks, possibly due to 2 different groups)
- symmetric distribution (left = right) – right-skewed distribution/asymmetric (top at the left, tail at right,
if the mean is bigger than the median, there is a right-skewed distribution)
There are the following location statistics: - mean (average, ) – mode (most occurring value)
- median ( (average of two) middle value(s) ) – quartiles/percentiles (1st quartile = cut-off point for 25%,
pth percentile is a cut-off point for p% of data, for a percentile P we compute its location in a data set of n
observations: Lp = p/100 *(n + 1). Computing the pth percentile value by linear interpolation:
Let l and h be the observations at the position ⌊𝐿𝑝⌋ and ⌈𝐿𝑝⌉ in the ordered data set.
pth percentile value = l + (Lp - ⌊𝐿𝑝⌋)(h – l).)
There are the following scale statistics: - range (max – min) – interquartile range (IQR) (3rd quartile – 1st)

- sample variance (σ^2 = ) – sample standard deviation (σ = )
- median absolute deviation (MAD) (median of the absolute deviation from the median)

1
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

, The higher these statistics, the more spread/variability in the data. Variance and standard deviation are
sensitive to outliers, IQR and MAD are not.
Standardization / z-score normalization: z-score transforms data in their original units into
universal statistical unit of standard deviation from the mean using the following formula:
The mean value of the transformed data set is 0 and the sample standard deviation is 1. Negative z-score
means that the value is below the mean, positive means above the mean. The observations with a z-score
larger than 2.5 are considered as outliers.
A Box(-and-Whisker) plot is a convenient way to graphically display summary statistics since it shows the
median, the 1st and 3rd quartile and the minimum and maximum values. It is better than histograms/kernel
density estimators to compare groups but the others are better for showing distribution shape.
QUIZ: The variance of 5 numbers is 10. If each number is divided by 2, then the variance of the new
numbers is 2.5. Since division by 2, when calculating the squares of differences: 2^2 so 10/2^2 = 2.5.

Week 2: VIS
We make data visualisations, not infographics (focused on telling a story creatively instead of data).
Visualization has always been important in history, for example when recording a pulse signal or when
trying to find the reason for an epidemic (visualizing the deaths which strangely occur near a pump). Also,
communicating data effectively is of importance, for example for a subway map.
Visualization is the process that transforms (abstract) data into (interactive) graphical representations for
the purpose of exploration, confirmation or communication. Communication is done to inform humans and
shows specific aspects of a larger dataset to allow the reader to better connect the presented information to
their existing knowledge. Exploration is done when questions are not well-defined and shows a large,
complex dataset which is meant for professionals. Confirmation is a combination of those two.
Why do we visualize data?
In the case of high-level actions, we analyze the data:
- visualization for consuming information: - discover – present – enjoy (meant for end-users)
- visualization for producing data: - annotate – record – derive (extends the dataset)
In the case of lower-level actions, we search in the data:
Lookup: search in a dictionary how to spell a certain word
Browse: look for a synonym for a certain word
Locate: try to find your lost keys
Explore: unexpected patterns

There are two kinds of targets: - We look at all data and then at the trends (define the “mainstream”), the
outliers (standout from the mainstream) and the features (task-dependent structures of interest).
- We look at attributes and then at one (by analyzing the distribution or the extremes) or many (by
analyzing dependency, correlation or similarity).
Human perception can be influenced which has been researched in a psychological theory called “Gestalt
theory”. Proximity: objects close to each other are perceived as a group. Similarity: objects that are
similar (color, shape, etc.) are perceived as a group. Continuity: we unconsciously draw a line through
points that are in a graph. So: position and the arrangement of visual elements is the most important
channel for visualizations.
Perception of colors begins with 3 specialized retinal cells known as cone cells. The red cone cell
shows black-white, the green one shows green-purple and the blue one shows blue – yellow. Combining
them gives the right color. However, you could be color blind when one of those cone cells are missing. It is
rare to miss the blue cone cell so doing visualizations in the colors blue-yellow is safe.
There are several ordering directions: - sequential (XS – S – M – etc.) – diverging (… -10 … 0 … 5 …)
– cyclic (days of the week)
A key attribute (also called an independent attribute) acts as an index that is used to look up value
attributes (also called a dependent attribute).
Data visualization makes use of marks (geometric primitives: points, lines, areas, complex shapes) and
channels (appearance of marks: position, color, length, size, shape).

2
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller IsabelRutten. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $4.27. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

53068 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$4.27  3x  sold
  • (1)
Add to cart
Added