100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Data Analytics (2IAB0) Summary Lectures 2020 €3,99   In winkelwagen

Samenvatting

Data Analytics (2IAB0) Summary Lectures 2020

1 beoordeling
 276 keer bekeken  3 keer verkocht

EN: Data Analytics for engineers (2IAB0) is a basis course of the Bachelor College at Eindhoven University of Technology. This means that all Bachelor TUe students should have completed this course. It is given in the third quartile of the first year. Data Analytics for Engineers provides more in...

[Meer zien]
Laatste update van het document: 3 jaar geleden

Voorbeeld 2 van de 14  pagina's

  • 28 mei 2020
  • 2 april 2021
  • 14
  • 2019/2020
  • Samenvatting
Alle documenten voor dit vak (1)

1  beoordeling

review-writer-avatar

Door: matthewmihu • 3 jaar geleden

It was what I was looking for

avatar-seller
IsabelRutten
Data Analytics (2IAB0) Summary Lectures
Week 1: EDA
Descriptive data analytics is data collected now that may be used later for other purposes and is used to
give an insight into the past. Predictive data analytics is for looking into the future: we only do predictions
but don’t give an indication of what we should we do. Prescriptive data analytics consists of data-driven
advices how to take action to influence of change the future.
Data are raw, unorganized numbers, facts, etc. Information is structured, meaningful and useful numbers
and facts.
There are two data forms/types:
- categorical/nominal: a) dichotomous (yes/no, male/female) b) nominal: no ordering (genre)
c) ordinal: has ordering (ratings, bad – good)
- numerical: a) interval: no fixed “zero point”, only difference has meaning (temperature in F, ranking)
b) ratio: has fixed “zero point”, so ratios also do make sense (budget, running time)
A reference table stores “all” data in a table so that it can be looked up easily. A demonstration table is a
table to illustrate a point (so present just enough data). In a table, pay attention to the kind of data type,
units of measurement, whether the values make sense when comparing columns or rows and which
column/row has the largest/smallest values.
Asking what to expect is also an important way to spot errors. You can ask two questions: “What are
reasonable values?” (human age) and “Given one value, what could be others?” (time so what distance?).
Typical questions for statistical plots are about whether values are as expected, what typical sizes are,
the variation of the values, the distribution of the values and whether there are any exceptional values.
A scatter plot is good for showing actual values and structure of numerical variables but it is not suitable
for large data sets because of the many overlapping dots. The jitter option (which changes horizontal
placement) may help to avoid this.
The choice of plot depends on the data type: bar charts –> categorical data, histograms –> numerical data.
A histogram has the range of data values split in bins (intervals of values). You can choose the number of
bins or the bin size. The histogram will show the number of observations in the dataset for every bin. The
rule of thumb for choosing number of bins is √𝑛 where n is the number of observations. If the bin width is
too small, the histogram will be too wiggly. If it is too large, there are too few details.
In a cumulative histogram, the vertical axis reflects the share (or %) of the observations in a dataset with
values smaller than a value specified on the horizontal axis.
Kernel density plots make use of bandwidths. Assuming each observation indicates that this value is
possible, but values nearby could also occur (but less likely), choose a bandwidth to be taken around each
observation, generate a kernel with the chosen bandwidth for every observation in the dataset and the sum
of the kernels results in the kernel density plot. Choosing the bandwidth is important!
Summary statistics are numbers to describe level (location statistics: what are “typical” values?) and
spread (scale statistics: how much do values vary?). Typical distribution shapes are as follows:
- unimodal distribution (1 peak) – bimodal distribution (2 peaks, possibly due to 2 different groups)
- symmetric distribution (left = right) – right-skewed distribution/asymmetric (top at the left, tail at right,
if the mean is bigger than the median, there is a right-skewed distribution)
There are the following location statistics: - mean (average, ) – mode (most occurring value)
- median ( (average of two) middle value(s) ) – quartiles/percentiles (1st quartile = cut-off point for 25%,
pth percentile is a cut-off point for p% of data, for a percentile P we compute its location in a data set of n
observations: Lp = p/100 *(n + 1). Computing the pth percentile value by linear interpolation:
Let l and h be the observations at the position ⌊𝐿𝑝⌋ and ⌈𝐿𝑝⌉ in the ordered data set.
pth percentile value = l + (Lp - ⌊𝐿𝑝⌋)(h – l).)
There are the following scale statistics: - range (max – min) – interquartile range (IQR) (3rd quartile – 1st)

- sample variance (σ^2 = ) – sample standard deviation (σ = )
- median absolute deviation (MAD) (median of the absolute deviation from the median)

1
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

, The higher these statistics, the more spread/variability in the data. Variance and standard deviation are
sensitive to outliers, IQR and MAD are not.
Standardization / z-score normalization: z-score transforms data in their original units into
universal statistical unit of standard deviation from the mean using the following formula:
The mean value of the transformed data set is 0 and the sample standard deviation is 1. Negative z-score
means that the value is below the mean, positive means above the mean. The observations with a z-score
larger than 2.5 are considered as outliers.
A Box(-and-Whisker) plot is a convenient way to graphically display summary statistics since it shows the
median, the 1st and 3rd quartile and the minimum and maximum values. It is better than histograms/kernel
density estimators to compare groups but the others are better for showing distribution shape.
QUIZ: The variance of 5 numbers is 10. If each number is divided by 2, then the variance of the new
numbers is 2.5. Since division by 2, when calculating the squares of differences: 2^2 so 10/2^2 = 2.5.

Week 2: VIS
We make data visualisations, not infographics (focused on telling a story creatively instead of data).
Visualization has always been important in history, for example when recording a pulse signal or when
trying to find the reason for an epidemic (visualizing the deaths which strangely occur near a pump). Also,
communicating data effectively is of importance, for example for a subway map.
Visualization is the process that transforms (abstract) data into (interactive) graphical representations for
the purpose of exploration, confirmation or communication. Communication is done to inform humans and
shows specific aspects of a larger dataset to allow the reader to better connect the presented information to
their existing knowledge. Exploration is done when questions are not well-defined and shows a large,
complex dataset which is meant for professionals. Confirmation is a combination of those two.
Why do we visualize data?
In the case of high-level actions, we analyze the data:
- visualization for consuming information: - discover – present – enjoy (meant for end-users)
- visualization for producing data: - annotate – record – derive (extends the dataset)
In the case of lower-level actions, we search in the data:
Lookup: search in a dictionary how to spell a certain word
Browse: look for a synonym for a certain word
Locate: try to find your lost keys
Explore: unexpected patterns

There are two kinds of targets: - We look at all data and then at the trends (define the “mainstream”), the
outliers (standout from the mainstream) and the features (task-dependent structures of interest).
- We look at attributes and then at one (by analyzing the distribution or the extremes) or many (by
analyzing dependency, correlation or similarity).
Human perception can be influenced which has been researched in a psychological theory called “Gestalt
theory”. Proximity: objects close to each other are perceived as a group. Similarity: objects that are
similar (color, shape, etc.) are perceived as a group. Continuity: we unconsciously draw a line through
points that are in a graph. So: position and the arrangement of visual elements is the most important
channel for visualizations.
Perception of colors begins with 3 specialized retinal cells known as cone cells. The red cone cell
shows black-white, the green one shows green-purple and the blue one shows blue – yellow. Combining
them gives the right color. However, you could be color blind when one of those cone cells are missing. It is
rare to miss the blue cone cell so doing visualizations in the colors blue-yellow is safe.
There are several ordering directions: - sequential (XS – S – M – etc.) – diverging (… -10 … 0 … 5 …)
– cyclic (days of the week)
A key attribute (also called an independent attribute) acts as an index that is used to look up value
attributes (also called a dependent attribute).
Data visualization makes use of marks (geometric primitives: points, lines, areas, complex shapes) and
channels (appearance of marks: position, color, length, size, shape).

2
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper IsabelRutten. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 82388 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€3,99  3x  verkocht
  • (1)
  Kopen