Samenvatting

Data Analytics (2IAB0) Summary Lectures 2020

Name: Data Analytics (2IAB0) Summary Lectures 2020
SKU: doc_724509
Rating: 5.00 (1 reviews)
Author: IsabelRutten

1 beoordeling

3 keer verkocht

Vak
Data Analytics 2IAB0 (2IAB0)

Instelling
Technische Universiteit Eindhoven (TUE)

EN: Data Analytics for engineers (2IAB0) is a basis course of the Bachelor College at Eindhoven University of Technology. This means that all Bachelor TUe students should have completed this course. It is given in the third quartile of the first year. Data Analytics for Engineers provides more in...

[Meer zien]

Laatste update van het document: 3 jaar geleden

Voorbeeld 2 van de 14 pagina's

Bekijk voorbeeld

Geupload op 28 mei 2020
Bestand laatst geupdate op 2 april 2021
Aantal pagina's 14
Geschreven in 2019/2020
Type Samenvatting

data analytics
technical university
eindhoven
2iab0
bachelor college
data
python
statistics
statistieken
data visualizations
data visualizaties

1 beoordeling

Door: matthewmihu • 3 jaar geleden

It was what I was looking for

Volgen

IsabelRutten Lid sinds 4 jaar 95 documenten verkocht

€3,99

Ook beschikbaar in voordeelbundel v.a. €21,99

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na je betaling
Lees online óf als PDF
Geen vaste maandelijkse kosten

Ook beschikbaar in voordeelbundel (1)

First Year Bachelor Computer Science and Engineering TUe

(1)

€ 40,89 € 21,99

6x verkocht

11 items

1. Oefenvragen - Discrete structures - everything that i learned by heart
2. Oefenvragen - Logic and set theory - everything that i learned by heart
3. Samenvatting - Applied logic (2itx0) summary 2019
4. Samenvatting - Data analytics (2iab0) summary lectures 2020
5. Samenvatting - Data structures (2il50) summary 2020
6. Samenvatting - Discrete structures (2it80) summary 2019
7. Samenvatting - Logic and set theory (2it60) book summary 2019
8. Samenvatting - Programming (2ip90) summary 2019
9. Samenvatting - Linear algebra (2dbi00) 2020 summary
10. Handleiding - A guide to scheduling
11. Samenvatting - Calculus cheat sheet
Meer zien

Data Analytics (2IAB0) Summary Lectures
Week 1: EDA
Descriptive data analytics is data collected now that may be used later for other purposes and is used to
give an insight into the past. Predictive data analytics is for looking into the future: we only do predictions
but don’t give an indication of what we should we do. Prescriptive data analytics consists of data-driven
advices how to take action to influence of change the future.
Data are raw, unorganized numbers, facts, etc. Information is structured, meaningful and useful numbers
and facts.
There are two data forms/types:
- categorical/nominal: a) dichotomous (yes/no, male/female) b) nominal: no ordering (genre)
c) ordinal: has ordering (ratings, bad – good)
- numerical: a) interval: no fixed “zero point”, only difference has meaning (temperature in F, ranking)
b) ratio: has fixed “zero point”, so ratios also do make sense (budget, running time)
A reference table stores “all” data in a table so that it can be looked up easily. A demonstration table is a
table to illustrate a point (so present just enough data). In a table, pay attention to the kind of data type,
units of measurement, whether the values make sense when comparing columns or rows and which
column/row has the largest/smallest values.
Asking what to expect is also an important way to spot errors. You can ask two questions: “What are
reasonable values?” (human age) and “Given one value, what could be others?” (time so what distance?).
Typical questions for statistical plots are about whether values are as expected, what typical sizes are,
the variation of the values, the distribution of the values and whether there are any exceptional values.
A scatter plot is good for showing actual values and structure of numerical variables but it is not suitable
for large data sets because of the many overlapping dots. The jitter option (which changes horizontal
placement) may help to avoid this.
The choice of plot depends on the data type: bar charts –> categorical data, histograms –> numerical data.
A histogram has the range of data values split in bins (intervals of values). You can choose the number of
bins or the bin size. The histogram will show the number of observations in the dataset for every bin. The
rule of thumb for choosing number of bins is √𝑛 where n is the number of observations. If the bin width is
too small, the histogram will be too wiggly. If it is too large, there are too few details.
In a cumulative histogram, the vertical axis reflects the share (or %) of the observations in a dataset with
values smaller than a value specified on the horizontal axis.
Kernel density plots make use of bandwidths. Assuming each observation indicates that this value is
possible, but values nearby could also occur (but less likely), choose a bandwidth to be taken around each
observation, generate a kernel with the chosen bandwidth for every observation in the dataset and the sum
of the kernels results in the kernel density plot. Choosing the bandwidth is important!
Summary statistics are numbers to describe level (location statistics: what are “typical” values?) and
spread (scale statistics: how much do values vary?). Typical distribution shapes are as follows:
- unimodal distribution (1 peak) – bimodal distribution (2 peaks, possibly due to 2 different groups)
- symmetric distribution (left = right) – right-skewed distribution/asymmetric (top at the left, tail at right,
if the mean is bigger than the median, there is a right-skewed distribution)
There are the following location statistics: - mean (average, ) – mode (most occurring value)
- median ( (average of two) middle value(s) ) – quartiles/percentiles (1st quartile = cut-off point for 25%,
pth percentile is a cut-off point for p% of data, for a percentile P we compute its location in a data set of n
observations: Lp = p/100 *(n + 1). Computing the pth percentile value by linear interpolation:
Let l and h be the observations at the position ⌊𝐿𝑝⌋ and ⌈𝐿𝑝⌉ in the ordered data set.
pth percentile value = l + (Lp - ⌊𝐿𝑝⌋)(h – l).)
There are the following scale statistics: - range (max – min) – interquartile range (IQR) (3rd quartile – 1st)

- sample variance (σ^2 = ) – sample standard deviation (σ = )
- median absolute deviation (MAD) (median of the absolute deviation from the median)

1
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

, The higher these statistics, the more spread/variability in the data. Variance and standard deviation are
sensitive to outliers, IQR and MAD are not.
Standardization / z-score normalization: z-score transforms data in their original units into
universal statistical unit of standard deviation from the mean using the following formula:
The mean value of the transformed data set is 0 and the sample standard deviation is 1. Negative z-score
means that the value is below the mean, positive means above the mean. The observations with a z-score
larger than 2.5 are considered as outliers.
A Box(-and-Whisker) plot is a convenient way to graphically display summary statistics since it shows the
median, the 1st and 3rd quartile and the minimum and maximum values. It is better than histograms/kernel
density estimators to compare groups but the others are better for showing distribution shape.
QUIZ: The variance of 5 numbers is 10. If each number is divided by 2, then the variance of the new
numbers is 2.5. Since division by 2, when calculating the squares of differences: 2^2 so 10/2^2 = 2.5.

Week 2: VIS
We make data visualisations, not infographics (focused on telling a story creatively instead of data).
Visualization has always been important in history, for example when recording a pulse signal or when
trying to find the reason for an epidemic (visualizing the deaths which strangely occur near a pump). Also,
communicating data effectively is of importance, for example for a subway map.
Visualization is the process that transforms (abstract) data into (interactive) graphical representations for
the purpose of exploration, confirmation or communication. Communication is done to inform humans and
shows specific aspects of a larger dataset to allow the reader to better connect the presented information to
their existing knowledge. Exploration is done when questions are not well-defined and shows a large,
complex dataset which is meant for professionals. Confirmation is a combination of those two.
Why do we visualize data?
In the case of high-level actions, we analyze the data:
- visualization for consuming information: - discover – present – enjoy (meant for end-users)
- visualization for producing data: - annotate – record – derive (extends the dataset)
In the case of lower-level actions, we search in the data:
Lookup: search in a dictionary how to spell a certain word
Browse: look for a synonym for a certain word
Locate: try to find your lost keys
Explore: unexpected patterns

There are two kinds of targets: - We look at all data and then at the trends (define the “mainstream”), the
outliers (standout from the mainstream) and the features (task-dependent structures of interest).
- We look at attributes and then at one (by analyzing the distribution or the extremes) or many (by
analyzing dependency, correlation or similarity).
Human perception can be influenced which has been researched in a psychological theory called “Gestalt
theory”. Proximity: objects close to each other are perceived as a group. Similarity: objects that are
similar (color, shape, etc.) are perceived as a group. Continuity: we unconsciously draw a line through
points that are in a graph. So: position and the arrangement of visual elements is the most important
channel for visualizations.
Perception of colors begins with 3 specialized retinal cells known as cone cells. The red cone cell
shows black-white, the green one shows green-purple and the blue one shows blue – yellow. Combining
them gives the right color. However, you could be color blind when one of those cone cells are missing. It is
rare to miss the blue cone cell so doing visualizations in the colors blue-yellow is safe.
There are several ordering directions: - sequential (XS – S – M – etc.) – diverging (… -10 … 0 … 5 …)
– cyclic (days of the week)
A key attribute (also called an independent attribute) acts as an index that is used to look up value
attributes (also called a dependent attribute).
Data visualization makes use of marks (geometric primitives: points, lines, areas, complex shapes) and
channels (appearance of marks: position, color, length, size, shape).

2
Data Analytics (2IAB0) Summary Q3 2020 by Isabel Rutten

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

Studenten hebben al meer dan 850.000 samenvattingen beoordeeld. Zo weet jij zeker dat je de beste keuze maakt!

In een paar klikken geregeld

Geen gedoe — betaal gewoon eenmalig met iDeal, creditcard of je Stuvia-tegoed en je bent klaar. Geen abonnement nodig.

Direct to-the-point

Studenten maken samenvattingen voor studenten. Dat betekent: actuele inhoud waar jij écht wat aan hebt. Geen overbodige details!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper IsabelRutten. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €3,99. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 69411 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Begin nu gratis

Samenvatting

Data Analytics (2IAB0) Summary Lectures 2020

Document informatie

Onderwerpen

Geschreven voor

1 beoordeling

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Dit zijn jouw voordelen als je samenvattingen koopt bij Stuvia:

Bewezen kwaliteit door reviews

In een paar klikken geregeld

Direct to-the-point

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?