100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Complete summary theory data analytics €7,69
In winkelwagen

Samenvatting

Complete summary theory data analytics

 19 keer bekeken  1 keer verkocht

Summary of the theory from all the lectures for data analytics for engineers.

Voorbeeld 3 van de 22  pagina's

  • 3 april 2022
  • 22
  • 2021/2022
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
jbtue
2IAB0 - Data Analytics for
Engineers
Week 1: EDA
EDA = Exploratory Data Analysis

Data types:
1. Categorical data - data that has no intrinsic numerical value
• Nominal: two or more outcomes that have no natural order
• Ordinal: two or more outcomes that have a natural order

2. Numerical data - data that has an intrinsic numerical value
• Continuous data: data that can attain any value on a given measurement scale
- Interval data: equal intervals represent equal di erences
- Ratio data: both di erences and ratios make sense; it has a xed ‘ zero point’
• Discrete data: data that can only attain certain values


Tables:
Reference table: store ‘all’ data in a table so that it can be looked up easily
Demonstration table: table to illustrate a point (so present just enough data, or speci c
summary)


Plots:
• Dot plots
- Good for showing actual values and structure of numerical values
- Not suitable for large data sets
- Jitter option may help to avoid overlapping dots
• Bar chart
- For comparing some numerical characteristics of groups de ned by categories of
categorical data
- Levels of categorical variable are on the x-axis, numerical values on the y-axis
• Histogram
- Not convenient for large data sets
- Range of data is split in bins (= intervals of values)
- Histogram shows the number of observations in the data set for every bin
2
- Rule of thumb for choosing a sensible number of bins: ≈ n where n is the number of
data points
• Cumulative histogram
- Shows counts or percentages of the current bin together with the counts or percentages
of all bins to the left of that bin
• Scatter plot
- Allows to investigate relations
! Bar charts are for categorical data, histograms for numerical data


Types of summary statistics:
• Level: location summary statistics
• Spread: scale summary statistics
• Relation: association summary statistics



1


ff ff fifi fi

, Location summary statistics:
1 n
n∑
1. Mean (average): xi
i=1
2. Median:
- Odd number of observations: middle value when ordered from small to large
- Even number of observations: average of two middle values when ordered from small to
large
3. Mode: most frequently occurring value, may be non-unique

! Mean is sensitive to ‘outliers’ => mean can be misleading / di cult to interpret for non-
symmetric data sets


Quartiles:
- 1st quartile = cut-o point for 25% of the data
- 2nd quartile = cut-o point for 50% of the data (= median)
- 3rd quartile = cut-o point for 75% of the data

Percentiles:
- Pth percentile - a cut-o point for P% of the data
- We de ne the 0th percentile to be the smallest element of the dataset and the 100th percentile
to be the largest element of it
- For a dataset with n observations, the 2 smallest observation will be at 100/(n − 1)th
percentile
- For percentile P we compute its location in a data set of n
observations: Lp = 1 + (P/100)*(n-1)
- Computing Pth percentile value by linear interpolation:




Scale statistics:
• Range = max - min
• Interquartile range (IQR) = 3rd quartile - 1st quartile
n
(xi − μ)2

i=1
• Sample variance = S or
2 σ2 =
n−1
n
(xi − μ)2

i=1
Sample standard deviation = S or σ =
• n−1
• Median absolute deviation (MAD): median of the absolute deviation from the median

The higher these statistics, the more the spread/variability in the data.
! The range, variance and standard deviation are sensitive to ‘outliers’, IQR and MAD are not.
2


fi ffff ff ffi

, Standardization:
The z-score transforms data in their original units into universal statistical unit of standard
deviation from the mean. The mean value of the z-scores of data set is 0 and the standard
deviation is 1.




Negative z-score: value is below mean
Positive z-score: value is above mean

Rule of thumb: observations with a z-score larger than 2.5 are considered to be ‘outliers’.


Association statistics:
Association statistics try to capture in a number how strong the relation between two quantities is.
The sign of an association statistic indicates whether it is:
- A positive association
- A negative association

Box and whisker plot:
• Median
• 1st and 3 quartile
• Min and max values
• Endpoints of whiskers show minimum/maximum if within 1.5 IQR from the nearest 1st/3rd
quartile
• Points further away than 1.5 IQR from nearest quartile are outliers
• Yield a quick indication of symmetry
• Indicate whether there are outliers


Kernel density plots (improved histograms):
• Choose a bandwidth to be taken around each data point
• Generate a kernel with the chosen bandwidth for every data point
• Count the data points weighted by the kernel
• There is no direct interpretation of the scale of the y-axis!


Violin plot:
• Combination of box-and-whisker plot and kernel density plot:
• Global shape of box-and-whisker plot
• Local details of kernel density plot


Typical distribution shapes:
- Unimodal distribution: 1 peak
- Bimodal distribution: 2 peaks
- Symmetric distribution
- Right-skewed distribution: long tail on the right, asymmetry may indicate ‘extreme values’




3

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper jbtue. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €7,69. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 53340 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€7,69  1x  verkocht
  • (0)
In winkelwagen
Toegevoegd