100% tevredenheidsgarantie Direct beschikbaar na betaling Zowel online als in PDF Je zit nergens aan vast
logo-home
Summary 2IAB0 + questions and correct answers €6,89   In winkelwagen

Samenvatting

Summary 2IAB0 + questions and correct answers

 25 keer bekeken  0 keer verkocht

Full summary + all questions including the correct answers

Voorbeeld 4 van de 51  pagina's

  • 2 november 2021
  • 51
  • 2020/2021
  • Samenvatting
Alle documenten voor dit vak (1)
avatar-seller
isabelveldhuizen
Lectures
EDA (Exploratory Data Analysis)
Data types and representations
What is data?
- Cambridge dictionary: “information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an electronic form that can be
stored and used by a computer”
- In this course; We will say “data” when referring to raw, unorganized numbers, facts, etc. and
use the word “information” for structured, meaningful and useful numbers and facts.

Numerical data
- Continuous data: data that can attain any value on a given measurement scale
o Interval data: continuous data for which only differences have meaning, has no “zero
point” (you cannot make ratios)
o Ratio data: continuous data for which ratios make sense, has fixed “zero point”
- Discrete data: data that can only attain certain values (e.g. integers)
Categorical data
- Nominal data: two or more outcomes that have no natural order
- Ordinal data: two or more outcomes that have a natural order

More on data
- The difference between continuous and discrete data is that discrete data have “gaps”
- Temperature in degrees Celsius is not ratio data, since 20 °C is not twice as hot as 10 °C (this is
because 0 °C is not a fixed minimum)
- Lengths are ratio data, since 0 m is an absolute minimum and 20 m is twice as long as 10 m.
- Categorical data are sometimes labelled with numbers (e.g., bad = 1, neutral = 2, good = 3). Such
labels do not turn such data into numerical data, since there is no meaning to the numbers. \


Tables are good for two reasons
- For reading off values
- To draw attention to actual values

Two kinds of tables
- Reference table: store “all” data in a table so that it can be looked up easily
- Demonstration table: table to illustrate a point (so present just enough data)

Questions you should ask when investigating tables
- What kind of data types and units of measurement?
- Do the values make sense when you compare columns or rows?
- Which column/row has largest values?
- Which column/row has smallest values?




1

,Exploratory Data Analysis
Numerical quantities focus on expected values, graphical summaries (graphs/plots) on unexpected
values (quote by John Tukey).
Tukey promoted to use graphs to explore data before using more advanced (and he also invented some
new types of graphs)

Key features of EDA:
- Getting to know the data before doing further analysis
- Extensively using graphs
- Generating questions
- Detecting errors in data
o Important ways to spot errors;
 Asking yourself what are reasonable values? (for example: lengths and ages of
humans)
 Given one value, what could be the others? (for example: given the time
walked, what distance can possibly be covered?)

Elementary statistical plots
Dot plots / strip plots
- Good for showing actual values and structure of numerical variables
- Not suitable for large datasets
- The jitter option (i.e. slight changes in horizontal placement) may help to avoid overlapping dots

Histogram
- The range of data values is split in bins (intervals of values)
o You can choose the number of bins, or
o Choose the bin width you would like to have
- The histogram shows the number of observations in the data set for every bin (there are
versions that show percentages)
- Histograms are sensitive to bin width
o Bin width too small  too wiggly
o Bin width too large  too few details
- Rule of thumb for choosing a sensible number of bins = sqrt(n)

Cumulative histogram
- A cumulative histogram shows counts or percentages of the current bin together with the
counts or percentages of all bins to the left of that bin.
- Useful to illustrate thresholds

Bar charts versus histograms
Bar charts are for categorical data, histograms for numerical data.

Scatter plot
- Scatter plots allow to investigate relations




2

,Location summary statistics
Going beyond plots
- Plots help us to explore and give clues
- Numerical summaries like averages help us to document essential features of data sets
- One should use both plots and numerical summaries. They complement each other
- Numerical summaries are often called statistics (note the double meaning of the word: both a
scientific field and computed numbers)

Summary statistics
There are different types of summary statistics
- Level: location summary statistics  what are “typical” values
- Spread: scale summary statistics  how much do values vary?
- Relation: association summary statistics  how do values of different quantities vary
simultaneously

Location summary statistics (level)
- Mean (average)
- Median:
o Odd # of observations: middle value when ordered from small to large
o Even # of observations: average of two middle values when order from small to large
- Mode: most frequently occurring value, may be non-unique

Mean is sensitive to outliers, the median is not. Mean can be misleading / difficult to interpret for non-
symmetric distributions.

Quartiles
Re-order the data from small to large:
- 1st quartile = cut-off point for 25% of the data
- 2nd quartile = cut-off point for 50% of the data (=median)
- 3rd quartile = cut-off point for 75% of the data

Location statistics: percentiles
- Pth percentile – a cut-off point for P% of data
- we define the 0th percentile to be the minimal element of the dataset and the 100 th percentile to
be the maximal element of it.
100
- For a dataset with n observations, the 2nd smallest observation will be at th percentile
n−1




3

, Computing percentiles




Scale and Association Statistics
Scale statistics
Range: max – min
Interquartile range (IQR): 3rd quartile – 1st quartile
2
(term∈dataset −sample mean)
Sample variance: s2= ∑
N −1
2
Sample standard deviation: s=
√ ∑ (term∈dataset−sample mean)
N −1
Mean absolute deviation (MAD): median of the absolute deviation from the median




The higher these statistics, the more spread/variability in the data

Remarks about scale summary statistics
- The standard deviation has right (physical) unit
- The variance is more convenient mathematically
- The range, variance and standard deviation are sensitive to outliers, IQR and MAD are not

4

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper isabelveldhuizen. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €6,89. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 64438 samenvattingen verkocht

Opgericht in 2010, al 14 jaar dé plek om samenvattingen te kopen

Start met verkopen
€6,89
  • (0)
  Kopen