Lectures
EDA (Exploratory Data Analysis)
Data types and representations
What is data?
- Cambridge dictionary: “information, especially facts or numbers, collected to be examined and
considered and used to help decision-making, or information in an electronic form that can be
stored and used by a computer”
- In this course; We will say “data” when referring to raw, unorganized numbers, facts, etc. and
use the word “information” for structured, meaningful and useful numbers and facts.
Numerical data
- Continuous data: data that can attain any value on a given measurement scale
o Interval data: continuous data for which only differences have meaning, has no “zero
point” (you cannot make ratios)
o Ratio data: continuous data for which ratios make sense, has fixed “zero point”
- Discrete data: data that can only attain certain values (e.g. integers)
Categorical data
- Nominal data: two or more outcomes that have no natural order
- Ordinal data: two or more outcomes that have a natural order
More on data
- The difference between continuous and discrete data is that discrete data have “gaps”
- Temperature in degrees Celsius is not ratio data, since 20 °C is not twice as hot as 10 °C (this is
because 0 °C is not a fixed minimum)
- Lengths are ratio data, since 0 m is an absolute minimum and 20 m is twice as long as 10 m.
- Categorical data are sometimes labelled with numbers (e.g., bad = 1, neutral = 2, good = 3). Such
labels do not turn such data into numerical data, since there is no meaning to the numbers. \
Tables are good for two reasons
- For reading off values
- To draw attention to actual values
Two kinds of tables
- Reference table: store “all” data in a table so that it can be looked up easily
- Demonstration table: table to illustrate a point (so present just enough data)
Questions you should ask when investigating tables
- What kind of data types and units of measurement?
- Do the values make sense when you compare columns or rows?
- Which column/row has largest values?
- Which column/row has smallest values?
1
,Exploratory Data Analysis
Numerical quantities focus on expected values, graphical summaries (graphs/plots) on unexpected
values (quote by John Tukey).
Tukey promoted to use graphs to explore data before using more advanced (and he also invented some
new types of graphs)
Key features of EDA:
- Getting to know the data before doing further analysis
- Extensively using graphs
- Generating questions
- Detecting errors in data
o Important ways to spot errors;
Asking yourself what are reasonable values? (for example: lengths and ages of
humans)
Given one value, what could be the others? (for example: given the time
walked, what distance can possibly be covered?)
Elementary statistical plots
Dot plots / strip plots
- Good for showing actual values and structure of numerical variables
- Not suitable for large datasets
- The jitter option (i.e. slight changes in horizontal placement) may help to avoid overlapping dots
Histogram
- The range of data values is split in bins (intervals of values)
o You can choose the number of bins, or
o Choose the bin width you would like to have
- The histogram shows the number of observations in the data set for every bin (there are
versions that show percentages)
- Histograms are sensitive to bin width
o Bin width too small too wiggly
o Bin width too large too few details
- Rule of thumb for choosing a sensible number of bins = sqrt(n)
Cumulative histogram
- A cumulative histogram shows counts or percentages of the current bin together with the
counts or percentages of all bins to the left of that bin.
- Useful to illustrate thresholds
Bar charts versus histograms
Bar charts are for categorical data, histograms for numerical data.
Scatter plot
- Scatter plots allow to investigate relations
2
,Location summary statistics
Going beyond plots
- Plots help us to explore and give clues
- Numerical summaries like averages help us to document essential features of data sets
- One should use both plots and numerical summaries. They complement each other
- Numerical summaries are often called statistics (note the double meaning of the word: both a
scientific field and computed numbers)
Summary statistics
There are different types of summary statistics
- Level: location summary statistics what are “typical” values
- Spread: scale summary statistics how much do values vary?
- Relation: association summary statistics how do values of different quantities vary
simultaneously
Location summary statistics (level)
- Mean (average)
- Median:
o Odd # of observations: middle value when ordered from small to large
o Even # of observations: average of two middle values when order from small to large
- Mode: most frequently occurring value, may be non-unique
Mean is sensitive to outliers, the median is not. Mean can be misleading / difficult to interpret for non-
symmetric distributions.
Quartiles
Re-order the data from small to large:
- 1st quartile = cut-off point for 25% of the data
- 2nd quartile = cut-off point for 50% of the data (=median)
- 3rd quartile = cut-off point for 75% of the data
Location statistics: percentiles
- Pth percentile – a cut-off point for P% of data
- we define the 0th percentile to be the minimal element of the dataset and the 100 th percentile to
be the maximal element of it.
100
- For a dataset with n observations, the 2nd smallest observation will be at th percentile
n−1
3
, Computing percentiles
Scale and Association Statistics
Scale statistics
Range: max – min
Interquartile range (IQR): 3rd quartile – 1st quartile
2
(term∈dataset −sample mean)
Sample variance: s2= ∑
N −1
2
Sample standard deviation: s=
√ ∑ (term∈dataset−sample mean)
N −1
Mean absolute deviation (MAD): median of the absolute deviation from the median
The higher these statistics, the more spread/variability in the data
Remarks about scale summary statistics
- The standard deviation has right (physical) unit
- The variance is more convenient mathematically
- The range, variance and standard deviation are sensitive to outliers, IQR and MAD are not
4
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller isabelveldhuizen. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $7.39. You're not tied to anything after your purchase.