Life763
Data Visualisation
What is it?
- Taking information and placing it in a visual context
- Make it easier to understand big and small data.
- Make it easier to detect patterns, look for trends.
- Place meaning into complicated datasets, so their message is clear and concise
Importance
- Live in an inherently visual world
- Live in the age of big data
- Citizen science on the increase
- Practical, real-world applications
Tufte’s 6 Principles of Graphical Integrity
1. Representation of numbers should match the true proportions
a. The “lie factor’ is a value used to describe the relation between the size of effect
shown in a graphic and the size of effect show in the data
b. Lie factor = size of effect shown in graphic/actual effect in data
c. Normally aim for a life factor 0.95 – 1.05
2. Labelling should be clear and detailed.
3. Design should not vary for some ulterior motive, show only data variation.
4. To represent money, well known units are best
a. Dollars or Pounds
5. The number of dimensions represent should be the same as the number in the data
a. Eg: Use are to show one dimensional data
6. Representations should not imply unintended context
Introduction to different types of plots
1. Tabular Data
Advantages:
a. Representation of numerical precision
b. Understandable multivariate visualization: each column is a different dimension.
c. Representation of heterogenous data
d. Compactness for small number of points
2. Line Charts
a. Show data points, not just fits
b. Line segments show connections, so do not use in categorical data
c. Connecting points by lines is often chart junk. A trend line or line of fit is better
3. Scatterplots
a. Show the values of each point and are a great way to represent 2D data
b. Higher dimensional datasets are often best projected to 2D
c. Avoid overplotting.
d. Color points on the basis of frequency
e. Ex: Use bubble charts for extra dimensions – using color, shape, size and shading
of ‘dots’s allows plots to represent additional dimensions
4. Bar plots vs. Pie Charts
a. Bar plots: show the frequency of proportion of categorical variables
, b. Pie charts: use more space and are harder to read and compare – DO NOT USE
FOR POSTER
5. Histograms
a. Data distribution
b. Assumptions of tests
T tests & ANOVA
Introduction to T tests
- Used to check for differences in scale data at ordinal or normal levels and only when you
have 2 treatments.
- Parametric statistics (t-test): testing for the means, normally when the data is
distributed. Ex: One sample, Two sample (Independent), Paired
- Non-Parametric statistics (t-tests): When the data isn’t normally distributed. Difference
between parametric tests is that they test median rather than the mean. Ex: Wilcoxon one-
sample, Mann Whitney U, Wilcoxon paired
When to use a T-test
- To test for a significance difference in the means between two groups
- Essentially, a t-test allows us to compare the average values of the two data sets and
determine if they came from the same population.
Assumptions
1. Only two treatments are compared.
2. Replicates sampled from treatment are independent of one another
3. Data in the samples are normally distributed
4. Both samples should have similar variance (homogeneity of variance) – One variant
should not be twice of the other
One Sample T-Tests
- Statistical difference between a sample mean and a known or hypothesized value of the
mean in the population
- Need to follow assumptions
- Ex: Checking the average blood cholesterol value against the recommended value
- R Code: Two sided vs. One sided (if you know whether your testing for either greater or
less than)
- Check for normality
- If the P value is below 0.05 then it is statically significant
Two Sample T-tests
- Statistical difference between two population means.
- Ex: Checking the average blood cholesterol value between men and women
-
Paired T-Tests
- Statistical difference between two variables for the same subject
- It’s used to determine whether the mean difference between two sets of observations is
zero. In a paired sample t-test, each subject or entity is measured twice, resulting in pairs
of observation.
- Null hypothesis: there is no difference the observed mean difference and zero
- Ex: Checking the average blood cholesterol value of women, before and after treatment
Mann Whitney U Test
- Statistical difference between two population medians
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller mariellamonyo. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for £7.16. You're not tied to anything after your purchase.