Summary Data Science in Biomedicine
College 1: Introduction to Data Science in Biomedicine
Datum: 26-09-2022
- Patients Data collection -> Biomedical data: Electronic Health Record (EHR) and
Omics -> Personalised Health Data Analysis: Large Volume Data, Data
Management and High Performance Computing
- Translate large data sets to something you can understand and discuss
There are many types of (big) data available:
- Numerical
- Textual
- Categorical
- Imaging
- Clinical
- Demographic
- Psychosocial
- Lifestyle
- Environmental
- Genomic
- DNA
- Genes
- proteins
- RNA
- SNPs
- ncRNA
-Splice variants
-RNA expression levels
Next Generation Sequencing (NGS)
- 1 illimina NovaSeq6000 run will read 6,000,000,000,000 (6,000,000,000 kb,
6,000,000 Mb, 6,000Gb, 6Tb) bases in ~44hr (computers and software is necessary)
- Bioinformatics pipelines, e.g., Analyzing NGS data
Reference mapping -> transcript assembly, comparison, merging -> detection of
differentially expressed genes/transcripts (understand input and output of programs,
know your statistics, modify the graphical output).
Using R or Python
R -> Retrieve data from a database, apply statistical analyses and visualize results
Python -> What if the data is in a wrong format then write a small Python script
,R vs Python
- R is dedicated to statistics
- R is very popular in research
- Many good libraries for R; Genomics, GWAS, Proteomics, Transcriptomics,
Metabolomics etc.
- R is not a real programming language but more a statistical scripting tool
- Python is easier and much better in handling text files and data text files
- R and Python are slower than C++
- Although loads of people R there will be a decline so why still learn?
R
- open source package for Statistics
- most popular statistics program in bioinformatics
- Also popular -> Python data analysis library - pandas
- MATLAB
R vs Excel
- In excel you can load data by opening a file or copy paste a data table
- You can edit this data in excel
- You can NOT edit data in R
R Graphics
- popular Graphics library is ggplot2 (also in Python)
- you can also log the data by log(my_data)
- How to plot multiple classes: multiple_classes <- c(“N”, “O”, “P”) and
my_multi_subset <- subset(my_annotated_subset, classID%in% multiple_classes
- C() is a list
- to add dimensional data to the graph, often the graphs are plotted in a matrix
- You have: Script, Data Sets, Text output and graphic output
,College 2: Data Science in Biomedicine Basis Statistics 1
Datum: 27-09-2022
What is statistics?
- Why do we need statistics?
- when difference?
- p-value?
- impact of risk
- identify problems
- where does the data come from?
- which data and conclusions are trustworthy
- properties?
- Reliable p-value
Measurements
- experiments -> variation
- variation between persons, equipments and time of the day
- define the experiments properly
- what is the main source of variation
- after standardization; do we always get exactly the same value
- measurements show variation!
P-value
- a p- value is the probability of a an observed result
- 0.05
- x axis = set of possible results
- y-axis is probability density
- same statistics, same p-value, different “impact of risk”
- you can calculate p-values but it never tells you if it’s good or bad
- especially in Biomedical sciences this can be an ethical discussion: Risk for
treating/not treating patients and until which age should you treat a patient
- 0.05 is a good starting point but always evaluate this assumptiom
- p-value cutoff = This means that, if your null hypothesis is indeed correct
and there is no difference between the groups, the result that you
obtained is very rare. You would expect to obtain such a result fewer than 1
in 20 times if you collected samples over and over again.
Generating data
- A statistician want: a good designed study, trustworthy data and many
replicates
- a statistician know how to: analyze data and calculate p-values
- a statistician does not know; detailed theoretical background, impact of risk
(threshold) and potential pitfalls.
, Some basic statistics in this course
- t-test
- linear regression
- permutation testing
- FDR testing
- Fischer’s exact test
- Chi-squared test
- Pearson’s vs Spearman correlation
- PCA
T-statistic
- Compares two data sets and tells you if they are different from each other
- e.g. compare two groups, one treated with a drug the other with a placebo
- Pearson 1857
- Fisher 1890
- Neyman 1894 (Random stats)
- Bayes 1702 (probability stats)
- A t-test is a statistical test that is used to compare the means of two
groups. It is often used in hypothesis testing to determine whether a process
or treatment actually has an effect on the population of interest, or whether
two groups are different from one another
Types of T test
1. independent Samples: compares the means for two independent groups
2. Paired Samples: compares means from the same group (e.g. at different time
points
3. One: test the mean of a single group against a known mean (a standard or
reference
Paired data
- group of mice (8) before and after albumin treatment
- the null hypothesis is that the pairwise difference between the two tests is
equal (h0:μd =0)