NOTES Big Data in Biomedical Sciences
LECTURE 1. Intro to Big Data – opening lecture
X What is Big Data in the field of health care?
Big Data is a new paradigm and an ecosystem that transforms case-based studies to large-
scale, data-driven research.
- Ecosystem: it is not just one type of Data analysis or strategy. It is a multi-approach.
- From case-based studies to large-scale, data-driven research: it transforms the view
of looking at individual cases/ patients to a much more data-driven type of research
in which all types of data (sets) are first explored and then from that new types of
hypotheses are extracted that can be tested on the patients.
X In health care:
Big Data will use specific health data of a population (or of a particular individual) and
potentially help to prevent epidemics, cure disease, cut down healthcare costs, etc.
- Why do we need big data analysis in Healthcare?
o Costs are increasing.
o Patients outcomes want to be prioritized.
→ you need data and information for that; It is not just about analysing data,
but also by tying some sort of an action to it.
- Integration is a key element of big data – we want to combine all sorts of data from
all sorts of resources and then combine them so that we get a much broader type of
understanding of our underlining data problem.
X Data science:
Within the field of data science the first thing you want to do is to make descriptions (1); you
have a large data set and you want to make a description of this data set.
Then you want to explore the large data set for new associations (2), so thing that were not
yet known – this is the hypothesis generation phase. For this you want to integrate all types
of (new) data (3). Finally, you want to make a prediction about the future/ future events (4).
1
,LECTURE 2. Data, Data science, and Big Data
X What is Data?
“a set of values of qualitative or quantitative variables” – Wikipedia.
- Set: a collection of distinct objects.
- Variable: measurement of an object.
- Qualitative: measurement of quality.
- Quantitative: measurement of volume, numbers, numerical.
Most often data does not come as a single pre-described data set. It is usually collected as a
complete collection of data coming from all different types of resources.
o The scientist has to go from raw data, which comes from all different
resources, to a structured form of data.
“information, especially facts or numbers, collected to be examined and considered and
used to help decision making” – Cambridge.
- values and actions; actions in this case is predictions – the goal is to make
predictions.
o The scientist has to go from unstructured raw data to a structured
framework. After, they have to use this structured data set to examine and
explore this.
▪ Ultimate goal: analyse data, visualize it, go from unstructured data to
structured data, and then to make predictions.
X What is Data science? – the science of data.
“ Data science is an inter-disciplinary field that uses scientific methods, processes,
algorithms, and systems to extract or extrapolate knowledge and insights from structured
and unstructured data. Data science is related to data mining and big data” – Wikipedia.
- You have to go from data to information.
A data scientist: someone that works a lot with data collection and data engineering (going
from unstructured to structured data), then they have to perform all sorts of analyses.
- The data analytics may include:
o Data structuring.
o Data summary.
o Exploring.
o Machine learning (predictive part).
X What is Big Data?
“Very large sets of data are produced by people using the internet, and that can only be
stored, understood, and used with the help of special tools and methods” – Cambridge.
- It does not have a very well defined definition.
- Big data is often discussed in the context of the three V’s:
1. Volume: often large data sets in big data.
2. Variety: data comes from all sorts of resources.
3. Velocity: data often comes in as a continuous stream.
2
,o Sometimes a fourth V is added – Veracity: there is a lot of uncertainty of the
type of data that is coming in.
▪ Meaning that there might be a lot of data points missing, the might be
a lot of errors or noise into the data.
o A fifth V can also be added – Value: in big data we do not only want to collect
or analyse the data, but the end goal is to add additional value to all of the
different types of resources.
▪ Going from a lot of data to information; the additive value of the
collected and processed data.
3
, LECTURE 3. MATLAB programming fresh up
X What is programming?
- Designing a set of instructions for a computer to execute.
o Basically making something able to perform automated actions – what steps
do we need to perform ‘automated operations’?
1. We need a mechanism to store information in a very systematic way.
2. We need some way to perform operations on this stored information.
3. We should make a list of these operations.
→ leading to operated procedures.
Within a computer you can store every type of information as zeros and ones (0 and 1),
which is the binary code. You refer to them as bits.
Turning machine example:
1. We need a mechanism to store information in a very systematic way.
= the tape used to write the 0 and 1 on.
2. We need some way to perform operations on this stored information.
= the mechanism to perform the operations like the pen and the eraser.
3. We should make a list of these operations.
= information that can be stored on the tape.
X Computer programming today
- The lowest level of instructions to computer is the binary code (most similar to the
turning machine).
o Zeros and ones – very clear instructions on what to do with every bit.
- After, you have the machine code.
o A little bit more global code on instructions to go from this type of code to
the binary code.
- Assembly code/ assembly language.
o A set of instructions to get machine code, binary code, etc.
- Scripting/ interpreted languages.
o Languages at the top of the pyramid are Perl, Python, Java, MATLAB.
4