LE: Introduction
Big data:
- Collection of data that is so large or complex that traditional data analysis software
and infrastructure is inadequate to deal with them.
- Data of a very large size, typically to the extent that its manipulation and
management present significant logistical challenges.
Big data is not only the volume of data:
- Dimensions of big data: volume, velocity and variety.
- The application of machine learning to detect patterns in data.
- New combinations of data.
- Answering questions that were impossible to address in the past.
6 V’s: First 3 data itself, second 3 how to interpret the data
- Volume: High throughput technologies (automated generation of data), continuous
monitoring of vital signs, increased storage capacities, increased communication
technologies, user-generated data (smartphone apps, social media, wearables).
- Velocity: Speed in which data is gathered. High-speed processing for fast clinical
decision support, increased data generation rate by the health infrastructure,
nowcasting: live data, social media monitoring; continuous monitoring; challenges:
high-speed analysis (efficient algorithms)
- Variety: Data at diffefet saaf (moafsuaf to popuaatioe), diffefet timf
pfeiod (fefqufesy), hftfeogfefou data (data foemat /typf (eum. sat.
bie.), diffefet eatuef of data ( teustuefdt fxpfeimfetaa df ige, peotosoaa
ue teustuefd t tfxt, osiaa mfdia, imagf )
- Variability: Dyeamis of thf big data tooa (aagoeithm ) oe dyeamis of
bioaogisaa peosf f (bioaogisaa ehythm t bioaogisaa saoska eoe-
dftfemiei tis di fa f peosf f t peosf f uedfeayieg data soaafstioea
fa oeaa hfaath fffst aed di fa f fvoautioe).
- Veracity & Value: clinically relevant data, longitudinal studies, quality of data, validity
of results, clinical relevance
Why is big data important:
- Growth in data acquisition (‘omics techniques)
- Dfvfaopmfet ie ICT tfsheoaogy (iesefa ieg somputieg powfe t Mooef’
aaw)
- Developments in data analysis: machine learning
- Personalized healthcare
- Silicon valley and healthcare (google, microsoft, apple)
Applications:
- Personalized health care
- Computer simulations of biological systems (organs)
- Predicting public health trends
- Self learning healthcare systems
Neglect the statement: with enough data the numbers speak for themselves (cons big data)
- Inherent limitations in Big Data tools (poor archiving and search functions, enormous
, quantities of data may lead to detecting patterns where none actually exist)
- Understanding of data requires interpretation and prior knowledge.
- Bigger data is not always better data (high number of data does not mean
representative data).
How big data can improve health
- By discovering associations and understanding patterns and trends within the data,
big data analytics has the potential to improve care, save lives and lower costs
- Detect diseases at earlier stages
- Prediction of health outcomes
- Analyzing disease patterns and tracking disease outbreaks and transmission
- Faster development of drugs and targeted vaccines
- Effective genomic analysis
- Continuous, remote monitoring
- Analysis of patient profiles
Ecological fallacy: When data collected at a group level are analyzed and the results are
assumed to apply to associations at the individual level.
LE: Data-driven science
Genome imputation: predict which genes a person has based on genotype of family.
Personalized nutrition: computer model that can predict what blood sugar concentration you
have after a meal based on metabolism patterns of different people.
Advantages of big data: lot of ways to calculate statistical power: likelihood that it will
distinguish an effect of a certain size from pure luck.
Volume of data can increase certainty of findings.
Differences small data approach vs big data approach
- Speed of discovery: sequential testing of hypothesis (hypothesis driven) vs
algorithmic hypothesis generation and testing (data-driven).
- Quantity vs quality of data: quaaity (mi ieg , eoi f) ovfe quaetity v teadf
of bftwffe quaaity aed quaetity. Doe’t effd to ampaf ie big data t af
bia , highfe eumbfe i moef pefsi f (but thfef tiaa sae bf bia if
omfthieg i mfa uefd weoeg).
- Method of reasoning: dfdustivi m (gfefeaa peiesipaf t pfsiaa sa f/ pfsifs
ob fevatioe ) v iedustivi m ( pfsiaa sa f/ pfsifs ob fevatioe t gfefeaa
peiesipaf) t efvfe 100% uef soesau ioe i teuf (peobabiai tis), but sao f to
ab oautf teuth bfsau f of voaumf.
- Explanatory dimension: fxaggfeatieg sau aaity t why v peagmatis
fmbeasf of what t what. Big data i abaf to fed soeefaatioe /pattfee
without keowieg why.
Challenges of big data
- Do you have all the data? Sample is defined relative to a target population, data is
collected by different experiments and methodologies, which are based on choices
and background theories.
- Curse of dimensionality. High dimensional data: each sample is described by many