Fundamentals of Bioinformatics
0
, FUNDAMENTALS OF BIOINFORMATICS
INHOUDSOPGAVE
LECTURES 2
Lecture 1 Fundamentals of Bioinformatics – Introduction 2
Lecture 2 Evolution 6
Lecture 3 Introduction to Machine Learning 11
Lecture 4 Protein structures & sequence profiles 13
Lecture 5 Impact prediction: SIFT and PolyPhen-2 15
Lecture 6 DNA Sequencing & Computational Analysis of Massively Parallel Sequencing (MPS) Data
19
Lecture 7 Title 24
Lecture 8 Big Data in Life Sciences 25
TUTORIALS 28
Tutorial 1 28
Tutorial 2 First steps in Python 30
Tutorial 3 Lists and loops 31
Tutorial 4 Functions 31
Tutorial 5 File I/O and dictionaries 32
1
, Lectures
Lecture 1 Fundamentals of Bioinformatics – Introduction
Chapter 1
Keywords of bioinformatics: programming, knowledge molecular biology, machine learning,
translating the research question into methods, designing workflows, general data science skills and
genomics/genetics, algorithms, statistics. Bioinformaticians often are working on workflows.
Bioinformatics: studying the informatic processes in biotic systems.
Sequencing: data analysis flow: images → reads (small read of DNA 200 – 500 nucleotides long) →
align them / match to existing genomes (computationally expensive) → significance values (SNP, …).
Small devices are available for laptops, however the computation stays big and is send to a server.
We measure molecules / variants in the cells. We can measure proteins, DNA, RNA: called omics. So
in essence you get a profile of molecular state of a sample.
Dimensions are often very difficult, e.g. per sample 10000 features for example 150 people.
Applications are found in: biomedicine, pharmacy, eco genomics, plant breeding.
Data sources – Molecular profiling:
- Genome (genomics): point mutations / small variants (exome sequencing, DNAseq),
structural variants (arrayCGH, DNAseq), methylation.
- RNA transcription/expression (transcriptomics): RNAseq, micro arrays, single cell RNAseq.
- Proteins (proteomic): immunohistochemistry, mass spectrometry.
- Metabolites – small molecules (metabolomics): NMR, mass spectrometry.
- Microbes – bacteria/virus/parasite: DNAseq (metagenomics), RNAseq (metatranscriptomics)
Genomics → transcriptomics → proteomics → metabolomics.
40% group project, 30% individual assignment, 30% conversion class.
Project: Benchmark impact prediction methods.
Impact prediction methods: tries to predict what the effect of a mutation is. Polyphen and SIFT are
existing and will be used.
We need to have an annotated set in which we know what the effect is of a mutation. We use the
dataset ClinVar. Use predictions from your method + the gold standard annotations & create a ROC.
What factors make if a mutation has an effect on the organism or not: stop codon, type of amino acid
change, another amino acid, another amino acid from another group, location, the loci (NCR will not
affect the mutation), mutation can affect a promoter heavily, frameshift, nonsense mutations,
deletions/ duplications, frame shift mutations, some factors that have an effect on their organism:
needs to be in a gene, in exon, should affect the amino acid preferably the first two bases of the
amino acid, a mutation that doesn't change the amino acid, a mutation that is not in the active part
of the protein, a mutation in a gene that is not expressed in a certain cell type, a mutation in non-
coding DNA, a mutation in an intron, structure of the protein (mutation which decides structure for
example), location of mutation (intron exon) number of mutations already present, the kind of
mutation (point mutation etc), - DNA repairing; Environmental changes (spontaneous mutations);
Indels changes, Exonic/intronic, Major signalling pathway (y/n), or other system, e.g. DNA repair,
promoter region (y/n), redundancy (gene copies available?), wobble base mutations, start-,stop-
2
, codon, structure of protein, regulation signal, splicing signal, location (IGR, NCR, RNA, repeat
regions, pseudogenes), sort of mutations (deletions/duplications, frame shift), repair mechanisms of
cells, ability to kill cell, epigenetic modification, introns and exons might change, nothing might
happen since it is in in intron, nothing might happen since it is a wobble base, regulatory sequences
might change leading to: over expression, under expression, mutations that might alter the eventual
protein in different ways, introduction of extra stop/start codons.
Bioinformatics is to find which mutations cause a disease. Because you find many, but not all are
important. Aim is to select the biomarker (to make the difference between disease and not).
Biomarker: observation on which to make a clinical decision/diagnosis
We use ClinVar for human SNPs as dataset with gold standard. We will compare the predictions from
the method with the gold standard to create an ROC curve to asses and compare the quality of
predictions by the different methods.
SNV – single nucleotide variant: a position in the DNA where a person/cell has a different nucleotide
than the refence genome. Often are heteroallelic (in only one of two chromosomes).
SNP – single nucleotide polymorphism: a position in the DNA where throughout a population variant
to the common reference may be observed.
To make predictions, we look at conserved regions to see look in the past and see if previous changes
at a certain position were allowed. Impact prediction can reduce the number of SNVs to consider in
biomarker selection.
SNPs occur almost once every 1000 nucleotides, so 4 to 5 million SNPs are present in one genome.
Reading
Central feature of life: ability to reproduce itself.
Three components of an evolutionary process: inheritance, the passing of characteristics from
parents to offspring; variation, the processes that make offspring other than exact copies of their
parents; and selection, the process that differentially favours the reproduction of some organisms,
and hence their characteristics, over others. Evolution is a cumulative process.
Life: the result of evolutionary process taking place on earth.
Reproductive fitness: a measure of how many surviving offspring an organism can produce.
All life is divided into four groups: viruses, archaea, bacteria and eucarya.
Vertebrates (animals with backbones (fish, reptiles, amphibians, birds, mammals)) are 3% of species.
Viruses: small amount of genetic material surrounded by a protein coat.
Sperm and eggs are germ cells (divide during meiosis); all other kinds of cells in the body are somatic.
Differentiated cells cannot reproduce an animal (except reproductive cell).
Membranes: boundaries between the cell and the outside world. All cells have phospholipid (lipids
with a phosphate group attached) cell membrane. The phosphate hydrophilic and the lipid
hydrophobic. The membrane contains all sorts of signal transduction mechanisms.
Proteins: molecules that accomplish most of the functions of the living cell. They can for example
function as enzyme (which catalyses chemical reactions).
Proteins are built up out of 20 naturally occurring amino acids, often as many as 4500 per protein.
Prosthetic groups: groups of atoms to which some proteins bind to function.
3