Notes Fundamentals of
Bioinformatics
Lecture 1: intro + Big Data Challenges in Genomics (K.
Sjölander – 2/9)
How to think about Data Science:
- What types of data are involved?
- What is the structure of the data?
- What are the sources (and consequences) of noise/error, and are these obvious or hidden?
- What people are involved (in generating or analysing the data)?
- What questions do they have?
- What tools do they use?
- What are the limitations of those tools?
- What can we do, as data scientists, to improve on the state of the art?
Functional annotation of genomes using homology-based annotation transfer. The standard
protocol: given a (gene or protein) sequence, search for homologs using BLAST (Basic local alignment
search tool). If the top hit has a significant E-value, transfer the annotation. If resources permit, look
for functional domains using Pfam HMMs (hidden Markov models). However, approximately 25% of
genes are misannotated using this protocol. Another 30% have no annotation.
Basic concepts involving homology
Homology -> same/similar form. If two genes are homologous then they are related by
evolution. But they may not have the same function or structure!
How is homology inferred? On the basis of sequence similarity (statistical models) and/or
structural and functional similarity
Partial homology: restricted to a subregion of a protein (related by domain fusion or fission)
Convergent vs divergent evolution. If two proteins are related by divergent evolution, they
share a common ancestor. If they are ”related” by convergent evolution, they have
converged on the same function but there is no common ancestor
Sources of functional annotation error:
1. Neofunctionalization stemming from gene duplication
2. Domain shuffling
3. Percolation (chain of) annotation errors
Question: researchers claimed to have cloned a gene that is a human neutral sphingomyelinase.
However, analysis of the gene sequence places the gene in a branch of the phylogenetic tree with
only bacterial genes. What has happened? Most likely the researchers made a mistake during
cloning.
Trees are a special type of graph
Graphs have nodes (vertices) and edges (branches)
Edges can be directed or undirected
Nodes can be internal or terminal
o Terminal nodes in a phylogenetic tree are called leaves (or taxa)
, o The term taxon refers to (groups of) species, but is commonly used to describe
genes in multi-gene families, even when the same species may be found in multiple
copies in the tree
Trees are a special subtype of graphs (acyclic connected graphs)
The valency (or degree) of a node equals the number of edges
A tree for which every internal node (except for the root) has degree 3 (one ancestor and
two children) is called a bifurcating or binary tree.
Trees for which internal nodes can have >2 children are called multifurcating trees
The diameter of a tree is equal to the longest path between two leaves (including edge
lengths, not simply number of edges)
Most phylogenetic trees are unrooted, and special methods must be used to infer the root.
Uses of phylogenetic trees:
- Traditional: reconstructing species phylogenies. Input is the multiple sequence alignment
(MSA) of a single gene family.
- Bioinformatics uses exploiting multi-gene families (protein super-families) for:
o Phylogenomic function prediction
o Improving multiple sequence alignment accuracy (guide trees)
o Functional site prediction
o Etc.
Interpreting tree topologies
Many phylogenetic trees are not meant to be interpreted as rooted (more about this later)
Terminal nodes (leaves) represent contemporary taxa (organisms, genes, proteins, or other
objects)
Internal nodes represent inferred ancestors - not generally from species existing today!
o In multi-gene families, these internal nodes may also represent duplication events
and domain architecture changes
Edge lengths are supposed to be proportional to the evolutionary distance
red: orthologs. Yellow block: super-orthologs.
Orthology prediction is critical to many areas of bioinformatics.
Orthologs: genes related by speciation (must be in different species)
Paralogs: genes related by duplication (can be in same species or different species)
Super-orthologs: genes joined by a path s.t. all nodes correspond to speciation.
Ultra-paralogs: genes joined by a path s.t. all nodes on path correspond to duplication events.
For species tree reconstruction there are two methods. By using multiple genes from the species you
can make supermatrix (preferred) or a supertree.
Major sources of phylogenetic error
, Sparse “taxon sampling”
o Historically refers to reconstructing phylogenies for single genes (restricted to
orthologs in different species)
o In protein superfamily reconstruction, including paralogous groups, simply refers to
the selection of proteins (multiple genes and multiple species)
Lineage-specific rate variation
o Historically refers to species that are evolving more rapidly than others
o In protein superfamily reconstruction, refers to genes (a group of orthologs) that are
evolving rapidly (perhaps due to neo-functionalization)
Site-specific rate variation
o Less common in single gene trees (orthologs in different species)
o Very common in protein superfamilies due to diversification of function following
gene duplication
Sequence fragments (or gene model errors)
o Very common in protein sequence databases
Insufficient site data (e.g., short MSA)
o Very common for trees based on single domains (esp. If <100 aa)
Few informative sites
Question: which sources of error are more likely to occur in phylogenomic species tree estimation?
Which are more likely in protein superfamily phylogeny reconstruction?
Types of errors in trees
In the branching order (topology)
o Example coarse branching order:
Relative branching order between taxonomic groups (primates, rodents,
ruminants)
Relative branching order between clades representing different genes (in a
multi-gene tree including duplication events/paralogous groups)
o Example fine branching order:
Relative branching order within Hominidae (human, chimps, bonobos,
gorilla, orangutan)
In branch lengths
Question: which type of error will have a bigger impact on orthology prediction?
The 4 Vs & Big Data Challenges in biology
Veracity:
Errors in gene models (missing exons, etc., esp. for eukaryotic genomes)
Errors in assigned functions (25% expected to be incorrect)
<1% of genes have experimental support for any aspect of their predicted function(s)
Errors are neither detected nor corrected
Volume (petabytes):
Continued exponential growth of sequence databases
Metagenome and next-generation sequencing technologies too big to store
Velocity
Streaming approaches needed for next-gen sequencing data and other data types
New data types emerge annually
Variety (Too many data types to list!)
Missing data:
~30% of the genes in a typical genome are annotated as “hypothetical” or “unknown”
, Ontologies biased towards selected genomes and processes/functions
The provenance of functional annotations is rarely stored
Links between sequences and papers/data may not be stored
Other practical issues: sociological/economic
Biologists seldom obtain training in algorithms, statistics, machine learning, software
engineering (and people with those skills seldom understand biology)
Cost of IT expertise
Summary points
Homology (detection, and use in function and structure prediction) is a central theme in
bioinformatics
Proteins evolve novel functions and structures by numerous processes - these must be taken
into account to prevent annotation errors based on homology
Functional annotation error rates are estimated at ~20-25%
Orthology relations are a subset of homology relations, and provide a greater degree of
specificity
o Not all orthology-prediction methods are equally accurate
Prediction methods almost always have a specificity-recall trade-off
How methods are benchmarked for expected accuracy is critical
o Separation of training and test data is essential
o Some benchmark datasets are easy, others are hard, and some are fundamentally
flawed
Lecture 2: gene expression, evolution and homology
(Heringa – 2018)
A gene can be a couple of hundred to two million base pairs long. The DNA is 3.2-3.3 billion base
pairs long. The length of the genome or the number of genes is not related to the complexity of the
organism.
Transcription. Gene expression is depended on a transcription factor (TF) binding a transcription
factor binding site (TFBS – a DNA motif) and a polymerase (Pol II in eukaryotes). The polymerase
attaches to the TATA-box, binding of the TF induces a conformational change that activates the
polymerase and starts transcription. Bacterial systems have only one TF, while eukaryotic cells need
multiple TFs and additional proteins to start transcription. TFBSs and TATA-boxes are very conserved
structures in evolution, though small differences exist. One gene can have multiple TFBSs. One TF
can activate the transcription of multiple genes. Gene expression is controlled by proximal and distal
regulatory elements, commonly bound by combinatorial transcription factor complexes. TF can
activate (enhancers) or inhibit (repressors, usually within gene regions) transcription. Gene
regulatory network models can be constructed from the TFs and the cis-regulatory elements with
which they interact.
DNA packaging. DNA is wound around histone proteins called nucleosomes. Other proteins wind
DNA into more tightly packed form, the chromosome. Unwinding portions of the chromosome is
important for mitosis, replication and RNA synthesis. Many TFBSs are possible upstream of a gene.