100% satisfaction guarantee Immediately available after payment Both online and in PDF No strings attached
logo-home
Lecture notes Fundamentals of Bioinformatics (VU) $9.16   Add to cart

Class notes

Lecture notes Fundamentals of Bioinformatics (VU)

 182 views  3 purchases
  • Course
  • Institution

This file contains extensive (40 pages) lecture notes of the course Fundamentals of Bioinformatics, complemented with summaries of the recommended reading material. The notes are written so they can be understood by students without a prior background in programming or informatics. The included fig...

[Show more]

Preview 4 out of 40  pages

  • March 2, 2020
  • 40
  • 2019/2020
  • Class notes
  • Unknown
  • All classes
avatar-seller
Notes Fundamentals of
Bioinformatics
Lecture 1: intro + Big Data Challenges in Genomics (K.
Sjölander – 2/9)
How to think about Data Science:
- What types of data are involved?
- What is the structure of the data?
- What are the sources (and consequences) of noise/error, and are these obvious or hidden?
- What people are involved (in generating or analysing the data)?
- What questions do they have?
- What tools do they use?
- What are the limitations of those tools?
- What can we do, as data scientists, to improve on the state of the art?

Functional annotation of genomes using homology-based annotation transfer. The standard
protocol: given a (gene or protein) sequence, search for homologs using BLAST (Basic local alignment
search tool). If the top hit has a significant E-value, transfer the annotation. If resources permit, look
for functional domains using Pfam HMMs (hidden Markov models). However, approximately 25% of
genes are misannotated using this protocol. Another 30% have no annotation.

Basic concepts involving homology
 Homology -> same/similar form. If two genes are homologous then they are related by
evolution. But they may not have the same function or structure!
 How is homology inferred? On the basis of sequence similarity (statistical models) and/or
structural and functional similarity
 Partial homology: restricted to a subregion of a protein (related by domain fusion or fission)
 Convergent vs divergent evolution. If two proteins are related by divergent evolution, they
share a common ancestor. If they are ”related” by convergent evolution, they have
converged on the same function but there is no common ancestor

Sources of functional annotation error:
1. Neofunctionalization stemming from gene duplication
2. Domain shuffling
3. Percolation (chain of) annotation errors

Question: researchers claimed to have cloned a gene that is a human neutral sphingomyelinase.
However, analysis of the gene sequence places the gene in a branch of the phylogenetic tree with
only bacterial genes. What has happened?  Most likely the researchers made a mistake during
cloning.

Trees are a special type of graph
 Graphs have nodes (vertices) and edges (branches)
 Edges can be directed or undirected
 Nodes can be internal or terminal
o Terminal nodes in a phylogenetic tree are called leaves (or taxa)

, o The term taxon refers to (groups of) species, but is commonly used to describe
genes in multi-gene families, even when the same species may be found in multiple
copies in the tree
 Trees are a special subtype of graphs (acyclic connected graphs)
 The valency (or degree) of a node equals the number of edges
 A tree for which every internal node (except for the root) has degree 3 (one ancestor and
two children) is called a bifurcating or binary tree.
 Trees for which internal nodes can have >2 children are called multifurcating trees
 The diameter of a tree is equal to the longest path between two leaves (including edge
lengths, not simply number of edges)
 Most phylogenetic trees are unrooted, and special methods must be used to infer the root.

Uses of phylogenetic trees:
- Traditional: reconstructing species phylogenies. Input is the multiple sequence alignment
(MSA) of a single gene family.
- Bioinformatics uses exploiting multi-gene families (protein super-families) for:
o Phylogenomic function prediction
o Improving multiple sequence alignment accuracy (guide trees)
o Functional site prediction
o Etc.

Interpreting tree topologies
 Many phylogenetic trees are not meant to be interpreted as rooted (more about this later)
 Terminal nodes (leaves) represent contemporary taxa (organisms, genes, proteins, or other
objects)
 Internal nodes represent inferred ancestors - not generally from species existing today!
o In multi-gene families, these internal nodes may also represent duplication events
and domain architecture changes
 Edge lengths are supposed to be proportional to the evolutionary distance




red: orthologs. Yellow block: super-orthologs.
Orthology prediction is critical to many areas of bioinformatics.
Orthologs: genes related by speciation (must be in different species)
Paralogs: genes related by duplication (can be in same species or different species)
Super-orthologs: genes joined by a path s.t. all nodes correspond to speciation.
Ultra-paralogs: genes joined by a path s.t. all nodes on path correspond to duplication events.

For species tree reconstruction there are two methods. By using multiple genes from the species you
can make supermatrix (preferred) or a supertree.

Major sources of phylogenetic error

,  Sparse “taxon sampling”
o Historically refers to reconstructing phylogenies for single genes (restricted to
orthologs in different species)
o In protein superfamily reconstruction, including paralogous groups, simply refers to
the selection of proteins (multiple genes and multiple species)
 Lineage-specific rate variation
o Historically refers to species that are evolving more rapidly than others
o In protein superfamily reconstruction, refers to genes (a group of orthologs) that are
evolving rapidly (perhaps due to neo-functionalization)
 Site-specific rate variation
o Less common in single gene trees (orthologs in different species)
o Very common in protein superfamilies due to diversification of function following
gene duplication
 Sequence fragments (or gene model errors)
o Very common in protein sequence databases
 Insufficient site data (e.g., short MSA)
o Very common for trees based on single domains (esp. If <100 aa)
 Few informative sites

Question: which sources of error are more likely to occur in phylogenomic species tree estimation?
Which are more likely in protein superfamily phylogeny reconstruction?

Types of errors in trees
 In the branching order (topology)
o Example coarse branching order:
 Relative branching order between taxonomic groups (primates, rodents,
ruminants)
 Relative branching order between clades representing different genes (in a
multi-gene tree including duplication events/paralogous groups)
o Example fine branching order:
 Relative branching order within Hominidae (human, chimps, bonobos,
gorilla, orangutan)
 In branch lengths
Question: which type of error will have a bigger impact on orthology prediction?

The 4 Vs & Big Data Challenges in biology
Veracity:
 Errors in gene models (missing exons, etc., esp. for eukaryotic genomes)
 Errors in assigned functions (25% expected to be incorrect)
 <1% of genes have experimental support for any aspect of their predicted function(s)
 Errors are neither detected nor corrected
Volume (petabytes):
 Continued exponential growth of sequence databases
 Metagenome and next-generation sequencing technologies too big to store
Velocity
 Streaming approaches needed for next-gen sequencing data and other data types
 New data types emerge annually
Variety (Too many data types to list!)
Missing data:
 ~30% of the genes in a typical genome are annotated as “hypothetical” or “unknown”

,  Ontologies biased towards selected genomes and processes/functions
 The provenance of functional annotations is rarely stored
 Links between sequences and papers/data may not be stored
Other practical issues: sociological/economic
 Biologists seldom obtain training in algorithms, statistics, machine learning, software
engineering (and people with those skills seldom understand biology)
 Cost of IT expertise

Summary points
 Homology (detection, and use in function and structure prediction) is a central theme in
bioinformatics
 Proteins evolve novel functions and structures by numerous processes - these must be taken
into account to prevent annotation errors based on homology
 Functional annotation error rates are estimated at ~20-25%
 Orthology relations are a subset of homology relations, and provide a greater degree of
specificity
o Not all orthology-prediction methods are equally accurate
 Prediction methods almost always have a specificity-recall trade-off
 How methods are benchmarked for expected accuracy is critical
o Separation of training and test data is essential
o Some benchmark datasets are easy, others are hard, and some are fundamentally
flawed

Lecture 2: gene expression, evolution and homology
(Heringa – 2018)
A gene can be a couple of hundred to two million base pairs long. The DNA is 3.2-3.3 billion base
pairs long. The length of the genome or the number of genes is not related to the complexity of the
organism.

Transcription. Gene expression is depended on a transcription factor (TF) binding a transcription
factor binding site (TFBS – a DNA motif) and a polymerase (Pol II in eukaryotes). The polymerase
attaches to the TATA-box, binding of the TF induces a conformational change that activates the
polymerase and starts transcription. Bacterial systems have only one TF, while eukaryotic cells need
multiple TFs and additional proteins to start transcription. TFBSs and TATA-boxes are very conserved
structures in evolution, though small differences exist. One gene can have multiple TFBSs. One TF
can activate the transcription of multiple genes. Gene expression is controlled by proximal and distal
regulatory elements, commonly bound by combinatorial transcription factor complexes. TF can
activate (enhancers) or inhibit (repressors, usually within gene regions) transcription. Gene
regulatory network models can be constructed from the TFs and the cis-regulatory elements with
which they interact.

DNA packaging. DNA is wound around histone proteins called nucleosomes. Other proteins wind
DNA into more tightly packed form, the chromosome. Unwinding portions of the chromosome is
important for mitosis, replication and RNA synthesis. Many TFBSs are possible upstream of a gene.

The benefits of buying summaries with Stuvia:

Guaranteed quality through customer reviews

Guaranteed quality through customer reviews

Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.

Quick and easy check-out

Quick and easy check-out

You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.

Focus on what matters

Focus on what matters

Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!

Frequently asked questions

What do I get when I buy this document?

You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.

Satisfaction guarantee: how does it work?

Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.

Who am I buying these notes from?

Stuvia is a marketplace, so you are not buying this document from us, but from seller BMWer. Stuvia facilitates payment to the seller.

Will I be stuck with a subscription?

No, you only buy these notes for $9.16. You're not tied to anything after your purchase.

Can Stuvia be trusted?

4.6 stars on Google & Trustpilot (+1000 reviews)

67866 documents were sold in the last 30 days

Founded in 2010, the go-to place to buy study notes for 14 years now

Start selling
$9.16  3x  sold
  • (0)
  Add to cart