College aantekeningen

Lecture notes Fundamentals of Bioinformatics (VU)

185 keer bekeken 3 keer verkocht

Instelling
Vrije Universiteit Amsterdam (VU)

This file contains extensive (40 pages) lecture notes of the course Fundamentals of Bioinformatics, complemented with summaries of the recommended reading material. The notes are written so they can be understood by students without a prior background in programming or informatics. The included fig...

[Meer zien]

Voorbeeld 4 van de 40 pagina's

Bekijk voorbeeld

Geupload op 2 maart 2020
Aantal pagina's 40
Geschreven in 2019/2020
Type College aantekeningen
Docent(en) Onbekend
Bevat Alle colleges

Volgen

BMWer Lid sinds 7 jaar 21 documenten verkocht

€8,49

In winkelwagen

Op verlanglijstje

100% tevredenheidsgarantie
Direct beschikbaar na betaling
Zowel online als in PDF
Je zit nergens aan vast

Notes Fundamentals of
Bioinformatics
Lecture 1: intro + Big Data Challenges in Genomics (K.
Sjölander – 2/9)
How to think about Data Science:
- What types of data are involved?
- What is the structure of the data?
- What are the sources (and consequences) of noise/error, and are these obvious or hidden?
- What people are involved (in generating or analysing the data)?
- What questions do they have?
- What tools do they use?
- What are the limitations of those tools?
- What can we do, as data scientists, to improve on the state of the art?

Functional annotation of genomes using homology-based annotation transfer. The standard
protocol: given a (gene or protein) sequence, search for homologs using BLAST (Basic local alignment
search tool). If the top hit has a significant E-value, transfer the annotation. If resources permit, look
for functional domains using Pfam HMMs (hidden Markov models). However, approximately 25% of
genes are misannotated using this protocol. Another 30% have no annotation.

Basic concepts involving homology
 Homology -> same/similar form. If two genes are homologous then they are related by
evolution. But they may not have the same function or structure!
 How is homology inferred? On the basis of sequence similarity (statistical models) and/or
structural and functional similarity
 Partial homology: restricted to a subregion of a protein (related by domain fusion or fission)
 Convergent vs divergent evolution. If two proteins are related by divergent evolution, they
share a common ancestor. If they are ”related” by convergent evolution, they have
converged on the same function but there is no common ancestor

Sources of functional annotation error:
1. Neofunctionalization stemming from gene duplication
2. Domain shuffling
3. Percolation (chain of) annotation errors

Question: researchers claimed to have cloned a gene that is a human neutral sphingomyelinase.
However, analysis of the gene sequence places the gene in a branch of the phylogenetic tree with
only bacterial genes. What has happened?  Most likely the researchers made a mistake during
cloning.

Trees are a special type of graph
 Graphs have nodes (vertices) and edges (branches)
 Edges can be directed or undirected
 Nodes can be internal or terminal
o Terminal nodes in a phylogenetic tree are called leaves (or taxa)

, o The term taxon refers to (groups of) species, but is commonly used to describe
genes in multi-gene families, even when the same species may be found in multiple
copies in the tree
 Trees are a special subtype of graphs (acyclic connected graphs)
 The valency (or degree) of a node equals the number of edges
 A tree for which every internal node (except for the root) has degree 3 (one ancestor and
two children) is called a bifurcating or binary tree.
 Trees for which internal nodes can have >2 children are called multifurcating trees
 The diameter of a tree is equal to the longest path between two leaves (including edge
lengths, not simply number of edges)
 Most phylogenetic trees are unrooted, and special methods must be used to infer the root.

Uses of phylogenetic trees:
- Traditional: reconstructing species phylogenies. Input is the multiple sequence alignment
(MSA) of a single gene family.
- Bioinformatics uses exploiting multi-gene families (protein super-families) for:
o Phylogenomic function prediction
o Improving multiple sequence alignment accuracy (guide trees)
o Functional site prediction
o Etc.

Interpreting tree topologies
 Many phylogenetic trees are not meant to be interpreted as rooted (more about this later)
 Terminal nodes (leaves) represent contemporary taxa (organisms, genes, proteins, or other
objects)
 Internal nodes represent inferred ancestors - not generally from species existing today!
o In multi-gene families, these internal nodes may also represent duplication events
and domain architecture changes
 Edge lengths are supposed to be proportional to the evolutionary distance

red: orthologs. Yellow block: super-orthologs.
Orthology prediction is critical to many areas of bioinformatics.
Orthologs: genes related by speciation (must be in different species)
Paralogs: genes related by duplication (can be in same species or different species)
Super-orthologs: genes joined by a path s.t. all nodes correspond to speciation.
Ultra-paralogs: genes joined by a path s.t. all nodes on path correspond to duplication events.

For species tree reconstruction there are two methods. By using multiple genes from the species you
can make supermatrix (preferred) or a supertree.

Major sources of phylogenetic error

,  Sparse “taxon sampling”
o Historically refers to reconstructing phylogenies for single genes (restricted to
orthologs in different species)
o In protein superfamily reconstruction, including paralogous groups, simply refers to
the selection of proteins (multiple genes and multiple species)
 Lineage-specific rate variation
o Historically refers to species that are evolving more rapidly than others
o In protein superfamily reconstruction, refers to genes (a group of orthologs) that are
evolving rapidly (perhaps due to neo-functionalization)
 Site-specific rate variation
o Less common in single gene trees (orthologs in different species)
o Very common in protein superfamilies due to diversification of function following
gene duplication
 Sequence fragments (or gene model errors)
o Very common in protein sequence databases
 Insufficient site data (e.g., short MSA)
o Very common for trees based on single domains (esp. If <100 aa)
 Few informative sites

Question: which sources of error are more likely to occur in phylogenomic species tree estimation?
Which are more likely in protein superfamily phylogeny reconstruction?

Types of errors in trees
 In the branching order (topology)
o Example coarse branching order:
 Relative branching order between taxonomic groups (primates, rodents,
ruminants)
 Relative branching order between clades representing different genes (in a
multi-gene tree including duplication events/paralogous groups)
o Example fine branching order:
 Relative branching order within Hominidae (human, chimps, bonobos,
gorilla, orangutan)
 In branch lengths
Question: which type of error will have a bigger impact on orthology prediction?

The 4 Vs & Big Data Challenges in biology
Veracity:
 Errors in gene models (missing exons, etc., esp. for eukaryotic genomes)
 Errors in assigned functions (25% expected to be incorrect)
 <1% of genes have experimental support for any aspect of their predicted function(s)
 Errors are neither detected nor corrected
Volume (petabytes):
 Continued exponential growth of sequence databases
 Metagenome and next-generation sequencing technologies too big to store
Velocity
 Streaming approaches needed for next-gen sequencing data and other data types
 New data types emerge annually
Variety (Too many data types to list!)
Missing data:
 ~30% of the genes in a typical genome are annotated as “hypothetical” or “unknown”

,  Ontologies biased towards selected genomes and processes/functions
 The provenance of functional annotations is rarely stored
 Links between sequences and papers/data may not be stored
Other practical issues: sociological/economic
 Biologists seldom obtain training in algorithms, statistics, machine learning, software
engineering (and people with those skills seldom understand biology)
 Cost of IT expertise

Summary points
 Homology (detection, and use in function and structure prediction) is a central theme in
bioinformatics
 Proteins evolve novel functions and structures by numerous processes - these must be taken
into account to prevent annotation errors based on homology
 Functional annotation error rates are estimated at ~20-25%
 Orthology relations are a subset of homology relations, and provide a greater degree of
specificity
o Not all orthology-prediction methods are equally accurate
 Prediction methods almost always have a specificity-recall trade-off
 How methods are benchmarked for expected accuracy is critical
o Separation of training and test data is essential
o Some benchmark datasets are easy, others are hard, and some are fundamentally
flawed

Lecture 2: gene expression, evolution and homology
(Heringa – 2018)
A gene can be a couple of hundred to two million base pairs long. The DNA is 3.2-3.3 billion base
pairs long. The length of the genome or the number of genes is not related to the complexity of the
organism.

Transcription. Gene expression is depended on a transcription factor (TF) binding a transcription
factor binding site (TFBS – a DNA motif) and a polymerase (Pol II in eukaryotes). The polymerase
attaches to the TATA-box, binding of the TF induces a conformational change that activates the
polymerase and starts transcription. Bacterial systems have only one TF, while eukaryotic cells need
multiple TFs and additional proteins to start transcription. TFBSs and TATA-boxes are very conserved
structures in evolution, though small differences exist. One gene can have multiple TFBSs. One TF
can activate the transcription of multiple genes. Gene expression is controlled by proximal and distal
regulatory elements, commonly bound by combinatorial transcription factor complexes. TF can
activate (enhancers) or inhibit (repressors, usually within gene regions) transcription. Gene
regulatory network models can be constructed from the TFs and the cis-regulatory elements with
which they interact.

DNA packaging. DNA is wound around histone proteins called nucleosomes. Other proteins wind
DNA into more tightly packed form, the chromosome. Unwinding portions of the chromosome is
important for mitosis, replication and RNA synthesis. Many TFBSs are possible upstream of a gene.

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Stuvia-klanten hebben meer dan 700.000 samenvattingen beoordeeld. Zo weet je zeker dat je de beste documenten koopt!

Snel en makkelijk kopen

Je betaalt supersnel en eenmalig met iDeal, creditcard of Stuvia-tegoed voor de samenvatting. Zonder lidmaatschap.

Focus op de essentie

Samenvattingen worden geschreven voor en door anderen. Daarom zijn de samenvattingen altijd betrouwbaar en actueel. Zo kom je snel tot de kern!

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Je krijgt een PDF, die direct beschikbaar is na je aankoop. Het gekochte document is altijd, overal en oneindig toegankelijk via je profiel.

Tevredenheidsgarantie: hoe werkt dat?

Onze tevredenheidsgarantie zorgt ervoor dat je altijd een studiedocument vindt dat goed bij je past. Je vult een formulier in en onze klantenservice regelt de rest.

Van wie koop ik deze samenvatting?

Stuvia is een marktplaats, je koop dit document dus niet van ons, maar van verkoper BMWer. Stuvia faciliteert de betaling aan de verkoper.

Zit ik meteen vast aan een abonnement?

Nee, je koopt alleen deze samenvatting voor €8,49. Je zit daarna nergens aan vast.

Is Stuvia te vertrouwen?

4,6 sterren op Google & Trustpilot (+1000 reviews)

Afgelopen 30 dagen zijn er 50990 samenvattingen verkocht

Opgericht in 2010, al 15 jaar dé plek om samenvattingen te kopen

Start met verkopen

Populaire Universiteiten

Populaire Hogescholen

Populaire Scholen

Populaire samengevatte studieboeken voor Communicatie en Taal

Populaire samengevatte studieboeken voor Economie en Bedrijf

Populaire samengevatte studieboeken voor Exact en Informatica

Populaire samengevatte studieboeken voor Gedrag en Maatschappij

Populaire samengevatte studieboeken voor Gezondheid en Geneeskunde

Populaire samengevatte studieboeken voor Onderwijs en Opvoeding

Populaire samengevatte studieboeken voor Recht en Bestuur

De beste samenvattingen om je Wft-diploma te behalen

De beste samenvattingen om je theorie examens te behalen

De beste samenvattingen voor je cursus in de Veiligheidsbranche

De beste samenvattingen voor Gezondheid & Hygiëne cursussen

De beste samenvattingen voor zakelijke cursussen

De beste samenvattingen voor je PABO WisCAT cursus

Populaire vakken

Populaire vakken

Populaire vakken

Boekverslagen en samenvattingen

Verkoper

College aantekeningen

Lecture notes Fundamentals of Bioinformatics (VU)

Document informatie

Onderwerpen

Geschreven voor

Verkoper

Ontvangen beoordelingen

Voorbeeld van de inhoud

Voordelen van het kopen van samenvattingen bij Stuvia op een rij:

Verzekerd van kwaliteit door reviews

Snel en makkelijk kopen

Focus op de essentie

Veelgestelde vragen

Wat krijg ik als ik dit document koop?

Tevredenheidsgarantie: hoe werkt dat?

Van wie koop ik deze samenvatting?

Zit ik meteen vast aan een abonnement?

Is Stuvia te vertrouwen?