Summary study book Bioinformatics and Functional Genomics of Jonathan Pevsner, J. Pevsner (Chapter 2, 3, 4, 5, 9) - ISBN: 9781118581780, Edition: 3rd Edition, Year of publication: - (Summary)
Samenvatting H1 "Bioinformatic and Functional Genomics"
Samenvatting H2 "Bioinformatic and Functional Genomics"
Samenvatting H3 "Bioinformatic and Functional Genomics"
All for this textbook (6)
Written for
Hanzehogeschool Groningen (Hanze)
Bio-informatica
Bio-informatica1
All documents for this subject (2)
Seller
Follow
RSusan
Content preview
Chapter 2 Access to Sequence Data and Related Information
Introduction to biological databases
There are two main technologies for DNA sequencing. Beginning in the 1970s
dideoxynucleotide sequencing (“Sanger sequencing”) was the principal method. Since 2005
next-generation sequencing (NGS) technology has emerged, allowing orders of magnitude
more sequence data to be generated. The availability of vastly more sequence data has
impacted most areas of bioinformatics and genomics.
Two ways of thinking about accessing data:
1. In terms of individual genes, proteins, or related molecules.
2. In terms of large datasets related to a problem of interest:
a. Study all the variants that have been identified across all human globin genes
b. In patients having mutations in a gene we might want to study the collection of
all of the tens of thousands of RNA transcripts in a given cell type in order to
assess the functional consequences of that variation.
c. Perhaps we want to sequence the DNA corresponding to a set of 100 genes
implicated in hemoglobin function. Databases and resources such as Entrez,
BioMart, and Galaxy facilitate the manipulation of larger datasets.
CENTRALIZED DATABASES STORE DNA SEQUENCES
3 main sites that have been responsible for storing nucleotide sequence data from 1982 to
the present:
1. GenBank -> National Center for Biotechnology Information (NCBI) (NIH)
2. the European Molecular Biology Laboratory (EMBL)-Bank
3. the DNA Database of Japan (DDBJ)
All three are coordinated by the International Nucleotide Sequence Database Collaboration
(INSDC) and they share their data.
Genbank, EMBL-Bank and DDBJ accept sequence data that consist of complete or
incomplete genomes (or chromosomes) analyzed by a whole-genome shotgun (WGS)
strategy. The WGS division consists of sequences generated by high-throughput sequencing
efforts.
CONTENTS OF DNA, RNA AND PROTEIN DATABASES
Organisms in GenBank/EMBL-Bank/DDBJ
Types of Data in GenBank/EMBL-Bank/DDBJ
We want to find out the sequence of human beta globin. A fundamental distinction is that
both DNA, RNA-based, and protein sequences are stored in discrete databases.
Furthermore, within each database sequence data are represented in a variety of forms.
Because RNA is relatively unstable, it is typically converted to complementary DNA (cDNA),
and a variety of databases contain cDNA sequences corresponding to RNA transcripts.
Beginning with the DNA, a first task is to learn the official name and symbol of a gene. For
humans and many other species, the RNA or cDNA is generally given the same name, while
the protein name may differ and is not italicized.
,Genomic DNA Databases
A gene is localized to a chromosome. The gene is the functional unit of heredity and is a
DNA sequence that typically consists of regulatory regions, protein-coding exons, and
introns. A bacterial artificial chromosome (BAC) is a large segment of DNA that is cloned into
bacteria. Similarly, yeast artificial chromosome (YAC) are used to clone large amount of DNA
into yeast. BACs and YACs are useful vectors with which to sequence large portions of
genomes.
DNA-Level Data: Sequence-Tagged Sites (STSs)
The Probe database at NCBI includes STSs, which are short genomic landmark sequences
for which both DNA sequence data and mapping data are available. Because they are
sometimes polymorphic, containing short sequence repeats, STSs can be useful for mapping
studies.
DNA-Level Data: Genome Survey Sequences (GSSs)
All searches of the NCBI Nucleotide database provide results that are divided into three
sections: GSS, ESTs and “CoreNucleotide”. The GSS division of GenBank consist of
sequences that are genomic in origin. The GSS division contains:
• random “single-pass read” genome survey sequences
• cosmid/BAC/YAC end sequences
• exon-trapped genomic sequences
• the Alu polymerase chain reaction (PCR) sequences
DNA-Level Data: High-Throughput Genomic Sequence (HTGS)
The HTGS division was created to make “unfinished” genomic sequence data rapidly
available to the scientific community. The HTGS division contains unfinished DNA
sequences generated by the high-throughput sequencing centers
RNA data
RNA-Level Data: cDNA Databases Corresponding to Expressed Genes
Protein-coding genes, pseudogenes, and noncoding genes are all transcribed from DNA to
RNA. Genes are expressed from particular regions of the body and times of development. If
one obtains a tissue such as liver, purifies RNA, then converts the RNA to the more stable
form of complementary DNA (cDNA).
RNA-Level Data: Expressed Sequence Tags (ESTs)
The database of expressed sequence tags (dbEST) is a division of GenBank that contains
sequence data and other information on “single-pass” cDNA sequences from a number of
organisms. An EST is a partial DNA sequence of a cDNA clone. All cDNA clones, and
therefore all ESTs, are derived from more specific RNA source. The RNA is converted into a
more stable form, cDNA, which may then be packaged into a cDNA library. Typically ESTs
are randomly selected cDNA clones that are sequenced on one strand (and therefore may
have a relatively high sequencing error rate). ESTs are often 300-800 base pairs in length.
Currently, GenBank divides ESTs into 3 major catergories:
• human
• mouse
• other
RNA-Level Data: UniGene
The goal of the UniGene project is to create gene-oriented clusters by automatically
partitioning ESTs into nonredundant sets. Ultimately there should be one UniGene cluster
assigned to each gene of an organism.
A UniGene cluster is a database entry for a gene containing a group of corresponding ESTs.
, There are far more human UniGene clusters than there are genes, because:
1. Much of the genome is transcribed at low levels. Currently, 64.000 human UniGene
clusters consist of a single EST and ~100.000 UniGene clusters consist of just 1-4
ESTs. These could reflect rare transcription events of unknown biological relevance.
2. Some DNA may be transcribed during the creation of a cDNA library without
corresponding to an authentic transcript; it is therefore a cloning artifact. Alternative
splicing may introduce apparently new clusters of genes because the spliced exon
has no homology to the rest of the sequence
3. Clusters of ESTs could correspond to distinct regions of one gene. In that case there
would be two (or more) UniGene entries corresponding to a single gene. As a
genome sequence becomes finished, it may become apparent that the two UniGene
clusters should properly cluster into one. The number of UniGene clusters may
therefore collapse over time.
Access to Information: Protein Databases
The Protein database at NCBI consist of translated coding regions from GenBank and
external databases such as UniProt, The Protein Information Resource (PIR), SWISS-PROT,
Protein Research Foundation (PRF) and the Protein Data Bank (PDB). The EBI similarly
provides information on proteins via these major databases.
UniProt
The Universal Protein Resource (UniProt) is the most comprehensive, centralized protein
sequence catalog. Formed as a collaborative effort in 2002, it consists of a combination of
three key databases:
1. Swiss-Prot is considered the best annotated protein database, with descriptions of
protein structure and function added by expert curators
2. The translated EMBL (TrEMBL) Nucleotide Sequence Database Library provides
automated annotations of proteins not in Swiss-Prot. It was created because of the
vast number of protein sequences that have become available through genome
sequencing projects.
3. PIR maintains the Protein Sequence Database, another protein database curated by
experts
UniProt is organized in 3 database layers:
1. The UniProt Knowledgebase (UniProtKB) is the central database that is divided into
the manually annotated UniProtKB/Swiss-Prot and the computationally annotated
UniProtKB/TrEMBL.
2. The UniProt Reference Clusters (UniRef) offer nonredundant reference clusters
based on UniProtKB. UniRef clusters are available with members sharing at least
50%, 90% or 100% identity.
3. The UniProt Archive, UniParc, consists of a stable, nonredundant archive of protein
sequences from a wide variety of sources
CENTRAL BIOINFORMATICS RESOURCES: NCBI AND EBI
Introduction to NCBI
The NCBI creates public databases, conducts research in computational biology, develops
software tools for analyzing genome data, and disseminates biomedical information.
Prominent resources include the following:
• PubMed is the search service from the National Library of Medicine (NLM) that
provides access to over 24 million citations in MEDLINE (Medical Literature, Analysis,
and Retrieval System Online) and other related databases, with links to participating
online journals.
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card or Stuvia-credit for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller RSusan. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for $3.21. You're not tied to anything after your purchase.