Biology Review
• The sequence of nucleotides in DNA is important because it encodes the sequence of amino
acids that constitutes the corresponding polypeptide
• Genetic Code: The relationship between a sequence of DNA and the sequence of the
corresponding polypeptide
• The coding sequence is read in the three nucleotides. Each group representing one amino
amino acid.
• Codon: Three nucleotide sequence.
• Reading in the conventional 5’ to 3’ direction the nucleotide sequence of the DNA strand that
encodes a polypeptide corresponds to the amino acid sequence of the polypeptide reading in
the direction from the N to C terminal of the protein
• The genetic code is read in number of overlapping triplets. Therefore there is three possible
ways of translating any nucleotide sequence depending on the starting point. These are called
reading frames that consists exclusively of triplets encoding amino acid is called an Open
Reading Frame (ORF)
• The reading frame affects which protein is made.
• Frame shift mutations are induced by accredins which causes additional basses to be
incorporated or deleted.
• Each mutagenic event in the presence of an accredins result in the addition or removal of a
single base pair
• Combinations of mutations reveal evidence about the nature of the genetic code
• The genetic code is degenerative which links it to the synonymous and non-synonymous
mutations
• Synonymous Mutation: change in the DNA sequence that codes for amino acids in a protein
sequence but does not change the encoded amino acid
• Non-synonymous mutation: change in the DNA sequence that codes for the amino acid
resulting in a change in a protein sequence
• The most commonly used sequence format is FASTA for nucleic acid and for amino acids. For
DNA the sequences are represent from 5’ to 3’ which is a sense strand and it is forward and is
a positive frame
• While the reverse would be a negative frame and would be from 3’ to 5’. For proteins the
sequences are represented from the N to the C terminal.
• The first line of a FASTA format sequence is the metadata. It is important to start with the
greater than sign (>), as it indicates the start of the FASTA format. In the second line you will
have the sequence either the nucleic acid or amino acid sequence. The second line cannot be
used for a metadata
Part 2
• Proteins are polypeptide polymers consisiting of a linear arrangement of amino acids.
• Protein structure can be categorized in several levels: Primary, Secondary (arrangement of
primary amino acid sequences into elements such as alpha helix, beta sheets, coils and
loops), Tertiary (3D arrangement formed by packing secondary structure elements into
globular domain), Quaternary (involves the arrangement of several polypeptides chains)
• Protein Domain: A region of protein that can adopt a particular 3D structure. Domains are
also called modules. Together a group of proteins that share a domain is called a Family.
,• There are many databases of protein families such as PFAM and SMART that will explore
the protein domain aspect and can give you a good idea of the protein architecture in terms of
domain
• Motif: Short conservative regions of proteins. Typically consists of a pattern of amino acid
that can characterize a protein family
Databases
Database
Computational Archives to store and organize data.
o To easily query/retrieve data
o Query: ask question to database
Organise data in structured records
Biological Databases
- Structured collection of information – don’t look all the same.
- Consists of basic units called records or entries.
- Each record consists of fields, which hold pre-defined data related to the record.
- For example, a protein database would have protein entries as records and protein
properties as fields (e.g., name of protein, length, amino-acid sequence)
- Biological databases use different organizing principles. Hyperlinks connect records in
different databases
- When you do database research it is important to state the name and version. As the
database is updated, you get new results
- The aim is to collect as much general information as possible about:
o Nucleotide sequence Databases
NCBI GenBank
EMBL Nucleotide Sequence Database
DDBJ
o Protein Sequences
UniProtKB
PDB – experimental data
Alpha fold – predicitions
- Primary: Raw data
o Sequence database:
GenBank
UniProit
o Tertiary:
PDB, AlphaFold
- Secondary:
o Motif database: PROSITE, PRINTS
Regular expression
o Domain database: Pfam, SMART
The ‘Perfect’ Database
Comprehensive, but easy to search.
Annotated, but not “too annotated.”
A simple, easy to understand structure.
Cross-referenced
, Minimum redundancy – not having multiple entries of the same entry
Easy retrieval of data
Manual databases are more reliable however automated databases are larger
Central Bioinformatics Resource (NCBI)
- Largest collection is housed at the National Center for Biotechnology Information
(NCBI), part of the National Library of Medicine
- Large staff of curators process the information and compile information into derivative
databases
- NCBI (with Ensembl, EBI, UCSC) is one of the central bioinformatics sites. It includes:
o PubMed
o Entrez search engine integrating many databases
o BLAST
o OMIM (Online Mendelian Inheritance in Man)
o Taxonomy
o Books
o Many additional resources
Databases & Sequence Analysis
- Main approach to investigate the meaning of sequences is the detection of similarity
between sequences, in order to infer related structures and functions
- It relies on the fact that some characteristic has been seen before
- If the hit isn’t present then it hasn’t been inputted in the database. Or if there is a hit then
it is meaningless as it is not reliable
Similarity & Homology
- We are trying to find the similarity between sequences or structures to infer homology
- Homologous sequences are those that have diverged from a common ancestor. Therefore
homology is a definitive statement
- Used data from identity to infer homology
- We can further distinguish between homologous sequences:
o Orthologues: proteins that do the same function in different species
o Paralogues: proteins that perform different functionns, but related functions
within one organism
Issues for Biological Databases
- Annotation is additional information to the raw sequence; for example, references, gene
position, cross – links to other databases.
- Is it correct?
o Most genes are annotated based on the similarity to other genes. This can cause
many problems
o Comparison was made when data was less complete
o If sequence is incorrectly annotated, the error propagates in the database
o Must not when annotation is based on similarity
- Is it good quality?
o Is the annotator an expert?
o Many databases have defined groups of ‘experts’ who annotate particular protein
families etc.
Sequence formats
The benefits of buying summaries with Stuvia:
Guaranteed quality through customer reviews
Stuvia customers have reviewed more than 700,000 summaries. This how you know that you are buying the best documents.
Quick and easy check-out
You can quickly pay through credit card for the summaries. There is no membership needed.
Focus on what matters
Your fellow students write the study notes themselves, which is why the documents are always reliable and up-to-date. This ensures you quickly get to the core!
Frequently asked questions
What do I get when I buy this document?
You get a PDF, available immediately after your purchase. The purchased document is accessible anytime, anywhere and indefinitely through your profile.
Satisfaction guarantee: how does it work?
Our satisfaction guarantee ensures that you always find a study document that suits you well. You fill out a form, and our customer service team takes care of the rest.
Who am I buying these notes from?
Stuvia is a marketplace, so you are not buying this document from us, but from seller mariellamonyo. Stuvia facilitates payment to the seller.
Will I be stuck with a subscription?
No, you only buy these notes for £7.16. You're not tied to anything after your purchase.