Terminology
Open reading frame: A DNA sequence of reasonable length that begins with an initiation
codon (ATG) and ends with a stop codon (TAG, TAA, TGA).
Single nucleotide polymorphism: Single base variations between genomes.
Haplotype: A local combination of genetics polymorphisms that tend to be inherited together.
CRISPR/Cas: Clustered Regularly Interspaced Short Palindromic Repeats/ -associated proteins.
Sequence assembly: Inference of the complete sequence of a region on data on individual
fragments, by piecing together overlaps.
Coverage: Ratio of the total number of sequenced bases over the genome length.
Contig: A partial assembly of data by overlapping fragments into a contiguous region of
sequences.
Single-end read: Sequence reported from one end of a fragment.
Paired-end read: Sequence reported from both ends of a fragment.
Read length: Number of bases reported on a single fragment from a single experiment.
De novo sequencing: Determining the sequence of the first genome of a species.
Resequencing: Once a reference genome is available, the genome genomes are not assembled
de novo, but rather mapped to an existing one.
Exome sequencing: Resequencing project that only sequences exons/protein-coding regions.
Linkage: Absence or reduction of independent assortment of parental genes, which are
usually transmitted together because the lie on the same chromosome and entails the
distribution of loci among chromosomes.
Linkage disequilibrium: The non-random association between two genetic markers or loci and
entail the distribution of allelic patterns in populations. The deviation of the genotype
distribution in the population from the ultimate 1:2:1 ratio.
,Retrotransposons (class I): Replicate via an RNA intermediate, meaning, there will ultimately
be two copies of the same element at two different locations. As a result, they use a ‘copy-
and-paste’ mode.
Transposons (class II): Produce DNA copies without an intermediate RNA stage; encode
transposase, which recognizes sequences within the transposon itself, cuts it out, and inserts
it elsewhere, thus, use a ‘cut-and-paste’ mode
Restriction fragment: A fragment of a DNA molecule that has been cleaved by a restriction
enzyme.
Restriction map: The reconstruction of restriction fragments into an entire sequence.
VNTR: Variable number tandem repeats
RFLP: Restriction fragment length polymorphism
STRs/STPRs/SSRs: Short Tandem (Palindromic) Repeats: Region that contain 2 – 5 bp repeated
from a few to a dozen times.
C-value: C-value refers to the constancy of the amount of DNA per haploid cell in a species
Homologues: Regions of genomes, or portions of proteins, that are derived from a common
ancestor
Paralogues: Related genes (i.e. homologues) that have diverged to provide separate functions
within the same species
Orthologues: Homologues that perform the same function in different species
Neo-functionalization: Following duplication, one copy may acquire a novel, beneficial
function and become preserved by natural selection, while the other copy retains the original
function
Sub-functionalization: Following duplication, both copies may become partially compromised
by mutation accumulation to the point at which their total capacity is reduced to the level of
the single-copy ancestral gene
Non-functionalization: Following duplication, one copy may simply become silenced by
degenerative/deleterious mutations, while the other copy retains the original function
Ka/Ks ≈ 1 neutral evolution: silent and substitution mutations have occurred to approximately
equal extents
, Ka/Ks > 1 positive (adaptive) selection: substitution mutations are more prevalent than silent
mutations, implying that selective pressures are active and the substitutions are
advantageous
Ka/Ks < 1 purifying (negative) selection: substitution mutations are underrepresented,
implying that the sequence is optimized fairly rigidly, with relatively little tolerance for
mutation
Polyploid: An organism that contains multiple sets of entire chromosomes
Autopolyploid: An organism that contains multiple copies of genomes from the same parent
Allopolyploid: An organism that contains multiple copies of genomes from different parents.
Long Questions:
Chapter 01
Contrast between the challenges of gene identification in prokaryotes vs eukaryotes
Easier in prokaryotes than in eukaryotes.
Prokaryotes have smaller genomes, fewer genes, contiguous genes that lack introns and
small intergenic regions.
Eukaryotes have sparsely distributed genes with most having introns and alternative
splicing also complicates gene ID.
Distinguish between two general methods of gene identification (4)
a priori methods – they work by recognizing sequence patterns within expressed genes
and the regions flanking them (e.g., codon statistics and no stop codons)
‘been there, seen that’ methods – they work by recognizing regions corresponding to
previously known genes (e.g., expressed sequence tag, EST)
Describe useful features of gene identification in addition to codon usage: what to look
for in the beginning, middle end of genes (9)
5′ exons will typically start with transcription start sites (TSS); there will be a core
promotor site that includes, for example, a TATA box roughly at -30 bp; initial exons
are usually free of in-frame stop codons and they end immediately before a GT splice
signal
Internal exons will also be free of in-frame stop codons; they will begin immediately
after an AG splice signal and end immediately before the next GT splice signals