BIOINFORMATICS & GENOME TECHNOLOGY: Genome Analysis
Chapter 1: Genome complexity and genome mapping
1. Genome structure and organization
• Genome complexity = a crucial factor in genome analysis (increasing genome complexity; variability)
• Genome complexity: DNA content
o Large variation is not necessary reflected in approx. number of genes (S3)
▪ Humans 30.000 genes, but smaller variation tov bacteria, bec much more non coding DNA
o Human DNA content
▪ few protein coding genes & pseudogenes, many unique non-coding & repetitive DNA
▪ interspersed repeats: DNA transposon, Long interspersed nuclear elements (LINE), Short
interspersed nuclear elements (SINE), LTR retrotransposon
▪ tandem repeats: telomers etc.
• Transposons:
o Classification according to mode of transposition
▪ 1) DNA transposons = cut and paste = class II: using DNA intermediate
▪ 2) retrotransposons = via replicative mechanism = class I: using mRNA intermediate
o Classification according to autonomy
▪ 1) autonomous: encode all gene products necessary for transposition
• DNA transposons (vb Tc1/mariner family), Non-LTR retrotransposons (vb LINEs), LTR
retrotransposons (vb ty/copia/HERV)
▪ 2) non autonomous: no coding capacity
• SINEs (vb Alu elements), processed pseudogenes
o General: The eukaryotic genome = very plastic = there are rearrangements, insertions/deletions
• Genome complexity in eukaryotes
o From DNA -> RNA: reduction in complexity: RNA represents 5% of genome
o From pre-MRNA -> to mature mRNA via RNA processing (differential splicing): increase in complexity
o Conclusion: same gene results in different RNA molecule/different coding sequences (vb male vs
female specific splicing) => this increases complexity at transcription level
• Reduction of complexity: DNA renaturation kinetics
o = method to physically distinguish between vb. regions of coding DNA or repetitive DNA (zie S4)
▪ After denaturation,renaturation of repetitive DNA = easier tov unique protein coding DNA
▪ Reden: repetitive DNA has more copies => not exact copies find each other, but also highly
similar copies will find each other fast => gevolg: more rapid renaturation, not specific
o = basis of Cot filtration technique
▪ = to separate (physically) the repetitive DNA sequences from "gene-rich" single/low-copy
sequences, thus allowing ‘focused’ sequencing on gene-rich genome regions
▪ Process of physically separating fractions of DNA
• 1) Denaturation of DNA => get ssDNA
• 2) Renaturation of ssDNA
o First: renaturation of Foldback DNA that is almost identical
o Second: renaturation of high & middle redundant (repetitive DNA) (low cot
value, high reassociation rate) (blue)
o Third: renaturation of single-low copy DNA (gene rich DNA) (high cot value,
low reassociation rate) (green)
▪ Renaturation process: Cot = [DNA] (mol/L) * renaturation time (sec) * Buffer factor
([cation], viscosity[DNA length])
• => At a given temperature
• Opm: k = reassociation rate (in M-1 sec-1 ); FB = foldback DNA (k & Cot undefined);
HR= high redundant; MR= middle redundant; SL = single-low copy
1
, ▪ What happens if we denature DNA to 95°C & rapidly or slowly cool down to roomT (T,
renaturation time parameters)
• 1) slowly: more time for right strands to find each other => correct renaturation
• 2) rapidly: not much time => repetitive DNA sticks together, unique DNA stays apart
o Cot filtration kinetics: calculation of Cot values
▪ Moment half of DNA is renatured: C/C0 = ½ => Cot value = 1/k
• In idealized reassociation curve in a Cot diagram: reaction half way = cot value
▪ Conclusion: In this manner the value of k can be derived experimentally from the
reassociation curve => This value depends on the cation concentration, temperature,
fragment size, etc.
o Cot filtration: HR, MR, SL
▪ 1) Sheared genomic DNA => denature, renature to certain cot value => get dsDNA formed:
highly repetitive DNA
▪ 2) ssDNA that hasn’t renatured yet: middle repetitive DNA & low copy DNA => denature,
renature to higher cot value => get dsDNA formed: middle repetitive DNA
▪ 3) ssDNA hasn’t renatured yet: low copy DNA => denature, renature to highest cot value =>
get dsDNA formed: low copy DNA
▪ Conclusion: increase of enrichment in unique DNA (zie rode fig)
2. Genome mapping and sequencing
• Different historical techniques to also reduce complexity in a sense of genome mapping
2.1 Restriction Fragment Length Polymorphism (RFLP)
• Restriction Fragment Length Polymorphism (RFLP)
o 1) You have a normal allele & affected allele
▪ Affected allele can be:
• Loss of cleavage site by mutation vb: normal allele 3 recognition sites => affected 2
• DNA insertions/deletions
• Length variation in microsatellite repeats
o 2) Then use a probe in Southern blotting with DNA from parents/progeny for detection of the
affected offspring
▪ 1) Two Heterozygous parents: a normal allele & affected allele: large & short fragment
▪ 2) Offspring can be: normal, affected, heterozygous
• 2 affected alleles: only larger fragment
• 2 normal alleles: only short fragment
• Heterozygous: large & short fragment
o Goal: To fenotypically (for a fenotypical trait) link normal & affected alleles to a certain site & match
that site to presence or absence of a restriction site
2
, • Southern blotting:
o 1) Immobilize DNA on a permanent substrate
o 2) Directed identification of specific DNA sequences
o Method: Genomic DNA => restriction endonuclease cleavage => gel electrophoresis on agarose gel
=> denature in alkali & blot transfer to nitrocellulose membrane: ssDNA fragments are on that blot
▪ Add radioactive probe containing sequences complementary to gene X => see if probe binds
to certain positions vb to gene X
o Probe = 25-2000 bp ssDNA/RNA = complementary to the sequence being searched = labeled
2.2 Sequence-tagged sites (STS)
• Sequence-tagged sites (STS)
o = short unique sequence (200-300 bp)
o Method: A functional STS marker will amplify a single Locus and produce a single band after PCR
o Process: Design primer couple that amplifies a unique region/ single locus
▪ Vb: apply this on samples (12) => result: specific PCR products with some variation in PCR
products based on variability in the amplified region
o Nadeel: Time consuming to develop!
o Opm: same thing as rflp, but at rflp we used southern blot hybridization here simple PCR
2.3 Amplified Fragment Length Polymorphism (AFLP)
• Amplified Fragment Length Polymorphism (AFLP)
o Method: complex genome => get barcodes that in random way represent that genome
o 1) Cut chromosomal DNA at recognition sites with random restriction enzymes EcoRI & MseI (type II
RE) => create overhangs: some fragments have overhang EcoRI & MseI
o 2) Add adaptors (dsDNA with sticky ends): 1 adaptor fits overhang EcoRI & other adaptor fits
overhang MseI => result: fragments with 2 different ends are flanked by 2 different adaptors
o 3) Design 2 primers based on sequence of adaptors, to perform PCR (tekening):
▪ Make a primer that ends with a nucleotide (here T at 3’end) that queers the first nucleotide
of the insert (T)
▪ Gevolg: for any random fragment the chance that the last nucleotide will queer the first
nucleotide of the insert = ¼ & on other side the same: ¼
• => ¼ x ¼ = 1/16 = only 1 in 16 fragments will be amplified (here only 3 on ppt) in
PCR, if distance between the primers are not too far
▪ Resultaat: this oversimplified AFLP barcode representing one sample, allows us to compare
this to other samples
• => the better these samples match to each other, the more closely they are related
(phylogenetic tree)
o In highly complex organisms vb humans, using 2 restriction enzymes may not be simplified enough
▪ => we can simplify this further by making the primer larger vb G-C extra aan 3’ (tekening)
• => 1/16 x 1/16 = 1/256 fragments will be amplified
• AFLP Gelelektrophorese diagram:
o Instead of interpreting a gel (vorige slide)
o peaks of specific lengths represent each a PCR product => like this match peaks & quantify conserved
and differential bands, allowing to make phylogenetic tree
o depending on: choice of primers & endonucleases & reproducibility
2.4 Rapid Amplification of Polymorphic DNA (RAPD)
• Rapid Amplification of Polymorphic DNA (RAPD)
o = markers are DNA fragments from PCR amplification of random segments of genomic DNA with
single primer of arbitrary nucleotide sequence
3
, o = does not require specific knowledge of the DNA sequence of the target organism:
▪ But use short identical 10-mer primers => because they are short they bind at certain
positions in the genome => gevolg: the primers will or will not amplify a segment of DNA
(PCR), depending on positions that are complementary to the primers' sequence + primers
are close
▪ No fragment is produced if
• 1) primers annealed too far apart
• 2) 3' ends of the primers are not facing each other
• Limitations of RAPD
o Nearly all RAPD markers associated with a trait, are dominant
▪ Gevolg: no difference whether a DNA segment is amplified from a locus that is heterozygous
(1 copy) or homozygous (2 copies)
o Co-dominant RAPD markers (different-sized DNA segments amplified from the same locus) are
detected only rarely
o RAPD technique is notoriously laboratory dependent (PCR, quality template DNA) concentrations of
PCR components, and the PCR cycling conditions may greatly influence results
o Results can be difficult to interpret (primer/template mismatches => absent/ decreased PCR yield)
2.5 Single Nucleotide Polymorphisms (SNPs)
• Single Nucleotide Polymorphisms (SNPs)
o = single-bp positions at which different sequence alternatives (alleles) exist in a population
▪ Vb: presence or absence of recognition site (rflp) = a SNP present or absent
o = highly abundant (1 per 1000 bp in humans)
▪ SNP in coding regions => potential function impact
▪ SNP in non-coding regions with or without phenotypic impact = potential marker*
• => Association of SNP to a certain phenotypical trait using LINKAGE DISEQUILIBRIUM
o = potentially suitable markers for multifactorial disorders using LINKAGE DISEQUILIBRIUM mapping*
o 2 approaches: Random genome-wide SNPs directed SNPs in specific loci
▪ => Dependent on cost, assay, throughput & accuracy
• Example: genetic variation in the human androgen receptor gene
o = primary indicator of alopecia in men (getting balt at early age)
o => Phenotypic linkage to specific gene(s)
o => Testing potential SNPs for strong correlation
▪ There was a single SNP that could be associated to whether you had alopecia or not
▪ vb Bowel disease: multiple SNPs at multiple positions determine different levels of the
disorders (not on or off)
• Biochemical reaction underlying SNP genotyping (ways to detect SNPs in a biochemical way)
o = Hybridisation or Enzyme based
o 1) Hybridisation with allele-specific oligonucleotides (ASOs) probes
▪ 1) 2 probes (ASOs) complementary to the SNPs variations
• The SNP position is centrally located within the probe
▪ 2) Only perfectly matched probes are stable vb A in probe & T in sequence (SNP)
▪ 3) Mismatches are unstable
▪ => allows to see based on hybridization whether a certain SNP is present or absent
o 2) Allele-specific primer extension
▪ 1) Allele specific Primer anneals to the SNP adjacent region
• Primer is complementary to the SNP at the 3’ end
▪ 2) Only matching primers will extend using DNAP (PCR reaction)
▪ 3) Mismatch => no binding, no extension, no PCR products formed
o 3) Minisequencing
4