BHCS2003 Summary Notes
Human genome - complete nucleic acid sequence
• Present as DNA within 23 chromosome pairs
• Contains protein-coding and non-coding genes
• Genetic info stored as DNA
• Total length is 1 metre, but as most cells are diploid, there is twice the amount
Deoxyribonucleic acid (DNA)
• Made of nucleotide subunits
• Nucleotide has 3 parts
o Five carbon ring – deoxyribose
o Phosphate group
o Nitrogenous base
Adenine (A), thymine (T), Guanine (G), Cytosine (C)
• Successive deoxyribose units linked through carbon atoms labelled 5’ and 3’
• Chains in the DNA double helix are anti-parallel – run in opposite directions, with 5’ at top and 3’ at bottom
• Enzymes which act on 5’ end cannot act on 3’ (and vice versa)
Ribonucleic acid (RNA)
• Five carbon sugar is ribose (extra oxygen)
• Nitrogenous bases are A, uracil (U), G and C
• E.g., messenger RNA (mRNA), transfer RNA (rRNA), ribosomal RNA (rRNA)
• Functions
o Structural – ribosomal complexes
o Regulatory – exist in Chr X inactivation
o 1000s of non-coding RNA transcribed in mammals essential for physiological processes
DNA more chemically stable than RNA, and can be copied more faithfully to be transmitted to daughter calls
DNA found in nucleus, and a small amount in the mitochondria (mitochondrial DNA, mtDNA)
Average chromosome has a DNA double helix with approx. 140million nucleotides per strand and is 4.8cm long
Nucleus only 10um in diameter
Chromosome functions
• Transmission of genetic info
• Expression of genetic info - spatial and temporal control
DNA complex with histone proteins to form chromatin which undergoes various coiling and compaction
During division, chromatin attaches to centromeres and become tightly condensed so genes can't be expressed
During interphase, chromosomes highly extend so genes can be expressed - chromatin in extended state is euchromatin
Heterochromatin remains highly condensed throughout cell cycle, genetically inactive
Chromosome structure
• Light and dark bands allow for identification
o A visible band has 8-10 million bp containing 10-100 genes
o Different degrees of contraction alter band resolution
• More extended = more band visible (increase resolution)
1
,BHCS2003 Summary Notes
• As chromosome contracts, bands merge and are harder to distinguish (decrease resolution)
• Centromere - point of attachment to mitotic spindle
• Telomere - structures to protect ends of chromosome, important for integrity, abused in cancer cells
Gene loci described as 7 p 15.2
- Chr 7, arm p (short arm), region 1, band 5, subregion 2 (within band)
Genes – discreet segments of DNA used to direct synthesis of proteins and RNA, part of DNA molecule that serves as a
template for making functionally important RNA molecules
Coding RNA (mRNA) – contain sequence that is decoded to make proteins
Noncoding RNA – involved in regulation of gene expression, involving catalytic RNA ribozymes
Genes are translated (‘read’) in the 5’ to 3’ end direction
DNA sequences
• Single copy – 50% of human genome, distributed through euchromatin
• Low copy number repeat – 10%, throughout euchromatin, 2-20 copies
• Moderately repeated DNA – 25%, single copies scatter through entire genome, approx. 500 copies
• Highly repeated DNA – 15%, clustered in heterochromatin as satellite DNA, 10000-500000+ copies
Gene structure
• Sense strand (mRNA strand) is complementary to the antisense strand (DNA strand)
• Introns are removed from primary mRNA during splicing as they are non-coding
• Exons are spliced together to form mature mRNA (exons are smaller than introns, and more consistent in size –
exons usually 300bp)
• The signal to start transcription (AUG start codon, or ATG in gene) occurs in 5’-untranslated regions (5’-UTR)
before transcript
• ATG codon not as start of first exon – allows for variable gene regulation
• Last exon at 3’ end contains a 3’-untranslated region (3’-UTR) which is usually long
• 5’ end rich in dinucleotide CG (i.e., C phosphate G) – CpG island (a sign of a promoter region, so signposts the start
of a gene)
• Few genes lack introns and are small – highly expressed genes often have short/no introns
• Large genes have large introns – transcription of long introns takes time (dystrophin transcription takes approx. 16
hours)
•
Protein-encoding genes
• Large number of proteins in humans due to alternative splicing of pre-mRNA rather than large number of genes
• Distribution of protein-encoding genes is non-random; chr 1, 17 and 19 have high gene density
• Gene-rich and gene-poor areas of chromosomes correlated with chromosome bands and G-C rich regions
• Dystrophin – protein that connects the cytoskeleton of muscle fibre to EC matrix
o DMD gene is one of the largest (>2Mbp)
• Titin (connectin) – giant protein responsible for passive elasticity of muscle
o TTN gene (304,815 bp) has longest coding sequence (114,414 bp), largest exon number (365) and longest
single exon (17,106bp)
• Large genes tend to produce large proteins – EXCEPT:
o DMD gene 50 times larger than apoliprotein B but total amino acid number in dystrophin is only 80% of
apoliprotein B
Multigene families
• Most genes are unique single copy genes
2
,BHCS2003 Summary Notes
• Genes with similar functions have risen through gene duplication and subsequent divergence to form multigene
families
• Some families clustered close together (alpha- and beta-globin gene clusters), where others are dispersed through
genome (PAX genes)
• Classic gene family – high degree of sequence homology
• Gene superfamily – limited homology but functionally related with similar structural domains (e.g., HLA genes)
• Globin genes
o 8 gene loci code for 6 types of globin chains
o Located on chr 11 and 16 with a number of pseudogenes
o Genes on both chromosomes expressed to produce the 2 types of globin chains in various haemoglobin
forms
ENCyclopaedia Of Dna Elements (ENCODE)
• Aims to build comprehensive list of functional elements in genome, including those acting at protein and RNA
level, and regulatory elements that control cells and activity of gene
• ENCODE 2 and 3 conducted whole-genome analysis
• ENCODE 4 seeks to expand catalogue of regulatory elements by using broader diversity of samples
• Some sequences highly conserved in humans, but not across mammals – indicates 4% of genome is newly under
selection in humans
• Some sequences highly conserved in mammals but varies between humans – suggests these regions no longer
functional
• Defined 7576 small RNA molecules, 17905 long non-coding molecules (2019)
• GENCODE
o Part of ENCODE to identify all protein-coding regions withing ENCODE regions
o Aims to build encyclopaedia of genes and variants by identifying all gene features in human and mouse
genomes
o Describes protein-coding loci including alternatively transcribed variants, non-coding loci and pseudogenes
Non-coding DNA
• Pseudogenes
o Closely resemble structural genes but not functionally expressed
o Unprocessed – gene undergoing duplication which are silenced by mutations in coding or regulatory
elements
o Processed – insertion of complementary DNA (cDNA) sequences by reverse transcriptase on mRNA
transcript lacking promoter sequence needed for expression
o Functional – identified by comparative genome studies, more conserved than expected for functionless
sequences
o Many have regulatory roles
PTEN gene (chr10) regulated by pseudogene PTENP1 (chr9) by competing for binding – PTEN
mutated in advanced cancers, PTENP1 acts as tumour suppressor gene
Non-coding RNA (ncRNA) – accessories needed to process genes to make proteins (e.g., tRNA, mRNA, rRNA)
Repetitive DNA
• >50% of genome consists of highly repetitive DNA
• Tandemly repeated sequences – throughout genome, or highly restricted in location
• Highly repeated interspersed sequences found throughout genome
• Short interspersed elements (SINEs) – less than 300bp, spliced into active RNA transcripts (affects activity)
• Long interspersed elements (LINES) – up to 8kbp long, small number can self-replicate to form retrotransposons
contributing to genome expansion
o Long terminal repeat-containing (LTR) retrotransposons
Repeats of 100b oriented in same direction
Help transcription (of elements) into RNA, which can be reverse transcribed into DNA and
inserted at new location in genome
o Encode for reverse transcriptase – produce DNA sequences from RNA template
3
, BHCS2003 Summary Notes
• DNA transposons - Move by different mechanism involving excision of elements and re-insertion elsewhere, most
are immobile
• Exon shuffling
o LINE1 repeats have promotors and can be transcribed – weak poly (A) signals means transcription
continues until another poly (A) signal is reached = RNA copy has LINE1 as well as downstream sequences
o LINE1 reverse transcriptase gives hybrid cDNA
o Subsequent transposition into a new chromosome location may lead to insertion of downstream exon into a
different gene
• Heterochromatin consists of long arrays of high copy number tandem repeats
• Megabases regions at centromeres
• Chr 1, 9 and 19 have large amount of heterochromatin near centromeres
• Most of Y chromosome and short arms of acrocentric (Chr13, 14, 15, 21 and 22) is heterochromatin
• Short lengths of heterochromatin at telomeres
• Short tandem repeats (STRs)
o Minisatellites
repeat sequences of 100bp to 20kbp
telomeric have TTAGGG repeats
hypervariable minisatellites associated with euchromatin of all chromosomes, especially in sub-
telomeric regions
o Microsatellites
Repeat sequences of <100bp with repeat unit of 1-4bp
Widely dispersed throughout all chromosomes
Form basis of forensic genetics
Mitochondrial DNA (mtDNA)
• Mitochondria – principal site of ATP production by oxidative phosphorylation
• 5-10 copies of mtDNA in each mitochondrion
• Several 100 proteins found in mitochondrion encoded by nuclear genes
• Nuclear DNA vs mtDNA
Human reference genome (Ensembl genome browser, July 2012)
• Chromosome lengths calculated by multiplying number of bp by 0.34nm (distance between bp in DNA double
helix)
• Number of proteins based on number of inherited precursor mRNA transcripts
• Does NOT include products of alternative pre-mRNA splicing
• Does NOT include modification to protein structure after translation
• Number of variations is a summary of unique DNA sequence changes within sequences
• Large number of non-expressed functional sequences identified throughout human genome
4