NEXT GENERATION SEQUENCING BIOINFORMATICS
Introduction to DNA and Genomics
DNA (1953): James Watson, Francis Crick, Maurice Wilkins en Rosalind Franklin 1st genome (1980-2000)
Personal genome:
- Genome costs now <1000 EUR
- > 10 0000 cancer genomes sequences
- 100.000 healthy genomes
- +1 Mio Genomes Europa
What is genetic variation?
• Differences in DNA content or structure among individuals
• Any two individuals have ~99.8% identical DNA.
• But the human genome is big - each haploid set of 23 chromosomes has 3.1 billion nucleotides.
• There are >100,000,000 know genetic variants in the human genome
• Effectively infinite combinations of alleles. The details matter.
Types of genetic variation
Single-nucleotide polymorphisms Insertion-deletion polymorphisms Structural variants (SVs)
(SNPs) (INDELs)
“DNA spelling mistakes” “extra or missing DNA” “Large blocks of extra, missing or
rearranged DNA”
A typical human genome
"We find that a typical [human] genome differs from the reference human genome at 4.1 million to 5.0 million sites.
Although >99.9% of variants consist of SNPs and short indels, structural variants affect more bases: the typical
genome contains an estimated 2,100 to 2,500 structural variants (~1,000 large deletions, ~160 copy-number variants,
~915 Alu insertions,
~128 L1 insertions, ~51 SVA insertions, ~4 NUMTs, and ~10 inversions),
affecting ~20 million bases of sequence.
Nucleotide diversity (𝚷): 1/756 bp to 1/620 bp
Applications Of Genomic Variation
GENOMIC MEDICINE
Helps in diagnosis and understanding of various diseases.
NON-INVASIVE PRENATAL TEST (NIPT)
Used for detecting genetic abnormalities in fetuses.
MENDELIAN DISORDERS
▸ ~7,000 Mendelian or monogenic disorders that collectively impact ~0.4% of live births
▸ Many have been discovered by ‘‘trio-based sequencing’’ of unaffected parents and an affected offspring
▸ Even if there is no cure, accurate diagnoses can provide meaningful resolution for patients and families, connect
them to disease-specific support networks, inform prognosis and co-morbidities, and facilitate family planning.
Trio sequencing
1
, SMA - SPINAL MUSCULAR ATROPHY → Baby PIA
▸ To exclude all problematic SMN1 variants, WGS (=whole genome sequencing) is needed
▸ Will we ultimately sequence the whole genome of all individuals ?
BRCA1 MUTATION
BRCA1 is also a mutation, leads to breast cancer.
Is also an example for genetic mutations
COMPLEX DISEASES;
POLYGENIC RISK
▸ Many traits are polygenic
▸Wide Association Study
▸ For example: Alzheimer’s disease
GENETIC META-ANALYSIS OF DIAGNOSED ALZHEIMER’s disease
Manhattn plot. Each spot is a location in the genome associated with a p-value.
Each spot is a SNP that has an association with the disease. So we have hundred SNP’s.
APOE = a large SNP and if we zoom in (we see dia 20)
Assessed through genome-wide association studies (GWAS), using SNPs to calculate disease risk.
GENE PRIORITIZATION
Polygenic risk scores estimate the likelihood of developing conditions like Alzheimer’s.
QUANTIFY GENETIC RISK
▸Long tail: people can benefit from sequence info
▸How can genetic info help if you’re not in the long tail?
▸e.g., when you get a disease, can help with diagnosis
If we have genetic risk, try to calculate polygenic risk. Score gives you how likely you are to get Alzheimers disease.
80% is a high risk of getting Alzheimer.
CANCER GENOMICS
▸ Somatic mutations = are non-inheritable DNA changes that occur in non-germline cells, often due to
environmental factors, and can contribute to diseases like cancer.
▸ Far more so than in the other areas discussed above, driver genes and mutations in cancer provide
clear molecular targets for therapeutic agents.
▸ non- small cell lung cancers with activating somatic mutations in the EGFR kinase => EGFR kinase
inhibitor gefitinib
▸ TCGA and PACWG: broad surveys
▸ about half of common tumors contain one or more clinically relevant mutations, predicting sensitivity
or resistance to specific agents or suggesting clinical trial eligibility
▸ Tumors shed DNA in the blood => circulating tumor DNA (ctDNA) => liquid biopsies
TUMOUR
- Somatic Mutations: Mutations in cancer cells provide molecular targets for therapy (e.g., EGFR mutations in lung cancer).
- Circulating Tumor DNA (ctDNA): Tumors shed DNA into the bloodstream, enabling liquid biopsies. Eg.: BRAF Mutation in Melanoma.
PCAWG PAN-CANCER
2
, SOMATIC BRAF MUTATION IN MELANOMA
APPLICATIONS OF GENOMIC VARIATION
- Traits & Ancestry: Genomic variants help trace human evolution and lineage (e.g., 23andMe).
- NGS: Modern methods used to detect genetic variation, sequencing genomes rapidly and comprehensively.
LENGTH
TRAITS
ANCESTRY
▸ Genetic variants are the "bread crumbs" for tracking evolution
History of genetic variation study
23ANDME
- Early Genetic Study: Began with agriculture and hemophilia in royal families,
accelerated after DNA discovery.
- Sanger Sequencing: Early method of DNA sequencing.
- Modern Techniques: Arrays and NGS led to an explosion of genomic testing, including
direct-to-consumer services (e.g., 23andMe).
(NGS) METHODS TO DETERMINE GENETIC VARIATION
- High-Density Arrays: Genotype millions of positions in the genome at low cost.
- Illumina NovaSeq6000: A machine that outputs billions of sequencing reads quickly.
- Trio Sequencing: Sequences parents and child to detect genetic conditions.
- Mapping Paired-End Sequencing: Aligning sequencing reads to a reference genome for analysis.
HISTORY OF STUDYING GENETIC VARIATION
▸ Long history of understanding heritability
▸ Selection of animals and plants in ancient agriculture
▸ Hemophilia in european royal families
▸ Gained momentum after discovery of DNA
▸ Slow and painful
▸ Early DNA sequencing approaches (Sanger)
▸ Molecular markers (eg RFLP)
RESTRICTION FRAGMENT LENGTH POLYMORPHISM
Restriction enzymes cut DNA– yielding fragments of different size
▸ Mutations may disrupt this pattern
▸ Link disruptions to disease
Types of mutations: Point mutations, structural alterations, and mutations in regulatory regions,
although the latter are harder to interpret.
ARRAYS AND NGS HAVE RESULTED IN AN EXPLOSION OFGENOMIC TESTING
▸ high-density DNA microarrays to genotype millions of specific positions in each of many human genomes. Coupled
with population-based maps of linkage disequilibrium (LD), array-based genotyping enables the ascertainment of
most common genetic variation in a human genome for a remarkably low cost (initially hundreds, now tens, of dollars
per individual) => GWAS and direct- to-consumer (eg. 23andme, MyHeritage, Ancestry,..)
3
,▸ massively parallel DNA sequencing technologies, which have steadily improved since their
introduction in 2005, can generate billions of short sequencing reads within a day or less => next-
generation sequencing or NGS now permit the near-comprehensive ascertainment of both rare and
common genetic variation for about $1,000 per individual (or a few hundred dollars, if one selectively
sequences the exome or coding regions of the genome).
ILLUMINA NOVASEQ6000 OUTPUT
TRIO SEQUENCING
INDEXING / BARCODES
=> Demultiplex reads => generates 2 fastq files for each sample (forwards and
reverse read)
MAPPING PAIRED-END SEQUENCING DATA TO A REFERENCE GENOME
TYPES OF GENOME ALTERATIONS THAT CAN BE DETECTED BY NEXT-GENERATION SEQ.
TYPES OF POINT MUTATIONS IN PROTEIN-CODING GENES
MUTATIONS IN REGULATORY REGIONS ARE HARDER TO INTERPRET
4
, ML APPROACHES IS UNDERSTANDING GENETIC VARIATION
EVALUATING AND PROCESSING RAW SEQUENCE DATA
TWO CASE STUDIES
◗ Cancer
◗ you will analyze a T-ALL patient and will try to find somatic mutations in NOTCH1 gene
◗ Mendelian disorders
◗ you will analyze a family trio with a lipid metabolism disorder
◗ you will try to find disease associated variants in the affected individuals
TYPICAL REFERENCE-BASED NGS ANALYSIS PIPELINE
2 samples of individual (before and after)
Most steps are rather black box. Algorithms are not very well
understood.
Need to understand the formats to understands output.
PROCESS
We start with sequencing, Illumina sequencing. It takes our input DNA and gives us back many reads from random
places in the genome.
THE GOAL
Find places where the align data is different from reference data.
Red line: variance, 1 location is different
FASTQ FILE
What comes out of the sequencer, don’t try to open it in Word. It’s a large file.
First one: identifier (starts with a @), than we have de sequence, than quality sequence.
ILLUMINA - SEQUENCING BY SYNTHESIS
5
, FASTQ AND PHRED
Every character has a number, explanation mark is 33.
A Phred score indicates the reliability of a DNA base call. The
higher the score, the lower the error probability, and the more
accurate the base:
• Score 10: 10% error probability.
• Score 20: 1% error probability.
• Score 30: 0.1% error probability.
QUALITY CONTROL OF FASTQ
◗ Garbage in = Garbage out
◗ Quality control is the first step of any data analysis
◗ This is where you first meet your data
◗ Identify problems then improve of them.
QUALITY CONTROL TOOLS: FASTQC
Tool that thakes my FASTQ-data → for statistics
Helps you decide if sequence trimming is needed before alignment.
Trimming = the process of removing low-quality bases or adapter sequences from raw
sequencing data to improve accuracy. Trimming ensures cleaner data and more reliable alignments.
It is needed if:
• Low Phred scores are present at sequence ends.
• Adapters or contaminants are detected.
• Poor-quality regions affect downstream alignment.
Helps you evaluate library enrichment / complexity. But note that different experiment
types are expected to have vastly different duplication profiles.
Helps evaluate adapter contamination
FASTQ PROCESSING TOOLS
◗ Trimming low quality bases → FASTX-Toolkit
◗ The -l 50 option says that base 50 should be the last base (i.e., trim down to 50 bases)
◗ the -Q 33 option specifies how base qualities on the 4th line of each fastq entry are encoded.
◗ Adapter trimming
◗ Cutadapt
◗
◗ The -m 22 option says to discard any sequence that is smaller than 22 bases after trimming. This avoids
problems trying to map very short, highly ambiguous sequences.
◗ The -O 10 option says not to trim 3' adapter sequences unless at least the first 10 bases of the adapter are
seen at the 3' end of the read. This prevents trimming short 3' sequences that just happen by chance to
match the first few adapter sequence bases.
◗ Flexbar
◗ Trimmomatic
◗ Fastp
REFERENCE-BASED NGS ANALYSIS
◗ Sequence of human genome was completed in 2001
PROBLEMS
◗ The human genome is big and complex.
◗ Half of the human genome is comprised of repeats
◗ Sequencers can produce up to 20 billion reads per run
◗ NovaSeq 6000 system outputs 6 Tb of data <2 days (~48 whole genomes)
◗ But they make mistakes. Frequently.
◗ Accurate alignment takes time, but it’s worth it.
◗ Alignment strategy is highly nuanced, depending on experimental context
6
, MAPPING
Best case scenario: an error free sequencing technology
Differences between reference and alignment.
HASH-BASED MAPPING
Step 1: hash/index the genome
Step 2: use the index to map reads
Step 3: local alignment
◗ There are >50 short-read aligners (see here)
◗ Running nearly all aligners follows a very similar procedure
◗ build index of the reference
◗ align
◗ We will stick to BWA in this class (specifically, BWA-MEM)
◗ there are three BWA algorithms
◗ BWA-backtrack (for shorter reads)
◗ BWA-SW (for longer reads, but with more gaps)
◗ BWA-MEM (longer reads, most commonly used)
MAPPING RESULTS: SAM FORMAT
Understanding the CIGAR string
CIGAR = “Concise Idiosyncratic Gapped Alignment Report “
M=match I=insertion D=deletion S=soft clip
You have to be ware, did you want paired ends? Than you need
to do your mapping differently.
Read: ACGCA-TGCAGTtagacgt
Ref: ACTCAGTG--GT
Cigar: 5M1D2M2I2M7S
EVALUATION OF MAPPED READS
◗ SAM files are great but
◗ are a bit too large – a compressed format would be better
◗ cannot be easily searched – would be great to filter or query them
◗ SAM→BAM (Binary Alignment Format)
SAM file BAM file
◗ information on the alignment of each read ◗ compression saves space
◗ optimized for readability and sequential access ◗ may be sorted and indexed
◗ the file is not readable by eye
Your default format should be BAM – only turn it into SAM when viewing the file
7
, SAM/BAM HIERARCHY
VISUALIZING ALIGNMENT FILES
◗ Samtools tview - text based alignment viewer, In terminal type the following command for a scrollable view:
◗ or try -d T option in Jupyter notebook → IGV: Integrative Genomics Viewer
◗ http://igv.org/
◗ Desktop application and web app
◗ Go to the web app, make sure we use genome version hg19.
◗ Upload tracks (see slide 3)
◗ Go to NOTCH1:
◗ and zoom in a little:
Blue lines:
Brown lines: No confirmation
Red lines: in the beginning a SNP in all the locations: homozygote SNP
Green lines: not present in all reads = heterozygoot
PRACTICAL SESSION 1
◗ “T- cell acute lymphocytic leukemia (T-ALL) is a cancer that starts from the early
version of a subset of white blood cells called T - lymphocytes in the bone marrow
(the soft inner part of the bones, where new blood cells are made). “
◗ Data comes from a T-ALL patient
◗ Illumina single-end exome re-sequencing
◗ 1 x 75bp
◗ subset of the data (only reads mapping to end of chr9 ~ 10Mb)
→ 2 samples, the sample name is TLE55_T.fq
"Hypobetalipoproteinemia is a disorder consisting of low levels of LDL cholesterol or apolipoprotein B."
◗ In people who do not have the genetic disorder hypobetalipoproteinemia, a low cholesterol level may be a marker
for poor nutrition, wasting disease, cancer, hyperthyroidism, and liver disease.
◗ A trio, with mother and child exhibiting low LDL cholestrol
◗ Lipid metabolism disorder ?
◗ Autosomal dominant
◗ Both mother & child are affected
◗ Can we find any disease associated variants?
EVALUATING AND PROCESSING RAW SEQUENCE DATA
Retrieve the data to analyze
1) Make a new folder in Jupyter and upload (in the folder)
2) Check in which folder we are:
→pwd
3) Now check the contents of your new folder with the ls command. If you run this for the first time it may contain
nothing more than the ipython notebook(s):
→ l -ls
4) We analyze fastq files from 2 different samples, a tumor sample (TLE66_T.fq) and normal control (TLE_66_N.fq).
They are located in mnt/nfs/data:
→ ls -lh /mnt/nfs/data/Case1/TLE66_?.fq
5) To create a symbolic link to the files we need for this session run:
→ ln -sf /mnt/nfs/data/Case1/TLE66_?.fq .
6) You should see the fastq files for tumor (T) and a fastq files for the normal (N) sample. Check with ls command.
→ pwd
→ ls -l
7) What is the 1st sequencing record in the file TLE66_T.fq file?
Hint: use the headcommand→ head -8 TLE66_T.fq
→ head -8 TLE66_T.fq
→ tail -4 TLE66_T.fq
8) How many sequence reads are there in the fastq files? Any difference between normal and tumor samples?
8
, Hint: use the wc -l command (but remember that fastq format contains 4 lines per sequence!)
wc -l TLE66_T.fq (count number of lines) → 1911552
wc -l TLE55_N.fq → 2162496
echo '1911552/4' | bc → 477888
echo '2162496/4' | bc → 540624
FASTQ Evaluation Tools
9) FASTQ evaluation tool
→fastqc TLE66_N.fq TLE66_T.fq -o . → output downloaden (HTML)
10) Note that the last two files (the *.fq files) are symlinks, you can see where they point to.
→ ls -lth
Mapping
11) We will use BWA for mapping. Type in bwa to invoke help. Note, for many linux command line commands, you
can get help by typing the command without any arguments. It often returns with a help text.
→bwa
12) We will use bwa mem algorithm for mapping our tumor and normal samples. Again we can get help by just typing
the command without any extra arguments:
→ bwa mem
13) Construct the actual command by inputting the actual filenames. Make Sam-file
bwa mem \
/mnt/nfs/data/resources/Homo_sapiens_assembly19_sorted.fasta \
TLE66_N.fq > TLE66_N.sam
bwa mem \
/mnt/nfs/data/resources/Homo_sapiens_assembly19_sorted.fasta \
TLE66_T.fq > TLE66_T.sam
14) You made 2 SAM files, check them
→ ls -lth
Understanding and assessing alignment results
15) However, they are large, so if you try to display everything it will crash your notebook. So you should use head to
look only a the first part of the file:
→ head -n 30 TLE66_N.sam
16) How many reads have been mapped for the normal sample? We need to exclude lines starting with @ symbol
which indicates header lines (use grep -v); and count the number of lines with wc -l command)
→ cat TLE66_N.sam | grep -v '^@' | wc -l→ 540707
17) Number of reads mapped for tumor sample:
→ cat TLE66_T.sam | grep -v '^@' | wc -l→ 477976
Look at the number of mapped reads, and the number of reads that are in fq files. Do you notice a discrepancy? There
are more reads in the sam file compared to fastq file. This is because bwa can report multiple alignments for a read (if
a read maps to several genomic loci). Given that we were only mapping against chr9, we should be able find a few by
looking for other chromosomes:
Specifically note the XA:Z flag - showing 'alternative hits'
→ grep -v '^@' TLE66_T.sam | grep chr4 | head -3
18) The sixth column in the SAM file is called the CIGAR string (more information) which indicated what the alignment looks
like, how many nucleotides match (M) are inserted (I), deleted (D) or clipped (H/S).
Can you find out how many reads have been mapped without mismatches? (CIGAR = 75M)? You can modify and
use the following commands:
→ cat TLE66_N.sam | awk '$6=="75M"' | wc -l → 530200
→ cat TLE66_T.sam| awk '$6=="75M"' | wc -l → 467860
19) Give some examples of reads with mismatches or soft clips. Note - use head to prevent crashing your notebook:
chromosome, length, mismatches, strength
→ cat TLE66_N.sam | grep -v '^@'| awk '$6=="75M"' | head -5
→ cat TLE66_N.sam| grep -v '^@' | awk '$6!="75M"' | head -5
Evaluation of the mapped reads
20) Start the Samtools-tool for the analyse of sequence data (help commands)
→ samtools
→ santools view
21) We'll now convert our sam files to compressed bam files:
- first we `cat` our sam files into a `grep` to get rid of all hits that have alternative hits (eg, have the `XA:Z` flag).
- then we use samtools to convert to bam:
- `samtools view` is the core command
- `-b` ensure bam output format
- `-q 20` filters on reads with a quality score of at least 20
- the extra `-` indicates that samtools is reading data from stdin
- and finally `-o <FILENAME>` indicates where to write the output to
→ cat TLE66_N.sam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b -q 20 - -o TLE66_N.bam
→ cat TLE66_T.sam | grep -v -e 'XA:Z:' -e 'SA:Z:' | samtools view -b -q 20 - -o TLE66_T.bam
22) Compare the sizes of SAM and BAM files. (Hint: use ls -lth command.)
9