A Hierarchical Learning Approach to Calibrate Allele Frequencies for SNP
Based Genotyping of DNA Pools
Andrew D. Hellicar, Daniel Smith, Ashfaqur John Henshall
Rahman, Ulrich Engelke Division Animal, Food and Health Sciences, CSIRO
Computational Informatics, CSIRO Armidale, NSW 2350, Australia
Hobart, TAS, 7000, Australia
Abstract— The combination of low density SNP arrays and studies still require a large number of samples to be
DNA pooling is a fast and cost effective approach to genotyping genotyped. This is an expensive exercise given each
that opens up basic genomics to a range of new applications and microarray can only be used once. DNA pooling is an attempt
studies. However we have identified significant limitations in the to address this issue by combining multiple DNA samples
existing approach to calculating allele frequencies with DNA
prior to genotyping. Each pool is genotyped as a single
pooling. These limitations include a reduced ability to deal with
SNP to SNP variation via the standard interpolation method. sample; greatly reducing the number of microarrays, and
Our contribution is a new hierarchical learning framework hence, cost and time required to undertake a study [4-5].
which resolves these drawbacks. The framework involves a For the typical case of genotyping individuals, SNP alleles
hierarchy of two greedily trained layers of learners. The first
layer learns the bias of each SNP then applies a calibration to
are arbitrarily labeled as A or B and the SNP genotype is one
reduce SNP bias by mapping into a common coordinate system of AA, AB, BB, due to the presence of two copies of the
across all SNPs. The second layer learns an allele frequency DNA. The raw output of a microarray is therefore quantized
function exploiting the global SNP data. A range of algorithms into one of three possible values. This quantization means the
have been applied including linear regression, neural network value can be retrieved despite the presences of genotyping
and support vector regression. The framework has been tested noise. In contrast to individual SNPs, DNA pools require the
on pooled samples of Black Tiger prawns that have been raw array output to be used to directly compute a quantitative
genotyped with low density Sequenom iPLEX panels. Analysis of genotype of each SNP [4]. This is known as the allele
pooled samples and the corresponding individually genotyped frequency. Pooled samples are subject to greater genotyping
SNP samples indicate the pooling approach introduces an allele
frequency RMS error of 0.12. The existing calibration approach
inaccuracies than individual samples [4-6]. This is a
corrects ~14% of the error. Our hierarchical approach is 4.5 consequence of the continuity of pooled allele frequency
times as effective by correcting for ~64% of the introduced error. estimates, which are more susceptible to genotyping noise
This is a significant reduction and has the potential to enable than the discrete alleles of bi-allelic SNP data.
genetic studies previously not possible due to allele frequency
error. Although testing so far is limited to low density SNP
Low density SNP array technologies have been utilized to
arrays the approach was developed to generalize to other SNP genotype species where genomic research is limited and where
genotyping technologies. it is economically infeasible to invest in expensive, higher
density microarrays. The Sequenom MassARRAY iPLEX
Keywords—Machine learning, DNA platform [7] is one such low cost technology that genotypes
between tens to a few thousand SNPs. By combining low
I. INTRODUCTION density SNP technology with DNA pooling, costs can be
Singular Nucleotide Polymorphisms (SNPs) based greatly reduced, opening up genomics to a range of low cost
genotyping is a fast and cost effective approach to identify studies and applications. Aquaculture is one such application,
functionally important polymorphisms of a species [1]. Such and this paper forms part of an evaluation of pooled based
gene markers can be linked with disease, complex traits or be genotyping of Black Tiger prawns (Penaeus monodon) for a
used to provide family information. Multiplex microarray selective breeding program. The study genotyped 22 pooled
systems have been developed for SNP genotyping to provide samples (each comprised of between 18 and 23 prawns) using
sufficiently dense coverage for genome wide association a Sequenom iPLEX panel of 63 SNPs. The pooled assays are
studies [2, 3]. For instance, Affymetrix array technology [2] being considered to construct the pedigree of individual
can interrogate 906,600 SNPs, whilst Illumina [3] released the farmed prawns, which are not sufficiently valued to employ
Human Omni5 Beadarray that can genotype 4.3 million SNPs. high density, individual genotyping.
The costs associated with developing such high density
There have been few feasibility studies using low density
technologies are still prohibitive for genotyping many species.
and low cost arrays with pooled samples. Pooling studies have
In particular, species where research has not been conducted
generally focused upon higher density SNP based microarrays
to identify polymorphic markers that provides coverage of the
that provide genome wide coverage [8-11]. In this paper, we
genome or genes of interest. Even for species where SNP
investigate the accuracy of using low density SNP arrays to
microarrays have been developed, SNP based association
, genotype pooled samples. To examine the effect of pooling nucleotide primer extension and mass spectrometry [14].
upon these lower cost technologies, pooled allele frequencies These errors may have a minor impact on genotype calling for
are compared to “ground truth” allele frequencies computed individuals, they have a major impact on the estimation of
from genotypes of the individuals belonging to the same pool. allele frequencies in genotyping based on DNA pooled data
though. Identifying and correcting for the bias and errors is
The contribution of this paper is an allele frequency
therefore instrumental for the success of the estimation of
estimation method based upon supervised machine learning
allele frequencies in pooled DNA experiments.
which we propose to correct for errors associated with pooled
allele frequencies. The estimation method involves two stages; Several research contributions aimed at solving this
the first calibrates each SNP by correcting for the bias problem exist. In [15] the degree of bias is quantified using the
exhibited in the SNPs raw outputs. These biases exist for coefficient of preferential amplification/hybridization (CPA),
Sequenom [12] and Illumina systems and are caused by which is defined as the ratio of average peak intensities
combinations of differential amplification and hybridization between two alleles. It was found that lognormal distributions
[4]. Bias creates errors in the pooled allele frequencies when adequately model bias introduced through preferential
estimating allele frequency directly. To correct for this bias, hybridization, resulting in reduced error of allele frequency
the pooled results are mapped onto a common domain across estimation for the human genome. The authors in [16] propose
all of the SNP. The second stage then involves training a a SNP genotyping method based on a general linear model
model which estimates allele frequency as a function on this that accounts for the nested structure of the data. The proposed
common domain. Both stages used supervised training based method does not require the CPA to be known and hence,
on “ground truth” allele frequencies. The machine learning avoids the need for individual SNP genotyping to determine
algorithms are implemented with SVM and radial basis neural allelic ratio of hybridization, therefore scaling up to arrays
networks. In addition, we examine the linearity of the error with many thousands of SNPs. Finally, in [17] piecewise
characteristics by comparing the accuracy of linear and non- linear interpolation of pooled alleles was used to correct for
linear methods. the bias in pooled DNA data of the human genome.
The paper is structured is follows. Section II outlines Unlike any of the previous methods, we use a hierarchical
previous related work that has considered the correction of machine learning approach to achieve both individual SNP
allele frequency estimates from DNA pools and discusses how calibration and bias corrected allele frequency estimation on
our machine learning based approach differs. In Section III, DNA pooled data. We show that linear piecewise interpolation
we investigate the accuracy of the pooled allele frequency is not always optimal and that for some SNPs Lagrange or
estimates with the Sequenom iPLEX platform. This Hermite interpolation result in improved performance. To the
investigation is used to motivate our proposed calibration best of our knowledge, this is the first line of research that
method for pooled allele estimation that is described in utilizes these techniques to successfully correct for bias in
Section IV. In Section V, we present the results of our DNA pooled data. Given the application to low-cost arrays,
calibration methodology and draw our conclusions in Section this framework is particularly beneficial in the agriculture and
VI. aquaculture domain and is in the following demonstrated on
pooled Black Tiger prawns (Penaeus monodon) data.
II. BACKGROUND
Genotyping errors are recognized to have an impact on the III. ACCURACY OF ALLELE FREQUENCY ESTIMATES
conclusions drawn from a study; however, they are too often We investigate the accuracy of the pooled allele frequency
neglected. In [13] genotyping errors are defined as a estimates with the Sequenom iPLEX platform.
discrepancy between the observed and true genotype of an
individual. Four classes of errors are identified related to (1) The existing approach for calculating allele frequencies
variation in DNA sequence, (2) low quantity and quality of follows the process shown on left of Fig. 1. The following
DNA, (3) biochemical artifacts, and (4) human factors. A paragraphs describe each of the steps in the process.
protocol for estimating error rates within these classes is A. Data sets
proposed.
We have four data sets including three data sets from the
In this work, we are particularly concerned with errors Sequenom iPLEX platform. Two data sets (iE1, iE2 Fig. 1)
stemming from incorrect allele frequency estimates, and hence contain SNP results for individuals, and a third (pE Fig. 1)
misleading biological conclusions, from pooled DNA samples contains SNP results for pools of individuals. A final pool
as compared to individual DNA samples. In SNP genotyping membership database (pM) identifies which individuals were
systems, such as the ones by Sequenom and Illumina, these in each of the pooled samples.
errors are a result of biochemical reactions during the
genotyping process and hence fall mainly into category (3) of Additional sub-sets of the available data were generated for
the above classification (not withstanding any other potential analysis. 48 individuals had duplicate samples in iE2 allowing
error sources). Whereas Illumina systems typically utilize a measurement repeatability to be assessed. 78 individuals in
process based on differential hybridization and fluorescent iE2 were also sequenced in iE1. Of the 63 SNPs called by the
detection, the Sequenom iPLEX platform is based on single iPLEX platform 2 SNPs consistently failed and were initially
This work was supported in part by a grant from Tasmanian Government
which is administered by the Tasmanian Department of Economic
Development, Tourism and the Arts and in part by the CSIRO Food Futures
Flagship.