The variant and genotype calling pipeline described in this vignette is intended for recent or ancient allopolyploid species, in which the reference genome sequence contains many paralogous regions that do not recombine with each other at meiosis. In highly duplicated genomes such as these, conventional alignment software, especially when restricted to only returning one alignment, will often align sequences to the incorrect paralog, causing issues with variant and genotype calling downstream. The pipeline described in this vignette uses read depth distribution across a population (the \(H_{ind}/H_E\) statistic) to determine whether a group of tags is behaving like a Mendelian locus. Attempting to optimize this statistic, it rearranges tags among paralogous alignment locations to correct alignments and generate well-behaved markers.
This pipeline is currently only designed for natural populations,
diversity panels, or mapping populations in which the most recent
generation was produced by random intermating among all progeny. F1,
self-fertilized, and backcrossed populations will not be processed
correctly because expected heterozygosity cannot be estimated directly
from allele frequencies in these populations. However, the \(H_{ind}/H_E\) statistic can still be used
from within polyRAD to filter non-Mendelian markers from these
populations using the HindHeMapping
function.
If you use this pipeline, please cite:
Clark LV, Mays W, Lipka AE, and Sacks EJ (2020) A population-level statistic for assessing Mendelian behavior of genotyping-by-sequencing data from highly duplicated genomes. bioRxiv doi: 10.1101/2020.01.11.902890, Version 1.
Below is an overview of the pipeline. It uses a combination of existing bioinformatics tools, custom Python scripts included with polyRAD, and R functions within polyRAD.
process_sam_multi.py
to group tags based on sets of
alignment locations.readProcessSamMulti
. Run HindHe
on the
imported data to identify problematic samples and estimate
inbreeding.process_isoloci.py
to sort tags into their correct
alignment locations within the groups of paralogs identified by
process_sam_multi.py
, and filter out tags that cannot be
adequately sorted.readProcessIsoloci
, then perform genotype calling.Everything except for the R portion of this pipeline will need to be run on your operating system’s Terminal/Shell/Command Prompt. Bowtie2 requires Mac or Linux. Python 3 is needed for the Python portions of the pipeline. If any of this presents a barrier to you, I recommend finding someone experienced with bioinformatics at your institution who can give you some advice.
Line breaks have been inserted in some terminal commands for readability in this tutorial. (Those same line breaks should not be typed at the command line.)
An overview of the TASSEL-GBSv2 pipeline is available at https://bitbucket.org/tasseladmin/tassel-5-source/wiki/Tassel5GBSv2Pipeline. TASSEL can be downloaded from https://www.maizegenetics.net/tassel.
We will only run the first couple steps of the TASSEL-GBS pipeline. The files that you will need to start with are the key file and your FASTQ files.
-c
parameter to something higher
than the default (10 is still pretty conservative but will discard most
sequencing errors) to save some processing downstream.The commands might look like (on Windows):
run_pipeline.bat -fork1 -GBSSeqToTagDBPlugin -e PstI-MspI -i D:\Msa\raw_data\
-db D:\Msa\Msa.db -k D:\Msa\key.txt -kmerLength 80 -endPlugin -runfork1
run_pipeline.bat -fork1 -TagExportToFastqPlugin -c 10 -db D:\Msa\Msa.db
-o D:\Msa\Msa_tags.fq -endPlugin -runfork1
run_pipeline.bat -fork1 -GetTagTaxaDistFromDBPlugin -db D:\Msa\Msa.db
-o D:\Msa\Msa_ttd.txt -endPlugin -runfork1
You may need to adjust the memory parameters to accomodate a large dataset.
If for some reason you can’t use TASSEL-GBS (for example, your FASTQ files don’t have inline barcodes, or your protocol doesn’t involve restriction enzymes) you might be able to craft a custom alternative for yourself. What you will need out of this step is a file formatted identically to a TagTaxaDist file from TASSEL, and a FASTA or FASTQ file containing the same sequence tags as the TagTaxaDist file.
A TagTaxaDist file is a tab-delimited text file formatted as below, and contains the read depth of every tag in every sample (taxon). I have used a simple convention for sample names here, but the names can be anything you want. Depending on the complexity of your dataset, it can be quite a large file.
Tag | Sam1 | Sam2 | Sam3 | Sam4 |
---|---|---|---|---|
TGCAGAAATCATAGATTAAGGATAT | 0 | 24 | 1 | 0 |
TGCAGAACAGGATACGATACCCCTT | 2 | 0 | 0 | 0 |
TGCAGAACAGTATACGATACCCCTT | 0 | 5 | 0 | 0 |
Next, align the FASTQ file generated in step 2 of the previous section to your reference genome. If you have never used Bowtie2 on your reference genome before, you will first need to make an index.
bowtie2-build Msinensis_497_v7.0.hardmasked.fa Msi_DH1_7
Importantly, you will need to set the -k
argument to
something higher than the number of subgenomes within your reference
genome. For example, in an allotetraploid do at least -k 3
,
and in an allohexaploid at least -k 4
.
For my Miscanthus dataset, I ran:
bowtie2 -k 3 --very-sensitive -x Msi_DH1_7 -U Msa_tags.fq -S Msa_align.sam
Before proceeding, we will want to get a look at the data to identify any outlier samples that may be a different ploidy from the rest, interspecific hybrids, or highly contaminated. I recommend removing them to avoid biasing \(H_{ind}/H_E\) estimates, but keep in mind that this means they will be excluded from all downstream analysis. If a sample is important to your study and you know of a reason why it would have a different heterozygosity from most samples, you should keep it.
We will import just 1000 loci using readProcessSamMulti
.
Sequence tags will be assigned preliminary alignment locations based on
where they had highest sequence similarity, or to a random location if
there was a tie. This is essentially a preview of what the data would
look like if we went ahead with variant calling using the top alignments
returned by Bowtie2.
library(polyRAD)
<- readProcessSamMulti("Msa_split_1_align.csv") myRADprelim
We will then estimate a \(H_{ind}/H_E\) matrix for this data, as well as getting a sum of the read depth for each individual.
<- HindHe(myRADprelim)
hh <- rowSums(myRADprelim$locDepth) TotDepthT
Now we will look at the distribution of \(H_{ind}/H_E\) across samples, and remove outlier samples.
<- rowMeans(hh, na.rm = TRUE)
hhByInd
plot(TotDepthT, hhByInd, xlog = TRUE,
xlab = "Depth", ylab = "Hind/He", main = "Samples")
abline(h = 0.5, lty = 2)
For a diploid population, we aren’t expecting values above 0.5. (In general, you shouldn’t see values above \(\frac{ploidy - 1}{ploidy}\).) So, let’s check which are the outlier samples, then remove them from the dataset.
<- mean(hhByInd) + 3 * sd(hhByInd)
threshold threshold
## [1] 0.8184066
> threshold] hhByInd[hhByInd
## KMS389 JY204 KMS359 KMS365 KMS394 KMS361 KMS454-66
## 1.0038990 0.9136827 1.0982234 1.0417214 0.9956114 1.2012728 1.0125090
## KMS397 KMS444 UI11-00032
## 0.9393743 1.0006418 0.9345475
<- hh[hhByInd <= threshold,] hh
<- SubsetByTaxon(myRADprelim, rownames(hh)) myRADprelim
If you remove any samples like this, you should also make a text file indicating samples to retain in the analysis, which we will use in the next Python step.
writeLines(rownames(hh), con = "samples.txt")
Now we can get a sense of what \(H_{ind}/H_E\) looks like across loci. From
this we can estimate inbreeding and get a rough sense of what proportion
of loci will need to be adjusted or filtered by
process_isoloci.py
.
<- colMeans(hh, na.rm = TRUE)
hhByLoc
hist(hhByLoc, breaks = 50, xlab = "Hind/He", main = "Loci", col = "lightgrey")
Again, we expect values below 0.5 (i.e. \(\frac{ploidy - 1}{ploidy}\)). The tail that we observe above that value likely represents groups of tags that aligned to the same location but in fact represent paralogous loci. The values near zero likely represent loci with very high amplification bias or overdispersion. We will take the peak at 0.3 to represent typical well-behaved markers.
To estimate inbreeding, we’ll also need to estimate overdispersion. You may want to check a broader range of values than 9 to 14. In the matrix of quantiles, we are looking for columns where the 25th percentile is about 0.25, the 50th percentile is about 0.5, etc.
<- TestOverdispersion(myRADprelim, to_test = 9:14)
overdispersionP
sapply(overdispersionP[names(overdispersionP) != "optimal"],
probs = c(0.01, 0.25, 0.5, 0.75, 0.99)) quantile,
## Optimal value is 12.
## 9 10 11 12 13 14
## 1% 0.02579315 0.01944909 0.01477767 0.01141705 0.008937821 0.006978093
## 25% 0.25334097 0.24125874 0.22646624 0.21165050 0.197336617 0.185307346
## 50% 0.55288835 0.54168785 0.52330409 0.50582868 0.490038345 0.477713465
## 75% 0.84390808 0.83639461 0.82940721 0.82350760 0.818544664 0.813849077
## 99% 1.00000000 1.00000000 1.00000000 1.00000000 1.000000000 1.000000000
If \(H_{ind}/H_E\) is 0.3 and overdispersion is 12, we can estimate inbreeding to be about 0.4 using the figure below, which was generated using simulated data.
If all markers were Mendelian, we can see what the expected distribution of \(H_{ind}/H_E\) would look like, given the sample size and read depth distribution observed in the dataset.
ExpectedHindHe(myRADprelim, inbreeding = 0.4, ploidy = 2, overdispersion = 12)
## Simulating rep 1
## Completed 5 simulation reps
## Mean Hind/He: 0.286
## Standard deviation: 0.0891
## 95% of observations are between 0.144 and 0.481
We can now import the sorted dataset into polyRAD. Since there is a
small number of individuals in this dataset, we will lower the filtering
thresholds for loci we import. The possiblePloidies
argument is left at the default of 2
because although the
species is allotetraploid, the sorted loci should now behave in a
diploid fashion.
<- readProcessIsoloci("Msa_split_1_sorted.csv", min.ind.with.reads = 80,
myRAD min.ind.with.minor.allele = 5)
We can see what the distribution of \(H_{ind}/H_E\) looks like after we have sorted and filtered isoloci.
<- HindHe(myRAD)
hh2 <- rowMeans(hh2, na.rm = TRUE)
hh2ByInd <- colMeans(hh2, na.rm = TRUE) hh2ByLoc
hist(hh2ByInd, xlab = "Hind/He", main = "Samples", breaks = 20, col = "lightgrey")
hist(hh2ByLoc, xlab = "Hind/He", main = "Loci", breaks = 50, col = "lightgrey")
By default, readProcessIsoloci
runs
MergeRareHaplotypes
internally. (This behavior can be
turned off with mergeRareHap = FALSE
.) Those loci that now
appear to have a very inflated \(H_{ind}/H_E\) are ones where all true minor
alleles were rare enough to be merged into the common allele, and any
remaining minor alleles only represent sequencing error or alleles that
were amplified very poorly. We can get rid of these.
mean(hh2ByLoc <= 0.5) # proportion of loci retained
## [1] 0.926087
<- names(hh2ByLoc)[hh2ByLoc <= 0.5]
keeploci head(keeploci)
## [1] "Chr01-000136198-top" "Chr01-000151033-top" "Chr01-000152532-top"
## [4] "Chr01-000260665-bot" "Chr01-000282781-top" "Chr01-000530373-bot"
hist(hh2ByLoc[keeploci], xlab = "Hind/He", main = "Loci", breaks = 50, col = "lightgrey")
<- SubsetByLocus(myRAD, keeploci) myRAD
Genotype calling can then be performed as normal.
<- IteratePopStruct(myRAD, overdispersion = 12) myRAD
If you want genotypes output as traditional SNP markers, you can use
the RADdata2VCF
function.
RADdata2VCF(myRAD, file = "Msa_test.vcf")