Estimate Genome Size of Polyploid Species Using K-mer Frequencies

library(findGSEP)
#> Loading required package: RColorBrewer
#> Loading required package: ggplot2

Description

findGSEP is a function for multiple polyploidy genome size estimation by fitting k-mer frequencies iteratively with a normal distribution model.

To use findGSEP, one needs to prepare a histo file, which contains two tab-separated columns. The first column gives frequencies at which k-mers occur in reads, while the second column gives counts of such distinct k-mers. Parameters k and related histo file are required for any estimation.

Dependencies (R library) required: pracma, fGarch, etc. - see DESCRIPTION for details.

Usage

findGSEP(
  path,
  samples,
  sizek,
  exp_hom,
  ploidy,
  range_left,
  range_right,
  xlimit,
  ylimit,
  output_dir
)

Arguments

Example Usage

To run the algorithm, follow these steps:

  1. Prepare a Path: Create a directory where the histo file will be stored. For example, create a directory named test_findGSEP.

  2. Put Histo File in the Path: Place your histo file in the test_findGSEP directory. In this example, the histo file name is ara_simulate_4ploidy_25x_rep4.histo.

  3. Provide Output Directory: Specify the output directory where the results will be saved. In this example, we use tempdir() as the output directory.

  4. Run the Algorithm: Use the following command to run the algorithm with the specified parameters:

    findGSEP(
        path = 'test_findGSEP',
        samples = 'ara_simulate_4ploidy_25x_rep4.histo',
        sizek = 21,
        exp_hom = 35,
        ploidy = 4,
        output_dir = tempdir(),
        range_left = 35 * 0.2, ## exp_hom*0.2
        range_right = 35 * 0.2, ## exp_hom*0.2
        xlimit = -1, ## will calculate automatically
        ylimit = -1 ## will calculate automatically
    )
  5. Output: The output will include:

    • A PDF file named ${samples}._hap_genome_size_est.pdf, which contains the estimated genome size.
    • A CSV file named ${samples}._haploid_size.csv, which contains the predicted genome size.

References

Laiyi Fu, Yanxin Xie, Shunkang Ling, and Hequan Sun# etc. al. findGSEP: a web application for estimating ge-nome size of polyploid species using k-mer frequencies

Session Info

R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] findGSEP_1.2.0     dplyr_1.1.4        png_0.1-8          scales_1.3.0       fGarch_4033.92    
[6] pracma_2.4.4       ggplot2_3.5.0      RColorBrewer_1.1-3

loaded via a namespace (and not attached):
 [1] Matrix_1.6-5        gtable_0.3.5        compiler_4.3.3      fBasics_4032.96     gbutils_0.5        
 [6] tidyselect_1.2.1    cvar_0.5            timeSeries_4032.109 yaml_2.3.8          fastmap_1.1.1      
[11] lattice_0.22-5      R6_2.5.1            generics_0.1.3      knitr_1.45          rbibutils_2.2.16   
[16] tibble_3.2.1        spatial_7.3-17      munsell_0.5.1       timeDate_4032.109   pillar_1.9.0       
[21] rlang_1.1.3         utf8_1.2.4          xfun_0.43           pkgload_1.3.4       cli_3.6.2          
[26] withr_3.0.0         magrittr_2.0.3      Rdpack_2.6          digest_0.6.35       grid_4.3.3         
[31] rstudioapi_0.16.0   lifecycle_1.0.4     vctrs_0.6.5         evaluate_0.23       glue_1.7.0         
[36] fansi_1.0.6         colorspace_2.1-0    rmarkdown_2.26      htmltools_0.5.8.1   tools_4.3.3        
[41] pkgconfig_2.0.3