Figshare+
Browse
ARCHIVE
SZFBb.tar.gz (332.15 MB)
ARCHIVE
SZFBa.tar.gz (362.39 MB)
ARCHIVE
SVA6b.tar.gz (342.69 MB)
ARCHIVE
SVA6a.tar.gz (354.89 MB)
ARCHIVE
SVA12b.tar.gz (340.04 MB)
ARCHIVE
SVA12a.tar.gz (361.05 MB)
ARCHIVE
STHb.tar.gz (346.41 MB)
ARCHIVE
STHa.tar.gz (360.83 MB)
ARCHIVE
SN1v3b.tar.gz (340.45 MB)
ARCHIVE
SN1v3a.tar.gz (506.07 MB)
ARCHIVE
SKFBb.tar.gz (363.71 MB)
ARCHIVE
SKFBa.tar.gz (357.1 MB)
ARCHIVE
SHH5b.tar.gz (339.19 MB)
ARCHIVE
SHH5a.tar.gz (343.71 MB)
ARCHIVE
SDFBb.tar.gz (343.38 MB)
ARCHIVE
SDFBa.tar.gz (363.88 MB)
ARCHIVE
S8Hb.tar.gz (342.93 MB)
ARCHIVE
S8Ha.tar.gz (339.9 MB)
ARCHIVE
H3S7b.tar.gz (345.43 MB)
ARCHIVE
H3S7a.tar.gz (364.53 MB)
1/0
24 files

Cannabis Pangenome Scaffolded Genomes

dataset
posted on 2024-05-30, 21:48 authored by Ryan LynchRyan Lynch, Lillian Padgitt-CobbLillian Padgitt-Cobb, Andrea R. Garfinkel, Brian Knaus, Nolan Hartwick, Nicholas Allsing, Anthony Aylward, Allen Mamerto, Justine Kipruto Kitony, Kelly Colt, Emily Murray, Tiffany Duong, Aaron Trippe, Seth Crawford, Kelly Vining, Todd Michael

Abstract

Cannabis sativa is a globally significant seed-oil, fiber, and drug-producing plant species. However, a century of prohibition has severely restricted legal breeding and germplasm resource development, leaving potential hemp-based nutritional and fiber applications unrealized. Existing cultivars are highly heterozygous and lack competitiveness in the overall fiber and grain markets, relegating hemp to less than 200,000 hectares globally1. The relaxation of drug laws in recent decades has generated widespread interest in expanding and reincorporating cannabis into agricultural systems, but progress has been impeded by the limited understanding of genomics and breeding potential. No studies to date have examined the genomic diversity and evolution of cannabis populations using haplotype-resolved, chromosome-scale assemblies from publicly available germplasm. Here we present a cannabis pangenome, constructed with 181 new and 12 previously released genomes from a total of 156 biological samples from both male (XY) and female (XX) plants, including 42 trio phased and 36 haplotype-resolved, chromosome-scale assemblies. We discovered widespread regions of the cannabis pangenome that are surprisingly diverse for a single species, with high levels of genetic and structural variation, and propose a novel population structure and hybridization history. Conversely, the cannabinoid synthase genes contain very low levels of diversity, despite being embedded within a variable region containing multiple pseudogenized paralogs and distinct transposable element arrangements. Additionally, we identified variants of acyl-lipid thioesterase (ALT) genes2 that are associated with fatty acid chain length variation and the production of the rare cannabinoids, tetrahydrocannabinol varin (THCV) and cannabidiol varin (CBDV). We conclude the Cannabis sativa gene pool has only been partially characterized, and that the existence of wild relatives in Asia remains likely, while its potential as a crop species remains largely unrealized.

1. Nions, U. Commodities at a glance: Special issue on industrial hemp. Commod Glance (2023) doi:10.18356/9789210019958.

2. Pulsifer, I. P. et al. Acyl-lipid thioesterase1-4 from Arabidopsis thaliana form a novel family of fatty acyl-acyl carrier protein thioesterases with divergent expression patterns and substrate specificities. Plant Mol. Biol. 84, 549–563 (2014).

Pangenome assembly and scaffolding

All genomes labeled Hifiasm_HiC, Hifiasm_Trio_RagTag, Hifiasm_RagTag, and Hifiasm (Supplementary Table 1) were assembled using Hifiasm v0.16.11. When available, HiC data and HiFi parental trio data were also incorporated into the assembly process defining the Hifiasm_HiC and Hifiasm_Trio_RagTag types respectively. CLR (continuous long reads) assemblies were generated using FALCON Unzip from PacBio SMRT Tools 9.0 Suite 15 and CCS (circular consensus sequencing) labeled genomes were assembled with HiCanu v2.2 16. After assembly, HiC reads were aligned to the Hifiasm_HiC contigs using the Juicer v1.6.2 pipeline2 followed by ordering and orientation utilizing version 180922 of the 3D-DNA pipeline3. The scaffolded assemblies were then manually corrected using Juicebox v1.11.084. Hifiasm_RagTag and Hifiasm_Trio_RagTag assemblies were scaffolded using the split chromosomes of the 24 HiC scaffolded genomes and error checked with yak-0.1 (github.com/lh3/yak). Sourmash v4.6.117 was used to generate a Jaccard similarity matrix between the chromosomes and each un-scaffolded assembly, and the most similar version of chromosome 1 through X was concatenated to generate a reference for scaffolding via RagTag v2.1.018. If the similarity matrix identified the Y chromosome as the best match, the assembly remained un-scaffolded. BUSCO v5.4.3 21 with the eudicots_odb10 dataset and assembly-stats v1.0.1 (https://github.com/sanger-pathogens/assembly-stats) were used on all assemblies to measure completeness and contiguity.

1. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021).

2. Durand, N. C. et al. Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. Cell Syst 3, 95–98 (2016).

3. Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).

4. Durand, N. C. et al. Juicebox Provides a Visualization System for Hi-C Contact Maps with Unlimited Zoom. Cell Syst 3, 99–101 (2016).

15. Chin, C.-S. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 13, 1050–1054 (2016).

16. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 30, 1291–1305 (2020).

17. Titus Brown, C. & Irber, L. sourmash: a library for MinHash sketching of DNA. J. Open Source Softw. 1, 27 (2016).

18. Alonge, M. et al. Automated assembly scaffolding using RagTag elevates a new tomato system for high-throughput genome editing. Genome Biol. 23, 258 (2022).

21. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. 2013--2015. Preprint at (2015).

Funding

NSF Postdoctoral Fellowship in Biology

Directorate for Biological Sciences

Find out more...

Tang Genomics Fund

To develop high-quality genome assemblies of heterozygous cassava varieties and new tools for pangenome analyses to serve breeding programs that need this detailed genomic understanding for more efficient breeding

Bill & Melinda Gates Foundation

Find out more...

History

Research Institution(s)

The Salk Institute for Biological Studies

Contact email

tmichael@salk.edu

I confirm there is no human personally identifiable information in the files or description shared

  • Yes

I confirm the files and description shared may be publicly distributed under the license selected

  • Yes

Competing Interest Statement

S.C. was a co-founder of Oregon CBD. A.R.G and A.T. were employees of Oregon CBD. R.C.L is a stakeholder in Saint Vrain Research LLC, which manufactures hemp based products. T.P.M is a founder of the carbon sequestration company CQuesta.