Figshare+
Browse
ARCHIVE
EDTAOutput.tar.gz (1.94 GB)
ARCHIVE
TRANS_query_coord.bed.tar.gz (200.01 kB)
ARCHIVE
INVs_query_coord.bed.tar.gz (70.26 kB)
ARCHIVE
INVTR_query_coord.bed.tar.gz (226.2 kB)
ARCHIVE
DUP_query_coord.bed.tar.gz (232.63 kB)
DATASET
csat_orientations.tsv (16.66 kB)
ARCHIVE
cannabinoid_synthase_annotations.tar.gz (29.63 kB)
1/0
7 files

Cannabis Pangenome Annotation Data

dataset
posted on 2024-05-30, 21:48 authored by Ryan LynchRyan Lynch, Lillian Padgitt-CobbLillian Padgitt-Cobb, Andrea R. Garfinkel, Brian Knaus, Nolan Hartwick, Nicholas Allsing, Anthony Aylward, Allen Mamerto, Justine Kipruto Kitony, Kelly Colt, Emily Murray, Tiffany Duong, Aaron Trippe, Seth Crawford, Kelly Vining, Todd Michael

Abstract

Cannabis sativa is a globally significant seed-oil, fiber, and drug-producing plant species. However, a century of prohibition has severely restricted legal breeding and germplasm resource development, leaving potential hemp-based nutritional and fiber applications unrealized. Existing cultivars are highly heterozygous and lack competitiveness in the overall fiber and grain markets, relegating hemp to less than 200,000 hectares globally1. The relaxation of drug laws in recent decades has generated widespread interest in expanding and reincorporating cannabis into agricultural systems, but progress has been impeded by the limited understanding of genomics and breeding potential. No studies to date have examined the genomic diversity and evolution of cannabis populations using haplotype-resolved, chromosome-scale assemblies from publicly available germplasm. Here we present a cannabis pangenome, constructed with 181 new and 12 previously released genomes from a total of 156 biological samples from both male (XY) and female (XX) plants, including 42 trio phased and 36 haplotype-resolved, chromosome-scale assemblies. We discovered widespread regions of the cannabis pangenome that are surprisingly diverse for a single species, with high levels of genetic and structural variation, and propose a novel population structure and hybridization history. Conversely, the cannabinoid synthase genes contain very low levels of diversity, despite being embedded within a variable region containing multiple pseudogenized paralogs and distinct transposable element arrangements. Additionally, we identified variants of acyl-lipid thioesterase (ALT) genes2 that are associated with fatty acid chain length variation and the production of the rare cannabinoids, tetrahydrocannabinol varin (THCV) and cannabidiol varin (CBDV). We conclude the Cannabis sativa gene pool has only been partially characterized, and that the existence of wild relatives in Asia remains likely, while its potential as a crop species remains largely unrealized.

1. Nions, U. Commodities at a glance: Special issue on industrial hemp. Commod Glance (2023) doi:10.18356/9789210019958.

2. Pulsifer, I. P. et al. Acyl-lipid thioesterase1-4 from Arabidopsis thaliana form a novel family of fatty acyl-acyl carrier protein thioesterases with divergent expression patterns and substrate specificities. Plant Mol. Biol. 84, 549–563 (2014).

Transposable element analysis

To identify transposable elements, we used the EDTA pipeline with default settings. EDTAOutput.tar.gz includes EDTA transposon annotations for 78 scaffolded, chromosome-level cannabis genomes.

Structural Variation analysis 

The 78 fully scaffolded assembly haplotypes were each aligned to the EH23a assembly using minimap2 (Heng Li 2018). Syri was then used to call structural variations on each alignment (Goel et al. 2019) and plotsr was used to visualize alignments and SVs (Goel and Schneeberger 2022).

  • DUP_query_coord.bed.tar.gz includes duplications for 78 assemblies with EH23a as reference
  • INVTR_query_coord.bed.tar.gz includes inverted translocations for 78 assemblies with EH23a as reference
  • INVs_query_coord.bed.tar.gz includes inversions for 78 assemblies with EH23a as reference
  • TRANS_query_coord.bed.tar.gz includes translocations for 78 assemblies with EH23a as reference


  • csat_orientations.tsv is a scaffold orientation file for 78 assemblies with EH23a as reference

Funding

NSF Postdoctoral Fellowship in Biology

Directorate for Biological Sciences

Find out more...

Tang Genomics Fund

To develop high-quality genome assemblies of heterozygous cassava varieties and new tools for pangenome analyses to serve breeding programs that need this detailed genomic understanding for more efficient breeding

Bill & Melinda Gates Foundation

Find out more...

History

Research Institution(s)

The Salk Institute for Biological Studies

Contact email

tmichael@salk.edu

I confirm there is no human personally identifiable information in the files or description shared

  • Yes

I confirm the files and description shared may be publicly distributed under the license selected

  • Yes

Competing Interest Statement

S.C. was a co-founder of Oregon CBD. A.R.G and A.T. were employees of Oregon CBD. R.C.L is a stakeholder in Saint Vrain Research LLC, which manufactures hemp based products. T.P.M is a founder of the carbon sequestration company CQuesta.