Supporting data for "Low-coverage whole genome sequencing for a highly selective cohort of severe COVID-19 patients"
Background
Despite advances in identification of genetic markers associated to severe COVID-19, the full genetic characterisation of the disease remains elusive. Imputation of low-coverage whole genome sequencing (lcWGS) has emerged as a competitive method to study such disease-related genetic markers as they enable genotyping of most common genetic variants used for genome wide association studies. This study aims at exploring the potential use of imputation in lcWGS for a highly selected severe COVID-19 patient cohort.
Findings
We generated an imputed dataset of 79 variant call format (VCF) patient files using the GLIMPSE1 tool, each containing, on average, 9.5 million single nucleotide variants. The validation assessment of imputation accuracy yielded a squared Pearson correlation of approximately 0.97 across sequencing platforms, showing that GLIMPSE1 can be used to confidently impute variants with minor allele frequency up to approximately 2% in Spanish ancestry individuals. We conducted a comprehensive analysis on the patient cohort, examining hospitalisation and intensive care utilisation, sex and age-based differences, and clinical phenotypes using a standardised set of medical terms specifically developed to characterise severe COVID-19 symptoms for this cohort.
Conclusion
This dataset highlights the utility and accuracy of lcWGS imputation in the study of COVID-19 severity, setting a precedent for other applications in resource-constrained environments. The methods and findings presented here may be leveraged in future genomic projects, providing vital insights for health challenges like COVID-19.