Peptide Conformers Dataset - PeptideCs
The dataset aims to (near-)completely describe potential energy surfaces of dipeptides at the density functional theory (DFT) level. Knowledge of this energy surface is important, for example, for solving protein folding problem from first principles.
The dataset consists of DFT (BP86-D3/dgauss-DZVP) calculations of CH3COO-(X)n-NHMe peptides (monopeptides, dipeptides) of 20 proteinogenic amino acids in their protonation state at pH = 7 (except for Histidine, where both the two deprotonated and and one protonated forms were considered).
The geometries were optimized at GFN2-xTB semiempirical method with ALPB solvation model (water) with XTB program under harmonically constrained dihedral angles to systematically map the respective energy surface. This combination of semiempirical geometry optimization and DFT single point calculations has been previously used to describe the potential energy surface of peptides.
All resulting geometries were optimized to closest local minimum and these minima were then clustered to remove similar conformations and the lowest energy conformations were kept.
For all of the conformations, DFT calculations with the BP-86 functional and dgauss-DZVP basis set and custom dispersion parameters were performed in Turbomole 7.5.1 with COSMO solvation model in water, where energy, energy gradient and natural population analysis charges were obtained. In addition to these COSMO-RS calculations were carried out with COSMOTherm 2020 to obtain COSMO-RS energies in different solvents.
Visualization of Ramachandran-like plots of potential energy surfaces from this dataset can be found on peptidecs.uochb.cas.cz.
All archive files can be unpacked in the same directory. After unpacking, they yield the structure described below. It is also possible to unpack only a part of the archives. In that case "minima_*.tar" archives contain only local minima and "constrained_*.tar" contain data from the constrained structures. Archives suffixed with "*_main.tar" contain most of the files except for coordinates, energy gradients and NBO charges, which are in separate archives.
Directories: First level directories are named after each mono/dipeptide with corresponding one letter amino acid abbreviations separated by dash (Hd, He in the case of two isomers of non-protonated Histidine with hydrogen either in delta or epsilon position, respectively). Subdirectories are named either "constrained" which correspond to the systematic mapping of the whole dihedral angle potential energy surface, "minima", which corresponds to nonredundant local minima. Each subdirectory contains 16 files as described in "File description".
Files: All files are stored in uncompressed binary NPY format of the Numpy Python library . The order of the entries in all files within one directory is identical (i.e. first item in coords.npy corresponds to first item in nboCharges.npi).
- atoms.npy: proton numbers of atoms in the species
- coords.npy: Cartesian atomic position [10-3 angstrom; stored as integers]
- nboCharges.npy: charges from the Natural population analysis [10-4; stored as integers]
- constrForce.npy: force constant of the constraining harmonic potential of dihedral angles [hartree bohr-2]
- energy.npy: COSMO energy in water (epsilon=80.) [hartree]
- eCosmoRSWater.npy: COSMO-RS energy in water [hartree]
- eCosmoRSDMF.npy: COSMO-RS energy in N,N-dimethylformamide [hartree]
- eCosmoRSHexane.npy: COSMO-RS energy n-Hexane [hartree]
- eCosmoRSOctanol.npy: COSMO-RS energy 1-Octanol [hartree]
- neighbourEnergyDiff.npy: difference of differences between the COSMO and COSMO-RS energies of the structure and its approximate 10 nearest neighbors in water [hartree]
- stid.npy: structure ID in this dataset
- dihedralAngles.npy: dihedral angles (phi_1, psi_1,.. chi_n) [degree]
- wallTime.npy: calculation time [second]
- dihedralAtoms.npy: atom numbers corresponding to each dihedral angle
- dihedralNumbers.npy: how many dihedral numbers each residue has
- gradients.npy: gradients of COSMO energy in water [hartree bohr-1]
Notes: COSMO-RS energies are missing (indicated by NaN value) in two cases due to failed COSMO-RS calculation (structure IDs R-E 321782725, and K-E 487750235). Apart from these, there were around 40 000 failed calculations for various reasons which are not included in the dataset.