
Image collection and supporting data for: An image dataset of cleared, x-rayed, and fossil leaves vetted to plant family for human and machine learning. Version 2.0.

Version 2 2024-05-23, 21:44
Version 1 2021-12-16, 19:05
posted on 2024-05-23, 21:44 authored by Peter WilfPeter Wilf, Scott L Wing, Herbert W. Meyer, Jacob A. Rose, Rohit Saha, Thomas SerreThomas Serre, N. Rubén Cúneo, Michael DonovanMichael Donovan, Diane M. Erwin, Maria A. Gandolfo, Erika B. Gonzalez-Akre, Fabiany Herrera, Shusheng Hu, Ari Iglesias, Kirk R. Johnson, Talia S. Karim, Xiaoyu Zou, Atsushi Yabe

Here we provide an updated image dataset and supporting data files, version 2, for the following primary article. Please refer to the primary article as well as the supporting data and updates provided here for all details.

Wilf P, SL Wing, HW Meyer, J Rose, R Saha, T Serre, NR Cúneo, MP Donovan, DM Erwin, MA Gandolfo, E González-Akre, F Herrera, S Hu, A Iglesias, KR Johnson, TS Karim, X Zou. 2021. An image dataset of cleared, x-rayed, and fossil leaves vetted to plant family for human and machine learning. PhytoKeys 187: 93–128, doi:10.3897/phytokeys.187.72350

The dataset version that corresponds exactly to the published article remains archived here as version 1 and is easily accessed by toggling the dataset version in this window.

The total image-collection size is now 34,368, consisting of 30,252 images of cleared and x-ray leaves and 4,116 of fossils.

Change list, version 1 to version 2:

1) Addition of NMNS Cleared Leaf Database (4,076 images).

The most significant change in version 2 is the addition of 4,076 images from the National Museum of Nature and Science (NMNS, Ibaraki, Japan) Cleared Leaf Database, made possible by the kind assistance of Dr. Atsushi Yabe, who is included here as a coauthor of version 2. The collection was made by Drs. Toshimasa Tanai of Hokkaido University and Kazuhiko Uemura of NMNS. More information on the NMNS Cleared Leaf Database and one-at-a-time image access are available at the website: . A prior publication using the database is Iwamasa and Noshita (2023), PLoS Comput Biol 19: e1010581.

All taxonomic names for the NMNS specimens as given were vetted and updated to species level by Edward Spagnuolo (acknowledged for his kind assistance) and P. Wilf using the Taxonomic Names Resolution Service (TNRS) and other sources (we note that the names attached to the prior cleared and x-rayed leaf images carried over from version 1 remain vetted only to family level, although taxa of interest can be easily updated using TNRS and many other resources). The vetted names were then used to update the as-given NMNS filenames to the same alpha-sortable format as the prior images (Family_Genus_species_dataset_catalognumber.jpg) and integrated into the same family folders for maximum ease of use. We thank Ivan Rodríguez for his kind assistance with this step.

2) Filename cleanup in all directories and updates to affected catalog files.

Thousands of filenames and their associated catalog entries were improved by batch-removing all periods and spaces (i.e. "sp." and "sp. ", "cf.", "aff.", "x. ") and cross-checked for consistency.

3) A small number of new fossils were added, namely 34 leaves from Dipterocarpaceae and other families from the Pliocene of Brunei (Wilf et al. 2022, PeerJ and online supplement) and seven leaves of Macaranga kirkjohnsonii from the Eocene Laguna del Hunco flora, Chubut, Argentina (Wilf et al. 2023, Am. J. Bot. and online supplement).

4) The following nomenclatural updates were applied to the filenames of all affected images (fossils and extant) and related catalog entries:

Adoxaceae to Viburnaceae.

Vauquelinia coloradensis to Kageneckia coloradensis (after Denk et al. 2023).

Vauquelinia lineara to Vauquelinia liniara (typo correction).

Browniea, Camptotheca, and the five living Nyssaceae genera are all now categorized in Nyssaceae (some were in Cornaceae in v. 1).

File annotations

The version 2 files are provided here as zip archives, as follows. As noted above, the version 1 files remain available by toggling the database version.

Families A–E, F–O, and P–Z, respectively, of cleared and x-rayed leaf images (30,252 images).

Fossil-leaf image collection from Florissant Fossil Beds National Monument (3,320 images).

Fossil-leaf image collection from several other sites (796 images).

Reference set of most of the uncropped image versions for the General Fossil collection, for access to scale bars and other archival information not otherwise available digitally (see main article and supplements linked in item 3 above). Filenames are suffixed with "_uncropped" and may have minor differences in format from the cropped set.

Archive containing three files:


Master inventory file listing all extant and fossil specimens.

See details in the main article (esp. table 1) for how to look up additional specimen data, which are easily available on the Web for most of the collections using the catalog numbers listed in this inventory file (also see below). Please note that the catalog numbers listed here may be primary or secondary, as described in the main article (table 1). The "old_Family" field preserves legacy data that can assist in locating physical specimens in the collections, which usually retain their original taxonomic organization (see main text).

The other two files are catalogs of specimen data not otherwise available on the Web (see main article).


Specimen data for the "General fossil" image collection. As mentioned in the primary article, several fossils retain their generic names, even if they are known to be botanically incorrect in publications or in the opinion of the present authors and thus placed in scare quotes in the primary article. In this case, the listed family name is regarded as correct. Scare quotes cannot be used in filenames and are thus omitted.


Voucher data for the Wing X-Ray image collection.

Technical notes for the Wing x-rays:

Catalog number field in the Master Inventory file = negative number + leaf number as listed in this file.

Example: "Wing_199-001" in the Master Inventory = negative 199, leaf 1 here =

Alphonsea arborea

(Annonaceae) = primary voucher US 904529.

Some typographical errors in this legacy catalog are left as-is, and identifications are not updated here. Vetted spellings and updated family and order assignments can be found by catalog number (= negative + leaf number) in the Master Inventory file. This file includes some additional records that did not meet criteria for the image dataset.


Collaborative Research: Origins of Southeast Asian Rainforests from Paleobotany and Machine Learning

Directorate for Geosciences

Find out more...

Collaborative Research: Origins of Southeast Asian Rainforests from Paleobotany and Machine Learning

Directorate for Geosciences

Find out more...

Collaborative Research: Origins of Southeast Asian Rainforests from Paleobotany and Machine Learning

Directorate for Geosciences

Find out more...

Collaborative Research: Patagonian Fossil Floras, the Keys to the Origins, Biogeography, Biodiversity, and Survival of the Gondwanan Rainforest Biome

Directorate for Biological Sciences

Find out more...

Collaborative Research: Patagonian Fossil Floras, the Keys to the Origins, Biogeography, Biodiversity, and Survival of the Gondwanan Rainforest Biome

Directorate for Biological Sciences

Find out more...

National Park Service


Research Institution(s)

Pennsylvania State University, Smithsonian Institution, Florissant Fossil Beds National Monument, Brown University, and many others

Contact email

I confirm there is no human personally identifiable information in the files or description shared

  • Yes

I confirm the files and description shared may be publicly distributed under the license selected

  • Yes