Publication of AI-generated synthetic structural data in data repositories is beginning to reveal the specific documentation elements that need to accompany synthetic datasets so as to ensure reproducibility and enable data reuse. This document identifies actions that research repositories can take to encourage users to provide AI-generated synthetic datasets with appropriate structure and documentation. The recommendations are specifically for AI generated data, not (for example) data produced using pre-configured models or missing data created by statistical inference. Additionally, this document discusses metadata/README elements for synthetic structured datasets (tabular and multi-modal) and not textual data from LLMs or images for computer vision.
The document is the result of a workshop held on 23rd January 2025, with participants from the Swedish National Data Service, Linköping University and Manchester University. It also draws on survey responses about current practice from 17 data repositories and a review of existing metadata and README requirements.