The Data Provenance Initiative is a large-scale audit of AI datasets used to train large language models. As a first step, we've traced 2000+ popular, text-to-text finetuning datasets from origin to creation, cataloging their data sources, licenses, creators, and other metadata, for researchers to explore using this tool. The purpose of this work is to improve transparency, documentation, and informed use of datasets in AI.