Description
Foundation Models require significantly more data for training than earlier AI generations. The scarcity of clinical data, as well as the necessity of perfecting model generalization capabilities, make it necessary to aggregate data for model training and validation from various datasets. In this work, we explore the challenges related to FAIR clinical imaging data (findability, accessibility, interoperability, and reusability) encountered while sourcing real open clinical imaging datasets from large public cohort studies, existing public data repositories and individual dataset publications. Additionally, we present anonymized real-world examples detailing access, metadata, and licensing configurations, illustrating specific problems that may emerge with regard to various FAIR principles. We introduce a tier system designed to identify dataset issues impacting machine readability. Furthermore, we evaluate the manual efforts and resources required to find, access, and fetch data for Foundation Model training, linking these activities to our tier-based framework for assessing dataset machine readability. Additionally, we provide some suggestions on how to refine datasets on different tiers to make compliant with the FAIR criteria and hence reduce the human workload of their procurement. Key strategies, such as utilizing Resource Description Framework to export key-value pairs from the Imaging Data Repository and constructing FAIR Data Point, are given as methods to facilitate highly automated dataset access through advanced techniques that adhere to FAIR standards.