Types of Data Repositories Percentage of values for the condition field covered by each UMLS Ontology Each column gives the percentage of the 497,124 values for the condition field contained in ClinicalTrials.gov records that are an exact match for a term from the given ontology. Definitions for significant fields in our analyses are given in Table2. Selecting a Data Repository | Data Sharing - National Institutes of Health Informatics Assoc. Trials 10, 56 (2009). Did you wear a seatbelt today? Records from ClinicalTrials.gov are also commonly used in metanalyses about trends in sources of funding for trials12,13,14,15,16, diseases and interventions studied13,17,18,19,20,21, study design17,19,22, time to publication following study completion11,23,24,25,26, geographical availability of trials sites27,28,29,30,31,32, and the causes of delays and early terminations in studies33,34,35,36,37,38,39,40. Article No field within ClincialTrials.gov records is required to use values from an ontology, and the data dictionary recommendation that the condition field use values that can be mapped to MeSH through the UMLS Metathesaurus is too vague to provide a defined set of expected values. 103, 2230 (2018). Enforcing the presence of all required elements, requiring values for certain fields to be drawn from ontologies, and creating a structured eligibility criteria element would improve the reusability of data from ClinicalTrials.gov in systematic reviews, metanalyses, and matching of eligible patients to trials. Our manual review of the eligibility criteria element suggested that at least 5% of trials have criteria for multiple groups. A clinical data repository consolidates data from various clinical sources, such as an EMR or a lab system, to provide a full picture of the care a patient has received. JAMA 291, 2457 (2004). Dechartres, A., Boutron, I., Trinquart, L., Charles, P. & Ravaud, P. Single-Center Trials Show Larger Treatment Effects Than Multicenter Trials: Evidence From a Meta-epidemiologic Study. Sharing and archiving data on the platform is fee-based. Compliance with Results Reporting at ClinicalTrials.gov. This provides the ability to connect and seamlessly share data between computerized systems and allows for the information exchange between other applications and databases. Metadata describe the source of the data (e.g., investigators, sponsoring organizations, data submission and update dates), the structure of datasets, experimental protocols, identifying and summarizing information, and other domain-specific information. There is inevitably some subjectivity in setting the question to be posed in a systematic review (see Section 11.5.1) and there is likely to be a trade-off between pooling only very similar trials, and achieving high statistical power. Stanford Center for Biomedical Informatics Research, Stanford University School of Medicine, 1265 Welch Rd, Stanford, CA, 94305, USA, Laura Miron,Rafael S. Gonalves&Mark A. Musen, You can also search for this author in Subset analyses - Clinical Trials - Mussen Healthcare Am. Obesity Challenge data consisted of 1237 discharge summaries from the Partners HealthCare Research Patient Data Repository. Federated Queries of Clinical Data Repositories: Scaling - ResearchGate When SARS-CoV-2, and the disease it causes, COVID-19, emerged in late 2019, researchers around the world began planning studies to figure out how to combat this global pandemic. Nat. Oncol. A data repository, often called a data archive or library, is a generic terminology that refers to a segmented data set used for reporting or analysis. Williams, R. J., Tse, T., DiPiazza, K. & Zarin, D. A. Metadata schemas often require that values for certain fields be drawn from a particular biomedical ontology in order to prevent the usage of synonyms and to provide a defined range of values that can be used to query the metadata. Syst. Med. We manually reviewed a random sample of 400 of the 117,906 records from the second and third groups in Table7 (i.e., all records in which eligibility criteria were present but incorrectly formatted). The PRS form-based entry system employs several methods to improve data quality. In this analysis, we investigated whether clinical-trial metadata values conform to expected data types, whether values are ontology terms where recommended, and whether unstructured free-text elements could be replaced with structured elements. The FDA requires only a responsible party, which is permitted to be a sponsor. A metadata schema for data objects in clinical research - Trials https://doi.org/10.1038/s41597-020-00780-z, DOI: https://doi.org/10.1038/s41597-020-00780-z. Scientific Data Drug Discov. Based on these results, we estimate that 16,200 records, 5% of records in the entire repository, define eligibility criteria for multiple groups of participants. The Ontology of Clinical Research (OCRe): An informatics foundation for the science of clinical research. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/. One of several form pages for entering data in the PRS. Miron, L., Gonalves, R. & Musen, M. Data from Obstacles to the Reuse of Study Metadata in ClinicalTrials.gov. Get real time updates on the latest news and events. Data Filtering | STAnford Research Repository (STARR) Tools | Stanford This enables reuse of data across multiple sources, which increases statistical power and accelerates our understanding of this disease. FAIRsharing as a community approach to standards, repositories and policies. The dictionary lists the acceptable values for allocation as Single Group, Parallel, Crossover, Factorial, and Sequential, but values appear in the records as Single Group Assignment, Parallel Group Assignment, etc. Such analyses are limited, however, by the number of patients available. Lu, C. et al. Inform. 167, 921929.e2 (2014). Hughes, S., Cohen, D. & Jaggi, R. Differences in reporting serious adverse events in industry sponsored clinical trial registries and journal articles on antidepressant and antipsychotic drugs: a cross-sectional study. However, there is unexplained inconsistency in the warning levels for missing required elements (missing masking generates a note, while missing interventional study model and number of arms both generate warnings). Bell, S. A. Robin Taylor, MLIS, joined NLM in 2016. MeSH provides the best coverage of any single ontology, but it does not cover significantly more terms than MEDDRA, which contains matches for 230,639 conditions (46%), or SNOMED-CT, which contains matches for 224,008 conditions (45%). The numbers of records missing the remaining fourteen required fields (missing in a non-negligible number of records) are displayed in Table5. FAIRsharing.org51 provides a registry of such standards for metadata in various scientific disciplines. Use of terms from well-known domain-specific ontologies is one of the fundamental guidelines enumerated by the FAIR principles for making scientific data and metadata Findable, Accessible, Interoperable, and Reusable56. Portfolio of prospective clinical trials including brachytherapy: an analysis of the ClinicalTrials.gov database. J. Clin. 26, 294305 (2019). Google Scholar. A Clinical Data Repository is a database or data warehouse where health data, generally with a granularity around each patient, is consolidated from multiple sources to provide health professionals an organized way to analyze the data and create reporting. Both of these studies also concluded that the metadata entry pipelines, which allowed the submission of user-defined fields and provided limited automated validation, contributed significantly to the quality of records. The largest such registry is ClinicalTrials.gov1, a Web-based resource created and maintained by the U.S. National Library of Medicine (NLM). Nucleic Acids Res. Trials 16, 564 (2015). B. Hosting the NIH CDE Task Force (CDETF), a trans-NIH community of practice. Opin. and determining whether ORCID and ROR entries should be created for entities if they do not exist. Trials 7, 9 (2006). PubMed We found that records from trials with a lead sponsor of NIH contain significantly more missing values than do those from the other three agency classes. permissions allowing them to be available for text mining. Of 72 ontologies in the UMLS, 29 contained at least one match for a condition value, and 43 contained no matches (omitted from figure). Rare Dis. Clinical-trial registries are repositories of structured records of keyvalue pairs (registrations) summarizing a trials start and end dates, eligibility criteria, interventions prescribed, study design, names of sponsors and investigators, and prespecified outcome measures, among other details. Google Scholar. We counted the number of records missing each field for 28 of the 41 fields required by the FDAAA801 Final Rule. Our analysis highlights the limitations of the current metadata stored in ClinicalTrials.gov and the benefits that would ensue from making ClinicalTrials.gov records more structured, and thus more findable by specific searches, interoperable with other knowledge sources, and reusable in statistical analyses of multiple studies. Biomedical metadata are typically records of keyvalue pairs that are created when investigators submit data to a repository such as the Gene Expression Omnibus (GEO), Protein Data Bank (PDB), or NCBI BioSample repository. Although the registrations in ClinicalTrials.gov serve the important purpose of enabling FDA oversight and protecting human subjects, they are also an invaluable source of metadata about clinical trials for systematic reviews, adverse events10,11, and analyses about funding sources12,13,14,15,16, study design17,19,22, time to publication following study completion11,23,24,25,26, geographical availability of trials27,28,29,30,31,32, causes of delays and early terminations33,34,35,36,37,38,39,40, and more. Article Timing and Completeness of Trial Results Posted at ClinicalTrials.gov and Published in Journals. National Library of Medicine8600 Rockville PikeBethesda, MD 20894, Web PoliciesFOIAHHS Vulnerability Disclosure, Health Data Standards: A Common Language to Support Research and Health Care, Office of the National Coordinator for Health Information Technology (ONC), A Journey to Spur Innovation and Discovery, Health Data Standards: A Common Language to Support Research and Health Care Psychiatry Intel Real-Time Evidence-Based Psychiatry and Mental Health Research Online, Common Data Elements: Increasing FAIR Data Sharing NLM Musings from the Mezzanine. Background. The new element definitions are legally required for all trials with start dates on or after January 18, 2017, and ClinicalTrials.gov released updated element definitions on January 11, 2017 to support the Final Rule regulations. Rather than being restricted to a single ontology, authors are encouraged to use either MeSH terms, or terms than can be mapped to MeSH through the UMLS metathesaurus. Currently within the PRS, investigator information must be manually reentered everywhere it occurs, which may validly be as the responsible party, as the overall official, and as the site-specific investigator for one or more locations for a single trial. Moja, L. P. et al. Clinical Data - Data Resources in the Health Sciences - Library Guides Wearable movement-tracking data identify Parkinson's disease - Nature A searchable database of more than 2,000 research data repositories, including 100+ relating to cancer. PubMed Central To evaluate completeness, we counted the numbers of records missing all fields required by FDAAA801. Riveros, C. et al. To obtain For data generated from research subject to such policies or funded under such FOAs, researchers should use the designated data repository(ies). . We review traits of reusable clinical data and offer a typology of clinical repositories with a range of known examples. We found that the reusability of clinical-trial metadata is hindered by the lack of a single minimum information standard for the fields required for registering clinical trials, and by the discrepancies between the 24 fields required by the WHO Trial Registration Data Set52 and the 41 fields required by FDAAA801. Barriers to clinical trial recruitment in head and neck cancer. Oral Oncol. Clinical Trials Registration and Results Information. The condition and intervention fields within ClinicalTrials.gov records share characteristics of fields that could support and be improved by ontology restrictions on the allowed values: expected values for these fields are already likely to be found in well-known ontologies such as MeSH or RXNORM, unrestricted values for these fields are likely to introduce synonyms (e.g., the proprietary name and generic name for a drug), and they are important fields for querying the repository. YoannPa/biotab.manager: Scripts to manage biotab files from TCGA. - GitHub As we do, several respondents suggested standardizing the vocabulary used in records by encouraging greater use of well-known controlled terminologies. We also counted the number of records with no listed Principal Investigator, required by the WHO dataset (called Contact for Scientific Inquiries), but not required by FDAAA801. Sci. Data warehouses, as discussed in detail in an upcoming section of this chapter, have distinguishing characteristics such as support for data query or analysis 'in place', and the straightforward transformation of data into smaller 'data marts'. Developing structured representations of inclusion and exclusion criteria that may be reused in future studies, or used to automatically match eligible patients (e.g., from a hospitals patient database) is an active area of research63,64. 6, e1000144 (2009). Eligibility criteria, which could be used to facilitate the matching of patients to applicable clinical trials if stored in a structured format, are currently stored as semi-structured free-text and cannot be used to query the repository. 66, 424427.e2 (2015). meet criteria that merit their recommendation for use in NIH-funded research. We noticed irregularities in the structure of both investigator and contact-related elements. Metadata that are structured using principled schemas and that use terms from ontologies are essential to making biomedical data findable and reusable for downstream analyses. Data Sets | National NLP Clinical Challenges (n2c2) - Harvard University The expected format for eligibility criteria in ClinicalTrials.gov is a bulleted list of strings that enumerate the criteria below the headers Inclusion Criteria and Exclusion Criteria. Eligibility criteria are stored as semi-structured free text. 173, 825 (2013). 1). The ClinicalTrials.gov XSD schema contained type definitions for all Boolean, integer, date, and age fields, and all records validated against this XSD (Table3). Wagner, D. E., Turner, L., Panoskaltsis-Mortari, A., Weiss, D. J. Miron, L., Gonalves, R.S. Galsky, M. D. et al. Wilkinson, M. D. et al. and JavaScript. CDEs are standardized, precisely defined questions that are paired with a set of specific allowable responses, then used systematically across different sites, studies, or clinical trials to ensure consistent data collection. Developing Human Connectome Project (dHCP) include images of neonatal subjects. World Health Organization. New Data Management and Sharing Policy: January 25, 2023. PLoS One 9, e84727 (2014). 35, 12031207 (2013). 1. Unless otherwise specified, we refer to elements by the name provided in the data dictionary. Informatics Assoc. The package is built upon TCGAbiolinks to query TCGA databases, and makes use of R data.table handle queries results. We found that automated validation rules within the PRS have been successful at enforcing type restrictions on numeric, date, and Boolean fields, and fields with enumerated values. PubMed Central Open Access Subset. If no appropriate discipline or data-type specific repository is available, researchers should consider a variety of other potentially suitable data sharing options: Small datasets (up to 2 GB in size) may be included as supplementary material to accompany articles submitted to PubMed Central (. We found 6,851 trials sponsored by the NIH, 3,032 trials sponsored by U.S. Tse, T., Fain, K. M. & Zarin, D. A. The Cloud-Enabled Architecture of the Clinical Data Repository - MDPI Radiat. /faqs#/data-management-and-sharing-policy.htm, U.S. Department of Health and Human Services, Data Management & Sharing Policy Overview, Research Covered Under the Data Management & Sharing Policy, Planning & Budgeting for Data Management and Sharing, Protecting Participant Privacy When Sharing Scientific Data, Principles and Best Practices for Protecting Participant Privacy, Designating Scientific Data for Controlled Access, Considerations for Obtaining Informed Consent, Considerations for Researchers Working with AI/AN Communities. Anyone you share the following link with will be able to read this content: Sorry, a shareable link is not currently available for this article. Those users cannot access all the data in the data repository. Hu, W., Zaveri, A., Qiu, H. & Dumontier, M. Cleaning by clustering: methodology for addressing data quality issues in biomedical metadata. 52, 7891 (2014). World Health Organization (WHO)/International Committee of Medical Journal Editors (ICJME)-ClinicalTrials.gov Cross Reference, https://prsinfo.clinicaltrials.gov/trainTrainer/WHO-ICMJE-ClinTrialsgov-Cross-Ref.pdf (2019). It sounds like a simple question, but there are so many ways to ask the question, and even more possible responses. (2011). Typically, it contains a subset of the clinical data as well as the operational and financial data of the enterprise and is focused primarily on administrative, managerial, and executive decision-making. 10, e1001566 (2013). At NLM, we think a lot about data standards, particularly health data standards. Therefore, all records contain correctly typed values for all occurrences of these elements. Even when values for fields in ClinicalTrials.gov records are drawn from an ontology, they are not specified using globally unique and persistent identifiers, which would enable the interoperability of data with systems that expect these well-defined terms as input. Cohort definition and recruitment are among the most challenging aspects of conducting clinical trials65, and difficulties in recruitment cause delays for the majority of trials66,67. The authors declare no competing interests. Only nine of fifteen fields are typed within the XSD, however, and the untyped fields appear as free text to programs ingesting the raw XML files. Of all 385,279 contact details that are provided, either as the overall contact or a location-specific contact, 81,195 (21%) lack a phone number and 86,611 (22%) lack an email address. Google Scholar. If your data contains a patient whose complete clinical record contains one or more encounters that have been filtered out by this policy . Python notebooks which reproduce all other analyses, tables, and figures are available at https://github.com/lauramiron/CTMetadataAnalysis. Califf, R. M. Characteristics of Clinical Trials Registered in ClinicalTrials.gov, 20072010. J. Biomed. Research questions, such as, What date did the patient first display COVID-19 symptoms? arose continuously. Thadani, S. R., Weng, C., Bigger, J. T., Ennever, J. F. & Wajngurt, D. Electronic Screening Improves Efficiency in Clinical Trial Recruitment. Now Available: Updated eCQM Data Element Repository (DERep) for CY2024 Ross, J. S., Mocanu, M., Lampropulos, J. F., Tse, T. & Krumholz, H. M. Time to Publication Among Completed Clinical Trials. Fed (US governmental agency other than NIH), 69,100 trials sponsored by industry, and 160,291 trials with agency class other. volume7, Articlenumber:443 (2020) J. If material is not included in the articles Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. Food and Drug Administration Amendments Act of 2007 (2007). Year Age Sex Hb (g/dL) PLT 10 9 /L WBC 10 9 /L N 10 9 /L N . The center for expanded data annotation and retrieval. Google Scholar. Viergever, R. F., Karam, G., Reis, A. To test adherence to this restriction, we used BioPortal to search for exact matches for each term, restricted to the 72 ontologies in the 2019 version of UMLS. Non-publication of large randomized clinical trials: cross sectional analysis. Values for fields commonly used in search queries, such as condition(s) and intervention(s), are not restricted to ontology terms, impeding search. PubMed Central Building a knowledge graph to enable precision medicine | Scientific Data Until then, the Ministerial Order 221/1984, that only required the drawing up of a discharge report for patients seen in . Researchers may wish to consult experts in their own institutions (e.g., librarians, data managers) for assistance in selecting an appropriate data repository. Haddad, R. I., Chan, A. T. & Vermorken, J. Article Gonalves, R. S. & Musen, M. A. BMJ Open 4, e005535e005535 (2014). The advantage of CEDAR is that it provides tight integration with biomedical ontologies to control both field names and values, while not being tied to a single repository or metadata schema. PubMed Central HISLec (13): Clinical Data Repositories Flashcards | Quizlet A) For updating the clinical data, which putative steps should be the following: clin.dat <- getClinicalData (coad.maf) Convert from the above data frame the column Tumor_Sample_Barcode into the first 12 characters, and also remove the 2 samples that are not primary, and have a data frame called clinical.updated.dat First, we checked adherence to the restrictions as they are defined. Article R.S.G. PLoS One 10, e0127242 (2015). For example, the same specimens originally collected for a clinical trial could also be used in secondary genomic research. Nat. A tabular format typically uses each row to represent data from a participant and each column to represent an item from a case report form ().Tabular formats have the advantage of being intuitive, relatively simple to create and machine . Introduction In recent years, the pace of the advancement of society has been enhanced by uninterrupted growth in the spread of information and communication technologies (ICT) and in ICT uptake by citizens, enterprises, and public organizations, as well as the increasing role of information in all spheres of life [ 1 ]. If researchers apply health data standards in their investigations if they ask questions and collect responses in a standardized way the data they collect can be combined and compared with data from other COVID-19 studies and EHRs. Since we are primarily concerned with the reusability of the existing metadata, we did not evaluate whether protocol elements and results were added in a timely manner in accordance with FDAAA801 and the Final Rule. Tse et al. CDEs are in use across NIH, to varying degrees. & Berndt, E. R. Trends in the globalization of clinical trials. Ms. Babski is responsible for overall management of one of NLMs largest divisions with more than 450 staff who provide health information services to a global audience of health care professionals, researchers, administrators, students, historians, patients, and the public. (2004). They are secondary databases, that is, they receive data that has been originally input into other sources. Federated queries of clinical data repositories: Scaling to a national 48 identify additional obstacles to clinical trial data reuse: follow-up studies are not always linked to the original study, records can be modified by the responsible party at any . For clinical-trial data, two main policies govern minimum information standards: International Committee of Medical Journal Editors (ICMJE)/ World Health Organization (WHO) trial registration dataset52, Section 801 of the Food and Drug Administration (FDA) Amendments Act of 2007 (FDAAA801)53, and its Final Rule (42 CFR Part 11)54, which updated and finalized required element definitions. The presence of some required fields is strictly enforced by the Protocol Registration System, but contact information, principal investigators, study design information, and outcome measures are frequently missing. 2. CAS Data marts also are more secure because they limit authorized users to isolated data sets. Both DeVito and Anderson69 found that lower levels of compliance with results reporting were associated with trials funded by NIH and other US governmental institutions versus trials funded by industry. Primary data nearly always come from research studies and electronic medical records. (Agency: National Institutes of Health, Department of Health and Human Services; Action: Final rule; Publication Date: 09/21/2016, 2016). ClinicalTrials.gov is also an important resource for patients and health care providers to search for studies for which patients are eligible. We verified that the intervention field could be restricted to ontology terms without significant loss of specificity, by demonstrating that 256,463 out of 557,436 listed intervention values (46%) can be matched to terms from BioPortal ontologies, even without any pre-processing (Fig. Including registry metadata in systematic reviews can help to identify selective reporting bias by comparing published outcomes to prespecified outcomes8,9, and adverse events are more likely to be reported in clinical trial registries than in published literature10,11. Further, the PRS should allow users to enter multiple blocks of inclusion and exclusion criteria and an associated criteria group name, such as the label of the corresponding study arm. Since 2006, data have been collected for >500,000 individuals aged 40-69 years with ongoing passive follow-up of clinical status 16. Tumor is always included in user queries for cancer, however (Table6). Markers by each field name indicate whether the element is required. Data include longitudinal clinical data from the VA's nationwide electronic health record system and : Conceptualization, study design, manuscript review and editing. There will also be a need to develop or adapt sustainable systems to assess repositories for clinical data and data objects against these standards. Med. How to avoid common problems when using ClinicalTrials.gov in research: 10 issues to consider. BMJ j448 (2017). Baldi, I., Lanera, C., Berchialla, P. & Gregori, D. Early termination of cardiovascular trials as a consequence of poor accrual: analysis of ClinicalTrials.gov 20062015. We used regular expressions to test whether values for eligibility criteria conformed to the expected semi-structured format.