Healthsheet: Development of a Transparency Artifact for Health Datasets

Rostamzadeh, Negar; Mincu, Diana; Roy, Subhrajit; Smart, Andrew; Wilcox, Lauren; Pushkarna, Mahima; Schrouff, Jessica; Amironesei, Razvan; Moorosi, Nyalleng; Heller, Katherine

doi:10.48550/arxiv.2202.13028

Cited by 4 publications

(6 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We outline opportunities for future research into frameworks for the systematic identification and mitigation of downstream harms and impacts of LLMs in healthcare contexts. Key principles include the use of participatory methods to design contextualized evaluations that reflect the values of patients that may benefit or be harmed, grounding the evaluation in one or more specific downstream clinical use cases 39,40 , and the use of dataset and model documentation frameworks for transparent reporting of choices and assumptions made during data collection and curation, model development and evaluation [41][42][43] . Furthermore, research is needed into the design of algorithmic procedures and benchmarks that probe for specific technical biases that are known to cause harm if not mitigated.…”

Section: Fairness and Equity Considerationsmentioning

confidence: 99%

Large language models encode clinical knowledge

Singhal

Azizi

et al. 2023

Nature

745

261

View full text Add to dashboard Cite

Large language models (LLMs) have demonstrated impressive capabilities, but the bar for clinical applications is high. Attempts to assess the clinical knowledge of models typically rely on automated evaluations based on limited benchmarks. Here, to address these limitations, we present MultiMedQA, a benchmark combining six existing medical question answering datasets spanning professional medicine, research and consumer queries and a new dataset of medical questions searched online, HealthSearchQA. We propose a human evaluation framework for model answers along multiple axes including factuality, comprehension, reasoning, possible harm and bias. In addition, we evaluate Pathways Language Model1 (PaLM, a 540-billion parameter LLM) and its instruction-tuned variant, Flan-PaLM2 on MultiMedQA. Using a combination of prompting strategies, Flan-PaLM achieves state-of-the-art accuracy on every MultiMedQA multiple-choice dataset (MedQA3, MedMCQA4, PubMedQA5 and Measuring Massive Multitask Language Understanding (MMLU) clinical topics6), including 67.6% accuracy on MedQA (US Medical Licensing Exam-style questions), surpassing the prior state of the art by more than 17%. However, human evaluation reveals key gaps. To resolve this, we introduce instruction prompt tuning, a parameter-efficient approach for aligning LLMs to new domains using a few exemplars. The resulting model, Med-PaLM, performs encouragingly, but remains inferior to clinicians. We show that comprehension, knowledge recall and reasoning improve with model scale and instruction prompt tuning, suggesting the potential utility of LLMs in medicine. Our human evaluations reveal limitations of today’s models, reinforcing the importance of both evaluation frameworks and method development in creating safe, helpful LLMs for clinical applications.

show abstract

Section: Fairness and Equity Considerationsmentioning

confidence: 99%

Large language models encode clinical knowledge

Singhal

Azizi

et al. 2023

Nature

745

261

View full text Add to dashboard Cite

show abstract

“…Second, the reviewed publications enable the documentation of information on the data selection, data version and data collection of the training data. Guidelines and checklists from different fields empower researchers to track data collection and selection rationale (Artrith et al, 2021;Bender and Friedman, 2018;Hutchinson et al, 2021;Isdahl and Gundersen, 2019;Rostamzadeh et al, 2022;Rule et al, 2019;Vasey et al, 2022;Walsh et al, 2021). Documentation guidelines also offer methods for recording how and when data was collected (Artrith et al, 2021;Gebru et al, 2021;Hutchinson et al, 2021;Norgeot et al, 2020;Srinivasan et al, 2021).…”

Section: Figure 1 Structure Of the Resultsmentioning

confidence: 99%

“…First, literature describes tools and methods for documenting data set size and composition. Guidelines help researchers with providing this information via summary statistics and visualizations (Gebru et al, 2021;Holland et al, 2018;Isdahl and Gundersen, 2019;Mitchell et al, 2019;Mora-Cantallops et al, 2021;Rostamzadeh et al, 2022;Schelter et al, 2017). In addition,…”

Section: Documenting the Training Datamentioning

confidence: 99%

A comprehensive review of techniques for documenting artificial intelligence

Königstorfer

2024

DPRG

View full text Add to dashboard Cite

Purpose Companies are increasingly benefiting from artificial intelligence (AI) applications in various domains, but also facing its negative impacts. The challenge lies in the lack of clear governance mechanisms for AI. While documentation is a key governance tool, standard software engineering practices are inadequate for AI. Practitioners are unsure about how to document AI, raising questions about the effectiveness of current documentation guidelines. This review examines whether AI documentation guidelines meet regulatory and industry needs for AI applications and suggests directions for future research. Design/methodology/approach A structured literature review was conducted. In total, 38 papers from top journals and conferences in the fields of medicine and information systems as well as journals focused on fair, accountable and transparent AI were reviewed. Findings This literature review contributes to the literature by investigating the extent to which current documentation guidelines can meet the documentation requirements for AI applications from regulatory bodies and industry practitioners and by presenting avenues for future research. This paper finds contemporary documentation guidelines inadequate in meeting regulators’ and professionals’' expectations. This paper concludes with three recommended avenues for future research. Originality/value This paper benefits from the insights from comprehensive and up-to-date sources on the documentation of AI applications.

show abstract

“…Additionally, a single de-identified file containing all annotated entities for ophthalmic medications was included. To promote transparency in our data collection methods and intended uses for this data, we have provided a HealthSheet, a structured datasheet specific to healthcare datasets as recommended by Rostamzadeh et al 31 based on the original datasheet by Gebru et al 32 which was developed for open-datasets for all use cases in AI. This datasheet is provided in the Supplementary Materials .…”

Section: Resultsmentioning

confidence: 99%

Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record

Chen

Lin

Yang

et al. 2022

Trans. Vis. Sci. Tech.

View full text Add to dashboard Cite

Purpose To describe the methods involved in processing and characteristics of an open dataset of annotated clinical notes from the electronic health record (EHR) annotated for glaucoma medications. Methods In this study, 480 clinical notes from office visits, medical record numbers (MRNs), visit identification numbers, provider names, and billing codes were extracted for 480 patients seen for glaucoma by a comprehensive or glaucoma ophthalmologist from January 1, 2019, to August 31, 2020. MRNs and all visit data were de-identified using a hash function with salt from the deidentifyr package. All progress notes were annotated for glaucoma medication name, route, frequency, dosage, and drug use using an open-source annotation tool, Doccano. Annotations were saved separately. All protected health information (PHI) in progress notes and annotated files were de-identified using the published de-identifying algorithm Philter. All progress notes and annotations were manually validated by two ophthalmologists to ensure complete de-identification. Results The final dataset contained 5520 annotated sentences, including those with and without medications, for 480 clinical notes. Manual validation revealed 10 instances of remaining PHI which were manually corrected. Conclusions Annotated free-text clinical notes can be de-identified for upload as an open dataset. As data availability increases with the adoption of EHRs, free-text open datasets will become increasingly valuable for “big data” research and artificial intelligence development. This dataset is published online and publicly available at https://github.com/jche253/Glaucoma_Med_Dataset . Translational Relevance This open access medication dataset may be a source of raw data for future research involving big data and artificial intelligence research using free-text.

show abstract

Healthsheet: Development of a Transparency Artifact for Health Datasets

Cited by 4 publications

References 18 publications

Large language models encode clinical knowledge

Large language models encode clinical knowledge

A comprehensive review of techniques for documenting artificial intelligence

Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record

Contact Info

Product

Resources

About