Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors in report delivery. While recent progress in automated report generation with vision-language models offers clear potential to ameliorate this situation, the path toward real-world adoption has been stymied by the challenge of evaluating the clinical quality of AI-generated reports. In this study, we build a state-of-the-art report generation system for chest radiographs, Flamingo-CXR, by fine-tuning a well-known vision-language foundation model on radiology data. To measure the quality of the AI-generated reports, we perform an expert evaluation, that is largest in scale and diversity to date, by engaging a group of 27 certified radiologists in the United States and India to provide detailed assessment of AI-generated and human written reports from an intensive care setting as well as an inpatient setting. We observe a wide distribution of preferences across the panel, ranging from full consensus to dissensus, across clinical settings and regions, with 55.4% of Flamingo-CXR intensive care reports evaluated to be preferable or equivalent to clinician reports, by half or more of the panel, rising to 77.7% for outpatient x-rays overall and to 94% for the subset of cases with no pertinent abnormal findings. For reports that contain errors we develop an assistive setting, the first demonstration of clinician-AI collaboration for radiology report composition, and we observe a synergistic improvement across all clinical settings. Altogether, these nuanced evaluations reveal disparities between the AI system and radiologists, identify areas for potential clinical utility and pave the way toward a collaborative system that enhances clinical accuracy of radiology reporting.