Metrics to benchmark AI systems in clinical data abstraction tasks
Metrics to benchmark AI systems in clinical data abstraction tasks
Sai Anurag Modalavalasa
Sai Anurag Modalavalasa
While clinicians mostly log longitudinal health record information in unstructured text format[1], clinical research data collection instruments, such as disease registries and Case Report Forms (CRFs), are designed to capture standardized clinical variables to enable systematic analysis. Consequently, unstructured health record data are often abstracted into structured datasets for use in clinical research [2,3,4,5]. In this blog, we introduce the set of metrics we use to evaluate the performance of AI systems in automating clinical data abstraction tasks.
Clinical Data Abstraction Tasks: Formulation
The input for clinical data abstraction tasks is unstructured data, predominantly text, and in some cases a combination of text and images. The output is a set of structured clinical variables. These variables can span different types: multiple-choice fields, dates, tables, or subjective text.
While some tasks are purely extractive (example: determining the history of smoking from the record), many aren’t. Some require numeric computations. For instance, body surface area (BSA) requires extracting height and weight and then applying a formula. Some tasks require dynamic reasoning and decision making. For example, identifying adverse events and applying Common Terminology Criteria for Adverse Events (CTCAE) grading[6] involves extracting relevant information from the health record and making a context-dependent determination.
Evaluation Methods and Metrics to Benchmark AI systems
The primary metrics we use are largely derived from quality control (QC) frameworks used by clinical researchers in manual data abstraction workflows. Grounding evaluation in these established systems enables direct benchmarking of AI performance against current manual processes. AI outputs are evaluated against expert-annotated datasets derived from clinical records. The key evaluation metrics are as follows:
Metric | Evaluation method | Notes/Challenges |
Correctness [% of variables correct] | Automated evaluation for structured variable types such as multiple-choice and dates. | For subjective text and tables, semantic similarity is a poor proxy for correctness. Cosine similarity does not capture whether the value is right. Additionally, summarization metrics such as ROUGE [19] are also not effective, since tasks often require calculation, abstraction, or rule-based decisions rather than extraction. Example: A model that correctly captures the adverse event (AE) description but assigns an incorrect grade may still achieve a high ROUGE score, because most of the textual overlap comes from the extracted description, even though the structured variable is incorrect. |
Completeness [% of variables with no missing critical information] | Expert evaluation on sampled outputs | This metric is particularly challenging to automate. Semantic similarity and summarization metrics are often ineffective in quantifying completeness. Example: Both variables have high semantic similarity, but one omits the tumor size and spread, which are critical clinical details. |
Timeliness [Time from record availability to structured data output] | Fully automated measurement | Similar metrics have been automatically measured in other evaluation studies[19] This is an important metric for researchers, as delays can reduce the utility in use cases requiring continuous monitoring, trend analysis, or immediate decision-making. |
Time to Audit [Time needed to verify correctness] | Automated measurements during human reviews | Clinicians often audit the abstracted outputs and identify failure modes. Manual abstraction, however, provides results without supporting rationale. AI systems that generate evidence artifacts alongside outputs can reduce the time required for audit. |
In addition to the metrics described above, clinicians we have collaborated with found value in quantifying an additional measure that is specific to AI-based clinical data abstraction tasks:
% Fields with Hallucinations: % of variables where the AI generates outputs based on fabricated assumptions. For instance, if a patient’s height is missing but the AI assumes a value to calculate body surface area (BSA), this is flagged as a hallucination. Since humans rarely introduce such fabricated data, tracking hallucinations is uniquely important for evaluating AI reliability and trustworthiness.
Classical metrics such as precision, recall, and F1-score are better suited to evaluate classification tasks than clinical data abstraction tasks, as they do not distinguish between qualitatively different failure modes such as incomplete extraction or hallucinated values. We therefore adopt evaluation metrics aligned with clinical data abstraction quality assurance frameworks.
If you are a clinician or researcher interested in accessing anonymized datasets, benchmark results, or learning more about experimental findings (e.g., whether performance on MCQ-style abstraction tasks correlates with performance on subjective clinical variables), please reach out to us.
While clinicians mostly log longitudinal health record information in unstructured text format[1], clinical research data collection instruments, such as disease registries and Case Report Forms (CRFs), are designed to capture standardized clinical variables to enable systematic analysis. Consequently, unstructured health record data are often abstracted into structured datasets for use in clinical research [2,3,4,5]. In this blog, we introduce the set of metrics we use to evaluate the performance of AI systems in automating clinical data abstraction tasks.
Clinical Data Abstraction Tasks: Formulation
The input for clinical data abstraction tasks is unstructured data, predominantly text, and in some cases a combination of text and images. The output is a set of structured clinical variables. These variables can span different types: multiple-choice fields, dates, tables, or subjective text.
While some tasks are purely extractive (example: determining the history of smoking from the record), many aren’t. Some require numeric computations. For instance, body surface area (BSA) requires extracting height and weight and then applying a formula. Some tasks require dynamic reasoning and decision making. For example, identifying adverse events and applying Common Terminology Criteria for Adverse Events (CTCAE) grading[6] involves extracting relevant information from the health record and making a context-dependent determination.
Evaluation Methods and Metrics to Benchmark AI systems
The primary metrics we use are largely derived from quality control (QC) frameworks used by clinical researchers in manual data abstraction workflows. Grounding evaluation in these established systems enables direct benchmarking of AI performance against current manual processes. AI outputs are evaluated against expert-annotated datasets derived from clinical records. The key evaluation metrics are as follows:
Metric | Evaluation method | Notes/Challenges |
Correctness [% of variables correct] | Automated evaluation for structured variable types such as multiple-choice and dates. | For subjective text and tables, semantic similarity is a poor proxy for correctness. Cosine similarity does not capture whether the value is right. Additionally, summarization metrics such as ROUGE [19] are also not effective, since tasks often require calculation, abstraction, or rule-based decisions rather than extraction. Example: A model that correctly captures the adverse event (AE) description but assigns an incorrect grade may still achieve a high ROUGE score, because most of the textual overlap comes from the extracted description, even though the structured variable is incorrect. |
Completeness [% of variables with no missing critical information] | Expert evaluation on sampled outputs | This metric is particularly challenging to automate. Semantic similarity and summarization metrics are often ineffective in quantifying completeness. Example: Both variables have high semantic similarity, but one omits the tumor size and spread, which are critical clinical details. |
Timeliness [Time from record availability to structured data output] | Fully automated measurement | Similar metrics have been automatically measured in other evaluation studies[19] This is an important metric for researchers, as delays can reduce the utility in use cases requiring continuous monitoring, trend analysis, or immediate decision-making. |
Time to Audit [Time needed to verify correctness] | Automated measurements during human reviews | Clinicians often audit the abstracted outputs and identify failure modes. Manual abstraction, however, provides results without supporting rationale. AI systems that generate evidence artifacts alongside outputs can reduce the time required for audit. |
In addition to the metrics described above, clinicians we have collaborated with found value in quantifying an additional measure that is specific to AI-based clinical data abstraction tasks:
% Fields with Hallucinations: % of variables where the AI generates outputs based on fabricated assumptions. For instance, if a patient’s height is missing but the AI assumes a value to calculate body surface area (BSA), this is flagged as a hallucination. Since humans rarely introduce such fabricated data, tracking hallucinations is uniquely important for evaluating AI reliability and trustworthiness.
Classical metrics such as precision, recall, and F1-score are better suited to evaluate classification tasks than clinical data abstraction tasks, as they do not distinguish between qualitatively different failure modes such as incomplete extraction or hallucinated values. We therefore adopt evaluation metrics aligned with clinical data abstraction quality assurance frameworks.
If you are a clinician or researcher interested in accessing anonymized datasets, benchmark results, or learning more about experimental findings (e.g., whether performance on MCQ-style abstraction tasks correlates with performance on subjective clinical variables), please reach out to us.


