Metrics to benchmark AI systems in clinical data abstraction tasks

Metrics to benchmark AI systems in clinical data abstraction tasks

Sai Anurag Modalavalasa

Sai Anurag Modalavalasa

While clinicians mostly log longitudinal health record information in unstructured text format[1], clinical research data collection instruments, such as disease registries and Case Report Forms (CRFs), are designed to capture standardized clinical variables to enable systematic analysis. Consequently, unstructured health record data are often abstracted into structured datasets for use in clinical research [2,3,4,5]. In this blog, we introduce the set of metrics we use to evaluate the performance of AI systems in automating clinical data abstraction tasks.

Clinical Data Abstraction Tasks: Formulation

The input for clinical data abstraction tasks is unstructured data, predominantly text, and in some cases a combination of text and images. The output is a set of structured clinical variables. These variables can span different types: multiple-choice fields, dates, tables, or subjective text.

While some tasks are purely extractive (example: determining the history of smoking from the record), many aren’t. Some require numeric computations. For instance, body surface area (BSA) requires extracting height and weight and then applying a formula. Some tasks require dynamic reasoning and decision making. For example, identifying adverse events and applying Common Terminology Criteria for Adverse Events (CTCAE) grading[6] involves extracting relevant information from the health record and making a context-dependent determination.

Evaluation Methods and Metrics to Benchmark AI systems

The primary metrics we use are largely derived from quality control (QC) frameworks used by clinical researchers in manual data abstraction workflows. Grounding evaluation in these established systems enables direct benchmarking of AI performance against current manual processes. AI outputs are evaluated against expert-annotated datasets derived from clinical records. The key evaluation metrics are as follows:

Metric

Evaluation method

Notes/Challenges

Correctness

[% of variables correct]

Automated evaluation for structured variable types such as multiple-choice and dates.
For subjective text and tables, manually evaluated on sampled outputs by expert reviewers.

For subjective text and tables, semantic similarity is a poor proxy for correctness. Cosine similarity does not capture whether the value is right. Additionally, summarization metrics such as ROUGE [19] are also not effective, since tasks often require calculation, abstraction, or rule-based decisions rather than extraction.

Example: A model that correctly captures the adverse event (AE) description but assigns an incorrect grade may still achieve a high ROUGE score, because most of the textual overlap comes from the extracted description, even though the structured variable is incorrect.

Completeness

[% of variables with no missing critical information]

Expert evaluation on sampled outputs

This metric is particularly challenging to automate. Semantic similarity and summarization metrics are often ineffective in quantifying completeness.

Example:
Disease stage: Stage II, T2N0M0
Disease stage: Stage II

Both variables have high semantic similarity, but one  omits the tumor size and spread, which are critical clinical details.

Timeliness

[Time from record availability to structured data output]

Fully automated measurement

Similar metrics have been automatically measured in other evaluation studies[19]

This is an important metric for researchers, as delays can reduce the utility in use cases requiring continuous monitoring, trend analysis, or immediate decision-making.

Time to Audit

[Time needed to verify correctness]

Automated measurements during human reviews

Clinicians often audit the abstracted outputs and identify failure modes. Manual abstraction, however, provides results without supporting rationale. AI systems that generate evidence artifacts alongside outputs can reduce the time required for audit. 

In addition to the metrics described above, clinicians we have collaborated with found value in quantifying an additional measure that is specific to AI-based clinical data abstraction tasks:

  • % Fields with Hallucinations: % of variables where the AI generates outputs based on fabricated assumptions. For instance, if a patient’s height is missing but the AI assumes a value to calculate body surface area (BSA), this is flagged as a hallucination. Since humans rarely introduce such fabricated data, tracking hallucinations is uniquely important for evaluating AI reliability and trustworthiness.

Classical metrics such as precision, recall, and F1-score are better suited to evaluate classification tasks than clinical data abstraction tasks, as they do not distinguish between qualitatively different failure modes such as incomplete extraction or hallucinated values. We therefore adopt evaluation metrics aligned with clinical data abstraction quality assurance frameworks.

If you are a clinician or researcher interested in accessing anonymized datasets, benchmark results, or learning more about experimental findings (e.g., whether performance on MCQ-style abstraction tasks correlates with performance on subjective clinical variables), please reach out to us.

While clinicians mostly log longitudinal health record information in unstructured text format[1], clinical research data collection instruments, such as disease registries and Case Report Forms (CRFs), are designed to capture standardized clinical variables to enable systematic analysis. Consequently, unstructured health record data are often abstracted into structured datasets for use in clinical research [2,3,4,5]. In this blog, we introduce the set of metrics we use to evaluate the performance of AI systems in automating clinical data abstraction tasks.

Clinical Data Abstraction Tasks: Formulation

The input for clinical data abstraction tasks is unstructured data, predominantly text, and in some cases a combination of text and images. The output is a set of structured clinical variables. These variables can span different types: multiple-choice fields, dates, tables, or subjective text.

While some tasks are purely extractive (example: determining the history of smoking from the record), many aren’t. Some require numeric computations. For instance, body surface area (BSA) requires extracting height and weight and then applying a formula. Some tasks require dynamic reasoning and decision making. For example, identifying adverse events and applying Common Terminology Criteria for Adverse Events (CTCAE) grading[6] involves extracting relevant information from the health record and making a context-dependent determination.

Evaluation Methods and Metrics to Benchmark AI systems

The primary metrics we use are largely derived from quality control (QC) frameworks used by clinical researchers in manual data abstraction workflows. Grounding evaluation in these established systems enables direct benchmarking of AI performance against current manual processes. AI outputs are evaluated against expert-annotated datasets derived from clinical records. The key evaluation metrics are as follows:

Metric

Evaluation method

Notes/Challenges

Correctness

[% of variables correct]

Automated evaluation for structured variable types such as multiple-choice and dates.
For subjective text and tables, manually evaluated on sampled outputs by expert reviewers.

For subjective text and tables, semantic similarity is a poor proxy for correctness. Cosine similarity does not capture whether the value is right. Additionally, summarization metrics such as ROUGE [19] are also not effective, since tasks often require calculation, abstraction, or rule-based decisions rather than extraction.

Example: A model that correctly captures the adverse event (AE) description but assigns an incorrect grade may still achieve a high ROUGE score, because most of the textual overlap comes from the extracted description, even though the structured variable is incorrect.

Completeness

[% of variables with no missing critical information]

Expert evaluation on sampled outputs

This metric is particularly challenging to automate. Semantic similarity and summarization metrics are often ineffective in quantifying completeness.

Example:
Disease stage: Stage II, T2N0M0
Disease stage: Stage II

Both variables have high semantic similarity, but one  omits the tumor size and spread, which are critical clinical details.

Timeliness

[Time from record availability to structured data output]

Fully automated measurement

Similar metrics have been automatically measured in other evaluation studies[19]

This is an important metric for researchers, as delays can reduce the utility in use cases requiring continuous monitoring, trend analysis, or immediate decision-making.

Time to Audit

[Time needed to verify correctness]

Automated measurements during human reviews

Clinicians often audit the abstracted outputs and identify failure modes. Manual abstraction, however, provides results without supporting rationale. AI systems that generate evidence artifacts alongside outputs can reduce the time required for audit. 

In addition to the metrics described above, clinicians we have collaborated with found value in quantifying an additional measure that is specific to AI-based clinical data abstraction tasks:

  • % Fields with Hallucinations: % of variables where the AI generates outputs based on fabricated assumptions. For instance, if a patient’s height is missing but the AI assumes a value to calculate body surface area (BSA), this is flagged as a hallucination. Since humans rarely introduce such fabricated data, tracking hallucinations is uniquely important for evaluating AI reliability and trustworthiness.

Classical metrics such as precision, recall, and F1-score are better suited to evaluate classification tasks than clinical data abstraction tasks, as they do not distinguish between qualitatively different failure modes such as incomplete extraction or hallucinated values. We therefore adopt evaluation metrics aligned with clinical data abstraction quality assurance frameworks.

If you are a clinician or researcher interested in accessing anonymized datasets, benchmark results, or learning more about experimental findings (e.g., whether performance on MCQ-style abstraction tasks correlates with performance on subjective clinical variables), please reach out to us.

FINISH YOUR

DOCUMENTATION WHILE

YOU TREAT WITH FOSTER

© 2024 by Foster AI Inc.

FINISH YOUR

DOCUMENTATION WHILE

YOU TREAT WITH FOSTER

© 2024 by Foster AI Inc.

FINISH YOUR

DOCUMENTATION WHILE

YOU TREAT WITH FOSTER

© 2024 by Foster AI Inc.