COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION OF MEDICAL DISCOURSE (CLARA-MED)

WHAT IS CLARA-MeD?

(Resumen en español más abajo)

The myriad of terms in medical texts is a language barrier to patient’s informed decision making. Laymen and patients often require explanations about technical terms in clinical trials, medical records or medication leaflets. However, healthcare professionals lack enough time to provide full details about pathologies or procedures during consultation. This is especially critical when it comes to patients’ participation in screenings for preventive care and clinical trials (CT). Protocols and CT announcements require being explainable enough for candidate patients to understand the procedures they could engage in.

To alleviate this language gap, automatic natural language processing methods may enhance the accessibility of health information and increase the patients’ health literacy. One of the approaches is term simplification; i.e. substituting a difficult-to-read word (e.g. «amigdalectomía») with an easier or more explicative paraphrase (e.g. «operación de anginas»).

The CLARA-MeD project aims at:

Developing linguistic resources for automatic medical term simplification in Spanish.
Conducting experiments in automatic text simplification.

The project involves the following work:

A comparable corpus of technical and laymen texts will be collected to map and extract patient equivalences of medical terms.
A simplified medical lexicon of Spanish, SimpMedLexSp, will gather equivalences between technical and patient terms.
Experiments will be run to compare lexical substitution approaches, methods based on state-of-the-art neural networks, and hybrid approaches.

Results might interest:

terminologists, especially the Medical Terminology Unit of the Spanish Royal Academy of Medicine.
the biomedical natural language processing research community working in Spanish.

The project is framed in the social challenge of improving the patients’ understanding of medical language, which is a must to avoid information manipulation and medical fake news.

The CLARA-MeD project (PID2020-116001RA-C33) was funded by MICIU/AEI/10.13039/501100011033/ in call «Proyectos I+D+i Retos de Investigación».

Resumen en español

La infinidad de términos en los textos médicos es una barrera lingüística para la toma de decisiones bien informada del paciente. Los pacientes y usuarios no especializados a menudo requieren explicaciones sobre los términos técnicos de los estudios clínicos, los informes médicos o los prospectos de medicamentos. Sin embargo, los profesionales sanitarios carecen del tiempo suficiente durante la consulta para aportar detalles sobre sus patologías o procedimientos. Esto es especialmente importante para la participación de los pacientes en pruebas y exámenes de cuidados preventivos así como en ensayos clínicos. Los protocolos y anuncios de ensayos clínicos han de ser suficientemente comprensibles para que los pacientes candidatos comprendan los procedimientos a los que se podrían someter.

Para aliviar esta brecha lingüística, existen métodos automáticos de procesamiento del lenguaje natural que pueden mejorar la accesibilidad a la información clínica o de salud y aumentar la alfabetización sanitaria de los pacientes. Uno de los enfoques es la simplificación de términos. Estos métodos permiten sustituir un término difícil de comprender (p. ej., «amigdalectomía») con una paráfrasis más explicativa (p. ej., «operación de anginas»).

El proyecto CLARA-MeD tiene como objetivo:

Desarrollar recursos lingüísticos para la simplificación automática de términos médicos en español.
Realizar experimentos en simplificación automática de textos en dominio médico.

En concreto, se llevarán a cabo los siguientes trabajos:

Se recogerá un corpus comparable de textos médicos técnicos y simplificados para extraer equivalencias de términos médicos en registro paciente.
Se creará un léxico médico simplificado del español, SimpMedLexSp, con equivalencias entre términos técnicos y orientados al paciente.
Se llevarán a cabo experimentos para comparar enfoques de simplificación basados en sustitución léxica, métodos basados en redes neuronales de última generación y enfoques híbridos.

Los resultados del proyecto CLARA-MeD pueden ser de interés para:

La comunidad investigadora en procesamiento del lenguaje natural biomédico que trabaja en español
Terminólogos, especialmente la Unidad de Terminología Médica de la Real Academia Nacional de Medicina de España.

El proyecto se enmarca en el reto social de mejorar la comprensión del lenguaje médico, que es indispensable para evitar la manipulación informativa y los bulos de información médica.

El proyecto CLARA-MeD (PID2020-116001RA-C33) fue financiado por MICIU/AEI/10.13039/501100011033/ en la convocatoria «Proyectos I+D+i Retos de Investigación».

CLARA-MeD RESEARCH TEAM

RESEARCHERS

OTHER COLLABORATORS

ACKNOWLEDGMENTS

MARISOL HERNANDO TUNDIDOR, Unidad de tratamiento de la información(CCHS, CSIC)
YARA MOSTAZO FERNÁNDEZ, Unidad de tratamiento de la información (CCHS, CSIC)

CLARA-MeD RESOURCES

▶ CORPUS

Corpus CLARA-MeD: A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. https://digital.csic.es/handle/10261/269887
If you use this corpus, please, cite as follows:
Campillos-Llanos, Leonardo, Ana Rosa Terroba Reinares, Sofía Zakhir Puig, Ana Valverde-Mateos, and Adrián Capllonch-Carrión (2022) Building a comparable corpus and a benchmark for Spanish medical text simplification. Procesamiento del lenguaje natural, nº 69, pp. 189-196.
A new collection of sentences simplified by experts is released at this address: https://digital.csic.es/handle/10261/346579. If you use this dataset, please, cite as follows:
Campillos-Llanos, Leonardo, Rocío Bartolomé Rodríguez, Ana Rosa Terroba Reinares (2024) Enhancing the understanding of clinical trials with a sentence-level simplification dataset. Procesamiento del lenguaje natural, nº 72, pp. 31-43.
An annotated corpus of 225 patient-oriented documents, where complex words were marked to conduct automatic complex word identification: https://digital.csic.es/handle/10261/373675. If you use this dataset, please, cite as follows:
Federico Ortega-Riba, Leonardo Campillos-Llanos, Doaa Samy (2025) Complex Word Identification for Lexical Simplification in Spanish Texts for Patients. Procesamiento del lenguaje natural, nº 74, pp. 95-108.

▶ LEXICON

SimpMedLexSp: a lexicon of technical and laymen medical terms. Sample file available here. If you use this dataset, please, cite as follows:
Campillos-Llanos, Leonardo, Ana Rosa Terroba-Reinares, Rocío Bartolomé-Rodríguez, Ana Valverde-Mateos, Cristina González-Sánchez, Adrián Capllonch-Carrión, Jónathan Heras-Vicente (2024) Replace, Paraphrase or Fine-tune? Evaluating Automatic Simplification for Medical Texts in Spanish. Proc. of LREC-COLING 2024, Torino, Italy, May 2024; pp. 13929–13945.

▶ ANNOTATION TOOL

MEDSPANER is a semantic annotation system for Spanish medical texts.
You can access it here (registration required).
Check also the companion GitHub repository or the demonstration video.
If you use MEDSPANER and you want to cite it, please, do it as follows:
Campillos-Llanos, Leonardo, Ana Valverde-Mateos, Adrián Capllonch-Carrión (2025) Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. BMC Bionformatics, 26(7). https://doi.org/10.1186/s12859-024-05949-6

▶ DEMO

CLARA-MeD tool: a system to help readers understand medical texts. Try it here (if the link does not work, copy and paste the following URL on the web browser: http://claramed.csic.es/demo)
If you want to cite this tool, please, do it as follows:
Campillos-Llanos, Leonardo, Federico Ortega-Riba, Ana Rosa Terroba-Reinares, Ana Valverde-Mateos, Adrián Capllonch-Carrión (2024) CLARA-MeD Tool – A System to Help Patients Understand Clinical Trial Announcements and Consent Forms in Spanish. Studies in Health Technology and Informatics, vol. 316, p. 95-99.

▶ OTHER

Readability score analysis: compute the Inflesz score of technical and simplified sentences (Python script).
Frequency-based complex word identification (Python script).
Embedding-based sentence aligner (Python script): given a comparable corpus of technical and simplified sentences, obtain aligned parallel sentences.
Trained neural-based and prompt-learning-based models for simplification are available at the CLARA-MeD HuggingFace repository.
N-grams from the CLARA-MeD corpus: 2-grams, 3-grams and 4-grams extracted from:
- Texts in technical register (source)
- Texts in simplified register (target)
Files available at this address.

▶ PUBLIC DISSEMINATION

«Descubre las innovaciones del procesamiento del lenguaje: herramientas en acción», XXIV Semana de la Ciencia 2024, CCHS, CSIC (15/11/2024).
«New Perspectives and Progress on Medical Natural Language Processing», Seminar at the CLARA-NLP Final Expert Workshop, co-organized with UAM and UNED. CCHS, CSIC (3/7/2024). Check here the program of the CLARA-NLP Final Expert Workshop.
«Cómo la inteligencia artificial nos puede ayudar a procesar textos médicos», XIII Feria Madrid es Ciencia 2024, IFEMA (8/3/2024)
«Aprende cómo funciona el procesamiento del lenguaje en la Inteligencia Artificial», XXIII Semana de la Ciencia, CCHS, CSIC (10 and 17/11/2023). Check some of the slides.
«¿Cómo ayuda el procesamiento del lenguaje a simplificar textos médicos?», Jornadas EnClaro 5ª edición (24/10/2023)
Entrevista en Hoy empieza todo 2 (Radio 3) (23/10/2023)
«Recursos para el procesamiento del lenguaje médico en español», en Jornada de Biología Computacional, Ciencia de datos e Inteligencia Artificial (CSIC, 3/7/2023)
«Simplificación de textos médicos con procesamiento del lenguaje: el proyecto CLARA-MeD», Seminario Mirian Andrés, Universidad de La Rioja (23/6/2023)
«Proyecto CLARA-MeD. Procesamiento del lenguaje médico para la simplificación automática de textos», Jornada de Grandes infraestructuras europeas de Ciencias Sociales y Humanidades en el CSIC: DARIAH y CLARÍN en el horizonte (11/5/2023)
«Advances in processing and simplification of clinical trials texts», seminario invitado en LISN (14/3/2023) y en CENTAL (16/3/2023)

Ayuda PID2020-116001RA financiada por MICIU/AEI /10.13039/501100011033

COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION OF MEDICAL DISCOURSE (CLARA-MED)

WHAT IS CLARA-MeD?

Resumen en español

CLARA-MeD RESEARCH TEAM

RESEARCHERS

LEONARDO CAMPILLOS-LLANOS

ADRIÁN CAPLLONCH CARRIÓN

CRISTINA GONZÁLEZ SÁNCHEZ

ANA VALVERDE MATEOS

OTHER COLLABORATORS

ANA ROSA TERROBA REINARES

SOFÍA ZAHKIR PUIG

ROCÍO BARTOLOMÉ RODRIGUEZ

JÓNATHAN HERAS VICENTE

Federico Ortega Riba