COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION
CLARA-FIN
- Yanco Torterolo, Antonio Moreno Sandoval (2023) Financial term simplifier demo, designed to make complex financial terminology more accessible for non-experts. SimFin
CLARA-DH
- Juan Cigarrán, Andrés Rodriguez-Francés, Ana García-Serrano (2024) CLARA-DM corpus: Public Access to CLARA-DM paleographic corpus. Corpus transcribed with TRANSKRIBUS including some historical journals from Diario de Madrid available at “Hemeroteca de la BNE”.
- Sánchez Salido, E. y A. García Serrano. (2023).Modelo de transcripción automática «paleográfica» de textos del siglo XVIII y XIX del Diario de Madrid (BNE), que se puede descargar desde la website de la herramienta TRANSKRIBUS: https://readcoop.eu/model/spanish-print-xviii-xix
- Antonio Menta (2023) Python framework for testing different models and datasets. It facilitates the reproducibility of the experiments performed and allows the comparison between the results obtained with different hyperparameters. Model trained using Pytorch-lightning and HuggingFace on an Nvidia GeForce GTX 1070 Ti GPU with 8 GB of memory.https://github.com/Hisarlik/Simplification_experiments
- Lara-Clares, A.; Lastra-Díaz, J J.; Garcia-Serrano, A. 2022, «HESML V2R1 Java software library of semantic similarity measures for the biomedical domain«, https://doi.org/10.21950/AQLSMV, e-cienciaDatos
CLARA-MeD
CORPUS
- Corpus CLARA-MeD: A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. https://digital.csic.es/handle/10261/269887
If you use this corpus, please, cite as follows:
Campillos-Llanos, Leonardo, Ana Rosa Terroba Reinares, Sofía Zakhir Puig, Ana Valverde-Mateos, and Adrián Capllonch-Carrión (2022) Building a comparable corpus and a benchmark for Spanish medical text simplification. Procesamiento del lenguaje natural, nº 69, pp. 189-196. → This article describes the process and criteria to simplify the technical sentences in two versions: at the syntax level, and both at the syntax and lexical levels. - A new collection of sentences simplified by experts is released at this address: https://digital.csic.es/handle/10261/346579. If you use this dataset, please, cite as follows:
Campillos-Llanos, Leonardo, Ana Rosa Terroba Reinares, Rocío Bartolomé Rodríguez (2024) Enhancing the understanding of clinical trials with a sentence-level simplification dataset. Procesamiento del lenguaje natural, nº 72, pp. 31-43. → This article describes the process and criteria to simplify the technical sentences in two versions: at the syntax level, and both at the syntax and lexical levels.
LEXICON
- SimpMedLexSp: a lexicon of technical and laymen medical terms. Sample file available here.
If you use this resource, please, cite it as follows:
Campillos-Llanos, Leonardo, Ana Rosa Terroba-Reinares, Rocío Bartolomé, Ana Valverde-Mateos, Cristina González, Adrián Capllonch, Jónathan Heras (2024) Replace, Paraphrase or Fine-tune? Evaluating Automatic Simplification for Medical Texts in Spanish. Proc. of LREC-COLING 2024, Torino, Italy, May 2024; pp. 13929–13945.
DEMO
- CLARA-MeD tool: a system to help readers understand medical texts. Try it here!
If you use this tool, please, cite it as follows: Campillos-Llanos, Leonardo, Federico Ortega-Riba, Ana Rosa Terroba-Reinares, Ana Valverde-Mateos, Adrián Capllonch-Carrión (2024) CLARA-MeD Tool – A System to Help Patients Understand Clinical Trial Announcements and Consent Forms in Spanish. Studies in Health Technology and Informatics 316, 95 – 99.
OTHER
- Readability score analysis: compute the Inflesz score of technical and simplified sentences (Python script).
- Frequency-based complex word identification (Python script).
- Embedding-based sentence aligner (Python script): given a comparable corpus of technical and simplified sentences, obtain aligned parallel sentences.
- Trained neural-based and prompt-learning-based models for simplification are available at the CLARA-MeD HuggingFace repository.
- N-grams from the CLARA-MeD corpus: 2-grams, 3-grams and 4-grams extracted from:
- Texts in technical register (
source
) - Texts in simplified register (
target
)
Files available at this address.
- Texts in technical register (