claranlp

COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION

CLARA-FIN

  •  Yanco Torterolo, Antonio Moreno Sandoval (2023) Financial term simplifier demo, designed to make complex financial terminology more accessible for non-experts. SimFin

CLARA-DH

  • Juan Cigarrán, Andrés Rodriguez-Francés, Ana García-Serrano (2024) CLARA-DM corpusPublic Access to CLARA-DM paleographic corpus. Corpus transcribed with TRANSKRIBUS including some historical journals from Diario de Madrid available at “Hemeroteca de la BNE”.
  • Sánchez Salido, E. y A. García Serrano. (2023).Modelo de transcripción automática «paleográfica» de textos del siglo XVIII y XIX del Diario de Madrid (BNE), que se puede descargar desde la website de la herramienta TRANSKRIBUS: https://readcoop.eu/model/spanish-print-xviii-xix
  • Antonio Menta (2023) Python framework for testing different models and datasetsIt facilitates the reproducibility of the experiments performed and allows the comparison between the results obtained with different hyperparameters. Model trained using Pytorch-lightning and HuggingFace on an Nvidia GeForce GTX 1070 Ti GPU with 8 GB of memory.https://github.com/Hisarlik/Simplification_experiments
  • Lara-Clares, A.; Lastra-Díaz, J J.; Garcia-Serrano, A. 2022, «HESML V2R1 Java software library of semantic similarity measures for the biomedical domain«, https://doi.org/10.21950/AQLSMV, e-cienciaDatos

CLARA-MeD

CORPUS

  • Corpus CLARA-MeD: A collection of 24 298 pairs of professional and simplified texts (>96 million tokens) for automatic medical text simplification in Spanish. A parallel corpus with a subset of 3800 sentence pairs of professional and laymen variants (149 862 tokens) is released as a benchmark for medical text simplification. https://digital.csic.es/handle/10261/269887
    If you use this corpus, please, cite as follows:
    Campillos-Llanos, Leonardo, Ana Rosa Terroba Reinares, Sofía Zakhir Puig, Ana Valverde-Mateos, and Adrián Capllonch-Carrión (2022) «Building a comparable corpus and a benchmark for Spanish medical text simplification». Procesamiento del lenguaje natural, nº 69, pp. 189-196. → This article describes the process and criteria to simplify the technical sentences in two versions: at the syntax level, and both at the syntax and lexical levels.
  • A new collection of sentences simplified by experts is released at this address: https://digital.csic.es/handle/10261/346579. If you use this dataset, please, cite as follows:
    Campillos-Llanos, Leonardo, Ana Rosa Terroba Reinares, Rocío Bartolomé Rodríguez (2024) «Enhancing the understanding of clinical trials with a sentence-level simplification dataset». Procesamiento del lenguaje natural, nº 72, pp. 31-43. → This article describes the process and criteria to simplify the technical sentences in two versions: at the syntax level, and both at the syntax and lexical levels.

LEXICON

  • SimpMedLexSp: a lexicon of technical and laymen medical terms. Sample file available here.

DEMO

  • CLARA-MeD tool: a system to help readers understand medical texts. Try it here!

OTHER

  • Readability score analysis: compute the Inflesz score of technical and simplified sentences (Python script).
  • Frequency-based complex word identification (Python script).
  • Embedding-based sentence aligner (Python script): given a comparable corpus of technical and simplified sentences, obtain aligned parallel sentences.
  • Trained neural-based and prompt-learning-based models for simplification are available at the CLARA-MeD HuggingFace repository.
  • N-grams from the CLARA-MeD corpus: 2-grams, 3-grams and 4-grams extracted from:
    • Texts in technical register (source)
    • Texts in simplified register (target)

    Files available at this address.

Slide

LEGAL NOTICE & PRIVACY POLICY