COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION
WHAT IS CLARA-NLP?
The CLARA-NLP proposal integrates the experience of researchers from three institutions, UAM, UNED and CSIC, who have collaborated for more than a decade in different NLP projects. On this occasion we explore the applications of automatic simplification of speciality text into easier-to-read text. Therefore, the common basis will be the evaluation framework and the difference will be in the domain (financial, digital humanities and medical) where each group has a contrasting experience in previous projects. What does each research team contribute?
- LLI-UAM: is one of the oldest computational linguistics laboratories in the country, founded with the MT EUROTRA project. For three decades they have been dedicated to numerous NLP applications, excelling in the development of language resources in multiple languages (Spanish, Arabic, Japanese, Chinese), both spoken and written. In the coordinated MULTIMEDICA project they addressed language resources in medicine. In the latest FINT-ESP project they have created the largest data repository on financial narrative in Spanish to date.
- UNED: The senior research members of the proposal belong to a well-known research laboratory nlp.uned.es (Research Group in Natural Language Processing and Information Retrieval at UNED the Spanish Open Distance University). Main work is focusing on natural language processing, textual and multimedia information retrieval, systems and platform development, cultural heritage and open resources for online learning. Their expertise is proved along the publications as well as the recent reproducibility software and open datasets at e-CienciaDatos. Also, it is worth mentioning the software prototype developed at the project for humanists collaborative work on Museo del Prado digital information. The group has participated with the UAM research group in the MAVIR consortium and its continuation, MA2VIRCM.
- CSIC: the IP carried out pre- and post-doctoral research at LLI-UAM and collaborated in the MULTIMEDICA project and in the MAVIR and MA2VIRCM consortia. His postdoctoral work at LIMSI (Université Paris Saclay – CNRS) gave him deeper experience on medical NLP. Back to Spain with a Marie Curie H2020 grant, the IP has developed resources for Spanish medical texts (a computational lexicon, a semantic term annotator and an annotated clinical trial corpus). He has recently started a tenure-track position at CSIC (August 2020).
The interaction among the partners’ objectives and tasks is as follows. Regarding the computational infrastructure, the UNED group will support the framework for conducting the machine- and deep-learning experiments and for integrating in it the developed corpora and resources. With regard to the generation of resources, each team will focus on a domain: CSIC and UNED will work on medical texts; UAM will work on financial texts, and UNED and UAM will collaborate in annotating texts for digital humanities. Besides, the UAM team will share resources from previous projects (e.g. a morphosyntactic tagger); all teams will organize a new shared task in Spanish. Regarding the training and educational resources, the UNED group will develop MOOCs and an ebook in collaboration with the other groups. All teams will share outreach activities.
Therefore, this proposal, which is part of the call for Challenges to Society (“Retos de la Sociedad”), clearly meets the objectives of multidisciplinary (computer scientists, linguists, economists, historians, medical doctors) and cross-domain (finance, humanities, medicine) collaboration. The interdisciplinarity is reached through the proposed work on natural language processing and the development of linguistic resources. The central problem we address is open access to specialised documentation in three domains by automating the process of writing it simply and in a way that is accessible to citizens. This coordinated project is an example of the synergy of complementary knowledge in various scientific fields and in different institutions (in addition to the three mentioned, the teams include researchers from the Universities of Alicante, Granada and Lancaster).
Other notable aspects of this proposal are the participation of a young researcher, although consolidated in a permanent position, leading one of the sub-projects, and the collaboration with a prestigious European university in the field of NLP (Lancaster, UK).
The challenges chosen are :
– Challenge 7: Digital economy, society and culture: this is the application of AI technologies to help advance the digital society by automatically simplifying specialised documents consumed by citizens. According to the National Research Plan (Plan Estatal de IDI), we work on “Tecnologías avanzadas para el PROCESAMIENTO DEL LENGUAJE NATURAL” (point VI, p. 83).
– Challenge 6: Social sciences and humanities, science with and for society: the research teams in the coordinated project are made up of social scientists (linguists, economists) and humanists (historians).
Their contribution is essential for bringing science and digital technology to society in the three tackled domains (finances, medicine and humanities).
Owing to the aforementioned context, the project is related to the Challenge 7 “Economy and Digital Society” (“Economía y Sociedad Digital”) according to the Plan Estatal de IDI (p. 83, point VI: “Tecnologías avanzadas para el procesamiento del lenguaje natural”). Our project is also related to the Challenge 6 (“Ciencias Sociales y Humanidades y Ciencia con y para la Sociedad”). A third challenge could be related to health. Simplifying medical discourse is a need for patients to engage actively in their care and better understand clinical protocols and procedures. Patient empowerment is especially critical not only in the context of the global COVID-19 pandemics, but also regarding current prevalent disorders in Western countries such as diabetes, obesity or chronic depression.
In this project we adopt a hybrid methodology (use of rules and lexicon for text annotation and automatic learning on annotated texts) based on experimentation with datasets and deep learning algorithms. The resources generated will be offered in open repositories, according to each institution’s plan. The results will be presented mainly through publications and participation in conferences. Competitive events or shared tasks will also be organised with the project’s datasets in order to achieve the widest possible audience for the results. The coordinated project also has a training plan for young researchers, who will benefit from the international contacts of the three groups. Finally, the shared knowledge and expertise of the research groups will be offered through an onsite course, different webinars, a MOOC and an ebook produced in the project.
The Work Plan is divided into five Work Packages:
- A coordination package (WP0), which includes the management of the project.
- A package for Infrastructure and experimental framework hub (WP1). Its main goal is to provide the project with a solid hardware and software hub to test not only the developed algorithms and models with the proposed document collections but also to allow external researchers to test their collections against the outcomes of the project.
- A package for generation of shared resources (WP2), which includes different tasks for each domain.
- A package devoted to training and educational resources (WP3).
- A package for dissemination (WP4), which makes a proposal for publications, datasets and participation in events.
Here is the list of all project members, divided by sub-projects:
CARCEM research team is collaborating with us (https://dimh.hypotheses.org/).