COMPUTATIONAL LINGUISTICS APPROACHES TO READABILITY AND AUTOMATIC SIMPLIFICATION IN FINANCIAL NARRATIVE (CLARA-FIN)
WHAT IS CLARA-FIN?
One of the conclusions drawn from the FinT-esp project is the significant amount of implicit information hidden in the financial reports. Logically, the communicators (especially presidents and CEOs) of the companies do not want to reveal the losses they have incurred in their management. In our experiments with the automatic classification of annual reports from companies with gain and loss(Moreno et al. 2019, El-Haj et al. in preparation), we found that it is not easy to distinguish one from another by purely lexical methods (whether lexicon-based or machine learning). This is difficult even for human specialists. The reason is that relevant information that can make a difference is absent.
The main goal of CLARA-FINT is to describe the core reporting structure that allows for the comparison of financial report contents. Therefore, our approach to simplification involves, above all, a clear discourse and syntactic structure, not just a basic vocabulary.
We will collect new texts to increase the size but especially the variety of the FinT-esp corpus. We will include news from the specialised press as well as information from websites. A more complete and varied corpus will enable us to develop more representative financial language models in Spanish.
A second specific objective is the participation in shared-tasks of evaluation within the framework of the Workshops on Financial Narrative Processing and MultiLing Financial Summarisation, organised by the researchers of UCREL – Lancaster. This will allow the inclusion of texts in Spanish for the summarisation competitions. It is planned that one of the conferences where these tasks can be raised is the annual meeting of the SEPLN.
A third objective is to advance financial narrative knowledge, both from an economic and a linguistic perspective. The valuable and enormous amount of data collected is a significant source for elaborating specialised lexicons or glossaries of financial terms and publications on the characteristics of financial discourse, its ways of organising information and argumentation. This knowledge has an impact on applied language disciplines such as Translation or Communication.
CLARA-FIN RESEARCH TEAM
RESEARCHERS
ANTONIO MORENO SANDOVAL
Antonio Moreno Sandoval is Professor at the Universidad Autónoma de Madrid. He is the head of the Laboratorio de Lingüística Informática and director of the Cátedra UAM-IIC in Computational Linguistics.
JOSÉ MARIA GUIRAO
José María Guirao is lecturer at the School of Technology and Telecommunications Engineering of the Universidad de Granada. He specialises in Internet applications and web-based systems. He has been a senior programmer at the Laboratorio de Lingüística Informática since 2002.
ANA GISBERT
Ana Gisbert Clemente is Associate Professor in the Department of Accounting in the Faculty of Economics at Universidad Autónoma de Madrid. She was a Predoctoral Fellow at Lancaster University within the context of the HARMONIA European project on Accounting Harmonisation and Standardisation in Europe. Since 2018 she is collaborating with the Laboratorio de Lingüística Informática to develop a financial narrative corpus to analyse the use of language in Spanish listed companies' annual reports. She has published papers in the areas of international accounting, corporate governance, audit oversight and earnings management. Her current research interests are focused on the analysis of financial reporting narratives.
CHELO VARGAS
Chelo Vargas Sierra earned her Master's Degree in Translation and Interpreting and a Ph.D. in Translation from the University of Alicante (UA). She is a Senior Lecturer at the Department of English Studies (UA) and she is the Director of the Institute of Modern Languages (IULMA).
JORDI PORTA
Jordi Porta Zamorano received a BSc degree in Computer Science from the Polytechnic University of Catalonia and a Ph.D. from the Universidad Autónoma de Madrid. He is member of the Technology Department at the Royal Spanish Academy and part-time teacher of the Computer Science Department at Universidad Autónoma de Madrid. His current R+D interests include natural language processing, computational lexicography and artificial intelligence.
MARTA TORDESILLAS
PAUL RAYSON
Paul Rayson is a Professor in Computer Science at Lancaster University, UK and Director of the UCREL interdisciplinary research centre which carries out research in corpus linguistics and natural language processing (NLP). A long term focus of his work is semantic multilingual NLP in extreme circumstances where language is noisy e.g. in historical, learner, speech, email, txt and other CMC varieties.
MAHMOUD EL-HAJ
Dr Mo El-Haj is a Senior Lecturer in Computer Science and the Co-Director of the UCREL NLP Group at Lancaster University. He received his Ph.D in Computer Science from The University of Essex and his research interests include Natural Language Processing (NLP), Language Resources, Data Science, Health and Medicine, Biomedical NLP, Text Summarization, Corpus and Computational Linguistics, Financial NLP, Machine Learning and Information Extraction.
DOAA SAMY
Doaa Samy earned her Ph.D in Computational Linguistics (Universidad Autónoma de Madrid, 2005) with main interests in Multilingual Resources with a solid academic career as Tenured Associate Professor at Cairo University since 2006. She is currently an Advanced Computational Linguist at the Instituto de Ingeniería del Conocimiento, Madrid and an Associate (External) Professor of Linguistics and Computational Linguistics at Universidad Complutense de Madrid.
PABLO HAYA
Pablo Haya is the head of Social Business Analytics group at IIC-Knowledge Engineering Institute (Madrid, Spain). He collaborates with the Department of Computer Science of the Universidad Autónoma de Madrid (Spain), from which he received a Ph.D. in Computer Science in 2006, teaching master courses on data science and ubiquitous computing.
BLANCA CARBAJO CORONADO
Blanca Carbajo Coronado holds a BA in Translation and Interpreting and a MA in Spanish Linguistics. She is currently a Ph.D student at the Computational Linguistics Laboratory, at Universidad Autónoma de Madrid with a scholarship (FPU) awarded by the Spanish Ministry of Science, Innovation and Universities. Her thesis deals with cause-effect relations in financial narratives using computational linguistic methods. She has also published work on financial terminology and corpus linguistics.
CLARA-FIN RESOURCES
PUBLICATIONS
Authors: Ana García Toro, Jordi Porta Zamorano and Antonio Moreno–Sandoval
Year: 2022
Overview: Introducing an automatic discourse marker (DM) tagger for Spanish, this paper discusses developing and evaluating a tool that achieves significant agreement rates among human annotators and an impressive F1-score using Transformers.
Authors: Mahmoud El-Haj, Nadhem Zmandar, Paul Rayson, Ahmed AbuRa’ed, Marina Litvak, Nikiforos Pittaras, George Giannakopoulos, Aris Kosmopoulos, Blanca Carbajo-Coronado and Antonio Moreno-Sandoval
Year: 2022
Overview: The paper showcases the outcomes of the FNS 2022, an initiative for summarizing financial annual reports from the UK, Greece, and Spain, as part of the FNP 2022 Workshop.
Authors: Abderrahim Ait Azzi, Sandra Bellato, Blanca Carbajo Coronado, Mahmoud El-Haj, Ismail El Maarouf, Mei Gan, Ana Gisbert, Juyeon Kang and Antonio Moreno Sandoval
Year: 2022
Overview: This paper details the FinTOC-2022 Shared Task, which focuses on extracting and hierarchically organizing the structure of financial documents, fostering progress in table-of-contents extraction technologies.
Authors: Antonio Moreno-Sandoval, Jordi Porta-Zamorano, Blanca Carbajo-Coronado, Doaa Samy, Dominique Mariko and Mahmoud El-Haj
Year: 2023
Overview: This paper presents the results and insights from the Financial Document Causality Detection Shared Task (FinCausal 2023). It outlines the task’s objectives, methodology, dataset creation, and evaluation metrics. It also discusses the approaches and results of participating teams.
Authors: Elias Zavitsanos, Aris Kosmopoulos, George Giannakopoulos, Marina Litvak, Blanca Carbajo-Coronado, Antonio Moreno-Sandoval and Mo El-Haj
Year: 2023
Overview: This paper presents the results and insights from the Financial Narrative Summarisation Shared Task (FNS 2023), focusing on summarizing annual reports from the UK, Greece, and Spain. The task, part of the 5th Financial Narrative Processing Workshop, aimed at using automatic summarization techniques, either abstractive or extractive, to condense long financial documents. The challenge attracted six systems from three teams.
Authors: Jordi Porta-Zamorano, Yanco Torterolo and Antonio Moreno-Sandoval
Year: 2023
Overview: This paper presents a T5-based system developed by LLI-UAM for the FinancES 2023 Shared Task. It includes noise and data augmentation experiments, using corrected datasets and ChatGPT for data improvement. The paper reports on the system’s performance across tasks, detailing the impact of noise, data augmentation, and hallucinations on model accuracy.
Author: Blanca Carbajo Coronado
Year: 2023
Overview: This study examines linguistic differences in shareholder letters from profitable and loss-making Spanish companies, focusing on verbs and nouns to discern financial performance indicators.
Authors: Blanca Carbajo Coronado and Antonio Moreno Sandoval
Year: 2024
Overview: This paper explores automatic concept extraction and lexical simplification in Spanish financial texts, employing AI language models for term identification and proposing strategies for making complex financial language more accessible.
DEMOS
Author: Yanco Torterolo
Year: 2023
Overview: Financial term simplifier demo, designed to make complex financial terminology more accessible for non-experts.
PUBLIC DISSEMINATION
Antonio Moreno Sandoval Invited Talks:
- «Annotating discourse markers and key financial terms in Spanish with transformers», Invited talk, 3rd Financial Narrative Processing workshop, Lancaster, 15 septiembre de 2021.
- «Some issues on Financial Narrative Processing in Spanish», Plenary session, Meaning and Knowledge Representation Conference, UAM. 6 de Julio de 2022
- «Algunas cuestiones sobre el procesamiento de la narrativa financiera», Invited talk, CITIUS, Centro Singular de Investigacion en Tecnoloxías Intelixentes, Universidad de Santiago de Compostela, 28 de noviembre de 2022.
- «How does the technology assist organizations in delivering more significant research impact?», Invited talk, University Leaders’Forum 2023, Universitas Muhammadiyah, Yogyakarta, Indonesia, 10 de marzo de 2023.
- «Lingüística e IA». Invited talk, Mundo actual, UAM, 3 mayo 2023.
- «¿Cómo pueden las tecnologías del lenguaje ayudar a la enseñanza del español?» Invited talk, Congreso Español para todos, Salamanca 27 de junio 2023
- «Panorama histórico de la Lingüística Computacional en España a travél del LLI-UAM», Lección inaugural del Máster de Lingüística aplicada y Tecnologías del lenguaje de la UCM: 15 septiembre 2023.
- «Tecnologías lingüísticas al servicio de la enseñanza de lenguas», Lección inaugural del Máster de Español en la Univ. Valladolid, 21 septiembre 2023.
- «Evolución de la Traducción Automática: desde los diccionarios a los transformers en 40 años» Invited Talk, Jornadas DARIAH-ES en la BNE: 7 de Noviembre 2023
- «LLI-UAM (1988-2023): 35 años de investigación, docencia y transferencia en Lingüística Computacional», Invited Talk, Jornadas sobre Docencia e Investigación Lingüística en la Era de la Inteligencia Artificial, Universidad de La Rioja, 13 de diciembre de 2023.
- «Herramientas digitales para la literatura: el diccionario de lemas y formas del Quijote», I Jornadas de Humanidades Digitales, UAM, 9 de enero de 2024.