My research work focuses on two main areas: information retrieval (IR) on one hand, and learning from medical data on the other. In the first area, the objectives are to explore/improve IR models and to thoroughly study the methods used to evaluate them. This interest in IR evaluation methods stems from a strong involvement in evaluation campaigns and the creation of test collections. In the second area, the goal of learning from clinical data is to leverage heterogeneous data by relying on textual resources (which inherently carry ambiguity and noise). This heterogeneity requires questioning and modifying data representations as well as learning approaches. Being particularly committed to making my work beneficial to society, I have been interested since my thesis in application domains such as healthcare.
The project, funded by BPI France, aims to accelerate the creation and accessibility of digital commons across the entire value chain of generative AI to ensure its use by the widest possible audience and to foster the emergence of innovative products and services. The specific approach to commons chosen for this initiative enables critical sectors such as healthcare to benefit from and contribute to the development of a foundation of sovereign and secure technologies.
The Pantagruel project aims to develop and evaluate multimodal (written, spoken, pictograms) and inclusive language models for French. The main contributions of the project are the development of freely accessible self-supervised models for French, covering one to three modalities for general and clinical domains. The project will not only produce models but will also design benchmarks to evaluate the generalization capabilities of such models. A portion of the project will focus on biases and stereotypes present in training corpora and downstream models.
This project addresses several limitations of current LLMs (Large Language Models) related to General Purpose Dialogue-assisted Digital Information Access (DbIA). Specifically, the project aims to enable users to access digital information more effectively by overcoming four challenges: (1) LLMs were not designed for information access; (2) LLMs have limited generalization capabilities to new domains and languages; (3) The truthfulness and reliability of their outputs are questionable; and (4) State-of-the-art LLMs are not all open-access, with their scientific methodology and proper evaluation barely described in the scientific literature. From a community-building perspective, Guidance aims to bring together the French information retrieval community and advance the development of DbIA models by leveraging LLMs.
The KODICARE project is a Franco-Austrian project funded by the ANR and the FWF (Austrian Science Fund). The partners include the company Qwant on the French side and Research Studio Austria on the Austrian side. The KODICARE project aims to create a new paradigm for evaluating information retrieval targeting the industrial context. Indeed, it is difficult in such a context to maintain Cranfield-style evaluations. In this project, we propose to focus on the impact of varying evaluation environments (knowledge delta) on system performance (result delta). Ultimately, a formalization of these deltas will enable the continuous evaluation of systems in changing environments while ensuring the interpretability of results.
The main objective is to enable more patient trajectory-oriented medicine using artificial intelligence. The application framework for this work is focused on patients with sleep apnea. The trajectories are based on all clinical, environmental, and societal markers that can be measured and have an influence on the patient's condition. Deep learning models enable the exploitation of such a large amount of heterogeneous data.
The FUI project is a partnership with the companies Ophrys Systèmes, Pole Star, Charvet, and Globe VIP. The goal is to create new interactive devices for museums, enabling guided tours using audio and cameras.
The Khresmoi project, a European FP7 project, brought together 12 European partners. It aimed to build multilingual and multimodal search and access systems for biomedical information sources. My team participated in three areas related to the exploration and retrieval of biomedical texts; the user interface and search system; and multilingual resources and information dissemination. Additionally, we led the project's evaluation component and the development of the system. My work across these areas involved conducting research, performing development tasks, managing progress, organizing meetings and teleconferences, and writing deliverables and research papers. This first experience on such a large-scale project was extremely formative: it allowed me to gain expertise in the field of information retrieval, build a strong international network, and learn how to manage highly collaborative research work. My collaborations with members of this project have continued since 2014.
Social media are commonly used to express opinions about interesting subjects. Our objective is to develop an eff.ective method for sentiment analysis and summarization of social media content, especially in health and medical .fields. As target domains, we focus on drugs. We aim to build a web-based system to provide a summarized view of public opinions. A sentence-based system has been built to achieve semantic annotation of the sentences, based on medical thesaurus semantic types (e.g. Chemical and drugs, Symptom), and then predict sentiments toward various aspects (e.g. side eff.ects, cost) of a drug using machine learning and linguistic approaches. This project has led to two publications in international refereed conference and one journal paper.
Comparable corpora are sets of texts written in different languages that are not translations of each other but that share common characteristics. Their main advantage is to be fully representative of linguistics and cultural specificities of their respective language. The Web could theoretically be considered as a comparable corpora source. However, the quality of corpora and of their extracted resources depends on the preliminary definition of corpora and on the carefulness of their compilation (i.e. the definition of common features in comparable corpora). In this thesis, we focus on the compilation of specialized comparable corpora in French and Japanese which documents are extracted from the Web. We propose a definition of these corpora and a set of common features: a specialized domain, a topic and a type of discourse (science or popular science). Our goal is to create a tool to assist comparable corpora compilation. First, we present automatic recognition of common features. Topics can be easily identified with keywords used in Web searches. On the contrary, the detection of the type of discourse needs a wide stylistic analysis. This task is performed over a learning corpus, which leads to the creation of a bilingual typology based on three levels of analysis: structural, modal and lexical. Second, we use this typology to learn a classification model with SVMlight and C4.5. This classification model is tested over an evaluation corpus. Our test results indicate that more than 70% of the documents are well classified. Finally, the classifier is integrated into a comparable corpora compilation assistant tool developed on UIMA system.