Semantic Indexing of Video Documents
Searching in image, video and audio collections has proper specificities, among which the “semantic gap” problem is the most challenging. The semantic gap refers to the “distance” between the signal samples (audio samples or pixels) of which raw multimedia documents are made of and the concepts and/or relations that make sense to human beings. Concept indexing or document categorization is very important for multimedia content-based search. The most common approach is based on supervised learning from labeled examples. Several challenges need to be addressed for efficient and practical multimedia content-based indexing and retrieval:
- the level of concept classification performance, still quite low on “wild” conditions (in the 0.2-0.3 range in a scale from 0 to 1);
- the generalization capabilities: the performance of the classifiers significantly degrades when they are used in domains different from those on which they were trained;
- scalability: classification methods need to be still operational when applied large numbers of documents, target concepts and content types.
We addressed these challenges using a generalized and sophisticated classification pipeline, working on all important stages: descriptor extraction and aggregation, descriptor optimization, classification for highly imbalanced datasets, fusion of classifiers, and re-ranking using the temporal and conceptual contexts. We also worked on the process of efficiently producing annotations using active learning and active cleaning approaches. We experimented these approaches in the context of the TRECVid1 and MediaEval2 international evaluation campaigns. In the 2013 issues, we got the second place within 26 participating groups at the TRECVid semantic indexing task and the first (resp. second) place within 5 (resp. 9) participants at the MediaEval subjective (resp. objective) violence detection task. This work led to five publications (of which two are to appear) in international journals, one as a book chapter, and several in national and international conferences.
Person Identification in Video Documents
People are among the main elements of interest for users searching in video collections. Therefore, indexing who is appearing and who is speaking, or who is mentioned either in the speech or in the image track, is a major goal for content-based video indexing. Though the problem is similar to general concept indexing, quite specific techniques can be used to obtain the maximal performance in this very important practical case. In this domain, we focused on:
- written name extraction by developing and improving overlaid text recognition techniques;
- unsupervised naming of persons using written names, pronounced names or both;
- multimodal fusion for person identification in video documents.
We experimented these approaches in the context of the REPERE3 national evaluation campaign where we were often ranked first within the three participants. This work led to one publication in a national journal, and several in national and international conferences. A tool for overlaid text extraction has been made publicly available.