This paper presents the results of the UPC-UB-STP team in the 2015 MediaEval Retrieving Diverse Images Task.The goal of the challenge is to provide a ranked list of Flickr photos for a predefined set of queries. Our approach firstly generates a ranking of images based on a query-independent estimation of its relevance. Only top results are kept and iteratively re-ranked based on their intra-similarity to introduce diversity.
This paper extends our previous work on the potential of EEG-based brain computer interfaces to segment salient objects in images. The proposed system analyzes the Event Related Potentials (ERP) generated by the rapid serial visual presentation of windows on the image. The detection of the P300 signal allows estimating a saliency map of the image, which is used to seed a semi-supervised object segmentation algorithm. Thanks to the new contributions presented in this work, the average Jaccard index was improved from 0.47 to 0.66 when processed in our publicly available dataset of images, object masks and captured EEG signals. This work also studies alternative architectures to the original one, the impact of object occupation in each image window, and a more robust evaluation based on statistical analysis and a weighted F-score.
Bolaños, M.; Mestre, R.; Talavera, E.; Giro, X.; Radeva, P. International Workshop on Wearable and Ego-vision Systems for Augmented Experience DOI: 10.1109/ICMEW.2015.7169863 Presentation's date: 2015-07-03 Presentation of work at congresses
Building a visual summary from an egocentric photostream captured by a lifelogging wearable camera is of high interest for different applications (e.g. memory reinforcement). In this paper, we propose a new summarization method based on keyframes selection that uses visual features extracted by means of a convolutional neural network. Our method applies an unsupervised clustering for dividing the photostreams into events, and finally extracts the most relevant keyframe for each event. We assess the results by applying a blind-taste test on a group of 20 people who assessed the quality of the summaries.
Roldan-Carlos, J.; Lux, M.; Giro, X.; Muñoz, P.; Anagnostopoulos, N. International Workshop on Content-Based Multimedia Indexing p. 1-4 DOI: 10.1109/CBMI.2015.7153622 Presentation's date: 2015-06-12 Presentation of work at congresses
With the advent of affordable multimedia smart phones, it has become common that people take videos when they are at events. The larger the event, the larger is the amount of videos taken there and also, the more videos get shared online. To search in this mass of videos is a challenging topic. In this paper we present and discuss a prototype software for searching in such videos. We focus only on visual information, and we report on experiments based on a research data set. With a small study we show that our prototype demonstrates promising results by identifying the same scene in different videos taken from different angles solely based on content based image retrieval.
Roldan-Carlos, J.; Lux, M.; Giro, X.; Muñoz, P.; Anagnostopoulos, N. International Workshop on Content-Based Multimedia Indexing p. 1-6 DOI: 10.1109/CBMI.2015.7153618 Presentation's date: 2015-06-12 Presentation of work at congresses
In endoscopic procedures, surgeons work with live video streams from the inside of their subjects. A main source for documentation of procedures are still frames from the video, identified and taken during the surgery. However, with growing demands and technical means, the streams are saved to storage servers and the surgeons need to retrieve parts of the videos on demand. In this submission we present a demo application allowing for video retrieval based on visual features and late fusion, which allows surgeons to re-find shots taken during the procedure.
Metric Access Methods (MAMs) are indexing techniques which allow working in generic metric spaces. Therefore, MAMs are specially useful for Content-Based Image Retrieval systems based on features which use non Lp norms as similarity measures. MAMs naturally allow the design of image browsers due to their inherent hierarchical structure. The Hierarchical Cellular Tree (HCT), a MAM-based indexing technique, provides the starting point of our work. In this paper, we describe some limitations detected in the original formulation of the HCT and propose some modifications to both the index building and the search algorithm. First, the covering radius, which is defined as the distance from the representative to the furthest element in a node, may not cover all the elements belonging to the node's subtree. Therefore, we propose to redefine the covering radius as the distance from the representative to the furthest element in the node's subtree. This new definition is essential to guarantee a correct construction of the HCT. Second, the proposed Progressive Query retrieval scheme can be redesigned to perform the nearest neighbor operation in a more efficient way. We propose a new retrieval scheme which takes advantage of the benefits of the search algorithm used in the index building. Furthermore, while the evaluation of the HCT in the original work was only subjective, we propose an objective evaluation based on two aspects which are crucial in any approximate search algorithm: the retrieval time and the retrieval accuracy. Finally, we illustrate the usefulness of the proposal by presenting some actual applications.
Carlier, A.; Charvillat, V.; Salvador, A.; Giro, X.; Marques, O. ACM international workshop on Crowdsourcing for multimedia p. 53-56 DOI: 10.1145/2660114.2660125 Presentation's date: 2014-11-07 Presentation of work at congresses
This paper introduces Click’n’Cut, a novel web tool for inter- active object segmentation designed for crowdsourcing tasks. Click’n’Cut combines bounding boxes and clicks generated by workers to obtain accurate object segmentations. These segmentations are created by combining precomputed object candidates in a light computational fashion that allows an immediate response from the interface. Click’n’Cut has been tested with a crowdsourcing campaign to annotate images from publicly available datasets. Results are competitive with state-of-the-art approaches, especially in terms of time needed to converge to a high quality segmentation.
Mohedano, E.; Healy, G.; McGuinness, K.; Giro, X.; O'Connor, N.; Smeaton, A. ACM International Conference on Multimedia p. 417-426 DOI: 10.1145/2647868.2654896 Presentation's date: 2014-11-06 Presentation of work at congresses
This paper explores the potential of brain-computer interfaces in segmenting objects from images. Our approach is centered around designing an effective method for displaying the image parts to the users such that they generate measurable brain reactions. When an image region, specifically a block of pixels, is displayed we estimate the probability of the block containing the object of interest using a score based on EEG activity. After several such blocks are displayed, the resulting probability map is binarized and combined with the GrabCut algorithm to segment the image into object and background regions. This study shows that BCI and simple EEG analysis are useful in locating object boundaries in images.
This document presents the contribution of the UPC team to the Social Event Detection (SED) Subtask 1 in MediaEval 2014. This contribution extends the solution tested in the previous year with a better optimization of the parameters that determine the clustering algorithm, and by introducing an additional pass that considers the merges of all pairs of mini-clusters generated during the two first passes. Our proposal also addresses the problem of incomplete metadata by generating additional textual tags based on geolocation and natural language processing techniques.
This paper presents a graphical environment for the annotation of still images that works both at the global and local scales. At the global scale, each image can be tagged with positive, negative and neutral labels referred to a semantic class from an ontology. These annotations can be used to train and evaluate an image classifier. A finer annotation at a local scale is also available for interactive segmentation of objects. This process is formulated as a selection of regions from a precomputed hierarchical partition called Binary Partition Tree. Three different semi-supervised methods have been presented and evaluated: bounding boxes, scribbles and hierarchical navigation. The implemented Java source code is published under a free software license.
The popularisation of the storage of photos on the cloud has opened new opportunities and challenges for the organisation and extension of photo collections. This paper presents a light computational solution for the clustering of web photos based on social events. The proposal combines a first over-segmentation of the photo collections of each user based on temporal cues, as previously proposed in PhotoTOC. On a second stage, the resulting mini-clusters are merged based on contextual metadata such as geolocation, keywords and user IDs. Results indicate that, although temporal cues are very relevant for event clustering, robust solutions should also consider all these additional features
Salvador, A.; Carlier, A.; Giro, X.; Marques, O.; Charvillat, V. ACM international workshop on Crowdsourcing for multimedia p. 15-20 DOI: 10.1145/2506364.2506367 Presentation's date: 2013-10-22 Presentation of work at congresses
We introduce a new algorithm for image segmentation based on crowdsourcing through a game : Ask'nSeek. The game provides information on the objects of an image, under the form of clicks that are either on the object, or on the back-ground. These logs are then used in order to determine the best segmentation for an object among a set of candidates generated by the state-of-the-art CPMC algorithm. We also introduce a simulator that allows the generation of game logs and therefore gives insight about the number of games needed on an image to perform acceptable segmentation.
These working notes present the contribution of the UPC team to the Social Event Detection (SED) task in MediaEval 2013. The proposal extends the previous PhotoTOC work in the domain of shared collections of photographs stored in cloud services. An initial over-segmentation of the photo collection is later refined by merging pairs of similar clusters.
These working notes paper present the contribution of the UPC team to the Hyperlinking sub-task of the Search and Hyperlinking Task in MediaEval 2013. Our contribution explores the potential of a solution based only on visual cues. In particular, every automatically generated shot is represented by a keyframe. The linking between video segments is based on the visual similarity of the keyframes they contain. Visual similarity is assessed with the intersection of bag of features histograms generated with the SURF descriptor.
Ventura, C.; Giro, X.; Vilaplana, V.; farre, D.; Carasusan, E. International Workshop on Content-Based Multimedia Indexing p. 29-34 DOI: 10.1109/CBMI.2013.6576548 Presentation's date: 2013-06 Presentation of work at congresses
This paper addresses the problem of video summarization through an automatic selection of a single representative keyframe. The proposed solution is based on the mutual reinforcement paradigm, where a keyframe is selected thanks to its highest and most frequent similarity to the rest of considered frames. Two variations of the algorithm are explored: a first one where only frames within the same video are used (intra-clip mode) and a second one where the decision also depends on the previously selected keyframes of related videos (inter-clip mode). These two algorithms were evaluated by a set of professional documentalists from a broadcaster's archive, and results concluded that the proposed techniques outperform the semi-manual solution adopted so far in the company.
This thesis addresses the problem of visual object retrieval, where a user formulates a query to an image database by providing one or multiple examples of an object of interest. The presented techniques aim both at finding those images in the database that contain the object as well as locating the object in the image and segmenting it from the background.
Every considered image, both the ones used as queries and the ones contained in the target database, is represented as a Binary Partition Tree (BPT), the hierarchy of regions previously proposed by Salembier and Garrido (2000). This data structure offers multiple opportunities and challenges when applied to the object retrieval problem.
A first application of BPTs appears during the formulation of the query, when the user must interactively segment the query object from the background. Firstly, the BPT can assist in adjusting an initial marker, such as a scribble or bounding box, to the object contours. Secondly, BPT can also define a navigation path for the user to adjust an initial selection to the appropriate spatial scale.
The hierarchical structure of the BPT is also exploited to extract a new type of visual words named Hierarchical Bag of Regions (HBoR). Each region defined in the BPT is described with a feature vector that combines a soft quantization on a visual codebook with an efficient bottom-up computation through the BPT. These descriptors allow the definition of a novel feature space, the Parts Space, where each object is located according to the parts that compose it.
HBoR descriptors have been applied to two scenarios for object retrieval, both of them solved by considering the decomposition of the objects in parts. In the first scenario, the query is formulated with a single object exemplar which is to be matched with each BPT in the target database. The matching problem is solved in two stages: an initial top-down one that assumes that the hierarchy from the query is respected in the target BPT, and a second bottom-up one that relaxes this condition and considers region merges which are not in the target BPT.
The second scenario where HBoR descriptors are applied considers a query composed of several visual objects. In this case, the provided exemplars are considered as a training set to build a model of the query concept. This model is composed of two levels, a first one where each part is modelled and detected separately, and a second one that characterises the combinations of parts that describe the complete object. The analysis process exploits the hierarchical nature of the BPT by using a novel classifier that drives an efficient top-down analysis of the target BPTs.
Ventura, C.; Martos, M.; Giro, X.; Vilaplana, V.; Marques, F. International Conference on MultiMedia Modeling p. 652-654 DOI: 10.1007/978-3-642-27355-1_67 Presentation's date: 2012-01-06 Presentation of work at congresses
Butko, T.; Canton-Ferrer, C.; Segura, C.; Giro, X.; Nadeu, C.; Hernando, J.; Casas, J. Eurasip journal on advances in signal processing Vol. 2011, p. 1-11 DOI: 10.1155/2011/485738 Date of publication: 2011-03-15 Journal article
Carcel, E.; Martos, M.; Giro, X.; Marques, F. International Workshop on Computational Intelligence for Multimedia Understanding p. 172-182 DOI: 10.1007/978-3-642-32436-9_15 Presentation's date: 2011 Presentation of work at congresses
Giro, X.; Ventura, C.; Pont, J.; Cortés, S.; Marques, F. ACM International Conference On Image And Video Retrieval p. 358-365 DOI: 10.1145/1816041.1816093 Presentation's date: 2010-07-06 Presentation of work at congresses
This paper presents the system architecture of a Content-
Based Image Retrieval system implemented as a web service.
The proposed solution is composed of two parts, a client run-
ning a graphical user interface for query formulation and a
server where the search engine explores an image repository.
The separation of the user interface and the search engine
follows a Service as a Software (SaaS) model, a type of cloud
computing design where a single core system is online and
available to authorized clients. The proposed architecture
follows the REST software architecture and HTTP proto-
col for communications, two solutions that combined with
metadata coded in RDF, make the proposed system ready
for its integration in the semantic web. User queries are
formulated by visual examples through a graphical inter-
face and content is remotely accessed also through HTTP
communication. Visual descriptors and similarity measures
implemented in this work are mostly de ned in the MPEG-7
standard, while textual metadata is coded according to the
Dublin Core speci cations.
No es fácil encontrar aquello que buscamos. Los sistemas
de búsqueda automatizados son máquinas que deben
interpretar la información introducida por el usuario para
ejecutar una petición y recuperar la información deseada.
La tarea del “traductor” es esencial en el proceso, por tanto,
una buena interfaz de usuario (GUI) es determinante en
el éxito de la búsqueda. Os presentamos el GOS (Graphic
Object Searcher), una aplicación que utiliza un sistema de
búsqueda visual, para recuperar imágenes similares a otra
imagen usada como ejemplo en la consulta. La industria
del sector audiovisual está especialmente interesada en el
desarrollo de este tipo de herramientas de gestión de contenidos
que han de facilitar su trabajo diario.
This paper describes the integration of two new services
aimed at assisting into the retrieval of video content from
a Multimedia Asset Manager (MAM). The first tool suggest
tags after an first textual query, and the second ranks the
keyframe of retrieved assets according to their visual simi-
larity. Both applications were implemented as web services
that are accessed from a Rich Internet Application via REST calls.