Siempre ha sido una meta fundamental de la visión por computador la comprensión de los seres humanos. Los primeros trabajos se fijaron en objetivos sencillos tales como la detección en imágenes de la posición de los individuos. A medida que la investigación progresó se emprendieron tareas mucho más complejas. Por ejemplo, a partir de la detección de los humanos se pasó a la estimación en dos y tres dimensiones de su postura por lo que la tarea consistía en identificar la localización en la imagen o el espacio de las diferentes partes del cuerpo, por ejemplo cabeza, torso, rodillas, brazos, etc...También los atributos humanos se convirtieron en una gran fuente de interés ya que permiten el reconocimiento de los individuos y de sus propiedades como el género o la edad. Más tarde, la atención se centró en el reconocimiento de la acción realizada. Todos estos trabajos reposan en las investigaciones previas sobre la estimación de las posturas y la clasificación de los atributos. En la actualidad, se llevan a cabo investigaciones de un nivel aún superior sobre cuestiones tales como la predicción de las motivaciones del comportamiento humano o la identificación del tallaje de un individuo a partir de una fotografía.En esta tesis desarrollamos una jerarquía de herramientas que cubre toda esta gama de problemas, desde descriptores de rasgos de bajo nivel a modelos probabilísticos de campos condicionales de alto nivel reconocedores de la moda, todos ellos con el objetivo de mejorar la comprensión de los humanos a partir de imágenes RGB monoculares. Para construir estos modelos de alto nivel es decisivo disponer de una batería de datos robustos y fiables de nivel bajo y medio. En este sentido, proponemos dos descriptores novedosos de bajo nivel: uno se basa en la teoría de la difusión de calor en las imágenes y otro utiliza una red neural convolucional para aprender representaciones discriminativas de trozos de imagen. También introducimos diferentes modelos de bajo nivel generativos para representar la postura humana: en particular presentamos un modelo discreto basado en un gráfico acíclico dirigido y un modelo continuo que consiste en agrupaciones de posturas en una variedad de Riemann. Como señales de nivel medio proponemos dos algoritmos estimadores de la postura humana: uno que estima la postura en tres dimensiones a partir de una estimación imprecisa en el plano de la imagen y otro que estima simultáneamente la postura en dos y tres dimensiones. Finalmente construimos modelos de alto nivel a partir de señales de nivel bajo y medio para la comprensión de la persona a partir de imágenes. En concreto, nos centramos en dos diferentes tareas en el ámbito de la moda: la segmentación semántica del vestido y la predicción del buen ajuste de la prenda a partir de imágenes con meta-datos con la finalidad de aconsejar al usuario sobre moda.En resumen, para extraer conocimiento a partir de imágenes con presencia de seres humanos es preciso construir modelos de alto nivel que integren señales de nivel medio y bajo. En general, el punto crítico para obtener resultados fiables es el empleo y la comprensión de rasgos fuertes. La aportación fundamental de esta tesis es la propuesta de una variedad de algoritmos de nivel bajo, medio y alto para el tratamiento de imágenes centradas en seres humanos que pueden integrarse en modelos de alto nivel, para mejor comprensión de los seres humanos a partir de fotografías, así como abordar problemas planteados por el buen ajuste de las prendas.
Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança.El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball.En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals.Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients.Aquestes són les principals contribucions de la tesi:(a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline").(b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment.(c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".
Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties.
This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work.
In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features.
We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient.
Our main contributions are as follows:
(a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines.
(b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion.
(c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.
Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".
Serradell, E.; Amável , M.; Sznitman, R.; Kybic, J.; Moreno-Noguer, F.; Fua, P. IEEE transactions on pattern analysis and machine intelligence Vol. 37, num. 3, p. 625-638 DOI: 10.1109/TPAMI.2014.2343235 Date of publication: 2015 Journal article
We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R^2 or R^3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. Engineering applications of artificial intelligence Vol. 35, p. 246-258 DOI: 10.1016/j.engappai.2014.06.025 Date of publication: 2014 Journal article
Robotic handling of textile objects in household environments is an emerging application that has recently received considerable attention thanks to the development of domestic robots. Most current approaches follow a multiple re-grasp strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields a desired configuration.
In this work we propose a vision-based method, built on the Bag of Visual Words approach, that combines appearance and 3D information to detect parts suitable for grasping in clothes, even when they are highly wrinkled.
We also contribute a new, annotated, garment part dataset that can be used for benchmarking classification, part detection, and segmentation algorithms. The dataset is used to evaluate our approach and several state-of-the-art 3D descriptors for the task of garment part detection. Results indicate that appearance is a reliable source of information, but that augmenting it with 3D information can help the method perform better with new clothing items.
We introduce LETHA (Learning on Easy data, Test on Hard), a new learning paradigm consisting of building strong priors from high quality training data, and combining them with discriminative machine learning to deal with low- quality test data. Our main contribution is an implementation of that concept for pose estimation. We first automatically build a 3D model of the object of interest from high-definition images, and devise from it a pose-indexed feature extraction scheme. We then train a single classifier to process these feature vectors. Given a low quality test image, we visit many hypothetical poses, extract features consistently and evaluate the response of the classifier. Since this process uses locations recorded during learning, it does not require matching points anymore. We use a boosting procedure to train this classifier common to all poses, which is able to deal with missing features, due in this context to self-occlusion. Our results demonstrate that the method combines the strengths of global image representations, discriminative even for very tiny images, and the robustness to occlusions of approaches based on local feature point descriptors.
El siguiente documento presenta un método robusto y eficiente para estimar la pose de una cámara. El método propuesto asume el conocimiento previo de un modelo 3D del entorno, y compara una nueva imagen de entrada únicamente con un conjunto pequeño de imágenes similares seleccionadas previamente por un algoritmo de "Bag of Visual Words". De esta forma se evita el alto coste computacional de calcular la correspondencia de los puntos 2D de la imagen de entrada contra todos los puntos 3D de un modelo complejo, que en nuestro caso contiene más de 100,000 puntos. La estimación de la pose se lleva a cabo a partir de estas correspondencias 2D-3D utilizando un novedoso algoritmo de PnP que realiza la eliminación de valores atípicos (outliers) sin necesidad de utilizar RANSAC, y que es entre 10 y 100 veces más rápido que los métodos que lo utilizan.
El siguiente documento presenta un método robusto y eficiente para estimar la pose de una cámara. El método propuesto asume el conocimiento previo de un modelo 3D del entorno, y compara una nueva imagen de entrada únicamente con un conjunto pequeño de imágenes similares seleccionadas previamente por un algoritmo de
We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. In particular, we show results on synthetic examples of a sphere and a quadric surface and on a large and complex dataset of human poses, where the proposed model is used as a regression tool for hypothesizing the geometry of occluded parts of the body.
Villamizar, M.A.; Sanfeliu, A.; Moreno-Noguer, F. IEEE International Conference on Robotics and Automation p. 4996-5003 DOI: 10.1109/ICRA.2014.6907591 Presentation's date: 2014 Presentation of work at congresses
We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.
Amor, A.; Ruiz, A.; Moreno-Noguer, F.; Sanfeliu, A. IEEE International Conference on Robotics and Automation p. 2595-2601 DOI: 10.1109/ICRA.2014.6907231 Presentation's date: 2014 Presentation of work at congresses
We present a real time method for pose estimation of objects from an UAV, using visual marks placed on non planar surfaces. It is designed to overcome constraints in small aerial robots, such as slow CPUs, low resolution cameras and image deformations due to distortions introduced by the lens or by the viewpoint changes produced during the flight navigation. The method consists of shape registration from extracted contours in an image. Instead of working with dense image patches or corresponding image features, we optimize a geometric alignment cost computed directly from the raw polygonal representations of the observed regions using efficient clipping algorithms. Moreover, instead of doing 2D image processing operations, the optimization is performed in the polygon representation space, allowing real-time projective matching. Deformation modes are easily included in the optimization scheme, allowing an accurate registration of different markers attached to curved surfaces using a single deformable prototype.
As a result, the method achieves accurate object pose estimation precision in real-time, which is very important for interactive UAV tasks, for example for short distance surveillance or bar assembly. We describe the main algorithmic components of the method and present experiments where our method yields an average error of less than 5mm in position at a distance of 0.7m, using a visual mark of 19mm x 19mm. Finally, we compare these results with current computer vision state-of-the-art systems.
We propose a real-time and accurate solution to the Perspective-n-Point (PnP) problem –estimating the pose of a calibrated camera from n 3D-to-2D point correspondences– that exploits the fact that in practice the 2D position of not all 2D features is estimated with the same accuracy. Assuming a model of such feature uncertainties is known in advance, we reformulate the PnP problem as a maximum likelihood minimization approximated by an unconstrained Sampson error function, which naturally penalizes the most noisy correspondences. The advantages of this approach are thoroughly demonstrated in synthetic experiments where feature uncertainties are exactly known.
Pre-estimating the features uncertainties in real experiments is, though, not easy. In this paper we model feature uncertainty as 2D Gaussian distributions representing the sensitivity of the 2D feature detectors to different camera viewpoints. When using these noise models with our PnP formulation we still obtain promising pose estimation results that outperform the most recent approaches.
Tsogkas, S.; Kokkinos, I.; Trulls, E.; Sanfeliu, A.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 168-175 DOI: 10.1109/CVPR.2014.29 Presentation's date: 2014 Presentation of work at congresses
In this work we propose a technique to combine bottom- up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs).
The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs.
We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.
We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accu- rate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.
Simo, E.; Quattoni, A.J.; Torras, C.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 3634-3641 DOI: 10.1109/CVPR.2013.466 Presentation's date: 2013-06-23 Presentation of work at congresses
We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.
Peñate, A.; Andrade-Cetto, J.; Moreno-Noguer, F. IEEE transactions on pattern analysis and machine intelligence Vol. 35, num. 10, p. 2387-2400 DOI: 10.1109/TPAMI.2013.36 Date of publication: 2013 Journal article
We propose a novel approach for the estimation of the pose and focal length of a camera from a set of 3D-to-2D point correspondences. Our method compares favorably to competing approaches in that it is both more accurate than existing closed form solutions, as well as faster and also more accurate than iterative ones. Our approach is inspired on the EPnP algorithm, a recent O(n) solution for the calibrated case. Yet we show that considering the focal length as an additional unknown renders the linearization and relinearization techniques of the original approach no longer valid, especially with large amounts of noise. We present new methodologies to circumvent this limitation termed exhaustive linearization and exhaustive relinearization which perform a systematic exploration of the solution space in closed form. The method is evaluated on both real and synthetic data, and our results show that besides producing precise focal length estimation, the retrieved camera pose is almost as accurate as the one computed using the EPnP, which assumes a calibrated camera.
Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.
Amável , M.; Sznitman, R.; Serradell, E.; Kybic, J.; Moreno-Noguer, F.; Fua, P. International Conference on Information Processing in Medical Imaging p. 572-583 DOI: 10.1007/978-3-642-38868-2_48 Presentation's date: 2013 Presentation of work at congresses
We present a general approach for solving the point-cloud matching problem for the case of mildly nonlinear transformations. Our method quickly finds a coarse approximation of the solution by exploring a reduced set of partial matches using an approach to which we refer to as Active Testing Search (ATS). We apply the method to registration of graph structures by branching point matching. It is based solely on the geometric position of the points, no additional information is used nor the knowledge of an initial alignment. In the second stage, we use dynamic programming to rene the solution. We tested our algorithm on angiography, retinal fundus, and neuronal data gathered using electron and light microscopy. We show that our method solves cases not solved by most approaches, and is faster than the remaining ones.
Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.
Kokkinos, I.; Trulls, E.; Sanfeliu, A.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 2890-2897 DOI: 10.1109/CVPR.2013.372 Presentation's date: 2013 Presentation of work at congresses
In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and widebaseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. IEEE/RSJ International Conference on Intelligent Robots and Systems p. 824-830 DOI: 10.1109/IROS.2013.6696446 Presentation's date: 2013 Presentation of work at congresses
Most current depth sensors provide 2.5D range images in which depth values are assigned to a rectangular 2D array. In this paper we take advantage of this structured information to build an efficient shape descriptor which is about two orders of magnitude faster than competing approaches, while showing similar performance in several tasks involving deformable object recognition. Given a 2D patch surrounding a point and its associated depth values, we build the descriptor for that point, based on the cumulative distances between their normals and a discrete set of normal directions. This processing is made very efficient using integral images, even allowing to compute descriptors for every range image pixel in a few seconds. The discriminative power of our descriptor, dubbed FINDDD, is evaluated in three different scenarios: recognition of specific cloth wrinkles, instance recognition from geometry alone, and detection of reliable and informed grasping points.
Garrell, A.; Villamizar, M.A.; Moreno-Noguer, F.; Sanfeliu, A. IEEE International Symposium on Robot and Human Interactive Communication p. 107-113 DOI: 10.1109/ROMAN.2013.6628463 Presentation's date: 2013 Presentation of work at congresses
During the last decade, there has been a growing interest in making autonomous social robots able to interact with people. However, there are still many open issues regarding the social capabilities that robots should have in order to perform these interactions more naturally. In this paper we present the results of several experiments conducted at the Barcelona Robot Lab in the campus of the “Universitat Politècnica de Catalunya” in which we have analyzed different important aspects of the interaction between a mobile robot and non-trained human volunteers. First, we have proposed different robot behaviors to approach a person and create an engagement with him/her. In order to perform this task we have provided the robot with several perception and action capabilities, such as that of detecting people, planning an approach and verbally communicating its intention to initiate a conversation. Once the initial engagement has been created, we have developed further
communication skills in order to let people assist the robot and improve its face recognition system. After this assisted and online learning stage, the robot becomes able to detect people under severe changing conditions, which, in turn enhances the number and the manner that subsequent human-robot interactions are performed.
Serradell, E.; Glowacki, P.; Kybic, J.; Moreno-Noguer, F.; Fua, P. IEEE Computer Society conference on computer vision and pattern recognition. Proceedings Vol. 2012, p. 996-1003 DOI: 10.1109/CVPR.2012.6247776 Date of publication: 2012 Journal article
We present a new approach to matching graphs embedded in R2 or R3. Unlike earlier methods, our approach does not rely on the similarity of local appearance features, does not require an initial alignment, can handle partial matches, and can cope with non-linear deformations and topological differences. To handle arbitrary non-linear deformations, we represent them as Gaussian Processes. In the absence of appearance information, we iteratively establish correspondences between graph nodes, update the structure accordingly, and use the current mapping estimate to find the most likely correspondences that will be used in the next iteration. This makes the computation tractable. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.
Grasping highly deformable objects, like textiles,
is an emerging area of research that involves both percep-
tion and manipulation abilities. As new techniques appear,
it becomes essential to design strategies to compare them.
However, this is not an easy task, since the large state-space
of textile objects explodes when coupled with the variability
of grippers, robotic hands and robot arms performing the
manipulation task. This high variability makes it very difficult
to design experiments to evaluate the performance of a system
in a repeatable way and compare it to others. We propose
a framework to allow the comparison of different grasping
methods for textile objects.
Instead of measuring each component separately, we there-
fore propose a methodology to explicitly measure the vision-
manipulation correlation by taking into account the throughput
of the actions. Perceptions of deformable objects should be
grouped into different clusters, and the different grasping
actions available should be tested for each perception type to
obtain the action-perception success ratio. This characterization
potentially allows to compare very different systems in terms
widely useful actions
with the cost of performing each action. We will also show
that this categorization is useful in manipulation planning of
Simo, E.; Ramisa, A.; Torras, C.; Alenyà, G.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 2673-2680 DOI: /10.1109/CVPR.2012.6247988 Presentation's date: 2012 Presentation of work at congresses
Markerless 3D human pose detection from a single image is a severely underconstrained problem because different 3D poses can have similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models that can only be correctly adjusted if 2D image features are accurately detected. Unfortunately, although current 2D part detector algorithms have shown promising results, they are not yet accurate enough to guarantee a complete disambiguation of the 3D inferred shape.
In this paper, we introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate
the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable from their image projections. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D
human shape. We validate the method on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.
This paper studies the use of temporal consistency to match appearance descriptors and handle complex ambiguities when computing dynamic depth maps from stereo. Previous attempts have designed 3D descriptors over the spacetime volume and have been mostly used for monocular action recognition, as they cannot deal with perspective changes. Our approach is based on a state-of-the-art 2D dense appearance descriptor which we extend in time by means of optical flow priors, and can be applied to wide-baseline stereo setups. The basic idea behind our approach is to capture the changes around a feature point in time instead of trying to describe the spatiotemporal volume. We demonstrate its effectiveness on very ambiguous synthetic video sequences with ground truth data, as well as real sequences.
We present an algorithm for geometric matching of graphs embedded in 2D or 3D space. It is applicable for registering any
graph-like structures appearing in biomedical images, such as blood vessels, pulmonary bronchi, nerve fibers, or dendritic
arbors. Our approach does not rely on the similarity of local appearance features, so it is suitable for multimodal registration
with a large difference in appearance. Unlike earlier methods, the algorithm uses edge shape, does not require an initial pose estimate, can handle partial matches, and can cope with nonlinear deformations and topological differences.
The matching consists of two steps. First, we find an affine transform that roughly aligns the graphs by exploring the set of all consistent correspondences between the nodes. This can be done at an acceptably low computational expense by using parameter uncertainties for pruning, backtracking as needed. Parameter uncertainties are updated in a Kalman-like scheme with each match.
In the second step we allow for a nonlinear part of the deformation, modeled as a Gaussian Process. Short sequences of edges are grouped into superedges, which are then matched between graphs. This allows for topological differences.
A maximum consistent set of superedge matches is found using a dedicated branch-and-bound solver, which is over 100 times faster than a standard linear programming approach. Geometrical and topological consistency of candidate matches is determined in a fast hierarchical manner.
We demonstrate the effectiveness of our technique at registering angiography and retinal fundus images, as well as neural image stacks.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. IEEE International Conference on Robotics and Automation p. 1703-1708 DOI: 10.1109/ICRA.2012.6225045 Presentation's date: 2012 Presentation of work at congresses
Detecting grasping points is a key problem in cloth manipulation. Most current approaches follow a multiple regrasp
strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields to a
desired configuration. In this paper, by contrast, we circumvent the need for multiple re-graspings by building a robust detector that identifies the grasping points, generally in one single step,
even when clothes are highly wrinkled.
In order to handle the large variability a deformed cloth may have, we build a Bag of Features based detector that combines
appearance and 3D geometry features. An image is scanned using a sliding window with a linear classifier, and the candidate
windows are refined using a non-linear SVM and a “grasp goodness” criterion to select the best grasping point.
We demonstrate our approach detecting collars in deformed polo shirts, using a Kinect camera. Experimental results show
a good performance of the proposed method not only in identifying the same trained textile object part under severe deformations and occlusions, but also the corresponding part in other clothes, exhibiting a degree of generalization.
We present an Online Random Ferns (ORFs) classifier that progressively learns and builds enhanced models of object appearances. During the learning process, we allow the human intervention to assist the classifier and discard false positive training samples. The amount of human intervention is minimized and integrated within the online learning, such that in a few seconds, complex object appearances can be learned. After the assisted learning stage, the classifier is able to detect the object under severe changing conditions. The system runs at a few frames per second, and has been validated for face and object detection tasks on a mobile robot platform. We show that with minimal human assistance we are able to build a detector robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds. 2012 ICPR Org Committee.
We present an Online Random Ferns (ORFs) classifier that progressively learns and builds enhanced models of object appearances. During the learning process, we allow the human intervention to assist the classifier and discard false positive training samples. The amount of human intervention is minimized and integrated within the online learning, such that in a few seconds, complex object appearances can be learned. After the assisted learning stage, the classifier is able to detect the object under severe changing conditions. The system runs at a few frames per second, and has been validated for face and object detection tasks on a mobile robot platform. We show that with minimal human assistance we are able to build a detector robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds.
The ARCAS project proposes the development and experimental validation of the first cooperative free-flying robot system for assembly and structure construction. The project will pave the way for a large number of applications including the building of platforms for evacuation of people or landing aircrafts, the inspection and maintenance of facilities and the construction of structures in inaccessible sites and in the space.\nThe detailed scientific and technological objectives are:\n1)New methods for motion control of a free-flying robot with mounted manipulator in contact with a grasped object as well as for coordinated control of multiple cooperating flying robots with manipulators in contact with the same object (e.g. for precise placement or joint manipulation)\n2)New flying robot perception methods to model, identify and recognize the scenario and to be used for the guidance in the assembly operation, including fast generation of 3D models, aerial 3D SLAM, 3D tracking and cooperative perception\n3)New methods for the cooperative assembly planning and structure construction by means of multiple flying robots with application to inspection and maintenance activities\n4)Strategies for operator assistance, including visual and force feedback, in manipulation tasks involving multiple cooperating flying robots\nThe above methods and technologies will be integrated in the ARCAS cooperative flying robot system that will be validated in the following scenarios: a) Indoor testbed with quadrotors, b) Outdoor scenario with helicopters, c) free-flying simulation using multiple robot arms.\nThe project will be implemented by a high-quality consortium whose partners have already demonstrated the cooperative transportation by aerial robots as well as high performance cooperative ground manipulation. The team has the ability to produce for the first time challenging technological demonstrations with a high potential for generation of industrial products upon project completion.
Villamizar, M.A.; Moreno-Noguer, F.; Andrade-Cetto, J.; Sanfeliu, A. Iberian conference on pattern recognition and image analysis p. 67-75 DOI: 10.1007/978-3-642-21257-4_9 Presentation's date: 2011 Presentation of work at congresses
We present an experimental evaluation of Boosted Random Ferns in terms of the detection performance and the training data. We show that adding an iterative bootstrapping phase during the learning of the object classifier, it increases its detection rates given that additional positive and negative samples are collected (bootstrapped) for retraining the boosted classifier. After each bootstrapping iteration, the learning algorithm is concentrated on computing more discriminative and robust features (Random Ferns), since the bootstrapped samples extend the training data with more difficult images.
The resulting classifier has been validated in two different object datasets, yielding successful detections rates in spite of challenging image conditions such as lighting changes, mild occlusions and cluttered background.
Perception and manipulation of rigid objects has
received a lot of attention, and several solutions have been
proposed. In contrast, dealing with deformable objects is a
relatively new and challenging task because they are more
complex to model, their state is difficult to determine, and
self-occlusions are common and hard to estimate. In this
paper we present our progress/results in the perception of
deformable objects both using conventional RGB cameras and
active sensing strategies by means of depth cameras.We provide
insights in two different areas of application: grasping of textiles
and plant leaf modelling.
We present an algorithm to simultaneously recover non-rigid shape and camera poses from point correspondences between a reference shape and a sequence of input images. The key novel contribution of our approach is in bringing the tools of the probabilistic SLAM methodology from a rigid to a deformable domain. Under the assumption that the shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, may be probabilistically formulated as a maximum a posterior estimate and solved using an iterative least squares optimization. An extensive evaluation on synthetic and real data, shows that our approach has several significant advantages over current approaches, such as performing robustly under large amounts of noise and outliers, and neither requiring to track points over the whole sequence nor initializations close from the ground truth solution.
Villamizar, M.A.; Grabner, H.; Andrade-Cetto, J.; Sanfeliu, A.; Van Gool, L.; Moreno-Noguer, F. British Machine Vision Conference p. 20.1 DOI: 10.5244/C.25.20 Presentation's date: 2011 Presentation of work at congresses
We propose an efficient method for object localization and 3D pose estimation. A
two-step approach is used. In the first step, a pose estimator is evaluated in the input
images in order to estimate potential object locations and poses. These candidates are
then validated, in the second step, by the corresponding pose-specific classifier. The
result is a detection approach that avoids the inherent and expensive cost of testing the
complete set of specific classifiers over the entire image. A further speedup is achieved
by feature sharing. Features are computed only once and are then used for evaluating the
pose estimator and all specific classifiers. The proposed method has been validated on
two public datasets for the problem of detecting of cars under several views. The results
show that the proposed approach yields high detection rates while keeping efficiency.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. International Conference of the Catalan Association for Artificial Intelligence p. 199-207 DOI: 10.3233/978-1-60750-842-7-199 Presentation's date: 2011 Presentation of work at congresses
In this paper we address the problem of finding an initial good grasping point for the robotic manipulation of textile objects lying on a flat surface. Given as input a point cloud of the cloth acquired with a 3D camera, we propose choosing as grasping points those that maximize a new measure of wrinkledness, computed from the distribution of normal directions over local neighborhoods. Real grasping experiments using a robotic arm are performed, showing that the proposed measure leads to promising results.
Recent advances in 3D shape recognition have shown
that kernels based on diffusion geometry can be effectively
used to describe local features of deforming surfaces. In
this paper, we introduce a new framework that allows using
these kernels on 2D local patches, yielding a novel feature
point descriptor that is both invariant to non-rigid image
deformations and illumination changes.
In order to build the descriptor, 2D image patches are
embedded as 3D surfaces, by multiplying the intensity level
by an arbitrarily large and constant weight that favors
anisotropic diffusion and retains the gradient magnitude
information. Patches are then described in terms of a
heat kernel signature, which is made invariant to intensity
changes, rotation and scaling. The resulting feature point
descriptor is proven to be significantly more discriminative
than state of the art ones, even those which are specifically
designed for describing non-rigid image deformations.
Simo, E.; Moreno-Noguer, F.; Perez-Gracia, A. ASME International Design Engineering Technical Conferences p. 377-386 DOI: 10.1115/DETC2011-47818 Presentation's date: 2011 Presentation of work at congresses
In this paper, we explore the idea of designing non- anthropomorphic multi-fingered robotic hands for tasks tha t replicate the motion of the human hand. Taking as input data a finite set of rigid-body positions for the five fingertips, we de- velop a method to perform dimensional synthesis for a kinema tic chain with a tree structure, with five branches that share thr ee common joints. We state the forward kinematics equations of relative dis- placements for each serial chain expressed as dual quaterni ons, and solve for up to five chains simultaneously to reach a numbe r of positions along the hand trajectory. This is done using a h y- brid global numerical solver that integrates a genetic algo rithm and a Levenberg-Marquardt local optimizer. Although the number of candidate solutions in this problem is very high, the use of the genetic algorithm allows us to per form an exhaustive exploration of the solution space to obtain a s et of solutions. We can then choose some of the solutions based on t he specific task to perform. Note that these designs match the ta sk exactly while generally having a finger design radically dif ferent from that of the human hand.
Serradell, E.; Romero, A.; Leta, R.; Gatta, C.; Moreno-Noguer, F. International Conference on Computer Vision p. 850-857 DOI: 10.1109/ICCV.2011.6126325 Presentation's date: 2011 Presentation of work at congresses
We present a novel approach to simultaneously reconstruct the 3D structure of a non-rigid coronary tree and estimate point correspondences between an input X-ray image and a reference 3D shape. At the core of our approach lies an optimization scheme that iteratively fits a generative 3D model of increasing complexity and guides the matching process. As a result, and in contrast to existing approaches that assume rigidity or quasi-rigidity of the structure, our method is able to retrieve large non-linear deformations even when the input data is corrupted by the presence of noise and partial occlusions. We extensively evaluate our approach under synthetic and real data and demonstrate a remarkable improvement compared to state-of-the-art.
The GARNICS project aims at 3D sensing of plant growth and building perceptual representations for learning the links to actions of a robot gardener. Plants are complex, self-changing systems with increasing complexity over time. Actions performed at plants (like watering), will have strongly delayed effects. Thus, monitoring and controlling plants is a difficult perception-action problem requiring advanced predictive cognitive properties, which so far can only be provided by experienced human gardeners. Sensing and control of a plants actual properties, i.e. its phenotype, is relevant to e.g. seed production and plant breeders. We address plant sensing and control by combining active vision with appropriate perceptual representations, which are essential for cognitive interactions. Core ingredients for these representations are channel representations, dynamic graphs and cause-effect couples (CECs). Channel representations are a wavelet-like, biologically motivated information representation, which can be generalized coherently using group theory. Using these representations, plant models -- represented by dynamic graphs -- will be acquired and by interacting with a human gardener the system will be taught the different cause-effect relations resulting from possible treatments. Employing decision making and planning processes via CECs, our robot gardener will then be able to choose from its learned repertoire the appropriate actions for optimal plant growth. This way we will arrive at an adaptive, interactive cognitive system, which will be implemented and tested in an industrially-relevant plant-phenotyping application.
Sánchez, J.; Ostlund, J.; Fua, P.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 1189-1196 DOI: 10.1109/CVPR.2010.5539831 Presentation's date: 2010 Presentation of work at congresses
Recent works have shown that 3D shape of non-rigid surfaces can be accurately retrieved from a single image given a set of 3D-to-2D correspondences between that image and another one for which the shape is known.
However, existing approaches assume that such correspondences can be readily established, which is not necessarily true when large deformations produce significant appearance changes between the input and the reference images. Furthermore, it is either assumed that the pose of the camera is known, or the estimated solution is pose-ambiguous. In this paper we relax all these assumptions and, given a set of 3D and 2D unmatched points, we present an approach to simultaneously solve their correspondences, compute the camera pose and retrieve the shape of the surface in the input image. This is achieved by introducing weak priors on the pose and shape that we model as Gaussian Mixtures. By combining them into a Kalman filter we can progressively reduce the number of 2D candidates that can be potentially matched to each 3D point, while pose and shape are refined. This lets us to perform a complete and efficient exploration of the solution space and retain the best solution.
Recovering the 3D shape of deformable surfaces from single images is difficult because many different shapes have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape
will be recovered. To overcome this limitation, we introduce an efficient
approach to exploring the set of solutions of an objective function based on point-correspondences and to proposing a small set of candidate 3D shapes. This allows the use of additional image information to choose
the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem.
Serradell, E.; Özuysa , M.; Lepetit, V.; Fua, P.; Moreno-Noguer, F. European Conference on Computer Vision p. 58-72 DOI: 10.1007/978-3-642-15558-1_5 Presentation's date: 2010 Presentation of work at congresses
Abstract. The homography between pairs of images are typically computed from the correspondence of keypoints, which are established by using image descriptors. When these descriptors are not reliable, either
because of repetitive patterns or large amounts of clutter, additional priors need to be considered. The Blind PnP algorithm makes use of geometric priors to guide the search for matches while computing camera
pose. Inspired by this, we propose a novel approach for homography estimation that combines geometric priors with appearance priors of ambiguous descriptors. More specifically, for each point we retain its best candidates according to appearance. We then prune the set of potential matches by iteratively shrinking the regions of the image that are consistent with the geometric prior. We can then successfully compute
homographies between pairs of images containing highly repetitive patterns
and even under oblique viewing conditions.
Villamizar, M.A.; Moreno-Noguer, F.; Andrade-Cetto, J.; Sanfeliu, A. IEEE Conference on Computer Vision and Pattern Recognition p. 1038-1045 DOI: 10.1109/CVPR.2010.5540104 Presentation's date: 2010 Presentation of work at congresses
We present a new approach for building an efficient and robust classifier for the two class problem, that localizes objects that may appear in the image under different orientations. In contrast to other works that address this problem using multiple classifiers, each one specialized for a specific orientation, we propose a simple two-step approach with an estimation stage and a classification stage. The estimator yields an initial set of potential object poses that are then validated by the classifier. This methodology allows reducing the time complexity of the algorithm while classification results remain high.
The classifier we use in both stages is based on a boosted combination of Random Ferns over local histograms of oriented gradients (HOGs), which we compute during a preprocessing step. Both the use of supervised learning and working on the gradient space makes our approach robust while being efficient at run-time. We show these properties by thorough testing on standard databases and on a new database made of motorbikes under planar rotations, and with challenging conditions such as cluttered backgrounds, changing illumination conditions and partial occlusions.
Villamizar, M.A.; Moreno-Noguer, F.; Andrade-Cetto, J.; Sanfeliu, A. International Conference on Pattern Recognition p. 388-391 DOI: 10.1109/ICPR.2010.103 Presentation's date: 2010 Presentation of work at congresses
We propose a new algorithm for detecting multiple object categories that exploits the fact that different categories may share common features but with different geometric distributions. This yields an efficient detector which, in contrast to existing approaches, considerably reduces the computation cost at runtime, where the feature computation step is traditionally the most expensive. More specifically, at the learning stage we compute common features by applying the same Random Ferns over the Histograms of Oriented Gradients
on the training images. We then apply a boosting step to build discriminative weak classifiers, and learn the specific geometric distribution of the Random Ferns for each class. At runtime, only a few Random Ferns have to be densely computed over each input image, and their geometric distribution allows performing the detection.
The proposed method has been validated in public datasets achieving competitive detection results, which are comparable with state-of-the-art methods that use specific features per class.
We propose a non-iterative solution to the PnP problem—the estimation of the pose of a calibrated camera from n 3D-to-2D point correspondences—whose computational complexity grows linearly with n. This is in contrast to state-of-the-art methods that are O(n5) or even O(n8), without being more accurate. Our method is applicable for all n ≥ 4 and handles properly both planar and non-planar configurations. Our central idea is to express the n 3D points as a weighted sum of four virtual control points. The problem then reduces to estimating the coordinates of these control points in the camera referential, which can be done in O(n) time by expressing these coordinates as weighted sum of the eigenvectors of a 12 × 12 matrix and solving a small constant number of quadratic equations to pick the right weights. Furthermore, if maximal precision is required, the output of the closed-form solution can be used to initialize a Gauss-Newton scheme, which improves accuracy with negligible amount of additional time. The advantages of our method are demonstrated by thorough testing on both synthetic and real-data.
We present a new model for people guidance in urban settings using several mobile robots, that overcomes the limitations of existing approaches, which are either tailored to tightly bounded environments, or based on unrealistic human behaviors. Although the robots motion is controlled by means of a standard particle filter formulation, the novelty of our approach resides in how the environment and human and robot motions are modeled. In particular we define a “Discrete-Time-Motion” model, which from one side represents the environment by means of a potential field, that makes it appropriate to deal with open areas, and on the other hand the motion models for people and robots respond to realistic situations, and for instance human behaviors such as “leaving the group” are considered.