Dense descriptors are becoming increasingly popular in a host of tasks, such as dense image correspondence, bag-of-words image classification, and label transfer. However, the extraction of descriptors on generic image points, rather than selecting geometric features, requires rethinking how to achieve invariance to nuisance parameters. In this work we pursue invariance to occlusions and background changes by introducing segmentation information within dense feature construction. The core idea is to use the segmentation cues to downplay the features coming from image areas that are unlikely to belong to the same region as the feature point. We show how to integrate this idea with dense SIFT, as well as with the dense scale- and rotation-invariant descriptor (SID). We thereby deliver dense descriptors that are invariant to background changes, rotation, and/or scaling. We explore the merit of our technique in conjunction with large displacement motion estimation and wide-baseline stereo, and demonstrate that exploiting segmentation information yields clear improvements.
We present a novel approach for feature correspondence and multiple structure discovery in computer vision. In contrast to existing methods, we exploit the fact that point-sets on the same structure usually lie close to each other, thus forming clusters in the image. Given a pair of input images, we initially extract points of interest and extract hierarchical representations by agglomerative clustering. We use the maximum weighted clique problem to find the set of corresponding clusters with maximum number of inliers representing the multiple structures at the correct scales. Our method is parameter-free and only needs two sets of points along with their tentative correspondences, thus being extremely easy to use. We demonstrate the effectiveness of our method in multiple-structure fitting experiments in both publicly available and in-house datasets. As shown in the experiments, our approach finds a higher number of structures containing fewer outliers compared to state-of-the-art methods.
Recent advances in 3D shape analysis and recognition have shown that heat diffusion theory can be effectively used to describe local features of deforming and scaling surfaces. In this paper, we show how this description can be used to characterize 2D image patches, and introduce DaLI, a novel feature point descriptor with high resilience to non-rigid image transformations and illumination changes. In order to build the descriptor, 2D image patches are initially treated as 3D surfaces. Patches are then described in terms of a heat kernel signature, which captures both local and global information, and shows a high degree of invariance to non-linear image warps. In addition, by further applying a logarithmic sampling and a Fourier transform, invariance to photometric changes is achieved. Finally, the descriptor is compacted by mapping it onto a low dimensional subspace computed using Principal Component Analysis, allowing for an efficient matching. A thorough experimental validation demonstrates that DaLI is significantly more discriminative and robust to illuminations changes and image transformations than state of the art descriptors, even those specifically designed to describe non-rigid deformations.
Siempre ha sido una meta fundamental de la visión por computador la comprensión de los seres humanos. Los primeros trabajos se fijaron en objetivos sencillos tales como la detección en imágenes de la posición de los individuos. A medida que la investigación progresó se emprendieron tareas mucho más complejas. Por ejemplo, a partir de la detección de los humanos se pasó a la estimación en dos y tres dimensiones de su postura por lo que la tarea consistía en identificar la localización en la imagen o el espacio de las diferentes partes del cuerpo, por ejemplo cabeza, torso, rodillas, brazos, etc...También los atributos humanos se convirtieron en una gran fuente de interés ya que permiten el reconocimiento de los individuos y de sus propiedades como el género o la edad. Más tarde, la atención se centró en el reconocimiento de la acción realizada. Todos estos trabajos reposan en las investigaciones previas sobre la estimación de las posturas y la clasificación de los atributos. En la actualidad, se llevan a cabo investigaciones de un nivel aún superior sobre cuestiones tales como la predicción de las motivaciones del comportamiento humano o la identificación del tallaje de un individuo a partir de una fotografía.En esta tesis desarrollamos una jerarquía de herramientas que cubre toda esta gama de problemas, desde descriptores de rasgos de bajo nivel a modelos probabilísticos de campos condicionales de alto nivel reconocedores de la moda, todos ellos con el objetivo de mejorar la comprensión de los humanos a partir de imágenes RGB monoculares. Para construir estos modelos de alto nivel es decisivo disponer de una batería de datos robustos y fiables de nivel bajo y medio. En este sentido, proponemos dos descriptores novedosos de bajo nivel: uno se basa en la teoría de la difusión de calor en las imágenes y otro utiliza una red neural convolucional para aprender representaciones discriminativas de trozos de imagen. También introducimos diferentes modelos de bajo nivel generativos para representar la postura humana: en particular presentamos un modelo discreto basado en un gráfico acíclico dirigido y un modelo continuo que consiste en agrupaciones de posturas en una variedad de Riemann. Como señales de nivel medio proponemos dos algoritmos estimadores de la postura humana: uno que estima la postura en tres dimensiones a partir de una estimación imprecisa en el plano de la imagen y otro que estima simultáneamente la postura en dos y tres dimensiones. Finalmente construimos modelos de alto nivel a partir de señales de nivel bajo y medio para la comprensión de la persona a partir de imágenes. En concreto, nos centramos en dos diferentes tareas en el ámbito de la moda: la segmentación semántica del vestido y la predicción del buen ajuste de la prenda a partir de imágenes con meta-datos con la finalidad de aconsejar al usuario sobre moda.En resumen, para extraer conocimiento a partir de imágenes con presencia de seres humanos es preciso construir modelos de alto nivel que integren señales de nivel medio y bajo. En general, el punto crítico para obtener resultados fiables es el empleo y la comprensión de rasgos fuertes. La aportación fundamental de esta tesis es la propuesta de una variedad de algoritmos de nivel bajo, medio y alto para el tratamiento de imágenes centradas en seres humanos que pueden integrarse en modelos de alto nivel, para mejor comprensión de los seres humanos a partir de fotografías, así como abordar problemas planteados por el buen ajuste de las prendas.
In this paper we tackle the problem of clothing parsing: Our goal is to segment and classify different garments a person is wearing. We frame the problem as the one of inference in a pose-aware Conditional Random Field (CRF) which exploits appearance, figure/ground segmentation, shape and location priors for each garment as well as similarities between segments, and symmetries between different human body parts. We demonstrate the effectiveness of our approach on the Fashionista dataset and show that we can obtain a significant improvement over the state-of-the-art.
Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança.El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball.En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals.Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients.Aquestes són les principals contribucions de la tesi:(a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline").(b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment.(c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".
Local features have become an essential tool in visual recognition. Much of the progress in computer vision over the past decade has built on simple, local representations such as SIFT or HOG. SIFT in particular shifted the paradigm in feature representation. Subsequent works have often focused on improving either computational efficiency, or invariance properties.
This thesis belongs to the latter group. Invariance is a particularly relevant aspect if we intend to work with dense features. The traditional approach to sparse matching is to rely on stable interest points, such as corners, where scale and orientation can be reliably estimated, enforcing invariance; dense features need to be computed on arbitrary points. Dense features have been shown to outperform sparse matching techniques in many recognition problems, and form the bulk of our work.
In this thesis we present strategies to enhance low-level, local features with mid-level, global cues. We devise techniques to construct better features, and use them to handle complex ambiguities, occlusions and background changes. To deal with ambiguities, we explore the use of motion to enforce temporal consistency with optical flow priors. We also introduce a novel technique to exploit segmentation cues, and use it to extract features invariant to background variability. For this, we downplay image measurements most likely to belong to a region different from that where the descriptor is computed. In both cases we follow the same strategy: we incorporate mid-level, "big picture" information into the construction of local features, and proceed to use them in the same manner as we would the baseline features.
We apply these techniques to different feature representations, including SIFT and HOG, and use them to address canonical vision problems such as stereo and object detection, demonstrating that the introduction of global cues yields consistent improvements. We prioritize solutions that are simple, general, and efficient.
Our main contributions are as follows:
(a) An approach to dense stereo reconstruction with spatiotemporal features, which unlike existing works remains applicable to wide baselines.
(b) A technique to exploit segmentation cues to construct dense descriptors invariant to background variability, such as occlusions or background motion.
(c) A technique to integrate bottom-up segmentation with recognition efficiently, amenable to sliding window detectors.
Les "features" locals s'han convertit en una eina fonamental en el camp del reconeixement visual. Gran part del progrés experimentat en el camp de la visió per computador al llarg de l'última decada es basa en representacions locals de baixa complexitat, com SIFT o HOG. SIFT, en concret, ha canviat el paradigma en representació de característiques visuals. Els treballs que l'han succeït s'acostumen a centrar o bé a millorar la seva eficiencia computacional, o bé propietats d'invariança. El treball presentat en aquesta tesi pertany al segon grup. L'invariança es un aspecte especialment rellevant quan volem treballab amb "features" denses, és a dir per a cada pixel. La manera tradicional d'atacar el problema amb "features" de baixa densitat consisteix en seleccionar punts d'interés estables, com per exemple cantonades, on l'escala i l'orientació poden ser estimades de manera robusta. Les "features" denses, per definició, han de ser calculades en punts arbitraris de la imatge. S'ha demostrat que les "features" denses obtenen millors resultats en tècniques de correspondència per a molts problemes en reconeixement, i formen la major part del nostre treball. En aquesta tesi presentem estratègies per a enriquir "features" locals de baix nivell amb "cues" o dades globals, de mitja complexitat. Dissenyem tècniques per a construïr millors "features", que usem per a atacar problemes tals com correspondències amb un grau elevat d'ambigüetat, oclusions, i canvis del fons de la imatge. Per a atacar ambigüetats, explorem l'ús del moviment per a imposar consistència espai-temporal mitjançant informació d'"optical flow". També presentem una tècnica per explotar dades de segmentació que fem servir per a extreure "features" invariants a canvis en el fons de la imatge. Aquest mètode consisteix en atenuar els components de la imatge (i per tant les "features") que probablement corresponguin a regions diferents a la del descriptor que estem calculant. En ambdós casos seguim la mateixa estratègia: la nostra voluntat és incorporar dades globals d'un nivell de complexitat mitja a la construcció de "features" locals, que procedim a utilitzar de la mateixa manera que les "features" originals. Aquestes tècniques són aplicades a diferents tipus de representacions, incloent SIFT i HOG, i mostrem com utilitzar-les per a atacar problemes fonamentals en visió per computador tals com l'estèreo i la detecció d'objectes. En aquest treball demostrem que introduïnt informació global en la construcció de "features" locals podem obtenir millores consistentment. Donem prioritat a solucions senzilles, generals i eficients. Aquestes són les principals contribucions de la tesi: (a) Una tècnica per a reconstrucció estèreo densa mitjançant "features" espai-temporals, amb l'avantatge respecte a treballs existents que podem aplicar-la a càmeres en qualsevol configuració geomètrica ("wide-baseline"). (b) Una tècnica per a explotar dades de segmentació dins la construcció de descriptors densos, fent-los invariants a canvis al fons de la imatge, i per tant a problemes com les oclusions en estèreo o objectes en moviment. (c) Una tècnica per a integrar segmentació de manera ascendent ("bottom-up") en problemes de reconeixement d'una manera eficient, dissenyada per a detectors de tipus "sliding window".
Serradell, E.; Amável , M.; Sznitman, R.; Kybic, J.; Moreno-Noguer, F.; Fua, P. IEEE transactions on pattern analysis and machine intelligence Vol. 37, num. 3, p. 625-638 DOI: 10.1109/TPAMI.2014.2343235 Date of publication: 2015 Journal article
We present a new approach for matching sets of branching curvilinear structures that form graphs embedded in R^2 or R^3 and may be subject to deformations. Unlike earlier methods, ours does not rely on local appearance similarity nor does require a good initial alignment. Furthermore, it can cope with non-linear deformations, topological differences, and partial graphs. To handle arbitrary non-linear deformations, we use Gaussian Processes to represent the geometrical mapping relating the two graphs. In the absence of appearance information, we iteratively establish correspondences between points, update the mapping accordingly, and use it to estimate where to find the most likely correspondences that will be used in the next step. To make the computation tractable for large graphs, the set of new potential matches considered at each iteration is not selected at random as in many RANSAC-based algorithms. Instead, we introduce a so-called Active Testing Search strategy that performs a priority search to favor the most likely matches and speed-up the process. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.
Villamizar, M.A.; Garrell, A.; Sanfeliu, A.; Moreno-Noguer, F. Iberian conference on pattern recognition and image analysis p. 496-504 DOI: 10.1007/978-3-319-19390-8_56 Presentation's date: 2015 Presentation of work at congresses
In this paper, we present an object recognition approach that in addition allows to discover intra-class mo dalities exhibiting high-correlated visual information. Unlike to more conventional approaches based on computing multiple sp ecialized classiers, the proposed approach combines a single classier, Boosted Random Ferns (BRFs), with probabilistic Latent Semantic Analysis (pLSA) in order to recognize an object class and to find automatically the most prominent intra-class appearance mo dalities (clusters) through tree-structured visual words. The proposed approach has b een validated in synthetic and real experiments where we show that the method is able to recognize objects with multiple appearance
In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.
While recent approaches have shown that it is possible to do template matching by exhaustively scanning the parameter space, the resulting algorithms are still quite demanding. In this paper we alleviate the computational load of these algorithms by proposing an efficient approach for predicting the match ability of a template, before it is actually performed. This avoids large amounts of unnecessary computations. We learn the match ability of templates by using dense convolutional neural network descriptors that do not require ad-hoc criteria to characterize a template. By using deep learning descriptions of patches we are able to predict match ability over the whole image quite reliably. We will also show how no specific training data is required to solve problems like panorama stitching in which you usually require data from the scene in question. Due to the highly parallelizable nature of this tasks we offer an efficient technique with a negligible computational cost at test time.
Rubio, A.; Villamizar, M.A.; Ferraz, L.; Peñate, A.; Ramisa, A.; Simo, E.; Sanfeliu, A.; Moreno-Noguer, F. IEEE International Conference on Robotics and Automation p. 1397-1402 DOI: 10.1109/ICRA.2015.7139372 Presentation's date: 2015 Presentation of work at congresses
We propose a robust and efficient method to estimate the pose of a camera with respect to complex 3D textured models of the environment that can potentially contain more than 100, 000 points. To tackle this problem we follow a top down approach where we combine high-level deep network classifiers with low level geometric approaches to come up with a solution that is fast, robust and accurate. Given an input image, we initially use a pre-trained deep network to compute a rough estimation of the camera pose. This initial estimate constrains the number of 3D model points that can be seen from the camera viewpoint. We then establish 3D-to-2D correspondences between these potentially visible points of the model and the 2D detected image features. Accurate pose estimation is finally obtained from the 2D-to-3D correspondences using a novel PnP algorithm that rejects outliers without the need to use a RANSAC strategy, and which is between 10 and 100 times faster than other methods that use it. Two real experimentsdealing with very large and complex 3D models demonstrate the effectiveness of the approach.
Simo, E.; Trulls, E.; Ferraz, L.; Kokkinos, I.; Fua, P.; Moreno-Noguer, F. International Conference on Computer Vision p. 118-126 DOI: 10.1109/ICCV.2015.22 Presentation's date: 2015 Presentation of work at congresses
Deep learning has revolutionalized image-level tasks such as classification, but patch-level tasks, such as correspondence, still rely on hand-crafted features, e.g. SIFT. In this paper we use Convolutional Neural Networks (CNNs) to learn discriminant patch representations and in particular train a Siamese network with pairs of (non)corresponding patches. We deal with the large number of potential pairs with the combination of a stochastic sampling of the training set and an aggressive mining strategy biased towards patches that are hard to classify. By using the L2 distance during both training and test- ing we develop 128-D descriptors whose euclidean distances reflect patch similarity, and which can be used as a drop-in replacement for any task involving SIFT. We demonstrate consistent performance gains over the state of the art, and generalize well against scaling and rotation, perspective transformation, non-rigid deformation, and illumination changes. Our descriptors are efficient to compute and amenable to modern GPUs, and are publicly available.
Ramisa, A.; Wang, J.; Lu, Y.; Dellandrea, E.; Moreno-Noguer, F.; Gaizauskas, R. Joint SIGDAT Conference on Empirical Methods in Natural Language Processing p. 214-220 Presentation's date: 2015 Presentation of work at congresses
We investigate the role that geometric, textual and visual features play in the task of predicting a preposition that links two visual entities depicted in an image. The task is an important part of the subsequent process of generating image descriptions. We explore the prediction of prepositions for a pair of entities, both in the case when the labels of such entities are known and unknown. In all situations we found clear evidence that all three features contribute to the prediction task.
The automatic generation of image captions has received considerable attention. The problem of evaluating caption generation systems, though, has not been that much explored. We propose a novel evaluation approach based on comparing the underlying visual semantics of the candidate and ground-truth captions. With this goal in mind we have defined a semantic representation for visually descriptive language and have augmented a subset of the Flickr-8K dataset with semantic annotations. Our evaluation metric (BAST) can be used not only to compare systems but also to do error analysis and get a better understanding of the type of mistakes a system does. To compute BAST we need to predict the semantic representation for the automatically generated captions. We use the Flickr-ST dataset to train classifiers that predict STs so that evaluation can be fully automated.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. Engineering applications of artificial intelligence Vol. 35, p. 246-258 DOI: 10.1016/j.engappai.2014.06.025 Date of publication: 2014 Journal article
Robotic handling of textile objects in household environments is an emerging application that has recently received considerable attention thanks to the development of domestic robots. Most current approaches follow a multiple re-grasp strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields a desired configuration.
In this work we propose a vision-based method, built on the Bag of Visual Words approach, that combines appearance and 3D information to detect parts suitable for grasping in clothes, even when they are highly wrinkled.
We also contribute a new, annotated, garment part dataset that can be used for benchmarking classification, part detection, and segmentation algorithms. The dataset is used to evaluate our approach and several state-of-the-art 3D descriptors for the task of garment part detection. Results indicate that appearance is a reliable source of information, but that augmenting it with 3D information can help the method perform better with new clothing items.
We present a novel approach for learning a finite mixture model on a Riemannian manifold in which Euclidean metrics are not applicable and one needs to resort to geodesic distances consistent with the manifold geometry. For this purpose, we draw inspiration on a variant of the expectation-maximization algorithm, that uses a minimum message length criterion to automatically estimate the optimal number of components from multivariate data lying on an Euclidean space. In order to use this approach on Riemannian manifolds, we propose a formulation in which each component is defined on a different tangent space, thus avoiding the problems associated with the loss of accuracy produced when linearizing the manifold with a single tangent space. Our approach can be applied to any type of manifold for which it is possible to estimate its tangent space. In particular, we show results on synthetic examples of a sphere and a quadric surface and on a large and complex dataset of human poses, where the proposed model is used as a regression tool for hypothesizing the geometry of occluded parts of the body.
We propose a real-time and accurate solution to the Perspective-n-Point (PnP) problem –estimating the pose of a calibrated camera from n 3D-to-2D point correspondences– that exploits the fact that in practice the 2D position of not all 2D features is estimated with the same accuracy. Assuming a model of such feature uncertainties is known in advance, we reformulate the PnP problem as a maximum likelihood minimization approximated by an unconstrained Sampson error function, which naturally penalizes the most noisy correspondences. The advantages of this approach are thoroughly demonstrated in synthetic experiments where feature uncertainties are exactly known.
Pre-estimating the features uncertainties in real experiments is, though, not easy. In this paper we model feature uncertainty as 2D Gaussian distributions representing the sensitivity of the 2D feature detectors to different camera viewpoints. When using these noise models with our PnP formulation we still obtain promising pose estimation results that outperform the most recent approaches.
Villamizar, M.A.; Sanfeliu, A.; Moreno-Noguer, F. IEEE International Conference on Robotics and Automation p. 4996-5003 DOI: 10.1109/ICRA.2014.6907591 Presentation's date: 2014 Presentation of work at congresses
We present a method for efficiently detecting natural landmarks that can handle scenes with highly repetitive patterns and targets progressively changing its appearance. At the core of our approach lies a Random Ferns classifier, that models the posterior probabilities of different views of the target using multiple and independent Ferns, each containing features at particular positions of the target. A Shannon entropy measure is used to pick the most informative locations of these features. This minimizes the number of Ferns while maximizing its discriminative power, allowing thus, for robust detections at low computational costs. In addition, after offline initialization, the new incoming detections are used to update the posterior probabilities on the fly, and adapt to changing appearances that can occur due to the presence of shadows or occluding objects. All these virtues, make the proposed detector appropriate for UAV navigation. Besides the synthetic experiments that will demonstrate the theoretical benefits of our formulation, we will show applications for detecting landing areas in regions with highly repetitive patterns, and specific objects under the presence of cast shadows or sudden camera motions.
Amor, A.; Ruiz, A.; Moreno-Noguer, F.; Sanfeliu, A. IEEE International Conference on Robotics and Automation p. 2595-2601 DOI: 10.1109/ICRA.2014.6907231 Presentation's date: 2014 Presentation of work at congresses
We present a real time method for pose estimation of objects from an UAV, using visual marks placed on non planar surfaces. It is designed to overcome constraints in small aerial robots, such as slow CPUs, low resolution cameras and image deformations due to distortions introduced by the lens or by the viewpoint changes produced during the flight navigation. The method consists of shape registration from extracted contours in an image. Instead of working with dense image patches or corresponding image features, we optimize a geometric alignment cost computed directly from the raw polygonal representations of the observed regions using efficient clipping algorithms. Moreover, instead of doing 2D image processing operations, the optimization is performed in the polygon representation space, allowing real-time projective matching. Deformation modes are easily included in the optimization scheme, allowing an accurate registration of different markers attached to curved surfaces using a single deformable prototype.
As a result, the method achieves accurate object pose estimation precision in real-time, which is very important for interactive UAV tasks, for example for short distance surveillance or bar assembly. We describe the main algorithmic components of the method and present experiments where our method yields an average error of less than 5mm in position at a distance of 0.7m, using a visual mark of 19mm x 19mm. Finally, we compare these results with current computer vision state-of-the-art systems.
Tsogkas, S.; Kokkinos, I.; Trulls, E.; Sanfeliu, A.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 168-175 DOI: 10.1109/CVPR.2014.29 Presentation's date: 2014 Presentation of work at congresses
In this work we propose a technique to combine bottom- up segmentation, coming in the form of SLIC superpixels, with sliding window detectors, such as Deformable Part Models (DPMs).
The merit of our approach lies in ‘cleaning up’ the low-level HOG features by exploiting the spatial support of SLIC superpixels; this can be understood as using segmentation to split the feature variation into object-specific and background changes. Rather than committing to a single segmentation we use a large pool of SLIC superpixels and combine them in a scale-, position- and object-dependent manner to build soft segmentation masks. The segmentation masks can be computed fast enough to repeat this process over every candidate window, during training and detection, for both the root and part filters of DPMs.
We use these masks to construct enhanced, background-invariant features to train DPMs. We test our approach on the PASCAL VOC 2007, outperforming the standard DPM in 17 out of 20 classes, yielding an average increase of 1.7% AP. Additionally, we demonstrate the robustness of this approach, extending it to dense SIFT descriptors for large displacement optical flow.
We propose a real-time, robust to outliers and accurate solution to the Perspective-n-Point (PnP) problem. The main advantages of our solution are twofold: first, it integrates the outlier rejection within the pose estimation pipeline with a negligible computational overhead; and second, its scalability to arbitrarily large number of correspondences. Given a set of 3D-to-2D matches, we formulate pose estimation problem as a low-rank homogeneous system where the solution lies on its 1D null space. Outlier correspondences are those rows of the linear system which perturb the null space and are progressively detected by projecting them on an iteratively estimated solution of the null space. Since our outlier removal process is based on an algebraic criterion which does not require computing the full-pose and reprojecting back all 3D points on the image plane at each step, we achieve speed gains of more than 100× compared to RANSAC strategies. An extensive experimental evaluation will show that our solution yields accu- rate results in situations with up to 50% of outliers, and can process more than 1000 correspondences in less than 5ms.
El siguiente documento presenta un método robusto y eficiente para estimar la pose de una cámara. El método propuesto asume el conocimiento previo de un modelo 3D del entorno, y compara una nueva imagen de entrada únicamente con un conjunto pequeño de imágenes similares seleccionadas previamente por un algoritmo de "Bag of Visual Words". De esta forma se evita el alto coste computacional de calcular la correspondencia de los puntos 2D de la imagen de entrada contra todos los puntos 3D de un modelo complejo, que en nuestro caso contiene más de 100,000 puntos. La estimación de la pose se lleva a cabo a partir de estas correspondencias 2D-3D utilizando un novedoso algoritmo de PnP que realiza la eliminación de valores atípicos (outliers) sin necesidad de utilizar RANSAC, y que es entre 10 y 100 veces más rápido que los métodos que lo utilizan.
El siguiente documento presenta un método robusto y eficiente para estimar la pose de una cámara. El método propuesto asume el conocimiento previo de un modelo 3D del entorno, y compara una nueva imagen de entrada únicamente con un conjunto pequeño de imágenes similares seleccionadas previamente por un algoritmo de
We introduce LETHA (Learning on Easy data, Test on Hard), a new learning paradigm consisting of building strong priors from high quality training data, and combining them with discriminative machine learning to deal with low- quality test data. Our main contribution is an implementation of that concept for pose estimation. We first automatically build a 3D model of the object of interest from high-definition images, and devise from it a pose-indexed feature extraction scheme. We then train a single classifier to process these feature vectors. Given a low quality test image, we visit many hypothetical poses, extract features consistently and evaluate the response of the classifier. Since this process uses locations recorded during learning, it does not require matching points anymore. We use a boosting procedure to train this classifier common to all poses, which is able to deal with missing features, due in this context to self-occlusion. Our results demonstrate that the method combines the strengths of global image representations, discriminative even for very tiny images, and the robustness to occlusions of approaches based on local feature point descriptors.
Simo, E.; Quattoni, A.J.; Torras, C.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 3634-3641 DOI: 10.1109/CVPR.2013.466 Presentation's date: 2013-06-23 Presentation of work at congresses
We introduce a novel approach to automatically recover 3D human pose from a single image. Most previous work follows a pipelined approach: initially, a set of 2D features such as edges, joints or silhouettes are detected in the image, and then these observations are used to infer the 3D pose. Solving these two problems separately may lead to erroneous 3D poses when the feature detector has performed poorly. In this paper, we address this issue by jointly solving both the 2D detection and the 3D inference problems. For this purpose, we propose a Bayesian framework that integrates a generative model based on latent variables and discriminative 2D part detectors based on HOGs, and perform inference using evolutionary algorithms. Real experimentation demonstrates competitive results, and the ability of our methodology to provide accurate 2D and 3D pose estimations even when the 2D detectors are inaccurate.
Peñate, A.; Andrade-Cetto, J.; Moreno-Noguer, F. IEEE transactions on pattern analysis and machine intelligence Vol. 35, num. 10, p. 2387-2400 DOI: 10.1109/TPAMI.2013.36 Date of publication: 2013 Journal article
We propose a novel approach for the estimation of the pose and focal length of a camera from a set of 3D-to-2D point correspondences. Our method compares favorably to competing approaches in that it is both more accurate than existing closed form solutions, as well as faster and also more accurate than iterative ones. Our approach is inspired on the EPnP algorithm, a recent O(n) solution for the calibrated case. Yet we show that considering the focal length as an additional unknown renders the linearization and relinearization techniques of the original approach no longer valid, especially with large amounts of noise. We present new methodologies to circumvent this limitation termed exhaustive linearization and exhaustive relinearization which perform a systematic exploration of the solution space in closed form. The method is evaluated on both real and synthetic data, and our results show that besides producing precise focal length estimation, the retrieved camera pose is almost as accurate as the one computed using the EPnP, which assumes a calibrated camera.
Recovering the 3D shape of deformable surfaces from single images is known to be a highly ambiguous problem because many different shapes may have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce a stochastic sampling approach to efficiently explore the set of solutions of an objective function based on point correspondences. This allows to propose a small set of ambiguous candidate 3D shapes and then use additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem. The advantages of our method are demonstrated on a variety of problems including both real and synthetic data.
Amável , M.; Sznitman, R.; Serradell, E.; Kybic, J.; Moreno-Noguer, F.; Fua, P. International Conference on Information Processing in Medical Imaging p. 572-583 DOI: 10.1007/978-3-642-38868-2_48 Presentation's date: 2013 Presentation of work at congresses
We present a general approach for solving the point-cloud matching problem for the case of mildly nonlinear transformations. Our method quickly finds a coarse approximation of the solution by exploring a reduced set of partial matches using an approach to which we refer to as Active Testing Search (ATS). We apply the method to registration of graph structures by branching point matching. It is based solely on the geometric position of the points, no additional information is used nor the knowledge of an initial alignment. In the second stage, we use dynamic programming to rene the solution. We tested our algorithm on angiography, retinal fundus, and neuronal data gathered using electron and light microscopy. We show that our method solves cases not solved by most approaches, and is faster than the remaining ones.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. IEEE/RSJ International Conference on Intelligent Robots and Systems p. 824-830 DOI: 10.1109/IROS.2013.6696446 Presentation's date: 2013 Presentation of work at congresses
Most current depth sensors provide 2.5D range images in which depth values are assigned to a rectangular 2D array. In this paper we take advantage of this structured information to build an efficient shape descriptor which is about two orders of magnitude faster than competing approaches, while showing similar performance in several tasks involving deformable object recognition. Given a 2D patch surrounding a point and its associated depth values, we build the descriptor for that point, based on the cumulative distances between their normals and a discrete set of normal directions. This processing is made very efficient using integral images, even allowing to compute descriptors for every range image pixel in a few seconds. The discriminative power of our descriptor, dubbed FINDDD, is evaluated in three different scenarios: recognition of specific cloth wrinkles, instance recognition from geometry alone, and detection of reliable and informed grasping points.
Garrell, A.; Villamizar, M.A.; Moreno-Noguer, F.; Sanfeliu, A. IEEE International Symposium on Robot and Human Interactive Communication p. 107-113 DOI: 10.1109/ROMAN.2013.6628463 Presentation's date: 2013 Presentation of work at congresses
During the last decade, there has been a growing interest in making autonomous social robots able to interact with people. However, there are still many open issues regarding the social capabilities that robots should have in order to perform these interactions more naturally. In this paper we present the results of several experiments conducted at the Barcelona Robot Lab in the campus of the “Universitat Politècnica de Catalunya” in which we have analyzed different important aspects of the interaction between a mobile robot and non-trained human volunteers. First, we have proposed different robot behaviors to approach a person and create an engagement with him/her. In order to perform this task we have provided the robot with several perception and action capabilities, such as that of detecting people, planning an approach and verbally communicating its intention to initiate a conversation. Once the initial engagement has been created, we have developed further
communication skills in order to let people assist the robot and improve its face recognition system. After this assisted and online learning stage, the robot becomes able to detect people under severe changing conditions, which, in turn enhances the number and the manner that subsequent human-robot interactions are performed.
Kokkinos, I.; Trulls, E.; Sanfeliu, A.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 2890-2897 DOI: 10.1109/CVPR.2013.372 Presentation's date: 2013 Presentation of work at congresses
In this work we exploit segmentation to construct appearance descriptors that can robustly deal with occlusion and background changes. For this, we downplay measurements coming from areas that are unlikely to belong to the same region as the descriptor’s center, as suggested by soft segmentation masks. Our treatment is applicable to any image point, i.e. dense, and its computational overhead is in the order of a few seconds. We integrate this idea with Dense SIFT, and also with Dense Scale and Rotation Invariant Descriptors (SID), delivering descriptors that are densely computable, invariant to scaling and rotation, and robust to background changes. We apply our approach to standard benchmarks on large displacement motion estimation using SIFT-flow and widebaseline stereo, systematically demonstrating that the introduction of segmentation yields clear improvements.
Simultaneously recovering the camera pose and correspondences between a set of 2D-image and 3D-model points is a difficult problem, especially when the 2D-3D matches cannot be established based on appearance only. The problem becomes even more challenging when input images are acquired with an uncalibrated camera with varying zoom, which yields strong ambiguities between translation and focal length. We present a solution to this problem using only geometrical information. Our approach owes its robustness to an initial stage in which the joint pose and focal length solution space is split into several Gaussian regions. At runtime, each of these regions is explored using an hypothesize-and-test approach, in which the potential number of 2D-3D matches is progressively reduced using informed search through Kalman updates, iteratively refining the pose and focal length parameters. The technique is exhaustive but efficient, significantly improving previous methods in terms of robustness to outliers and noise.
Serradell, E.; Glowacki, P.; Kybic, J.; Moreno-Noguer, F.; Fua, P. IEEE Computer Society conference on computer vision and pattern recognition. Proceedings Vol. 2012, p. 996-1003 DOI: 10.1109/CVPR.2012.6247776 Date of publication: 2012 Journal article
We present a new approach to matching graphs embedded in R2 or R3. Unlike earlier methods, our approach does not rely on the similarity of local appearance features, does not require an initial alignment, can handle partial matches, and can cope with non-linear deformations and topological differences. To handle arbitrary non-linear deformations, we represent them as Gaussian Processes. In the absence of appearance information, we iteratively establish correspondences between graph nodes, update the structure accordingly, and use the current mapping estimate to find the most likely correspondences that will be used in the next iteration. This makes the computation tractable. We demonstrate the effectiveness of our approach first on synthetic cases and then on angiography data, retinal fundus images, and microscopy image stacks acquired at very different resolutions.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. IEEE International Conference on Robotics and Automation p. 1703-1708 DOI: 10.1109/ICRA.2012.6225045 Presentation's date: 2012 Presentation of work at congresses
Detecting grasping points is a key problem in cloth manipulation. Most current approaches follow a multiple regrasp
strategy for this purpose, in which clothes are sequentially grasped from different points until one of them yields to a
desired configuration. In this paper, by contrast, we circumvent the need for multiple re-graspings by building a robust detector that identifies the grasping points, generally in one single step,
even when clothes are highly wrinkled.
In order to handle the large variability a deformed cloth may have, we build a Bag of Features based detector that combines
appearance and 3D geometry features. An image is scanned using a sliding window with a linear classifier, and the candidate
windows are refined using a non-linear SVM and a “grasp goodness” criterion to select the best grasping point.
We demonstrate our approach detecting collars in deformed polo shirts, using a Kinect camera. Experimental results show
a good performance of the proposed method not only in identifying the same trained textile object part under severe deformations and occlusions, but also the corresponding part in other clothes, exhibiting a degree of generalization.
This paper studies the use of temporal consistency to match appearance descriptors and handle complex ambiguities when computing dynamic depth maps from stereo. Previous attempts have designed 3D descriptors over the spacetime volume and have been mostly used for monocular action recognition, as they cannot deal with perspective changes. Our approach is based on a state-of-the-art 2D dense appearance descriptor which we extend in time by means of optical flow priors, and can be applied to wide-baseline stereo setups. The basic idea behind our approach is to capture the changes around a feature point in time instead of trying to describe the spatiotemporal volume. We demonstrate its effectiveness on very ambiguous synthetic video sequences with ground truth data, as well as real sequences.
Simo, E.; Ramisa, A.; Torras, C.; Alenyà, G.; Moreno-Noguer, F. IEEE Conference on Computer Vision and Pattern Recognition p. 2673-2680 DOI: /10.1109/CVPR.2012.6247988 Presentation's date: 2012 Presentation of work at congresses
Markerless 3D human pose detection from a single image is a severely underconstrained problem because different 3D poses can have similar image projections. In order to handle this ambiguity, current approaches rely on prior shape models that can only be correctly adjusted if 2D image features are accurately detected. Unfortunately, although current 2D part detector algorithms have shown promising results, they are not yet accurate enough to guarantee a complete disambiguation of the 3D inferred shape.
In this paper, we introduce a novel approach for estimating 3D human pose even when observations are noisy. We propose a stochastic sampling strategy to propagate
the noise from the image plane to the shape space. This provides a set of ambiguous 3D shapes, which are virtually undistinguishable from their image projections. Disambiguation is then achieved by imposing kinematic constraints that guarantee the resulting pose resembles a 3D
human shape. We validate the method on a variety of situations in which state-of-the-art 2D detectors yield either inaccurate estimations or partly miss some of the body parts.
We present an algorithm for geometric matching of graphs embedded in 2D or 3D space. It is applicable for registering any
graph-like structures appearing in biomedical images, such as blood vessels, pulmonary bronchi, nerve fibers, or dendritic
arbors. Our approach does not rely on the similarity of local appearance features, so it is suitable for multimodal registration
with a large difference in appearance. Unlike earlier methods, the algorithm uses edge shape, does not require an initial pose estimate, can handle partial matches, and can cope with nonlinear deformations and topological differences.
The matching consists of two steps. First, we find an affine transform that roughly aligns the graphs by exploring the set of all consistent correspondences between the nodes. This can be done at an acceptably low computational expense by using parameter uncertainties for pruning, backtracking as needed. Parameter uncertainties are updated in a Kalman-like scheme with each match.
In the second step we allow for a nonlinear part of the deformation, modeled as a Gaussian Process. Short sequences of edges are grouped into superedges, which are then matched between graphs. This allows for topological differences.
A maximum consistent set of superedge matches is found using a dedicated branch-and-bound solver, which is over 100 times faster than a standard linear programming approach. Geometrical and topological consistency of candidate matches is determined in a fast hierarchical manner.
We demonstrate the effectiveness of our technique at registering angiography and retinal fundus images, as well as neural image stacks.
Grasping highly deformable objects, like textiles,
is an emerging area of research that involves both percep-
tion and manipulation abilities. As new techniques appear,
it becomes essential to design strategies to compare them.
However, this is not an easy task, since the large state-space
of textile objects explodes when coupled with the variability
of grippers, robotic hands and robot arms performing the
manipulation task. This high variability makes it very difficult
to design experiments to evaluate the performance of a system
in a repeatable way and compare it to others. We propose
a framework to allow the comparison of different grasping
methods for textile objects.
Instead of measuring each component separately, we there-
fore propose a methodology to explicitly measure the vision-
manipulation correlation by taking into account the throughput
of the actions. Perceptions of deformable objects should be
grouped into different clusters, and the different grasping
actions available should be tested for each perception type to
obtain the action-perception success ratio. This characterization
potentially allows to compare very different systems in terms
widely useful actions
with the cost of performing each action. We will also show
that this categorization is useful in manipulation planning of
We present an Online Random Ferns (ORFs) classifier that progressively learns and builds enhanced models of object appearances. During the learning process, we allow the human intervention to assist the classifier and discard false positive training samples. The amount of human intervention is minimized and integrated within the online learning, such that in a few seconds, complex object appearances can be learned. After the assisted learning stage, the classifier is able to detect the object under severe changing conditions. The system runs at a few frames per second, and has been validated for face and object detection tasks on a mobile robot platform. We show that with minimal human assistance we are able to build a detector robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds. 2012 ICPR Org Committee.
We present an Online Random Ferns (ORFs) classifier that progressively learns and builds enhanced models of object appearances. During the learning process, we allow the human intervention to assist the classifier and discard false positive training samples. The amount of human intervention is minimized and integrated within the online learning, such that in a few seconds, complex object appearances can be learned. After the assisted learning stage, the classifier is able to detect the object under severe changing conditions. The system runs at a few frames per second, and has been validated for face and object detection tasks on a mobile robot platform. We show that with minimal human assistance we are able to build a detector robust to viewpoint changes, partial occlusions, varying lighting and cluttered backgrounds.
The ARCAS project proposes the development and experimental validation of the first cooperative free-flying robot system for assembly and structure construction. The project will pave the way for a large number of applications including the building of platforms for evacuation of people or landing aircrafts, the inspection and maintenance of facilities and the construction of structures in inaccessible sites and in the space.\nThe detailed scientific and technological objectives are:\n1)New methods for motion control of a free-flying robot with mounted manipulator in contact with a grasped object as well as for coordinated control of multiple cooperating flying robots with manipulators in contact with the same object (e.g. for precise placement or joint manipulation)\n2)New flying robot perception methods to model, identify and recognize the scenario and to be used for the guidance in the assembly operation, including fast generation of 3D models, aerial 3D SLAM, 3D tracking and cooperative perception\n3)New methods for the cooperative assembly planning and structure construction by means of multiple flying robots with application to inspection and maintenance activities\n4)Strategies for operator assistance, including visual and force feedback, in manipulation tasks involving multiple cooperating flying robots\nThe above methods and technologies will be integrated in the ARCAS cooperative flying robot system that will be validated in the following scenarios: a) Indoor testbed with quadrotors, b) Outdoor scenario with helicopters, c) free-flying simulation using multiple robot arms.\nThe project will be implemented by a high-quality consortium whose partners have already demonstrated the cooperative transportation by aerial robots as well as high performance cooperative ground manipulation. The team has the ability to produce for the first time challenging technological demonstrations with a high potential for generation of industrial products upon project completion.
Villamizar, M.A.; Grabner, H.; Andrade-Cetto, J.; Sanfeliu, A.; Van Gool, L.; Moreno-Noguer, F. British Machine Vision Conference p. 20.1 DOI: 10.5244/C.25.20 Presentation's date: 2011 Presentation of work at congresses
We propose an efficient method for object localization and 3D pose estimation. A
two-step approach is used. In the first step, a pose estimator is evaluated in the input
images in order to estimate potential object locations and poses. These candidates are
then validated, in the second step, by the corresponding pose-specific classifier. The
result is a detection approach that avoids the inherent and expensive cost of testing the
complete set of specific classifiers over the entire image. A further speedup is achieved
by feature sharing. Features are computed only once and are then used for evaluating the
pose estimator and all specific classifiers. The proposed method has been validated on
two public datasets for the problem of detecting of cars under several views. The results
show that the proposed approach yields high detection rates while keeping efficiency.
Ramisa, A.; Alenyà, G.; Moreno-Noguer, F.; Torras, C. International Conference of the Catalan Association for Artificial Intelligence p. 199-207 DOI: 10.3233/978-1-60750-842-7-199 Presentation's date: 2011 Presentation of work at congresses
In this paper we address the problem of finding an initial good grasping point for the robotic manipulation of textile objects lying on a flat surface. Given as input a point cloud of the cloth acquired with a 3D camera, we propose choosing as grasping points those that maximize a new measure of wrinkledness, computed from the distribution of normal directions over local neighborhoods. Real grasping experiments using a robotic arm are performed, showing that the proposed measure leads to promising results.
Simo, E.; Moreno-Noguer, F.; Perez-Gracia, A. ASME International Design Engineering Technical Conferences p. 377-386 DOI: 10.1115/DETC2011-47818 Presentation's date: 2011 Presentation of work at congresses
In this paper, we explore the idea of designing non- anthropomorphic multi-fingered robotic hands for tasks tha t replicate the motion of the human hand. Taking as input data a finite set of rigid-body positions for the five fingertips, we de- velop a method to perform dimensional synthesis for a kinema tic chain with a tree structure, with five branches that share thr ee common joints. We state the forward kinematics equations of relative dis- placements for each serial chain expressed as dual quaterni ons, and solve for up to five chains simultaneously to reach a numbe r of positions along the hand trajectory. This is done using a h y- brid global numerical solver that integrates a genetic algo rithm and a Levenberg-Marquardt local optimizer. Although the number of candidate solutions in this problem is very high, the use of the genetic algorithm allows us to per form an exhaustive exploration of the solution space to obtain a s et of solutions. We can then choose some of the solutions based on t he specific task to perform. Note that these designs match the ta sk exactly while generally having a finger design radically dif ferent from that of the human hand.
Serradell, E.; Romero, A.; Leta, R.; Gatta, C.; Moreno-Noguer, F. International Conference on Computer Vision p. 850-857 DOI: 10.1109/ICCV.2011.6126325 Presentation's date: 2011 Presentation of work at congresses
We present a novel approach to simultaneously reconstruct the 3D structure of a non-rigid coronary tree and estimate point correspondences between an input X-ray image and a reference 3D shape. At the core of our approach lies an optimization scheme that iteratively fits a generative 3D model of increasing complexity and guides the matching process. As a result, and in contrast to existing approaches that assume rigidity or quasi-rigidity of the structure, our method is able to retrieve large non-linear deformations even when the input data is corrupted by the presence of noise and partial occlusions. We extensively evaluate our approach under synthetic and real data and demonstrate a remarkable improvement compared to state-of-the-art.
Perception and manipulation of rigid objects has
received a lot of attention, and several solutions have been
proposed. In contrast, dealing with deformable objects is a
relatively new and challenging task because they are more
complex to model, their state is difficult to determine, and
self-occlusions are common and hard to estimate. In this
paper we present our progress/results in the perception of
deformable objects both using conventional RGB cameras and
active sensing strategies by means of depth cameras.We provide
insights in two different areas of application: grasping of textiles
and plant leaf modelling.
Villamizar, M.A.; Moreno-Noguer, F.; Andrade-Cetto, J.; Sanfeliu, A. Iberian conference on pattern recognition and image analysis p. 67-75 DOI: 10.1007/978-3-642-21257-4_9 Presentation's date: 2011 Presentation of work at congresses
We present an experimental evaluation of Boosted Random Ferns in terms of the detection performance and the training data. We show that adding an iterative bootstrapping phase during the learning of the object classifier, it increases its detection rates given that additional positive and negative samples are collected (bootstrapped) for retraining the boosted classifier. After each bootstrapping iteration, the learning algorithm is concentrated on computing more discriminative and robust features (Random Ferns), since the bootstrapped samples extend the training data with more difficult images.
The resulting classifier has been validated in two different object datasets, yielding successful detections rates in spite of challenging image conditions such as lighting changes, mild occlusions and cluttered background.
We present an algorithm to simultaneously recover non-rigid shape and camera poses from point correspondences between a reference shape and a sequence of input images. The key novel contribution of our approach is in bringing the tools of the probabilistic SLAM methodology from a rigid to a deformable domain. Under the assumption that the shape may be represented as a weighted sum of deformation modes, we show that the problem of estimating the modal weights along with the camera poses, may be probabilistically formulated as a maximum a posterior estimate and solved using an iterative least squares optimization. An extensive evaluation on synthetic and real data, shows that our approach has several significant advantages over current approaches, such as performing robustly under large amounts of noise and outliers, and neither requiring to track points over the whole sequence nor initializations close from the ground truth solution.
Recent advances in 3D shape recognition have shown
that kernels based on diffusion geometry can be effectively
used to describe local features of deforming surfaces. In
this paper, we introduce a new framework that allows using
these kernels on 2D local patches, yielding a novel feature
point descriptor that is both invariant to non-rigid image
deformations and illumination changes.
In order to build the descriptor, 2D image patches are
embedded as 3D surfaces, by multiplying the intensity level
by an arbitrarily large and constant weight that favors
anisotropic diffusion and retains the gradient magnitude
information. Patches are then described in terms of a
heat kernel signature, which is made invariant to intensity
changes, rotation and scaling. The resulting feature point
descriptor is proven to be significantly more discriminative
than state of the art ones, even those which are specifically
designed for describing non-rigid image deformations.