In this paper, we simultaneously estimate camera pose and non-rigid 3D shape from a monocular video, using a sequential solution that combines local and global representations. We model the object as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency. The resulting approach allows to sequentially estimate shape and camera poses, while progressively learning a global low-rank model of the shape that is fed back into the optimization scheme, introducing thus, global constraints. The overall combination of local (physical) and global (statistical) constraints yields a solution that is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, without requiring any training data at all. Validation is done in a variety of real application domains, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our on-line methodology yields significantly more accurate reconstructions than competing sequential approaches, being even comparable to the more computationally demanding batch methods.
The final publication is available at link.springer.com
This paper describes two sequential methods for recovering the camera pose together with the 3D shape of highly deformable surfaces from a monocular video. The nonrigid 3D shape is modeled as a linear combination of mode shapes with time-varying weights that define the shape at each frame and are estimated on-the-fly. The low-rank constraint is combined with standard smoothness priors to optimize the model parameters over a sliding window of image frames. We propose to obtain a physics-based shape basis using the initial frames on the video to code the time-varying shape along the sequence, reducing the problem from trilinear to bilinear. To this end, the 3D shape is discretized by means of a soup of elastic triangular finite elements where we apply a force balance equation. This equation is solved using modal analysis via a simple eigenvalue problem to obtain a shape basis that encodes the modes of deformation. Even though this strategy can be applied in a wide variety of scenarios, when the observations are denser, the solution can become prohibitive in terms of computational load. We avoid this limitation by proposing two efficient coarse-to-fine approaches that allow us to easily deal with dense 3D surfaces. This results in a scalable solution that estimates a small number of parameters per frame and could potentially run in real time. We show results on both synthetic and real videos with ground truth 3D data, while robustly dealing with artifacts such as noise and missing data.
The final publication is available at link.springer.com
Pumarola, A.; Vakhitov, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. IEEE International Conference on Robotics and Automation p. 4503-4508 DOI: 10.1109/ICRA.2017.7989522 Data de presentació: 2017 Presentació treball a congrés
Low textured scenes are well known to be one of the main Achilles heels of geometric computer vision algorithms relying on point correspondences, and in particular for visual SLAM. Yet, there are many environments in which, despite being low textured, one can still reliably estimate line-based geometric primitives, for instance in city and indoor scenes, or in the so-called "Manhattan worlds", where structured edges are predominant. In this paper we propose a solution to handle these situations. Specifically, we build upon ORB-SLAM, presumably the current state-of-the-art solution both in terms of accuracy as efficiency, and extend its formulation to simultaneously handle both point and line correspondences. We propose a solution that can even work when most of the points are vanished out from the input images, and, interestingly it can be initialized from solely the detection of line correspondences in three consecutive frames. We thoroughly evaluate our approach and the new initialization strategy on the TUM RGB-D benchmark and demonstrate that the use of lines does not only improve the performance of the original ORB-SLAM solution in poorly textured frames, but also systematically improves it in sequence frames combining points and lines, without compromising the efficiency.
Low textured scenes are well known to be one of the main Achilles heels of geometric computer vision algorithms relying on point correspondences, and in particular for visual SLAM. Yet, there are many environments in which, despite being low textured, one can still reliably estimate line-based geometric primitives, for instance in city and indoor scenes, or in the so-called
The most standard approach to resolve the inherent ambiguities of the non-rigid structure from motion problem is using low-rank models that approximate deforming shapes by a linear combination of rigid basis. These models are typically global, i.e., each shape basis contributes equally to all points of the surface. While this approach has been shown effective to represent smooth deformations, its performance degrades for surfaces composed of various regions, each following a different deformation rule. Piecewise methods attempt to capture this type of behavior by locally modeling surface patches, although they subsequently require enforcing global constraints to assemble back the patches. In this paper we propose an approach that combines the best of global and local models: it locally considers low-rank models but, by construction, does not need to impose global constraints to guarantee local patch continuity. We achieve this by a simple expectation maximization strategy that besides learning global shape bases, it locally adapts their contribution to each specific surface region. Furthermore, as a side contribution, in order to split the surface into different local patches, we propose a novel physically-based mesh segmentation approach that obeys an energy criterion. The complete framework is evaluated in both synthetic and real datasets, and shows an improved performance to competing methods.
Agudo, A.; Moreno-Noguer, F.; Calvo, B.; Montiel, J.M.M. IEEE transactions on pattern analysis and machine intelligence Vol. 38, num. 5, p. 979-994 DOI: 10.1109/TPAMI.2015.2469293 Data de publicació: 2016-05-01 Article en revista
We propose a new approach to simultaneously recover camera pose and 3D shape of non-rigid and potentially extensible surfaces from a monocular image sequence. For this purpose, we make use of the Extended Kalman Filter based Simultaneous Localization And Mapping (EKF-SLAM) formulation, a Bayesian optimization framework traditionally used in mobile robotics for estimating camera pose and reconstructing rigid scenarios. In order to extend the problem to a deformable domain we represent the object's surface mechanics by means of Navier's equations, which are solved using a Finite Element Method (FEM). With these main ingredients, we can further model the material's stretching, allowing us to go a step further than most of current techniques, typically constrained to surfaces undergoing isometric deformations. We extensively validate our approach in both real and synthetic experiments, and demonstrate its advantages with respect to competing methods. More specifically, we show that besides simultaneously retrieving camera pose and non-rigid shape, our approach is adequate for both isometric and extensible surfaces, does not require neither batch processing all the frames nor tracking points over the whole sequence and runs at several frames per second.
This paper describes a real-time sequential method to simultaneously recover the camera motion and the 3D shape of deformable objects from a calibrated monocular video. For this purpose, we consider the Navier-Cauchy equations used in 3D linear elasticity and solved by finite elements, to model the time-varying shape per frame. These equations are embedded in an extended Kalman filter, resulting in sequential Bayesian estimation approach. We represent the shape, with unknown material properties, as a combination of elastic elements whose nodal points correspond to salient points in the image. The global rigidity of the shape is encoded by a stiffness matrix, computed after assembling each of these elements. With this piecewise model, we can linearly relate the 3D displacements with the 3D acting forces that cause the object deformation, assumed to be normally distributed. While standard finite-element-method techniques require imposing boundary conditions to solve the resulting linear system, in this work we eliminate this requirement by modeling the compliance matrix with a generalized pseudoinverse that enforces a pre-fixed rank. Our framework also ensures surface continuity without the need for a post-processing step to stitch all the piecewise reconstructions into a global smooth shape. We present experimental results using both synthetic and real videos for different scenarios ranging from isometric to elastic deformations. We also show the consistency of the estimation with respect to 3D ground truth data, include several experiments assessing robustness against artifacts and finally, provide an experimental validation of our performance in real time at frame rate for small maps
In recent years, there has been a growing interest on tackling the Non-Rigid Structure from Motion problem (NRSfM), where the shape of a deformable object and the pose of a moving camera are simultaneously estimated from a monocular video sequence. Existing solutions are limited to single objects and continuous, smoothly changing sequences. In this paper we extend NRSfM to a multi-instance domain, in which the images do not need to have temporal consistency, allowing for instance, to jointly reconstruct the face of multiple persons from an unordered list of images. For this purpose, we present a new formulation of the problem based on a dual low-rank shape representation, that simultaneously captures the between- and within-individual deformations. The parameters of this model are learned using a variant of the probabilistic linear discriminant analysis that requires consecutive batches of expectation and maximization steps. The resulting approach estimates 3D deformable shape and pose of multiple instances from only 2D point observations on a collection images, without requiring pre-trained 3D data, and is shown to be robust to noisy measurements and missing points. We provide quantitative and qualitative evaluation on both synthetic and real data, and show consistent benefits compared to current state of the art.
The final publication is available at link.springer.com
Agudo, A.; Montiel, J.M.M.; Calvo, B.; Moreno-Noguer, F. IEEE Winter Conference on Applications of Computer Vision p. 1 DOI: 10.1109/WACV.2016.7477725 Data de presentació: 2016 Presentació treball a congrés
This paper describes an on-line approach for estimating non-rigid shape and camera pose from monocular video sequences. We assume an initial estimate of the shape at rest to be given and represented by a triangulated mesh, which is encoded by a matrix of the distances between every pair of vertexes. By applying spectral analysis on this matrix, we are then able to compute a low-dimensional shape basis, that in contrast to standard approaches, has a very direct physical interpretation and requires a much smaller number of modes to span a large variety of deformations, either for inextensible or extensible configurations. Based on this low-rank model, we then sequentially retrieve both camera motion and non-rigid shape in each image, optimizing the model parameters with bundle adjustment over a sliding window of image frames. Since the number of these parameters is small, specially when considering physical priors, our approach may potentially achieve real-time performance. Experimental results on real videos for different scenarios demonstrate remarkable robustness to artifacts such as missing and noisy observations.
In this paper, we propose a sequential solution to simultaneously estimate camera pose and non-rigid 3D shape from a monocular video. In contrast to most existing approaches that rely on global representations of the shape, we model the object at a local level, as an ensemble of particles, each ruled by the linear equation of the Newton's second law of motion. This dynamic model is incorporated into a bundle adjustment framework, in combination with simple regularization components that ensure temporal and spatial consistency of the estimated shape and camera poses. The resulting approach is both efficient and robust to several artifacts such as noisy and missing data or sudden camera motions, while it does not require any training data at all. Validation is done in a variety of real video sequences, including articulated and non-rigid motion, both for continuous and discontinuous shapes. Our system is shown to perform comparable to competing batch, computationally expensive, methods and shows remarkable improvement with respect to the sequential ones.
In this paper, we address the problem of simultaneously recovering the 3D shape and pose of a deformable and potentially elastic object from 2D motion. This is a highly ambiguous problem typically tackled by using low-rank shape and trajectory constraints. We show that formulating the problem in terms of a low-rank force space that induces the deformation, allows for a better physical interpretation of the resulting priors and a more accurate representation of the actual object's behavior. However, this comes at the price of, besides force and pose, having to estimate the elastic model of the object. For this, we use an Expectation Maximization strategy, where each of these parameters are successively learned within partial M-steps, while robustly dealing with missing observations. We thoroughly validate the approach on both mocap and real sequences, showing more accurate 3D reconstructions than state-of-the-art, and additionally providing an estimate of the full elastic model with no a priori information.