Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks (Devlin et al., 2019). Most of the existing approaches rely on a randomly initialized classifier on top of such networks. We argue that this fine-tuning procedure is sub-optimal as the pre-trained model has no prior on the specific classifier labels, while it might have already learned an intrinsic textual representation of the task. In this paper, we introduce a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase. We study commonsense reasoning tasks where the model must rank a set of hypotheses given a premise, focusing on the COPA (Gordon et al., 2012), Swag (Zellers et al., 2018), HellaSwag (Zellers et al., 2019) and CommonsenseQA (Talmor et al., 2019) datasets. By exploiting our scoring method without fine-tuning, we are able to produce strong baselines (e.g. 80% test accuracy on COPA) that are comparable to supervised approaches. Moreover, when fine-tuning directly on the proposed scoring function, we show that our method provides a much more stable training phase across random restarts (e.g ×10 standard deviation reduction on COPA test accuracy) and requires less annotated data than the standard classifier approach to reach equivalent performances.
Abstract-Robust wide baseline pose estimation is an essential step in the deployment of smart camera networks. In this work, we highlight some current limitations of conventional strategies for relative pose estimation in difficult urban scenes. Then we propose a solution which relies on an adaptive search of corresponding interest points in synchronized video streams which allows us to converge robustly towards a high-quality solution. The experiments are performed using a manually annotated ground truth of a large scale scene exhibiting significant depth and perspective variation, uniform areas, repetitive patterns and homogeneous dynamic elements. The results show a fast and robust convergence of the solution, and a significant improvement, compared to single image based alternatives, of the RMSE of ground truth matches, and of the maximum absolute error.
Robust wide baseline pose estimation is an essential step in the deployment of smart camera networks. In this work, we highlight some current limitations of conventional strategies for relative pose estimation in difficult urban scenes. Then, we propose a solution which relies on an adaptive search of corresponding interest points in synchronized video streams which allows us to converge robustly toward a high-quality solution. The core idea of our algorithm is to build across the image space a nonstationary mapping of the local pose estimation uncertainty, based on the spatial distribution of interest points. Subsequently, the mapping guides the selection of new observations from the video stream in order to prioritize the coverage of areas of high uncertainty. With an additional step in the initial stage, the proposed algorithm may also be used for refining an existing pose estimation based on the video data; this mode allows for performing a data-driven self-calibration task for stereo rigs for which accuracy is critical, such as onboard medical or vehicular systems. We validate our method on three different datasets which cover typical scenarios in pose estimation. The results show a fast and robust convergence of the solution, with a significant improvement, compared to single image-based alternatives, of the RMSE of ground-truth matches, and of the maximum absolute error.
This paper introduces an innovative approach for handling 2D compound hypotheses within the Belief Function Theory framework. We propose a polygon-based generic representation which relies on polygon clipping operators. This approach allows us to account in the computational cost for the precision of the representation independently of the cardinality of the discernment frame. For the BBA combination and decision making, we propose efficient algorithms which rely on hashes for fast lookup, and on a topological ordering of the focal elements within a directed acyclic graph encoding their interconnections. Additionally, an implementation of the functionalities proposed in this paper is provided as an open source library. Experimental results on a pedestrian localization problem are reported. The experiments show that the solution is accurate and that it fully benefits from the scalability of the 2D search space granularity provided by our representation.
Detection and tracking of pedestrians in vast crowded areas is a complex problem addressed actively by the computer vision community. Proposed algorithms should ideally tackle issues of accuracy and speed at the same time. Lengthy computation times for high-quality optimizationbased algorithms relying on multiple sensors make them impractical to use on long and detailed sequences. Hence, an efficient acceleration scheme, which preserves the overall accuracy, is vital to be considered. In the current work, we iterate various steps taken to accelerate a multi-camera pedestrian detection algorithm formulated as an optimization of a height map with local scene geometry constraints. The work is performed using the NVIDIA CUDA framework which allows us to efficiently utilize GPU processors and optimize the various memory accesses. The final results show more than 1000x speedup on real data frames. With respect to preserving the output accuracy, we achieve an accelerated output which is more than 99.9% in agreement with the original results.
This paper addresses the problem of head detection in crowded environments. Our detection is based entirely on the geometric consistency across cameras with overlapping fields of view, and no additional learning process is required. We propose a fully unsupervised method for inferring scene and camera geometry, in contrast to existing algorithms which require specific calibration procedures. Moreover, we avoid relying on the presence of body parts other than heads or on background subtraction, which have limited effectiveness under heavy clutter. We cast the head detection problem as a stereo MRF-based optimization of a dense pedestrian height map, and we introduce a constraint which aligns the height gradient according to the vertical vanishing point direction. We validate the method in an outdoor setting with varying pedestrian density levels. With only three views, our approach is able to detect simultaneously tens of heavily occluded pedestrians across a large, homogeneous area.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.