Recently, several approaches have explored the detection and classification of objects in videos to perform Zero-Shot Action Recognition (ZSAR) with remarkable results. In these methods, class-object relationships are used to associate visual patterns with the semantic side information because these relationships also tend to appear in texts. Therefore, word vector methods would reflect them in their latent representations. Inspired by these methods and by video captioning's ability to describe events not only with a set of objects but with contextual information, we propose a method in which video captioning models, called observers, provide different and complementary descriptive sentences. We demonstrate that representing videos with descriptive sentences instead of deep features, in ZSAR, is viable and naturally alleviates the domain adaptation problem, as we reached state-of-the-art (SOTA) performance on the UCF101 dataset and competitive performance on HMDB51 without their training sets. We also demonstrate that word vectors are unsuitable for building the semantic embedding space of our descriptions. Thus, we propose to represent the classes with sentences extracted from documents acquired with search engines on the Internet, without any human evaluation on the quality of descriptions. Lastly, we build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets. We show that this pre-training is essential for bridging the semantic gap. The projection onto this space is straightforward for both types of information, visual and semantic, because they are sentences, enabling the classification with nearest neighbour rule in this shared space. Our code is available at https://github.com/valterlej/zsarcap.
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events (e.g., minutes) can be decomposed into simpler events (e.g., a few seconds), and that these simple events are shared across several complex events. We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks. A clustering method is used to group representations producing a visual codebook (i.e., a long video is represented by a sequence of integers given by the cluster labels). A dense representation is learned by encoding the co-occurrence probability matrix for the codebook entries. We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features. As a result of this approach, we are able to replace the audio signal in the Bi-Modal Transformer (BMT) method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual signal with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in captioning compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.
A adolescência é uma fase marcada por grandes transformações, que envolvem aspectos biológicos, psíquicos, sociais e culturais. Nesse sentido, o adolescente apresenta vulnerabilidades para a transmissão das doenças sexualmente transmissíveis, entre elas o HPV. Constata-se o aumento da prevalência do câncer de cabeça e pescoço em adultos jovens, associada ao HPV. Assim, o objetivo da pesquisa foi o desenvolvimento de um aplicativo para smartphones, voltado à adolescentes, como estratégia de prevenção do câncer de cabeça e pescoço causado pelo HPV. Para tal realizou-se oficinas participativas com 90 adolescentes de ensino médio, com idade entre 16 e 18 anos, sobre a relação entre câncer de cabeça e pescoço e HPV. O conteúdo do aplicativo foi elaborado segundo temas levantados pelos alunos. O conteúdo do aplicativo foi validado por expertises, com concordância maior que 85%. A validação de aparência foi realizada pelos próprios adolescentes, com o mesmo nível de concordância.
This paper presents an approach that uses probabilistic logic reasoning to compute subjective interestingness scores for classification rules. In the proposed approach, domain knowledge is represented as a probabilistic logic program that encodes information from experts and statistical reports. The computation of interestingness scores is performed by a procedure that applies linear programming to reasoning regarding the probabilities of interest. It provides a mechanism to calculate probability-based subjective interestingness scores. Further, a sample application illustrates the use of the described approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.