Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, i.e. describing images with syntactically and semantically meaningful sentences. Starting from 2015 the task has generally been addressed with pipelines composed of a visual encoder and a language model for text generation. During these years, both components have evolved considerably through the exploitation of object regions, attributes, the introduction of multi-modal connections, fully-attentive approaches, and BERT-like early-fusion strategies. However, regardless of the impressive results, research in image captioning has not reached a conclusive answer yet. This work aims at providing a comprehensive overview of image captioning approaches, from visual encoding and text generation to training strategies, datasets, and evaluation metrics. In this respect, we quantitatively compare many relevant state-of-the-art approaches to identify the most impactful technical innovations in architectures and training strategies. Moreover, many variants of the problem and its open challenges are discussed. The final goal of this work is to serve as a tool for understanding the existing literature and highlighting the future directions for a research area where Computer Vision and Natural Language Processing can find an optimal synergy.
Imaging the cosmic 21 cm signal will map out the first billion years of our Universe. The resulting 3D lightcone (LC) will encode the properties of the unseen first galaxies and physical cosmology. Here, we build on previous work using neural networks (NNs) to infer astrophysical parameters directly from 21 cm LC images. We introduce recurrent neural networks (RNNs), capable of efficiently characterizing the evolution along the redshift axis of 21 cm LC images. Using a large database of simulated cosmic 21 cm LCs, we compare the relative performance in parameter estimation of different network architectures. These including two types of RNNs, which differ in their complexity, as well as a more traditional convolutional neural network (CNN). For the ideal case of no instrumental effects, our simplest and easiest to train RNN performs the best, with a mean squared parameter estimation error (MSE) that is lower by a factor of ≳2 compared with the other architectures studied here, and a factor of ≳8 lower than the previously-studied CNN. We also corrupt the cosmic signal by adding noise expected from a 1000 h integration with the Square Kilometre Array, as well as excising a foreground-contaminated ‘horizon wedge’. Parameter prediction errors increase when the NNs are trained on these contaminated LC images, though recovery is still good even in the most pessimistic case (with R2 ≳ 0.5−0.95). However, we find no notable differences in performance between network architectures on the contaminated images. We argue this is due to the size of our data set, highlighting the need for larger data sets and/or better data augmentation in order to maximize the potential of NNs in 21 cm parameter estimation.
The semantic segmentation of underwater imagery is an important step in the ecological analysis of coral habitats. To date, scientists produce fine-scale area annotations manually, an exceptionally time-consuming task that could be efficiently automatized by modern CNNs. This paper extends our previous work presented at the 3DUW’19 conference, outlining the workflow for the automated annotation of imagery from the first step of dataset preparation, to the last step of prediction reassembly. In particular, we propose an ecologically inspired strategy for an efficient dataset partition, an over-sampling methodology targeted on ortho-imagery, and a score fusion strategy. We also investigate the use of different loss functions in the optimization of a Deeplab V3+ model, to mitigate the class-imbalance problem and improve prediction accuracy on coral instance boundaries. The experimental results demonstrate the effectiveness of the ecologically inspired split in improving model performance, and quantify the advantages and limitations of the proposed over-sampling strategy. The extensive comparison of the loss functions gives numerous insights on the segmentation task; the Focal Tversky, typically used in the context of medical imaging (but not in remote sensing), results in the most convenient choice. By improving the accuracy of automated ortho image processing, the results presented here promise to meet the fundamental challenge of increasing the spatial and temporal scale of coral reef research, allowing researchers greater predictive ability to better manage coral reef resilience in the context of a changing environment.
Grapevine winter pruning is a complex task, that requires skilled workers to execute it correctly. The complexity of this task is also the reason why it is time consuming. Considering that this operation takes about 80-120 hours/ha to be completed, and therefore is even more crucial in largesize vineyards, an automated system can help to speed up the process. To this end, this paper presents a novel multidisciplinary approach that tackles this challenging task by performing object segmentation on grapevine images, used to create a representative model of the grapevine plants. Second, a set of potential pruning points is generated from this plant representation. We will describe (a) a methodology for data acquisition and annotation, (b) a neural network fine-tuning for grapevine segmentation, (c) an image processing based method for creating the representative model of grapevines, starting from the inferred segmentation and (d) potential pruning points detection and localization, based on the plant model which is a simplification of the grapevine structure. With this approach, we are able to identify a significant set of potential pruning points on the canes, that can be used, with further selection, to derive the final set of the real pruning points.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.