Jiebo Luo scite author profile

In classic pattern recognition problems, classes are mutually exclusive by deÿnition. Classiÿcation errors occur when the classes overlap in the feature space. We examine a di erent situation, occurring when the classes are, by deÿnition, not mutually exclusive. Such problems arise in semantic scene and document classiÿcation and in medical diagnosis. We present a framework to handle such problems and apply it to the problem of semantic scene classiÿcation, where a natural scene may contain multiple objects such that the scene can be described by multiple class labels (e.g., a ÿeld scene with a mountain in the background). Such a problem poses challenges to the classic pattern recognition paradigm and demands a di erent treatment. We discuss approaches for training and testing in this scenario and introduce new metrics for evaluating individual examples, class recall and precision, and overall accuracy. Experiments show that our methods are suitable for scene classiÿcation; furthermore, our work appears to generalize to other classiÿcation problems of the same nature.

show abstract

DOTA: A Large-Scale Dataset for Object Detection in Aerial Images

Xia

et al. 2018

View full text Add to dashboard Cite

Object detection is an important and challenging problem in computer vision. Although the past decade has witnessed major advances in object detection in natural scenes, such successes have been slow to aerial imagery, not only because of the huge variation in the scale, orientation and shape of the object instances on the earth's surface, but also due to the scarcity of wellannotated datasets of objects in aerial scenes. To advance object detection research in Earth Vision, also known as Earth Observation and Remote Sensing, we introduce a large-scale Dataset for Object deTection in Aerial images (DOTA). To this end, we collect 2806 aerial images from different sensors and platforms. Each image is of the size about 4000 × 4000 pixels and contains objects exhibiting a wide variety of scales, orientations, and shapes. These DOTA images are then annotated by experts in aerial image interpretation using 15 common object categories. The fully annotated DOTA images contains 188, 282 instances, each of which is labeled by an arbitrary (8 d.o.f.) quadrilateral. To build a baseline for object detection in Earth Vision, we evaluate state-of-the-art object detection algorithms on DOTA. Experiments demonstrate that DOTA well represents real Earth Vision applications and are quite challenging. * DOTA dataset is available at

show abstract

Image Captioning with Semantic Attention

et al. 2016

View full text Add to dashboard Cite

Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and Flickr30K. Experimental results show that our algorithm significantly outperforms the state-of-the-art approaches consistently across different evaluation metrics.

show abstract

Recognizing realistic actions from videos “in the wild”

2009

View full text Add to dashboard Cite

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

et al. 2017

View full text Add to dashboard Cite

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

271

368

View full text Add to dashboard Cite

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

show abstract

Learning from Noisy Labels with Distillation

et al. 2017

View full text Add to dashboard Cite

The ability of learning from noisy labels is very useful in many visual recognition tasks, as a vast amount of data with noisy labels are relatively easy to obtain. Traditionally, label noise has been treated as statistical outliers, and techniques such as importance re-weighting and bootstrapping have been proposed to alleviate the problem. According to our observation, the real-world noisy labels exhibit multimode characteristics as the true labels, rather than behaving like independent random outliers. In this work, we propose a unified distillation framework to use "side" information, including a small clean dataset and label relations in knowledge graph, to "hedge the risk" of learning from noisy labels. Unlike the traditional approaches evaluated based on simulated label noises, we propose a suite of new benchmark datasets, in Sports, Species and Artifacts domains, to evaluate the task of learning from noisy labels in the practical setting. The empirical study demonstrates the effectiveness of our proposed method in all the domains.

show abstract

Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs

et al. 2017

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jiebo Luo

Learning multi-label scene classification

DOTA: A Large-Scale Dataset for Object Detection in Aerial Images

Image Captioning with Semantic Attention

Recognizing realistic actions from videos “in the wild”

Learning Multi-attention Convolutional Neural Network for Fine-Grained Image Recognition

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Learning from Noisy Labels with Distillation

Multimodal Fusion with Recurrent Neural Networks for Rumor Detection on Microblogs

Contact Info

Product

Resources

About