In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes on different levels of granularity (questions, paragraphs, sentences, entities), the representations of which are initialized with pre-trained contextual encoders. Given this hierarchical graph, the initial node representations are updated through graph propagation, and multihop reasoning is performed via traversing through the graph edges for each subsequent sub-task (e.g., paragraph selection, supporting facts extraction, answer prediction). By weaving heterogeneous nodes into an integral unified graph, this hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously. Experiments on the HotpotQA benchmark demonstrate that the proposed model achieves new state of the art, outperforming existing multi-hop QA approaches. 1
In the primate visual system, area V4 is located in the ventral pathway and is traditionally thought to be involved in processing color and form information. However, little is known about its functional role in processing motion information. Using intrinsic signal optical imaging over large fields of view in V1, V2, and V4, we mapped the direction of motion responses in anesthetized macaques. We found that V4 contains direction-preferring domains that are preferentially activated by stimuli moving in one direction. These direction-preferring domains normally occupy several restricted regions of V4 and tend to overlap with orientation- and color-preferring domains. Single-cell recordings targeting these direction-preferring domains also showed a clustering, as well as a columnar organization of V4 direction-selective neurons. These data suggest that, in contrast to the classical view, motion information is also processed in ventral pathway regions such as area V4.
Multimodal pre-training has propelled great advancement in vision-and-language research. These large-scale pre-trained models, although successful, fatefully suffer from slow inference speed due to enormous computation cost mainly from cross-modal attention in Transformer architecture. When applied to reallife applications, such latency and computation demand severely deter the practical use of pre-trained models. In this paper, we study Image-text retrieval (ITR), the most mature scenario of V+L application, which has been widely studied even prior to the emergence of recent pre-trained models. We propose a simple yet highly effective approach, LightningDOT that accelerates the inference time of ITR by thousands of times, without sacrificing accuracy. LightningDOT removes the time-consuming cross-modal attention by pre-training on three novel learning objectives, extracting feature indexes offline, and employing instant dot-product matching with further re-ranking, which significantly speeds up retrieval process. In fact, Light-ningDOT achieves new state of the art across multiple ITR benchmarks such as Flickr30k, COCO and Multi30K, outperforming existing pre-trained models that consume 1000× magnitude of computational hours. 1
Existing language model compression methods mostly use a simple L 2 loss to distill knowledge in the intermediate representations of a large BERT model to a smaller one. Although widely used, this objective by design assumes that all the dimensions of hidden representations are independent, failing to capture important structural knowledge in the intermediate layers of the teacher network. To achieve better distillation efficacy, we propose Contrastive Distillation on Intermediate Representations (CODIR), a principled knowledge distillation framework where the student is trained to distill knowledge through intermediate layers of the teacher via a contrastive objective. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pretraining and finetuning stages, and achieves superb performance on the GLUE benchmark, outperforming state-of-the-art compression methods. 1
The ability to extract the shape of moving objects is fundamental to visual perception. However, where such computations are processed in the visual system is unknown. To address this question, we used intrinsic signal optical imaging in awake monkeys to examine cortical response to perceptual contours defined by motion contrast (motion boundaries, MBs). We found that MB stimuli elicit a robust orientation response in area V2. Orientation maps derived from subtraction of orthogonal MB stimuli aligned well with the orientation maps obtained with luminance gratings (LGs). In contrast, area V1 responded well to LGs, but exhibited a much weaker orientation response to MBs. We further show that V2 direction domains respond to motion contrast, which is required in the detection of MB in V2. These results suggest that V2 represents MB information, an important prerequisite for shape recognition and figure-ground segregation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.