Image captioning models typically follow an encoder-decoder architecture which uses abstract image feature vectors as input to the encoder. One of the most successful algorithms uses feature vectors extracted from the region proposals obtained from an object detector. In this work we introduce the Object Relation Transformer, that builds upon this approach by explicitly incorporating information about the spatial relationship between input detected objects through geometric attention. Quantitative and qualitative results demonstrate the importance of such geometric attention for image captioning, leading to improvements on all common captioning metrics on the MS-COCO dataset.Preprint. Under review.
We prove a version of the Cauchy-Davenport theorem for general linear maps. For subsets A, B of the finite field Fp, the classical CauchyDavenport theorem gives a lower bound for the size of the sumset A + B in terms of the sizes of the sets A and B. Our theorem considers a general linear map L :
In this paper, we propose learning an embedding function for content-based image retrieval within the e-commerce domain using the triplet loss and an online sampling method that constructs triplets from within a minibatch. We compare our method to several strong baselines as well as recent works on the DeepFashion and Stanford Online Product datasets. Our approach significantly outperforms the state-of-the-art on the DeepFashion dataset. With a modification to favor sampling minibatches from a single product category, the same approach demonstrates competitive results when compared to the state-of-the-art for the Stanford Online Products dataset.
As an extension of Polya's classical result on random walks on the square grids (Z d ), we consider a random walk where the steps, while still have unit length, point to different directions. We show that in dimensions at least 4, the returning probability after n steps is at most n −d/2−d/(d−2)+o(1) , which is sharp. The real surprise is in dimensions 2 and 3. In dimension 2, where the traditional grid walk is recurrent, our upper bound is n −ω(1) , which is much worse than in higher dimensions. In dimension 3, we prove an upper bound of order n −4+o(1) . We find a new conjecture concerning incidences between spheres and points in R 3 , which, if holds, would improve the bound to n −9/2+o(1) , which is consistent to the d ≥ 4 case. This conjecture resembles Szemerédi-Trotter type results and is of independent interest.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.