A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

Burns, Andrea; Arsan, Deniz; Agrawal, Sanjna; Kumar, Ranjitha; Saenko, Kate; Plummer, Bryan A.

doi:10.1007/978-3-031-20074-8_18

Cited by 10 publications

(5 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In order to help visually impaired individuals understand the meaning of GUI pages and components, researchers have attempted to use computational vision technology for GUI modeling and semantic understanding of GUI pages [16,19,28,41,44,46,51,60,70,131,136,142]. Schoop et al [115] designed a novel system that models the perceived tappability of mobile UI elements with a vision-based deep neural network and helps provide design insights with datasetlevel and instance-level explanations of model predictions.…”

Section: Gui Understanding and Intelligentmentioning

confidence: 99%

Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

Liu,

Chen,

Wang

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

Mobile apps have become indispensable for accessing and participating in various environments, especially for low-vision users. Users with visual impairments can use screen readers to read the content of each screen and understand the content that needs to be operated. Screen readers need to read the hint-text attribute in the text input component to remind visually impaired users what to fill in. Unfortunately, based on our analysis of 4,501 Android apps with text inputs, over 76% of them are missing hint-text. These issues are mostly caused by developers' lack of awareness when considering visually impaired individuals. To overcome these challenges, we developed an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further designed a feedback-based inspection mechanism to further adjust hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness. HintDroid can not only help visually impaired individuals, but also help ordinary people understand the requirements of input components. HintDroid demo video: https://youtu.be/FWgfcctRbfI.

show abstract

Section: Gui Understanding and Intelligentmentioning

confidence: 99%

Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

Liu,

Chen,

Wang

et al. 2024

Proceedings of the CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

show abstract

“…Web Navigation and Question-Answering Web navigation task (Toyama et al 2021;Yao et al 2022;Burns et al 2022) involves developing algorithms or models that enable automated agents to navigate and interact with websites on the Internet. There are some related datasets (Liu et al 2018;Xu et al 2021;Mazumder and Riva 2020;Yao et al 2022;Deng et al 2023;.…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, for Embodied AI datasets, we consider R2R (Anderson et al 2018), REVERIE (Qi et al 2020b) and EQA (Das et al 2018), where the first two are the widely used vision-and-language navigation (VLN) datasets while the last one is a famous embodied question answering dataset. Regarding App-based datasets, we compare Pixel-Help (Li et al 2020), MoTIF (Burns et al 2022) and META-GUI (Sun et al 2022). As for the website, we consider seven datasets for a comprehensive comparison, including MiniWoB++ (Liu et al 2018), RUSS (Xu et al 2021), FLIN (Mazumder and Riva 2020), WebShop (Yao et al 2022), MIND2WEB (Deng et al 2023), WebQA (Chang et al 2022), and ScreenQA (Hsiao et al 2022).…”

Section: Webvln-v1 Dataset Analysis Webvln-v1 Dataset Vs Related Data...mentioning

confidence: 99%

WebVLN: Vision-and-Language Navigation on Websites

Chen,

Pitawela,

Zhao

et al. 2024

AAAI

View full text Add to dashboard Cite

Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.

show abstract

“…MMC4 has not yet been used for pretraining or downstream applications. In mobile apps, the closest domain to webpages, there are two open source datasets that contain all modalities (text, image, and structure): Rico (Deka et al, 2017) and MoTIF (Burns et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Burns,

Srinivasan,

Ainslie

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data 1 . We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Extensive experiments show that the new data in WikiWeb2M improves task performance compared to prior work.

show abstract

A Dataset for Interactive Vision-Language Navigation with Unknown Command Feasibility

Cited by 10 publications

References 32 publications

Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

Unblind Text Inputs: Predicting Hint-text of Text Input in Mobile Apps via LLM

WebVLN: Vision-and-Language Navigation on Websites

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Contact Info

Product

Resources

About