“…In order to help visually impaired individuals understand the meaning of GUI pages and components, researchers have attempted to use computational vision technology for GUI modeling and semantic understanding of GUI pages [16,19,28,41,44,46,51,60,70,131,136,142]. Schoop et al [115] designed a novel system that models the perceived tappability of mobile UI elements with a vision-based deep neural network and helps provide design insights with datasetlevel and instance-level explanations of model predictions.…”
Section: Gui Understanding and Intelligentmentioning
Mobile apps have become indispensable for accessing and participating in various environments, especially for low-vision users. Users with visual impairments can use screen readers to read the content of each screen and understand the content that needs to be operated. Screen readers need to read the hint-text attribute in the text input component to remind visually impaired users what to fill in. Unfortunately, based on our analysis of 4,501 Android apps with text inputs, over 76% of them are missing hint-text. These issues are mostly caused by developers' lack of awareness when considering visually impaired individuals. To overcome these challenges, we developed an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further designed a feedback-based inspection mechanism to further adjust hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness. HintDroid can not only help visually impaired individuals, but also help ordinary people understand the requirements of input components. HintDroid demo video: https://youtu.be/FWgfcctRbfI.
“…In order to help visually impaired individuals understand the meaning of GUI pages and components, researchers have attempted to use computational vision technology for GUI modeling and semantic understanding of GUI pages [16,19,28,41,44,46,51,60,70,131,136,142]. Schoop et al [115] designed a novel system that models the perceived tappability of mobile UI elements with a vision-based deep neural network and helps provide design insights with datasetlevel and instance-level explanations of model predictions.…”
Section: Gui Understanding and Intelligentmentioning
Mobile apps have become indispensable for accessing and participating in various environments, especially for low-vision users. Users with visual impairments can use screen readers to read the content of each screen and understand the content that needs to be operated. Screen readers need to read the hint-text attribute in the text input component to remind visually impaired users what to fill in. Unfortunately, based on our analysis of 4,501 Android apps with text inputs, over 76% of them are missing hint-text. These issues are mostly caused by developers' lack of awareness when considering visually impaired individuals. To overcome these challenges, we developed an LLM-based hint-text generation model called HintDroid, which analyzes the GUI information of input components and uses in-context learning to generate the hint-text. To ensure the quality of hint-text generation, we further designed a feedback-based inspection mechanism to further adjust hint-text. The automated experiments demonstrate the high BLEU and a user study further confirms its usefulness. HintDroid can not only help visually impaired individuals, but also help ordinary people understand the requirements of input components. HintDroid demo video: https://youtu.be/FWgfcctRbfI.
“…Web Navigation and Question-Answering Web navigation task (Toyama et al 2021;Yao et al 2022;Burns et al 2022) involves developing algorithms or models that enable automated agents to navigate and interact with websites on the Internet. There are some related datasets (Liu et al 2018;Xu et al 2021;Mazumder and Riva 2020;Yao et al 2022;Deng et al 2023;.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, for Embodied AI datasets, we consider R2R (Anderson et al 2018), REVERIE (Qi et al 2020b) and EQA (Das et al 2018), where the first two are the widely used vision-and-language navigation (VLN) datasets while the last one is a famous embodied question answering dataset. Regarding App-based datasets, we compare Pixel-Help (Li et al 2020), MoTIF (Burns et al 2022) and META-GUI (Sun et al 2022). As for the website, we consider seven datasets for a comprehensive comparison, including MiniWoB++ (Liu et al 2018), RUSS (Xu et al 2021), FLIN (Mazumder and Riva 2020), WebShop (Yao et al 2022), MIND2WEB (Deng et al 2023), WebQA (Chang et al 2022), and ScreenQA (Hsiao et al 2022).…”
Section: Webvln-v1 Dataset Analysis Webvln-v1 Dataset Vs Related Data...mentioning
Vision-and-Language Navigation (VLN) task aims to enable AI agents to accurately understand and follow natural language instructions to navigate through real-world environments, ultimately reaching specific target locations. We recognise a promising opportunity to extend VLN to a comparable navigation task that holds substantial significance in our daily lives, albeit within the virtual realm: navigating websites on the Internet. This paper proposes a new task named Vision-and-Language Navigation on Websites (WebVLN), where we use question-based instructions to train an agent, emulating how users naturally browse websites. Unlike the existing VLN task that only pays attention to vision and instruction (language), the WebVLN agent further considers underlying web-specific content like HTML, which could not be seen on the rendered web pages yet contain rich visual and textual information. Toward this goal, we contribute a dataset, WebVLN-v1, and introduce a novel approach called Website-aware VLN Network (WebVLN-Net), which is built upon the foundation of state-of-the-art VLN techniques. Experimental results show that WebVLN-Net outperforms current VLN and web-related navigation methods. We believe that the introduction of the newWebVLN task and its dataset will establish a new dimension within the VLN domain and contribute to the broader vision-and-language research community. Code is available at: https://github.com/WebVLN/WebVLN.
“…MMC4 has not yet been used for pretraining or downstream applications. In mobile apps, the closest domain to webpages, there are two open source datasets that contain all modalities (text, image, and structure): Rico (Deka et al, 2017) and MoTIF (Burns et al, 2022).…”
Webpages have been a rich, scalable resource for vision-language and language only tasks. Yet only pieces of webpages are kept in existing datasets: image-caption pairs, long text articles, or raw HTML, never all in one place. Webpage tasks have resultingly received little attention and structured image-text data left underused. To study multimodal webpage understanding, we introduce the Wikipedia Webpage suite (WikiWeb2M) containing 2M pages with all of the associated image, text, and structure data 1 . We verify its utility on three generative tasks: page description generation, section summarization, and contextual image captioning. We design a novel attention mechanism Prefix Global, which selects the most relevant image and text content as global tokens to attend to the rest of the webpage for context. By using page structure to separate such tokens, it performs better than full attention with lower computational complexity. Extensive experiments show that the new data in WikiWeb2M improves task performance compared to prior work.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.