Mapping natural language commands to web elements

Pasupat, Panupong; Jiang, Tian-Shun; Liu, Evan Zheran; Guu, Kelvin; Liang, Percy

doi:10.18653/v1/d18-1540

Cited by 28 publications

(17 citation statements)

References 34 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sikuli uses screenshots to refer to the GUI elements for automation [34]. Neural networks have been proposed to map high-level verbal descriptions of web elements (the text of the element, its graphical attributes, and its relative position to other elements on the page) to specific graphical elements [23,28]. Recently, we show it is more accurate to use a neural network to first translate the naturallanguage description to a formal semantic representation, which is then used algorithmically to identify the element of interest in the target web page [33].…”

Section: Pbd For Automationmentioning

confidence: 99%

DIY assistant: a multi-modal end-user programmable virtual assistant

Fischer

Campagna

Choi

et al. 2021

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

Figure 1. Creating a virtual assistant skill that returns the cost of ingredients in a list using diya (DIY Assistant). (a) A user sees a cookie recipe on a popular food blog and wants to see how much the ingredients are. (b) They enter diya's recording mode using their voice and search for one of the ingredients on Walmart's website. (c) They click on the first search result and highlight the price, telling diya via voice that it should be returned. (d) A few days later, they are interested in the łSpaghetti Carbonara" recipe on another food blog. They highlight the ingredients and ask diya to run the previously defined program with them. (e) diya returns the prices of the items.

show abstract

Section: Pbd For Automationmentioning

confidence: 99%

DIY assistant: a multi-modal end-user programmable virtual assistant

Fischer

Campagna

Choi

et al. 2021

Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

View full text Add to dashboard Cite

show abstract

“…Although the implementation of generating app GUI screenshot confirmations used in SOVITE, as described above, only applies to programming-by-demonstration instructable agents such as SUGILITE [35], PLOW [1], and VASTA [58], there are other feasible approaches for generating app GUI screenshot confirmations in other types of agents. For example, recent advances in machine learning have been shown to support directly matching natural language commands to specific GUI elements [52] and generating semantic labels for GUI elements from screenshots [13]. For agents that use web API calls to fulfill the task intents, it is also feasible to compare the agent API calls to the API calls made by apps by analyzing the code of the apps (e.g., CHABADA [20]), or to the network traffic collected from the apps (e.g., MobiPurpose [28]).…”

Section: Generating the App Gui Screenshot Confirmationsmentioning

confidence: 99%

Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs

Chen

Xia

et al. 2020

Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology

View full text Add to dashboard Cite

A major problem in task-oriented conversational agents is the lack of support for the repair of conversational breakdowns. Prior studies have shown that current repair strategies for these kinds of errors are often ineffective due to: (1) the lack of transparency about the state of the system's understanding of the user's utterance; and (2) the system's limited capabilities to understand the user's verbal attempts to repair natural language understanding errors. This paper introduces SOVITE, a new multi-modal (speech plus direct manipulation) interface that helps users discover, identify the causes of, and recover from conversational breakdowns using the resources of existing mobile app GUIs for grounding. SOVITE displays the system's understanding of user intents using GUI screenshots, allows users to refer to third-party apps and their GUI screens in conversations as inputs for intent disambiguation, and enables users to repair breakdowns using direct manipulation on these screenshots. The results from a remote user study with 10 users using SOVITE in 7 scenarios suggested that SOVITE's approach is usable and effective.

show abstract

“…Some works have made early progress in this domain (Liu et al, 2018b;Deka et al, 2016; thanks to the availability of large datasets of GUIs like RICO (Deka et al, 2017). Recent reinforcement learning-based approaches and semantic parsing techniques have also shown promising results in learning models for navigating through GUIs for user-specified task objectives (Liu et al, 2018a;Pasupat et al, 2018). For ITL, an interesting future challenge is to combine these user-independent domain-agnostic machine-learned models with the user's personalized instructions for a specific task.…”

Section: Extracting Task Semantics From Guismentioning

confidence: 99%

Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations

Mitchell

Myers

2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

View full text Add to dashboard Cite

We show SUGILITE, an intelligent task automation agent that can learn new tasks and relevant associated concepts interactively from the user's natural language instructions and demonstrations, using the graphical user interfaces (GUIs) of third-party mobile apps. This system provides several interesting features:(1) it allows users to teach new task procedures and concepts through verbal instructions together with demonstration of the steps of a script using GUIs; (2) it supports users in clarifying their intents for demonstrated actions using GUI-grounded verbal instructions; (3) it infers parameters of tasks and their possible values in utterances using the hierarchical structures of the underlying app GUIs; and (4) it generalizes taught concepts to different contexts and task domains. We describe the architecture of the SUGILITE system, explain the design and implementation of its key features, and show a prototype in the form of a conversational assistant on Android.

show abstract

Mapping natural language commands to web elements

Cited by 28 publications

References 34 publications

DIY assistant: a multi-modal end-user programmable virtual assistant

DIY assistant: a multi-modal end-user programmable virtual assistant

Multi-Modal Repairs of Conversational Breakdowns in Task-Oriented Dialogs

Interactive Task Learning from GUI-Grounded Natural Language Instructions and Demonstrations

Contact Info

Product

Resources

About