Few-shot training LLMs for project-specific code-summarization

Ahmed, Toufique; Dévanbu, Prémkumar

doi:10.1145/3551349.3559555

Cited by 53 publications

(14 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…That said, in the mutant-test predictions, both precision and recall drop significantly for both approaches; this suggests that training data containing project-specific vocabulary and methods contribute substantially to the same project performance. This is consistent with other results showing that projects have distinct vocabulary and style, making cross project prediction difficult for many tasks [3,13]. Precision continues to be quite a bit higher than recall in the cross project setting, for both models.…”

Section: Rq1: Same Project Performancesupporting

confidence: 91%

“…However, cloud providers increasingly provide GPU access; recently, GitHub actions announced plans to do the same for CI. 3 Indeed, GPUs are becoming more broadly accessible, including via idle GPU time or services like Google Colab. Future testing approaches or any ML for SE applications are thus increasingly realistic to deploy in practice.…”

Section: Limitations and Threatsmentioning

confidence: 99%

See 1 more Smart Citation

Contextual Predictive Mutation Testing

Jain,

Alon,

Groce

et al. 2023

Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Enginee

View full text Add to dashboard Cite

Mutation testing is a powerful technique for assessing and improving test suite quality that artificially introduces bugs and checks whether the test suites catch them. However, it is also computationally expensive and thus does not scale to large systems and projects. One promising recent approach to tackling this scalability problem uses machine learning to predict whether the tests will detect the synthetic bugs, without actually running those tests. However, existing predictive mutation testing approaches still misclassify 33% of detection outcomes on a randomly sampled set of mutant-test suite pairs. We introduce MutationBERT, an approach for predictive mutation testing that simultaneously encodes the source method mutation and test method, capturing key context in the input representation. Thanks to its higher precision, Mu-tationBERT saves 33% of the time spent by a prior approach on checking/verifying live mutants. MutationBERT, also outperforms the state-of-the-art in both same project and cross project settings, with meaningful improvements in precision, recall, and F1 score. We validate our input representation, and aggregation approaches for lifting predictions from the test matrix level to the test suite level, finding similar improvements in performance. MutationBERT not only enhances the state-of-the-art in predictive mutation testing, but also presents practical benefits for real-world applications, both in saving developer time and finding hard to detect mutants. CCS CONCEPTS• Software and its engineering → Dynamic analysis; Software testing and debugging.

show abstract

Section: Rq1: Same Project Performancesupporting

confidence: 91%

Section: Limitations and Threatsmentioning

confidence: 99%

Contextual Predictive Mutation Testing

Jain,

Alon,

Groce

et al. 2023

Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Enginee

View full text Add to dashboard Cite

show abstract

“…Several studies have leveraged such models for various SE tasks such as code generation [26,37,45], code repair [55,58,74], code summarization [2,36,48]. In addition, such models (specifically, ChatGPT) combine conversational capabilities with code-related tasks, allowing programmers to interact with the model.…”

Section: Related Workmentioning

confidence: 99%

ChatGPT Incorrectness Detection in Software Reviews

Tanzil,

Khan,

Uddin

2024

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

View full text Add to dashboard Cite

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 -0.75. CCS CONCEPTS• Computing methodologies → Natural language processing; • Software and its engineering → Software testing and debugging.

show abstract

“…Github Copilot uses GPT-3 for automated code generation from natural language inputs [8]. Several researchers have addressed code generation [8], [36], docstring generation [8], [60], and code repair [61], [62] problems. Bareiß et al [63] show how fewshot learning can be effective at (i) code mutation; (ii) test oracle generation from natural language documentation; and (iii) test case generation task.…”

Section: B Llms In Software Engineeringmentioning

confidence: 99%

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Ahmed¹,

Ghosh²,

Bansal³

et al. 2023

Preprint

View full text Add to dashboard Cite

Incident management for cloud services is a complex process involving several steps and has a huge impact on both service health and developer productivity. On-call engineers require significant amount of domain knowledge and manual effort for root causing and mitigation of production incidents. Recent advances in artificial intelligence has resulted in state-ofthe-art large language models like GPT-3.x (both GPT-3.0 and GPT-3.5), which have been used to solve a variety of problems ranging from question answering to text summarization. In this work, we do the first large-scale study to evaluate the effectiveness of these models for helping engineers root cause and mitigate production incidents. We do a rigorous study at Microsoft, on more than 40,000 incidents and compare several large language models in zero-shot, fine-tuned and multi-task setting using semantic and lexical metrics. Lastly, our human evaluation with actual incident owners show the efficacy and future potential of using artificial intelligence for resolving cloud incidents.

show abstract

Few-shot training LLMs for project-specific code-summarization

Cited by 53 publications

References 11 publications

Contextual Predictive Mutation Testing

Contextual Predictive Mutation Testing

ChatGPT Incorrectness Detection in Software Reviews

Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models

Contact Info

Product

Resources

About