• DOI: 10.1145/1134285.1134500
  • Corpus ID: 42173103

Performing systematic literature reviews in software engineering

  • D. Budgen , P. Brereton
  • Published in International Conference on… 28 May 2006
  • Computer Science

Tables from this paper

table 1

6,645 Citations

How reliable are systematic reviews in empirical software engineering, editorial : systematic reviews in information and software engineering, writing for synthesis of evidence in empirical software engineering.

  • Highly Influenced

Systematic review in software engineering: where we are and where we should be going

On the pragmatic design of literature studies in software engineering: an experience-based guideline, a critical appraisal of systematic reviews in software engineering from the perspective of the research questions asked in the reviews, six years of systematic literature reviews in software engineering: an extended tertiary study, the role of rapid reviews in supporting decision-making in software engineering practice, conducting systematic literature reviews and systematic mapping studies, is there a future for empirical software engineering, 9 references, evidence-based software engineering, abstract research in software engineering: an analysis of the literature, analyzing the past to prepare for the future: writing a literature review, systematic reviews to support evidence-based medicine: how to review and apply findings of healthcare research., an analysis of research in computing disciplines, systematic reviews to support evidence‐based medicine: how to review and apply findings of healthcare research., systematic reviews to support evidence-based medicine: how to review and apply findings of healthcare research, related papers.

Showing 1 through 3 of 0 Related Papers

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Large Language Models for Software Engineering: A Systematic Literature Review

Large Language Models (LLMs) have significantly impacted numerous domains, including Software Engineering (SE). Many recent publications have explored LLMs applied to various SE tasks. Nevertheless, a comprehensive understanding of the application, effects, and possible limitations of LLMs on SE is still in its early stages. To bridge this gap, we conducted a systematic literature review (SLR) on LLM4SE, with a particular focus on understanding how LLMs can be exploited to optimize processes and outcomes. We select and analyze 395 research papers from January 2017 to January 2024 to answer four key research questions (RQs). In RQ1, we categorize different LLMs that have been employed in SE tasks, characterizing their distinctive features and uses. In RQ2, we analyze the methods used in data collection, preprocessing, and application, highlighting the role of well-curated datasets for successful LLM for SE implementation. RQ3 investigates the strategies employed to optimize and evaluate the performance of LLMs in SE. Finally, RQ4 examines the specific SE tasks where LLMs have shown success to date, illustrating their practical contributions to the field. From the answers to these RQs, we discuss the current state-of-the-art and trends, identifying gaps in existing research, and flagging promising areas for future study. Our artifacts are publicly available at https://github.com/xinyi-hou/LLM4SE_SLR .

1. Introduction

In the field of language processing, traditional Language Models (LMs) have been foundational elements, establishing a basis for text generation and understanding  (Moore and Lewis, 2010 ) . Increased computational power, advanced machine learning techniques, and access to very large-scale data have led to a significant transition into the emergence of Large Language Models (LLMs)   (Zan et al . , 2023b ; Zhao et al . , 2023d ) . Equipped with expansive and diverse training data, these models have demonstrated an impressive ability to simulate human linguistic capabilities, leading to a sea of changes across multiple domains. With their capacity to learn from massive corpora and generate plausible text, LLMs are blurring the line between human and machine-produced language. They have provided researchers and engineers alike with a powerful tool to explore the complexity and richness of human communication, consequently sparking a transformational period in the field of language processing and beyond.

Software Engineering (SE) – a discipline focused on the development, implementation, and maintenance of software systems – is one of those areas reaping the benefits of the LLM revolution  (Ma et al . , 2023a ) . The utilization of LLMs in SE primarily emerges from an innovative perspective where numerous SE challenges can be effectively reframed into data, code, or text analysis tasks  (Wang et al . , 2022a ) . Using LLMs to address these SE tasks has shown a wealth of potential breakthroughs  (Xia and Zhang, 2023b ; Tian et al . , 2023b ; Xia and Zhang, 2023a ; Lajkó et al . , 2022 ; Charalambous et al . , 2023 ; Sobania et al . , 2023 ; Cao et al . , 2023 ; Zhang et al . , 2020a ) . The applicability of LLMs is particularly pronounced in tasks such as code summarization  (Wan et al . , 2018 ) , which involves yielding an abstract natural language depiction of a code’s functionality, as well as the generation of well-structured code  (Yin and Neubig, 2017 ) and code artifacts like annotations  (Liang and Zhu, 2018 ) . Codex, an LLM with 12 billion parameters, has demonstrated the ability to solve 72.31% of complex Python programming challenges posed by humans  (Chen et al . , 2021b ) . GPT-4  (OpenAI, 2023b ) , an LLM from OpenAI, has been used with a strong performance in several SE tasks, encompassing code writing, understanding, execution, and reasoning. It not only handles real-world applications and diverse coding challenges but also shows the ability to explain results in natural language and generate code from pseudocode  (Bubeck et al . , 2023 ) .

Simultaneously, researchers have embarked on a series of research activities regarding LLM-related works, where a number of literature reviews or survey papers have been produced  (Fan et al . , 2023c ; Chang et al . , 2023 ; Yang et al . , 2023b ) . Table  1 summarises some of these. However, these related studies have limitations. They either focus narrowly on a single SE scope, such as the application of LLMs in software testing  (Wang et al . , 2023c ) and natural-language-to-code (NL2Code) tasks  (Zan et al . , 2023b ) , or they are primarily centered on Machine Learning (ML) or Deep Learning (DL) models  (Wang et al . , 2022a ; Watson et al . , 2022 ; Yang et al . , 2022b ) , overlooking more advanced and recently emerged LLM applications, such as ChatGPT  (OpenAI, 2022a ) , which are increasingly finding applications within the SE field  (Tian et al . , 2023b ; White et al . , 2023b ; Lubowitz, 2023 ; Sridhara et al . , 2023 ) . Alternatively, they merely offer a preliminary exploration of the performance of LLMs in various SE tasks through empirical experiments  (Ma et al . , 2023a ; Sridhara et al . , 2023 ; Xu et al . , 2022 ; Dou et al . , 2023 ; Yuan et al . , 2023a ) , or analyze existing partially relevant studies to reveal the challenges in this field  (Fan et al . , 2023b ) without conducting a systematic literature survey. Furthermore, some works have investigated the application of Code LLMs in SE  (Zhang et al . , 2023b ; Zheng et al . , 2023a ) , yet have not fully considered general LLMs like ChatGPT and LLaMA  (Touvron et al . , 2023a ) , which are also widely applied to various SE tasks  (Huang et al . , 2023a ; Shapkin et al . , 2023 ; Pan et al . , 2023a ; Yan et al . , 2023a ) . The integration of LLMs within SE is undoubtedly a complex endeavor, requiring key considerations including the choice of the right model, comprehension of the unique features of different LLMs, devising pre-training and fine-tuning strategies, handling of data, evaluation of outcomes, and surmounting implementation challenges  (Zan et al . , 2023b ) . Despite the burgeoning interest and ongoing explorations in the field, a detailed and systematic review of LLMs’ application in SE has been notably absent in the current literature . This gap signifies a need for understanding the relationship between LLMs and SE. In response, our research aims to bridge this gap, providing valuable insights to the community.

Reference Year Scope of models Scope of SE tasks SLR Time frame # Collected Papers
Zhang et al.  , ) 2023 Code LLM Automated program repair 2017-2023 185
Zheng et al.  , ) 2023 Code LLM General SE scope 2021-2023 149
Fan et al.  , ) 2023 LLM General SE scope - Not specified
Zan et al.  , ) 2023 LLM (12M+) NL2Code 2020-2023 Not specified
Wang et al.  , ) 2023 LLM (117M+) Software testing 2019-2023 52
Wang et al.  , ) 2022 ML, DL General SE scope 2009-2020 1,209 (ML) + 358 (DL)
Yang et al.  , ) 2022 DL General SE scope 2015-2020 250
Watson et al.  , ) 2022 DL General SE scope 2009-2019 128
Our work 2024 LLM General SE scope 2017-2024 395

“M” means million and “B” means billion. The numbers in parentheses indicate the parameter sizes of LLMs.

SLR stands for Systematic Literature Review. This column denotes whether the paper follows an SLR process.

ML and DL refer to Machine Learning and Deep Learning, respectively.

In this paper, we conduct an SLR on the utilization of LLMs in SE (LLM4SE). By mapping the current state-of-the-art, pinpointing the key strengths, weaknesses, and gaps in the existing LLM4SE literature, and proposing potential avenues for future research, our review aims to provide researchers and practitioners with a thorough guide to the convergence of LLMs and SE. We anticipate that our findings will be instrumental in guiding future inquiries and advancements in this rapidly evolving field. This work makes the following key contributions:

We are the first to present a comprehensive SLR on 395 papers published between January 2017 and January 2024 that focus on the use of LLM-based solutions to address SE challenges. We conducted a detailed analysis of the selected papers based on publication trends, distribution of publication venues, etc.

We have classified the LLMs utilized for the reported SE tasks and have provided a summary of the usage and trends of different LLM categories within the SE domain.

We describe the reported data processing stages, encompassing data collection, categorization, preprocessing, and representation.

We discuss optimizers used for LLM4SE tasks, including tuning techniques, prevalent prompt engineering techniques, and commonly employed evaluation metrics.

We describe the key applications of LLM4SE encompassing a diverse range of 85 specific SE tasks, grouped into six core SE activities – requirements engineering, software design, software development, software quality assurance, software maintenance, and software management.

We have summarised key challenges that using LLMs encounters within the SE field and have suggested several potential research directions for LLM4SE.

Section  2 presents our research questions (RQs) and elaborates on our SLR methodology. The succeeding Sections  3 to 6 are devoted to answering each of these RQs individually. Section  7 discloses the potential threats to the validity of our study. Section  8 discusses the challenges yet to be overcome when employing LLMs to solve SE tasks and highlights promising opportunities and directions for future research. Section  9 concludes the whole paper.

2. Approach

This SLR follows the methodology proposed by Kitchenham  et al.   (Kitchenham et al . , 2007 , 2022 ) , used in most other SE-related SLRs  (Li et al . , 2017 ; Wang et al . , 2022a ; Ramirez et al . , 2018 ; Liu et al . , 2022b ) . Following the guidelines provided by Kitchenham  et al. , our methodology included three main steps: planning the review (i.e., Section  2.1 , 2.2 ), conducting the review (i.e., Section  2.3 , 2.4 ), and analyzing the basic review results (i.e, Section  2.5 ).

2.1. Research Questions

To provide a comprehensive overview of the LLM4SE field, it is important to fully comprehend how these models are currently being applied in SE, the challenges they face, and their potential future research directions in SE. Thus, we aim to provide an SLR of the application of LLMs to software engineering. This study thus aims to answer the following research questions:

RQ1: What LLMs have been employed to date to solve SE tasks? RQ1 is designed to map out the landscape of LLMs applied in the field of SE. It seeks to identify and categorize the various LLM architectures—such as decoder-only, encoder-decoder, and encoder-only models—that have been leveraged to address diverse SE challenges. This RQ aims to provide a comprehensive overview of how these models are being utilized and the implications of their usage in this field.

RQ2: How are SE-related datasets collected, preprocessed, and used in LLMs? RQ2 delves into the methodologies behind the assembly, refinement, and application of datasets in the realm of LLMs for SE tasks. It aims to uncover the strategies for dataset collection, the criteria for dataset selection, and the preprocessing steps essential for making the data conducive for LLM training and application. Additionally, this question seeks to explore the types of data that are most prevalent in SE-related LLM research and how these data types influence the modeling and outcomes.

RQ3: What techniques are used to optimize and evaluate LLM4SE? RQ3 aims to explore the use of different optimization and evaluation techniques specific to LLMs in the context of SE. This includes an investigation into Parameter Efficient Fine-Tuning (PEFT) methods and various prompting techniques that are tailored to enhance LLM performance on SE tasks. Furthermore, this RQ aims to assess the range of evaluation metrics and methodologies employed to gauge the effectiveness and impact of LLMs in SE, providing insights into how these models are fine-tuned and assessed for their utility and efficiency.

RQ4: What SE tasks have been effectively addressed to date using LLM4SE? This RQ aims to identify the SE tasks that have been successfully tackled using LLMs, offering a detailed view of the application spectrum of LLMs in SE. It seeks to identify the specific tasks within SE, such as code generation and program repair, where LLMs have shown significant utility, and to explore the nature and scope of these applications.

2.2. Search Strategy

Refer to caption

As shown in Fig. 1 , we employed the “Quasi-Gold Standard” (QGS)  (Zhang et al . , 2011 ) approach for paper search. We conducted a manual search to identify a set of relevant studies and extracted a search string from them. This search string was then used to perform an automated search, and subsequently, a snowballing search was employed to further supplement the search results. This approach ensures both search efficiency and maximum coverage, minimizing the risk of omission. Subsequently, we employed a series of relatively strict filtering steps to obtain the most relevant studies. Specifically, we followed five steps to determine the relevance of the studies:

Select publication venues for manual search and select digital databases for automated search to ensure coverage of all the selected venues.

Establish QGS: Screen all papers for manual search and filter by inclusion/exclusion criteria (defined in Table 3 ).

Subjectively define the search string based on domain knowledge.

Conduct an automated search using the search string defined in Step (3).

Conduct snowballing search after performing study selection on the results of manual search and automated search.

2.2.1. Search Items

During the manual search, we selected six of the top SE conferences and journals (i.e., ICSE, ESEC/FSE, ASE, ISSTA, TOSEM, and TSE, as shown in Table  2 ) and searched for papers that applied LLM4SE. We systematically crawled a list comprising 4,618 published papers from the top venues. Following automated scanning via scripts, we manually verified and identified 51 papers that were relevant to our research objectives. These 51 relevant papers formed the basis for constructing the Quasi-Gold Standard (QGS). Our search string should combine two sets of keywords: one pertaining to SE tasks, and the other related to LLMs. Only if the paper contains both types of keywords, there is a higher probability that it is the paper we need. The complete set of search keywords is as follows:

Keywords related to SE tasks: Software Engineering, Software Development, Program* 1 1 1 The * symbol serves as a wildcard, representing any characters or character sequence. For example, “Program*” can match “Program”, “Programming”, “Programmer”, and so on. , Software Testing, Software Mainten*, SE, Software Lifecycle, Software Design*, Code representation, Code generation, Code comment generation, Code search, Code localization, Code completion, Code summarization, Method name generation, Bug detection, Bug localization, Vulnerability detection, Testing techniques, Test case generation, Program analysis, Bug classification, Defect prediction, Program repair, Code clone detection, Bug report, Software quality evaluation, SATD detection, Code smell detection, Compiled-related, Code review, Software classification, Code classification, Code change, Incident detection, Requirement extraction, Requirement traceability, Requirement validation, Effort cost prediction, Mining GitHub/Github mining, Mining SO (Stack Overflow)/SO mining, Mining app/App mining, Mining tag/Tag mining, Developer-based mining

Keywords related to LLMs: LLM, Large Language Model*, Language Model*, LM, PLM, Pre-trained, Pre-training, Natural Language Processing, NLP, Machine Learning, ML, Deep Learning, DL, Artificial Intelligence, AI, Transformer, BERT, Codex, GPT, T5, Sequence Model*, Attention Model*, Transfer Learning, Neural Network*, ChatGPT, GPT-*

It is important to note that the list of keywords related to LLMs that we set up includes Machine Learning, Deep Learning, and other such terms that do not seem to be necessarily related to LLMs. The reason for this is that we want to avoid omitting papers related to our research as much as possible, so the process of performing automated searches expands our search scope.

Acronym Venues
ASE International Conference on Automated Software Engineering
ESEC/FSE Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering
ICSE International Conference on Software Engineering
ISSTA International Symposium on Software Testing and Analysis
TOSEM Transactions on Software Engineering and Methodology
TSE Transactions on Software Engineering

2.2.2. Search Datasets

After determining the search string, we conducted an automated search across seven widely used databases, which are capable of covering all published or latest papers. Given that the first paper about the Transformer architecture  (Vaswani et al . , 2017 ) , which forms the basis for LLMs, was published in 2017, we focused our search on papers published from that year onward 2 2 2 The cut-off date for the paper collection process of this version is January 31, 2024. . Two authors independently performed the automated search, and the search results from each database were merged and deduplicated. Specifically, we obtained 1,192 papers from IEEE Xplore, 10,445 papers from the ACM Digital Library, 62,290 papers from ScienceDirect, 42,166 papers from Web of Science, 85,671 papers from Springer, 9,966 papers from arXiv, and 4,035 papers from DBLP.

2.3. Study Selection

2.3.1. study inclusion and exclusion criteria.

Inclusion criteria
1) The paper claims that an LLM is used.
2) The paper claims that the study involves an SE task.
3) The paper with accessible full text.
Exclusion criteria
1) Short papers whose number of pages is less than 8.
2) Duplicate papers or similar studies with different versions from the same authors.
3) Studies belonging to books, thesis, monographs, keynotes, panels, or venues not executing a full
peer-review process.
4) Tool demos and editorials.
5) The paper is published in a workshop or a doctoral symposium.
6) The paper is a grey publication, e.g., a technical report or thesis.
7) Non-English written literature.
8) The paper mentions the use of LLMs without describing the employed techniques.
9) The paper leverages SE methods to enhance LLMs, rather than focusing on using LLMs for SE tasks.

Based on our search strategy, we initially obtained 218,765 papers that potentially relate to our research. Next, we needed to further evaluate the relevance of these papers based on inclusion and exclusion criteria (To ensure that our inclusion and exclusion criteria were sufficiently objective and rational, we designed these criteria following several state-of-the-art SLR papers  (Naveed et al . , 2024 ; Wang et al . , 2022a ; Watson et al . , 2022 ; Yang et al . , 2022b ) .), as shown in Table  3 , so that the selected papers can directly address our research questions. The paper selection process, as illustrated in Fig.  1 , consists of six phases. In the first phase, we conducted automated filtering to exclude papers with less than 8 pages  (Bashroush et al . , 2017 ; Wang et al . , 2022a ) (Exclusion criteria 1), reducing the number of papers to 80,611. In the second phase, we examined the titles, abstracts, and keywords of the papers to identify those that include relevant LLM-related keywords. We then expanded the search scope to avoid missing relevant papers, including ML, DL, and other related keywords that may not directly correspond to LLM. The purpose of this phase is to narrow down the scope and filter out papers directly related to LLM (Inclusion criteria 1). Papers that are filtered out in this phase are then manually reviewed in the fifth phase. Additionally, we excluded 448 non-English written literature (Exclusion criteria 7). After the second phase, the number of papers was reduced to 5,078.

The third phase involves identifying the venues of the papers (Exclusion criteria 3). We extracted publication information such as “journal”, “URL”, “DOI”, and “series” to determine the publication sources. For papers from arXiv in 2023 and 2024, we chose to retain them, considering that this field is emerging and many works are in the process of submission. Although these papers did not undergo peer review, we have a quality assessment process to eliminate papers with low quality. This step resulted in 1,172 papers.

In the fourth phase, we merged and deduplicated the remaining papers from the seven databases and the manually searched paper list (Exclusion criteria 2), resulting in 810 papers. We then reviewed the full texts of the papers and excluded 190 papers that were grey publications or were published in workshops or doctoral symposiums (Exclusion criteria 4, 5, 6). By further assessing the quality of the papers, we identified 382 papers directly relevant to our research. This phase primarily involved excluding papers that mentioned LLMs but did not directly apply them, such as papers that only discussed LLMs in future work or focused on evaluating the performance of LLM-enabled tools  (Wang et al . , 2023c ) (Exclusion criteria 8). For systematic views, survey, and review papers, we have retained them and will assess their content during the quality assessment phase to determine their relevance to our research.

2.3.2. Study Quality Assessment

ID Quality Assessment Criteria
QAC1 Is the study relevant to SE tasks?
QAC2 Does the study utilize LLMs?
QAC3 Is the research not a secondary study, such as an SLR, review, or survey?
QAC4 Was the research published in a high-repute venue?
QAC5 Is there a clear motivation for the research?
QAC6 Does the study provide a clear description of the techniques used?
QAC7 Are the experimental setups, including experimental environments and
dataset information, described in detail?
QAC8 Does the study clearly confirm the experimental findings?
QAC9 Are the key contributions and limitations of the study discussed?
QAC10 Does the study make a contribution to the academic or industrial community?

A well-crafted quality assessment can help to prevent biases introduced by low-quality studies and can indicate to readers where caution about conclusions should be drawn  (Yang et al . , 2021 ) . We formulated ten Quality Assessment Criteria (QAC), as shown in Table  4 . These aim to assess the relevance, clarity, validity, and significance of included papers. We used a scoring system of -1, 0, 1 (irrelevant/unmet, partially relevant/met, relevant/fully met). The first three questions were designed for the remaining 382 papers in the fifth stage. If QAC1, QAC2, or QAC3 received a score of -1, there is no need to proceed with QAC4-QAC10, and the paper can be excluded directly. QAC4-QAC10 involved assessing the content of the papers using a scoring system of 0, 1, 2, 3 (poor, fair, good, excellent). Finally, we calculated the total score of QAC4-QAC10 for each paper. For published papers, the maximum score for QAC4-QAC10 should be 21 (3 × \times × 7). We retained papers with a score of 16.8 (21 × \times × 0.8) or above. For unpublished papers on arXiv, the score for QAC4 is always 0, and the maximum score for QAC5-QAC10 should be 18 (3 × \times × 6). We retained papers with a score of 14.4 (18 × \times × 0.8) or above. After this quality assessment, we obtained a final set of 382 papers.

2.4. Snowballing Search

To identify any additional possibly relevant primary studies, we conducted a snowballing search. Snowballing refers to using the reference list of a paper or the citations to the paper to identify additional papers. Snowballing could benefit from not only looking at the reference lists and citations but also complementing them with a systematic way of looking at where papers are actually referenced and where papers are cited. Using the references and the citations respectively is referred to as backward and forward snowballing.

Before conducting snowballing, a set of initial papers needs to be prepared. In this study, the initial paper list consists of the remaining 382 papers after the quality assessment. We performed forward and backward snowballing, which resulted in the collection of 3,964 and 9,610 papers, respectively. After initial deduplication, we were left with 5,152 papers. We then conducted the full study selection process on these 5,152 papers, including deduplicating them with the 382 papers from performing snowballing on the initial list. As a result, we obtained an additional 13 papers.

2.5. Data Extraction and Analysis

Refer to caption

RQ Data Item
1,2,3,4 The category of SE task
1,2,3,4 The category of LLM
1,4 Characteristics and applicability of LLMs
2 The adopted data handling techniques
3 The adopted weight training algorithms and optimizer
3 The selected evaluation metrics
4 The SE activity to which the SE task belongs
4 The developed strategies and solutions

We finally obtained 395 relevant research papers after searching and snowballing. Fig.  2 presents an overview of the distribution of the included papers. As shown in Fig.  2 (a), 154 papers are published in peer-reviewed venues. ICSE is the most common of these venues, with a contribution of 41 papers. Other venues with noteworthy contributions include TSE, ESEC/FSE, and TOSEM, contributing 14, 12, and 11 papers respectively. Meanwhile, the remaining 241 papers are published on arXiv, an open-access platform that serves as a repository for scholarly articles. This finding is not surprising since much new LLM4SE research is rapidly emerging and thus many works are just completed and are likely in the peer review process. Despite the non-peer-reviewed nature of these papers, we have performed a rigorous quality assessment process on all collected papers, to ensure the quality of validity of our findings. This approach allows us to include all high-quality and relevant publications while maintaining high research standards.

Fig.  2 (b) shows the temporal distribution of the included papers. The number of publications has seen a rapidly growing trend since 2020. In 2020 and 2021, there are only 7 and 13 relevant papers, respectively. However, by 2022, the number of papers has increased dramatically to 56. What’s surprising is that, in 2023 alone, the number of published papers has already reached 273. And within just one month in 2024, 46 relevant papers are published. This rapid growth trend demonstrates that there is a growing research interest in the domain of LLM4SE.

In order to visualize the main content of our collection of papers, we generated a word cloud based on the abstracts of 395 papers as shown in Fig.  3 . The most frequently occurring words include “code”, “LLM”, “language”, “model”, “large”, “task”, “software”,“generation”, “performance”, and “program”, clearly indicating the main themes explored in these papers. The terms “code” and “software” emphasize the core elements of software engineering, while “LLM”, “large”, “language” and “model” denote the use of large language models in a variety of tasks. The terms “generation”, “task”, and “program” emphasize the use of the LLM for automatic code generation and other SE tasks. In addition, “performance” reflects the evaluation and assessment of the effectiveness of LLM in SE applications. The word cloud provides further visual evidence that the literature we have collected is closely related to our research topic.

We then conducted data extraction during the full-text review. This extraction phase collected all relevant data that would facilitate a comprehensive and insightful response to the RQs outlined in Section  2.1 . As depicted in Table  5 , we extracted data including the classification of SE tasks, their corresponding activities, as well as the category, characteristics, and applicability of the LLMs. With this collected data, we systematically analyzed the relevant aspects of LLM4SE.

3. RQ1: What LLMs have been employed to date to solve SE tasks?

3.1. large language models (llms).

Pre-trained language models (PLMs) have demonstrated impressive capabilities in solving various NLP tasks  (Kojima et al . , 2022 ; Shanahan, 2022 ; Wei et al . , 2022b ; Zhao et al . , 2023d ) . Researchers have observed that scaling up the model sizes significantly enhances their capacity, leading to remarkable performance improvements when the parameter scale surpasses a certain threshold  (Shanahan, 2022 ; Hoffmann et al . , 2022 ; Taylor et al . , 2022 ) . The term “Large Language Model” (LLM) was introduced to distinguish language models based on their parameter size, specifically referring to large-sized PLMs  (Zhao et al . , 2023d ) . However, we note that the literature lacks a formal consensus on the minimum parameter scale for LLMs, as the model’s capacity is intertwined with both data size and total compute  (Wang et al . , 2023c ) . In this paper, we adopt the LLM scope division and taxonomy introduced by Pan et al. ( Pan et al . , 2023b ) and categorize the mainstream LLMs investigated in this study into three groups according to their architectures: encoder-only, encoder-decoder, and decoder-only LLMs. This taxonomy and relevant models are shown in Fig.  4 . We have included the LLMs used by each work and their parameter sizes (if declared in the paper) in our public repository: https://github.com/xinyi-hou/LLM4SE_SLR . Additionally, Table  6 summarizes the LLMs with different architectures suitable for different types of SE tasks.

Refer to caption

Model Type Example of SE tasks
Encoder-only Understanding Code Understanding
Bug localization
Vulnerability detection
Encoder-Decoder Understanding and Generation Code summarization
Code translation
Program repair
Decoder-only Generation Code generation
Code completion
Test case generation

Encoder-only LLMs. Encoder-only LLMs are a type of neural network architecture that utilizes only the encoder component of the model  (Devlin et al . , 2018 ) . The encoder’s function is to process and encode the input sentence into a hidden representation, capturing the relationships between words and the overall context of the sentence. Notable instances of encoder-only LLMs include BERT  (Devlin et al . , 2018 ) and its variants  (Feng et al . , 2020 ; Guo et al . , 2020 ; Liu et al . , 2019 ; Lan et al . , 2019 ) . As an example, BERT’s structure, based on the Transformer’s encoder architecture, has been referenced in 50 our selected primary studies. Its distinctive bidirectional attention mechanism simultaneously considers the left and right context of each word during training. In the SE domain, other prominent models like CodeBERT  (Feng et al . , 2020 ) , GraphCodeBERT  (Guo et al . , 2020 ) , RoBERTa  (Liu et al . , 2019 ) , and ALBERT  (Lan et al . , 2019 ) have been widely employed. Specialized models such as BERTOverflow  (Tabassum et al . , 2020 ) and CodeRetriever  (Li et al . , 2022b ) have been specifically developed for SE applications. These models differ from BERT by leveraging program structure, introducing new pre-training tasks, or engaging new modalities, thereby improving the architecture’s application to code-related tasks. For example, CodeBERT integrates a token prediction scheme to comprehend code by predicting subsequent tokens, enhancing its understanding of programming languages for tasks like code completion and bug detection  (Feng et al . , 2020 ) . GraphCodeBERT introduces edge-type prediction, recognizing relationships between code elements as a graph. This enables GraphCoderBERT to leverage code structure, improving its effectiveness in tasks like code summarization and program analysis  (Guo et al . , 2020 ) . Encoder-only LLMs have shown efficacy in tasks requiring a nuanced understanding of the entire sentence or code snippet. Examples include code review, bug report understanding, and named entity recognition pertaining to code entities  (Pudari and Ernst, 2023 ; Sghaier and Sahraoui, 2023 ; Yang et al . , 2022c ; Arakelyan et al . , 2023 ; Li et al . , 2023i ; Mukherjee and Hellendoorn, 2023 ) .

Encoder-decoder LLMs. Encoder-decoder LLMs incorporate both encoder and decoder modules  (Vaswani et al . , 2017 ) . The encoder ingests the input sentence and encodes it into a hidden space, effectively capturing the underlying structure and semantics. This hidden representation serves as an intermediary language, bridging the gap between diverse input and output formats. Conversely, the decoder utilizes this hidden space to generate the target output text, translating the abstract representation into concrete and contextually relevant expressions. Models such as PLBART  (Ahmad et al . , 2021 ) , T5  (Raffel et al . , 2020 ) , and CodeT5  (Wang et al . , 2021a ) embodies this architecture. Further advancements are evident in CodeT5+  (Wang et al . , 2023e ) , while AlphaCode  (Li et al . , 2022a ) and CoTexT  (Phan et al . , 2021 ) showcase the architecture’s adaptability to various SE tasks. The encoder-decoder design offers flexible training strategies and is proficient in handling multifaceted tasks such as summarization, translation, and question-answering. Within the field of SE, this ability has been successfully applied to tasks like code summarization  (Al-Kaswan et al . , 2023 ; Gu et al . , 2022 ; Mastropaolo et al . , 2021b ) . The encoder module’s capacity to understand and represent both the structure and semantics of code is pivotal, allowing the decoder to translate this comprehension into concise, human-readable summaries.

Decoder-only LLMs. Decoder-only LLMs exclusively utilize the decoder module to generate the target output text, following a distinct training paradigm that emphasizes sequential prediction  (Radford et al . , 2018 ) . Unlike the encoder-decoder architecture, where the encoder processes input text, the decoder-only architecture begins with an initial state and predicts subsequent tokens, gradually building the output text. This approach relies heavily on the model’s ability to understand and anticipate language structure, syntax, and context. GPT-series models, such as GPT-1  (Radford et al . , 2018 ) , GPT-2  (Radford et al . , 2019 ) , GPT-3  (Brown et al . , 2020 ) , GPT-3.5  (OpenAI, 2022b ) , GPT-4  (OpenAI, 2023b ) , as well as their notable derivative, ChatGPT  (OpenAI, 2022a ) 3 3 3 ChatGPT is a conversational agent built upon the GPT architecture, with GPT-3.5 and GPT-4 being specific versions of the architecture, each representing successive advancements. , represent their major implementations. More specialized versions like CodeGPT  (Lu et al . , 2021 ) , InstructGPT  (Ouyang et al . , 2022 ) , Codex  (Chen et al . , 2021b ) , Copilot  (GitHub, 2023 ) 4 4 4 Copilot is an application built upon LLMs tailored for coding tasks. For convenience, all subsequent references in this paper to LLMs and their applications, such as ChatGPT and Copilot, will collectively be referred to as LLMs. , and others have been fine-tuned for specific tasks in SE. Open-source models like GPT-J  (Wang and Komatsuzaki, 2021 ) , GPT-Neo  (Black et al . , 2021 ) , GPT-NeoX  (Black et al . , 2022 ) , LLaMA  (Touvron et al . , 2023a ) , and Vicuna  (Chiang et al . , 2023 ) also follow this architecture. Decoder-only LLMs are usually more suitable for various generation tasks, such as code generation and code completion. These models can generally perform downstream tasks from a few examples or simple instructions without adding prediction heads or fine-tuning, making them valuable tools in SE research. 2022 marked a surge in the development of decoder-only LLMs, a trend that gained further momentum in 2023, notably with the launch of commercial products by leading Internet companies. For example, Google launched Gemini  (Google, 2023 ) , Meta introduced LLaMA  (Touvron et al . , 2023a ) and Llama 2  (Touvron et al . , 2023b ) , and Anthropic unveiled Claude  (Anthropic, 2023 ) , etc. Contrary to LLMs such as GPT-4 and its derivative application, ChatGPT, released by OpenAI, which were promptly integrated into SE tasks, these new additions have not yet found widespread application within the SE field. Their potential remains largely unexplored, with opportunities for further assessment and utilization in specific tasks and challenges. The continued advancement of these models emphasizes the active exploration and innovation within decoder-only architectures.

Refer to caption

3.2. Trend Analysis

As shown in Fig.  5 , in the span from 2020 to 2024, the architecture of LLMs has witnessed notable shifts in preference and application within SE tasks. The specific choices between decoder-only, encoder-decoder, and encoder-only structures have shaped the direction of research and solutions in the SE domain  (Wong et al . , 2023 ) . This analysis explores trends in the adoption of these architectures over the years, reflecting the evolving dynamics of LLM for SE tasks.

Evolution of LLM architectures in 2021. The year 2020 saw research papers predominantly concentrating on encoder-only LLMs for SE tasks, evidenced by a total of eight papers. Decoder-only LLMs or encoder-decoder LLMs were scarcely featured in that year’s research. A marked change occurred in 2021. Out of 19 papers in 2021, nine were dedicated to decoder-only LLMs, constituting 47.37% of the research. Additionally, two papers, or 10.53%, focused on encoder-decoder LLMs. Encoder-only LLMs witnessed a slight decline, representing 42.1% of the field with eight papers. This rapid transition can be linked to the generative capability of decoder-only LLMs. Researchers  (Laskar et al . , 2023 ; Sadik et al . , 2023 ; Sridhara et al . , 2023 ) found that these models, e.g., GPT series, requiring minimal fine-tuning, could produce not only syntactically correct but also functionally relevant code snippets. Their proficiency in grasping the context of code quickly made them a preferred choice.

Diversity of LLM architectures in 2022. 2022 experienced a significant increase in diversity, with more varied LLM architectures finding representation. Out of a total of 142 papers, 73 were centered around decoder-only LLMs, comprising 51.41% of the studies. Encoder-decoder LLMs made their presence known in 17 papers, accounting for 11.97%. Meanwhile, encoder-only LLMs led the field slightly with 52 papers, capturing 36.62% of the research interest. This diverse distribution suggests an exploration phase where researchers were actively assessing and leveraging different architectures to suit varied needs and challenges. The near-equal interest across different architectures underscores the field’s richness, indicating that no single approach had become the definitive choice.

Dominance of the decoder-only architecture in 2023. 2023 signaled a strong shift towards decoder-only LLMs. An impressive 432 instances of utilizing decoder-only LLMs were recorded across 195 unique papers, reflecting that a single paper might employ multiple such models. These papers focusing on decoder-only LLMs constituted a significant 70.7% of the total research this year. In comparison, encoder-decoder LLMs were the subject of 85 papers, contributing 13.91%, while encoder-only LLMs appeared to stabilize, with 94 papers, representing 15.39% of the 2023 research landscape. This trend signifies a shift in focus and resources toward exploring and harnessing the decoder-only architecture as the primary approach in many current and future LLM4SE research and applications.

Exploration of the LLM architecture in 2024. The initial trends in January 2024 showcase the ongoing evolution of LLM architectures. Among the 120 papers examined, decoder-only LLMs continued to maintain a prominent position, with 77 papers dedicated to this architecture, constituting 64.17% of the research. Encoder-decoder LLMs appeared in 24 papers, representing 20% of the total, while encoder-only LLMs were featured in 19 papers, making up 15.83%. Although there is a slight decrease in the dominance of decoder-only architectures compared to the previous year, they still hold a central role. The persistent exploration of encoder-decoder and encoder-only architectures suggests an enduring interest in diverse configurations within the SE research community.

Criteria for LLM selection in SE tasks. The selection of an LLM for SE tasks should involve careful consideration rather than arbitrary choice. Key factors guiding this selection encompass the model’s proficiency in understanding the context of code, its ability to generate relevant content, responsiveness to fine-tuning, and demonstrated performance on SE-specific benchmarks  (Xie et al . , 2023a ; Li et al . , 2023c , b ) . Given the stringent syntactical rules and functional requirements inherent to SE tasks, models capable of seamlessly integrating these complex aspects were typically favored.

Task-specific fine-tuning. A notable trend is the customization of LLMs for precise SE tasks  (Izadi et al . , 2022 ; Li et al . , 2023i ; Zhang et al . , 2022c ) . By fine-tuning models with datasets tailored to specific functions such as bug detection or code review, researchers were able to achieve marked performance improvements  (Ciborowska and Damevski, 2023 ; Kou et al . , 2023a ) .

In conclusion, the evolution of LLMs for SE, transitioning from encoder-only to decoder-only architectures, highlights the field’s vibrancy and adaptability. This shift has fundamentally altered the approach to SE tasks, reflecting the ongoing innovation within the discipline.

RQ1 - Summary

4. RQ2: How are SE-related datasets collected, preprocessed, and used in LLMs?

Data plays a crucial role in the model training phase  (Sun et al . , 2022 ) . First, data is collected to obtain diversity and richness to ensure that the model can cope with different scenarios and situations. Second, data is classified to clarify the training objectives of the model and avoid confusion and misinformation. The preprocessing of data is indispensable to clean and transform the data to improve its quality. Finally, data is formatted into a structure suitable for model processing, allowing the LLM to learn the data’s features and patterns effectively. We analyze the reported processes of data collection, data classification, data preprocessing, and data representation in our selected primary studies on LLM4SE.

4.1. How are the datasets for training LLMs sourced?

Data is an indispensable and critical factor in training LLMs, which determines the generalization ability, effectiveness, and performance of the models  (Sun et al . , 2022 ) . Adequate, high-quality, and diverse data is critical to allow models to fully learn features and patterns, optimize parameters, and ensure reliability in validation and testing. We first investigate the methods used to obtain the dataset. By analyzing the methods of data collection, we divided the data sources into four categories: open-source datasets, collected datasets, constructed datasets, and industrial datasets. Open-source datasets   (Chen et al . , 2023c ; Khakhar et al . , 2023 ; Wang et al . , 2023h ; Zeng et al . , 2022 ) refer to publicly accessible collections of data that are often disseminated through open-source platforms or repositories. For example, datasets like HumanEval  (Chen et al . , 2021b ) , which consists of 164 manually crafted Python problems, each accompanied by its respective unit tests. The open-source nature of these datasets ensures their credibility and allows for community-driven updates, making them a reliable resource for academic research. Collected datasets   (Huang et al . , 2018 ; Tian et al . , 2023b ; Sghaier and Sahraoui, 2023 ; Mastropaolo et al . , 2022b ) are those that researchers compile directly from a multitude of sources, including but not limited to, major websites, forums, blogs, and social media platforms. For instance, researchers  (Chan et al . , 2023 ; Salza et al . , 2022 ; Weyssow et al . , 2023b ; Yang et al . , 2022c ) often scrape data from Stack Overflow  (Overflow, 2023 ) threads or GitHub  (Github, 2023 ) issues comments to create a dataset tailored to their specific research questions. Constructed datasets   (Ezzini et al . , 2022 ; Koide et al . , 2023 ; Kang et al . , 2022 ; Zhang et al . , 2022a ) are specialized datasets that researchers create by modifying or augmenting collected datasets to better align with their specific research objectives. These modifications can be carried out through manual or semi-automatic methods and may include the generation of domain-specific test sets, annotated datasets, or synthetic data. For example, researchers often take a collected dataset of code snippets and manually annotate them with bug types to create a constructed dataset for studying automated program repair techniques  (Fan et al . , 2023a ; Jin et al . , 2023b ; Wu et al . , 2023a ) . Industrial datasets   (Alhamed and Storer, 2022 ; Moharil and Sharma, 2022 ; Wang et al . , 2020c ) are those obtained from commercial or industrial entities and often contain proprietary business data, user behavior logs, and other sensitive information. These datasets are particularly valuable for research that aims to address real-world business scenarios. However, the acquisition of such datasets is often complicated by issues related to business confidentiality and data privacy. For example, in a collaborative effort with China Merchants Bank (CMB), Wang  et al.   (Wang et al . , 2020c ) were able to access 21 projects from CMB’s repositories. Access to such data would likely require non-disclosure agreements and other legal safeguards to protect business interests. Each of these dataset types offers unique advantages and challenges, and the choice between them should be guided by the specific requirements and constraints of the research project at hand.

Refer to caption

Fig.  6 shows the collection strategies of LLM-related datasets. As can be seen from the data in the figure, 235 studies used open-source datasets for training LLMs . One of the main reasons for using open-source datasets in LLM training is their authenticity and credibility. Open-source datasets usually contain real-world data collected from various sources (such as relevant studies that have been conducted), which makes them highly reliable and representative of real-world scenarios. This helps LLMs learn from real examples to better understand real-world applications and improve their performance. Second, since LLMs are a topic that has just recently emerged, a lack of suitable training sets does exist. Therefore, researchers often collect data from sites such as Stack Overflow and GitHub and build datasets to make the data more composite for SE tasks. Among the 395 papers we studied, we discovered that merely six studies utilized industrial datasets. This suggests a potential misalignment between the properties of datasets used in academic research and those encountered in real-world industrial contexts. This divergence underscores the need for future research to investigate industrial datasets, thereby ensuring that LLMs are applicable and robust across both academic and industrial scenarios.

Note that some papers use multiple datasets that span different categories, e.g., Xu et al.   (Xu et al . , 2022 ) evaluated the performance of Codex, GPT-J, GPT-Neo, and other LLMs on SE tasks, and Mastropaolo et al.   (Mastropaolo et al . , 2021b ) investigated the use of T5 in several code-related tasks such as fixing bugs and generating code comments. For different LLMs or different SE tasks, researchers may use different training datasets. On the other hand, some papers focus on exploring how existing LLMs (e.g., ChatGPT) are used in SE tasks  (White et al . , 2023b ) and do not specify the dataset used for model training, as these LLMs like ChatGPT often do not require users to prepare training data themselves for general usage scenarios.

4.2. What types of SE datasets have been used in existing LLM4SE studies?

Category Data type Total
Text-based Programming tasks/problems (42) Prompts (33) 151
datasets SO (i.e. Stack Overflow) posts (12) Bug reports (11)
Requirements documentation (9) APIs/API documentation (8)
Q&A pairs (6) Vulnerability descriptions (4)
Reviews (4) Logs (3)
Methods (3) Project issues (3)
Code comments (2) Theorems (2)
Buggy text (1) Dockerfiles (1)
Outage descriptions (1) Semantic merge conflicts (1)
Site text (1) Software development tasks (1)
User intents (1) Software specifications (1)
User reviews (1)
Code-based Source code (60) Bugs/Buggy code (16) 103
datasets Vulnerable source code (8) Patches (4)
Code changes (3) Test suites/cases (3)
Bug-fix pairs (2) Error code (2)
Error-fix pairs (1) Flaky test cases (1)
Identifiers (1) Labeled clone pairs (1)
Packages (1)
Graph-based GUI Images (1) 1
datasets
Software Code repository (9) Android apps (3) 20
repository Issues and commits (3) Pull-requests (2)
-based datasets Industrial projects (1) Open-source projects (1)
Web applications (1)
Combined Programming tasks and test suites/cases (17) Source code and comments (12) 55
datasets Programming tasks and solutions (8) Source code and description (3)
Code-text pairs (2) Souce code and API usage sequences (2)
Source code and test suites/cases (2) Bug report and test suites/cases (1)
Buggy code and comments (1) Buggy code and solutions (1)
Code files and summaries (1) Binary code and related annotations (1)
Failing test code and error messages (1) Source code and Q&A pairs (1)
Source code, methods, and logs (1) Vulnerable code and description (1)

*See Appendix A for the full table including references.

Data types play a pivotal role in shaping the architecture and selection of LLMs, as they directly influence the extraction of implicit features and subsequent model decisions (Chan et al . , 2023 ; Ghadhab et al . , 2021 ; Yang et al . , 2023f ; Shi et al . , 2022 ) . The choice of data types can significantly impact the overall performance and generalization ability of the LLMs. We examine and classify the types of SE datasets employed in LLM4SE studies. By investigating the relationship between data types, model architectures, and performance, we seek to shed light on the critical role of data types in the success of LLM4SE applications.

Data type categorization. We classified the data types of all datasets into five categories: code-based, text-based, graph-based, software repository-based, and combined data types. Table  7 describes the specific data included in the data types corresponding to the datasets we summarized from the 395 studies. We can find that most of the studies used text-based datasets, accounting for a total of 151 . The dominance of text-based datasets in training LLMs for SE tasks highlights the models’ exceptional natural language processing capabilities. These LLMs excel in understanding and processing textual data, making them an ideal choice for tasks that involve code comprehension, bug fixing, code generation, and other text-oriented SE challenges. Their ability to process and learn from vast amounts of text data enables them to provide powerful insights and solutions for various SE applications.

The most prevalent type of data utilized in training LLMs for SE tasks is programming tasks/problems with 42 instances observed among the surveyed papers. This dominance can be attributed to the diverse and challenging nature of programming problems, which provide LLMs with opportunities to generalize knowledge and skills across various SE challenges, fostering a robust understanding of software concepts and enhancing performance across a wide range of tasks, including code generation, code completion, and code summarization, etc. Prompts follow closely behind programming tasks, with 33 instances observed in the surveyed papers, providing task-specific guidance to LLMs, serving as cues or instructions for the models, and helping them understand the context and requirements of SE tasks. This combination helps the models develop a robust understanding of software concepts and perform well in a wide range of tasks. There are also SO (i.e., Stack Overflow) posts (12), bug reports (11), etc., which are among the more numerous data types in text-based datasets.

The predominance of source code (60) as the most abundant data type in code-based datasets can be attributed to its fundamental role in SE. Source code serves as the foundation of any software project, containing the logic and instructions that define the program’s behavior. Therefore, having a large volume of source code data is crucial for training LLMs to understand the intricacies of software development, enabling them to effectively generate, analyze, and comprehend code in various SE tasks. There are also common data types, such as bugs/buggy code (16) and patches (4), for program repair tasks. Additionally, vulnerable source code (8) is used for vulnerability detection tasks. Graph-based datasets are used in some research studies for SE tasks, e.g., Kolthoff et al.   (Kolthoff et al . , 2023 ) used a dataset composed of screenshots from Google Play Android applications to construct a graphical user interface (GUI) repository in their study on LLM for the rapid prototyping task. These datasets represent code using graph structures, capturing relationships and dependencies between code components.

Software repository-based datasets are compilations of data extracted from version control systems, such as Git repositories, containing code, documentation, and related artifacts. This data includes Code repository (3), issues and commits (3), and so on. The data in software repositories can provide a wealth of information covering all aspects of the software development process, including code evolution history, records of issue fixes and feature improvements, code quality assessments, and so on. These data are valuable for studying behaviors and trends in the software development process, improving software quality and development efficiency, and evaluating the performance of software engineering techniques. Therefore, many studies have used software repository-based datasets for empirical analysis and model training.

Some studies employed combined datasets containing multiple datatypes. Among them, the most common type is “programming tasks and test suites/cases”. Other combinations of data types include “source code and comments”, “programming tasks and solutions”, “source code and description ”, “code-text pairs”, etc.

4.3. How do data types influence the selection of data-preprocessing techniques?

For the training and application of LLMs, the raw dataset needs to be subjected to data processing to obtain a clean and suitable dataset for model training. The data processing steps  (Manh et al . , 2023 ; Lee et al . , 2022 ) involve operations such as data cleaning, noise removal, normalization, etc. To ensure consistency and quality of the data, different data types may require different processing methods to improve the performance and effectiveness of LLMs in SE tasks. In this section, we aim to detail the data preprocessing procedures for the two most used types of datasets, i.e., text-based datasets and code-based datasets.

Refer to caption

The data preprocessing procedure for text-based datasets. As displayed in Fig.   7 , the steps of text-based dataset preprocessing consist of seven steps in total, yet there are some differences from the code-based dataset preprocessing steps. The process begins with data extraction  (Yang et al . , 2023f ; Ciborowska and Damevski, 2022 ; Ezzini et al . , 2022 ; Ciborowska and Damevski, 2023 ) , where relevant text is carefully extracted from SE documentation from a variety of sources, including bug reports  (Ciborowska and Damevski, 2023 ) , requirements documents  (Kolthoff et al . , 2023 ) , code comments  (Prenner and Robbes, 2021 ) , and API documentation  (Khan et al . , 2021 ) . This step ensures that the dataset captures diverse, task-specific textual information. After data extraction, the text is initially segmented and categorized according to the specific requirements of the research task. For example, the text can be segmented into sentences or further broken down into individual words as needed for analysis  (He et al . , 2023 ; Kou et al . , 2023a ) . To ensure the quality and relevance of the dataset, substandard data deletion is performed to eliminate any invalid or irrelevant text. For example, the dataset used by Lee et al.   (Lee et al . , 2022 ) was constructed from bug reports, and in the “unqualified data deletion” process the researchers filtered out bug reports with fewer than 15 words because the text was too short to contain contextual information. Next, preprocessing operations are performed on the text to standardize and clean it. Common preprocessing steps include removing certain symbols, stop words, and special characters  (Rahmani et al . , 2023 ; Wang et al . , 2020c ) . This standardized form of text facilitates the efficient processing of LLMs. To avoid introducing bias and redundancy in the dataset, researchers eliminated duplicate instances by removing any duplicate text samples  (He et al . , 2023 ; Kou et al . , 2023a ; Xu et al . , 2022 ) . This step enhances the diversity of the dataset and helps the model to generalize better to new inputs. “Data tokenization” is a key step in preparing the text for LLMs  (Luo et al . , 2022 ) . Text is labeled into smaller units, such as words or subwords, so that LLMs are easier to manage and process efficiently. Finally, the preprocessed dataset is partitioned into different subsets, usually including a training set, a validation set, and a test set.

The data preprocessing procedure for code-based datasets. We now summarize the process of preprocessing a code-based dataset, which consists of seven steps. Fig.   8 describes the individual data processing steps in detail and gives examples. The first step is data extraction, which involves retrieving relevant code segments from different sources such as software repositories or version control systems  (Kang et al . , 2023b ; Yang et al . , 2023f ) . Depending on the requirements of the research task  (Mastropaolo et al . , 2021b ; Yuan et al . , 2023b ) , code segments can be extracted at different levels of granularity, ranging from individual methods and functions to entire source code files or even complete software projects. The next step is to remove any code segments that do not meet predefined criteria or quality standards  (Li et al . , 2021 ; Shi et al . , 2022 ; Prenner and Robbes, 2021 ) . This filtering process ensures that the extracted code is relevant to the specific SE task under study, thus eliminating incomplete or irrelevant code snippets. To avoid introducing bias and redundancy during model training, the third step involves removing duplicate instances  (Zhao et al . , 2021 ; Ciniselli et al . , 2021 ; Xu et al . , 2022 ) . Any duplicate code instances are identified and removed from the dataset, thus increasing the diversity and uniqueness of the data. After the data extraction and filtering steps, the fourth step, data compilation, comes into play. The extracted and filtered code segments are merged and compiled into a unified code dataset. This compilation process simplifies data storage and access and facilitates subsequent analysis and model training  (Chan et al . , 2023 ; Mastropaolo et al . , 2022a ) . In the fifth step, the problem of invalid or non-executable code is solved by removing data that cannot be compiled. Any code segments that cannot be compiled or executed are removed from the dataset to ensure that the remaining code instances are valid and usable during model training and evaluation. The sixth step is code representation, which consists of converting the code segments into a suitable representation that can be processed by the LLMs. This conversion can take different forms: token-based representation involves tokenizing the source or binary code into distinct tokens; tree-based representation parses the code into Abstract Syntax Trees (AST); and graph-based representation generates a Program Dependence Graph (PDG), encompassing Control Flow Graphs (CFG) and Call Graphs (CG). Finally, in the “data segmentation” step, the preprocessed dataset is partitioned into different subsets for training, validation, and testing  (Ciniselli et al . , 2021 ; Weyssow et al . , 2023b ) . The training set is used to train the LLM, the validation set helps to tune the hyperparameters and optimize the model performance, and the testing set evaluates the model’s ability on unseen data. By strictly adhering to these seven preprocessing steps, researchers can create structured and standardized code-based datasets, thus facilitating the effective application of LLMs for a variety of SE tasks such as code completion, error detection, and code summarization.

It is worth emphasizing that the order of these steps is not fixed and can be adjusted based on the specific research task and its associated requirements. Researchers need to carefully consider the objectives, characteristics of the dataset, and the desired outcomes when determining the optimal sequence for these preprocessing techniques.

4.4. What input formats are the datasets for LLM training converted to?

Once suitable datasets have been carefully chosen and clean data has been achieved through the preprocessing steps, the next critical aspect is the transformation of the data into appropriate formats that can effectively serve as inputs for LLMs. Table  8 shows four distinct data input types that emerged during the research: Token-based input, Tree/Graph-based input, Pixel-based input, and Hybrid-based input. We now detail each as follows:

Category Input forms Total
Token-based input Text in tokens (150) Code in tokens (118) 347
Code and text in tokens (78)
Tree/Graph-based input Code in tree structure (2) Code in graph structure (3) 5
Pixel-based input Pixel (1) 1
Hybrid-based input Hybrid input forms (2) 2

Token-based input. Token-based input  (Ahmed et al . , 2024 ; Al-Kaswan et al . , 2023 ; Arakelyan et al . , 2023 ) involves representing code and text as sequences of tokens, which are smaller units like words or subwords. Text in tokens refers to the tokenization of textual data, such as documentation, bug reports, or requirements, enabling the LLMs to process and analyze natural language descriptions effectively. Code and text in tokens combine both code and its associated textual context, allowing the model to capture the relationships between code elements and their descriptions. Code in tokens refers to the representation of code snippets broken down into meaningful tokens, allowing the LLMs to understand programming language syntax and semantics at a fine-grained level.

Tree/Graph-based input. Tree-based input  (Ma et al . , 2023a ; Ochs et al . , 2023 ; Zhang et al . , 2023j ) represents code as hierarchical tree structures, capturing the syntactic relationships between code elements. Each node in the tree represents a code element, and the edges represent the hierarchical nesting of control flow statements and other code structures. This form of input allows the LLMs to understand the code’s hierarchical structure and perform tasks like code completion and bug fixing. Graph-based input represents code as a graph structure, where nodes represent code elements and edges represent the relationships between them. Unlike trees, graphs allow more flexible and complex relationships between code elements, enabling the model to capture non-linear dependencies in the code. This form of input is used in tasks like code summarization and vulnerability detection by considering the code’s intricate relationships.

Pixel-based input. Pixel-based input  (Nasir et al . , 2023 ) visualizes code as images, where each pixel represents a code element or token. This visual representation allows the LLMs to process and understand code through image-based learning. In this input form, LLMs learn from the visual patterns and structures in the code to perform tasks like code translation or generating code visualizations.

Hybrid-based input. Hybrid-based input  (Niu et al . , 2022 ) combines multiple modalities to provide LLMs with diverse perspectives for better code comprehension. For example, a hybrid input may combine code in tokens with visual representations of code, allowing the model to learn both from the fine-grained details in the tokenized code and from the overall visual structure of the code. This approach enhances the model’s ability to understand complex code patterns and improve performance in tasks such as code comprehension and code generation.

During our investigation of LLM-based models for SE tasks, we observed distinct trends in the usage of different input forms during the training process. Token-based input forms, namely code in tokens and text in tokens were the most prevalent, collectively constituting approximately 97.75% of the studies 5 5 5 This refers to studies that explicitly state input forms of LLMs, i.e., a total of 355 papers as shown in Table  8 . . Specifically, code in tokens was widely adopted in 118 studies, accounting for approximately 33.24% of the total studies, demonstrating its popularity as a primary choice for representing code snippets. This approach allowed LLMs to grasp programming language syntax and semantics effectively, making it suitable for a wide range of code-related tasks. Similarly, text in tokens was utilized in 150 studies, comprising around 42.25% of the total studies. This input form allowed LLMs to process natural language descriptions, bug reports, and documentation with greater efficiency and accuracy. The popularity of token-based input forms underscores their significance in leveraging the power of LLMs for software engineering applications.

In contrast, tree/graph-based input forms, such as code in tree-structure, were used in only seven studies, making up approximately 1.4% of the total . Although less prevalent, this input type emerged as a promising choice to represent the hierarchical structure and syntactic relationships within code. Its adoption indicated an ongoing exploration of tree-based representations in specialized tasks, such as code completion and bug fixing.

Pixel-based input and hybrid-based input were relatively less common, each found in one study, contributing approximately 0.28% of the total studies each . While their adoption rates were lower, these input forms presented intriguing possibilities for specific applications. Pixel-based input offered a unique visual representation of code, potentially advantageous for code translation tasks. Meanwhile, hybrid-based input, combining multiple modalities (e.g., code in tree structure and text in tokens in Niu et al. ’s work  (Niu et al . , 2022 ) ), showcased the potential for enhancing code comprehension tasks by offering diverse perspectives for the models to learn from.

In summary, the trends in input form usage reveal a strong preference for token-based input, demonstrating its versatility and effectiveness in various SE tasks. However, ongoing exploration of other input forms, such as tree/graph-based, pixel-based, and hybrid-based, suggests a dynamic and evolving landscape in the application of LLMs for SE, with potential for further innovation and improvement in specialized domains. Each of these input forms caters to specific characteristics of the SE tasks being addressed, enabling LLMs to perform effectively across a wide range of code-related applications with a more comprehensive understanding of the input data.

RQ2 - Summary

5. RQ3: What techniques are used to optimize and evaluate LLM4SE?

5.1. what tuning techniques are used to enhance the performance of llms in se tasks.

Through surveying research related to LLM4SE, we found that while many general-purpose LLMs (e.g., ChatGPT) can be directly applied to software engineering tasks such as code generation  (Dong et al . , 2023b ; Liu et al . , 2023a ; Yetiştiren et al . , 2023 ) , code summarization  (Shi et al . , 2023c ; Sun et al . , 2023b ; Yang et al . , 2023d ) , and program repair  (Charalambous et al . , 2023 ; Gao et al . , 2023b ; Xia and Zhang, 2023b ) without fine-tuning, the hidden potential of LLMs often needs to be realized through tuning to be fully exploited. Specifically, this requires training LLMs with task-specific data to learn knowledge relevant to the task context to perform better. We observed that out of 83 studies, LLMs were fine-tuned using full fine-tuning techniques to adapt to downstream SE tasks, with the majority being BERT series models  (Ciniselli et al . , 2021 ; Ezzini et al . , 2022 ; Fatima et al . , 2022 ; Jesse et al . , 2022 ; Kou et al . , 2023a ; Lee et al . , 2022 ; Lin et al . , 2021 ; Luo et al . , 2022 ; Salza et al . , 2022 ; Von der Mosel et al . , 2022 ; Wang et al . , 2022c ; Wei et al . , 2022a ; Zhang et al . , 2022a ) . The cost of training these LLMs is expensive, requiring a large amount of computational resources and massive amounts of data. It is also costly to train and deploy the fine-tuned models separately for each downstream task, as the traditional fine-tuning approach would need to copy a model and perform full-parameter fine-tuning for each downstream task  (Cassano et al . , 2023 ; Deng et al . , 2023c ; Ezzini et al . , 2022 ; Izadi et al . , 2022 ; Jesse et al . , 2022 ; Lee et al . , 2022 ) .

To reduce this computational burden, some researchers have previously used In-Context Learning (ICL)   (Gao et al . , 2023b ; Geng et al . , 2024 ; Hu et al . , 2024 ; Huang et al . , 2023f ; Jiang et al . , 2023a ) , which feeds the model with manually designed “prompts” that are overly reliant on human design and do not require updating model parameters at all. However, ICL only operates at the time of inference and does not involve learning task-specific parameters, which experimentally proved to give the model limited improvement in downstream tasks  (Liu et al . , 2022a ) . To address this problem, researchers have begun to apply Parameter Efficient Fine-Tuning (PEFT)   (Houlsby et al . , 2019 ) techniques to LLMs. PEFT aims to improve the performance of pre-trained models on new tasks by optimizing the subset of parameters fine-tuned, thereby reducing the overall computational complexity. This approach maintains the majority of the pre-trained model’s parameters in a fixed state, focusing fine-tuning efforts on a minimal yet impactful set of parameters  (Weyssow et al . , 2023a ) . Prior code intelligence research has demonstrated the capabilities of PEFT techniques, frequently revealing their superiority over full fine-tuning on a variety of tasks  (Weyssow et al . , 2023a ) . Four common techniques of PEFT include Low-Rank Adaptation (LoRA)  (Hu et al . , 2021 ) , prompt tuning  (Lester et al . , 2021 ) , prefix tuning  (Li and Liang, 2021 ) , and adapter tuning  (Houlsby et al . , 2019 ) . We now elaborate on each as follows:

Low-Rank Adaptation (LoRA). LoRA injects low-rank trainable matrices into the attention layers of the Transformer architecture to significantly reduce the number of trainable parameters. We observed that eight studies  (Arakelyan et al . , 2023 ; Lu et al . , 2023 ; Pan et al . , 2023c ; Shestov et al . , 2024 ; Shi et al . , 2023c ; Silva et al . , 2023 ; Wang et al . , 2023e ; Zhang et al . , 2023k ) utilized LoRA to enhance the performance of LLMs in SE tasks. For instance, Pan et al.   (Pan et al . , 2023c ) trained SteloCoder, specifically designed for translating multiple programming languages into Python code, which is based on the StarCoder LLM. LoRA technology was employed during the modification of the StarCoder model architecture to adjust the parameter count. Additionally, Silva et al.   (Silva et al . , 2023 ) applied LoRA to LLaMA, resulting in a highly effective “program repair adapter” for fixing bugs through fine-tuning.

Prompt tuning. Prompt tuning involves appending learnable tokens to the model’s input, guiding it towards better task performance. This method keeps the model’s architecture unchanged, leveraging adaptable prompts to influence outputs without altering internal parameters. In the surveyed papers, three research works  (Lu et al . , 2023 ; Wang et al . , 2023e ; Zhu et al . , 2023 ) utilized prompt tuning. For instance, Zhu et al.   (Zhu et al . , 2023 ) proposed a method named AUMENA, which automates method naming tasks through context-aware prompt tuning.

Prefix tuning. Prefix tuning adapts pre-trained language models by adding trainable tokens not just to the input but also across internal layers, affecting the model’s intermediate representations. This approach modifies the model’s processing with minimal changes to its original parameters, allowing for task-specific customization. This technique was utilized in the following two studies: Lu et al.   (Lu et al . , 2023 ) fine-tuned LLaMA-Reviewer for automating code review, while Wang et al.   (Wang et al . , 2023e ) fine-tuned CodeT5+ for multiple downstream tasks such as code completion, code generation, and code search.

Adapter tuning. Adapter tuning adds small neural network modules to the original model, then fine-tuning them on specific tasks without altering the original model’s parameters. Agarwal et al.   (Agarwal et al . , 2024 ) fine-tuned LLMs using adapter tuning techniques to make them suitable for code representation tasks. Wang et al.   (Wang et al . , 2023a ) indicated that LLMs refined through adapter tuning perform exceptionally well in code search and code summarization tasks.

In addition to the above-mentioned tuning methods, other techniques have been used for tuning LLMs in the LLM4SE domain, such as Reinforcement Learning (RL)   (Islam and Najafirad, 2024 ; Islam et al . , 2024 ; Jain et al . , 2023a ; Steenhoek et al . , 2023 ; Yang et al . , 2023f ) , Supervised Fine Tuning (SFT)   (Dong et al . , 2023c ; Islam and Najafirad, 2024 ; Mastropaolo et al . , 2022b ; Steenhoek et al . , 2023 ; Yang et al . , 2023f ) , an unsupervised data augmentation method called syntax fine-tuning   (Qi et al . , 2023 ) , knowledge preservation fine-tuning   (Siddiq et al . , 2023a ) , and task-oriented fine-tuning   (Sun et al . , 2023b ) , etc.

5.2. What prompt engineering techniques are applied to improve the performance of LLMs in SE tasks?

Prompt engineering is a method of enhancing model performance by using task-specific instructions, known as prompts, without modifying the core model parameters. This approach enables LLMs to seamlessly integrate into downstream tasks solely based on the given prompts, guiding model behavior without the need to update model parameters  (Sahoo et al . , 2024 ) . Fig.  9 presents eight prompt engineering techniques currently applied in the LLM4SE domain.

Refer to caption

Few-shot prompting. Few-shot prompting involves providing a limited number of examples or instructions to the model to perform a specific task. The model learns from these examples and generalizes to similar tasks with minimal training data. In the surveyed LLM4SE research, 88 studies utilized few-shot prompting  (Ahmed et al . , 2024 ; Feng and Chen, 2023 ; First et al . , 2023 ; Geng et al . , 2024 ; Kang et al . , 2022 ; Wei et al . , 2022a ; Xu et al . , 2024 ; Zhang et al . , 2023g ) . For instance, Geng et al.   (Geng et al . , 2024 ) adopted an in-context learning paradigm and providing a specific number of prompts simultaneously significantly outperformed state-of-the-art supervised learning methods in generating comments with multiple intents.

Zero-shot prompting. In zero-shot prompting  (Radford et al . , 2019 ) , the model is expected to perform a task without any explicit training on that task. Instead, it relies on the prompt provided during inference to generate the desired output. Following few-shot prompting in terms of usage frequency, 79 studies adopted zero-shot prompting  (Kou et al . , 2023a ; Li et al . , 2023e ; Luo et al . , 2022 ; Pei et al . , 2023 ; Schäfer et al . , 2023b ; Weyssow et al . , 2023b ; Wu et al . , 2023a ; Yan et al . , 2023a ) . For example, Li et al.   (Li et al . , 2023e ) introduced CodeEditor, a pre-trained model specifically designed for code editing, and demonstrated its effectiveness in automatic code editing under zero-shot settings.

Chain-of-Thought (CoT) prompting. Wei et al.   (Wei et al . , 2022b ) introduced a prompting technique called Chain-of-Thought (CoT), which involves each prompt building upon the preceding one, resulting in a coherent chain of reasoning that enhances the model’s ability to generate well-structured and thoughtful responses. Huang et al.   (Huang et al . , 2023g ) proposed a novel method leveraging the fault-tolerance and comprehension capabilities of pre-trained LLMs to generate Control Flow Graphs. This method involves a Chain-of-Thought (CoT) with four steps: structural hierarchy extraction, nested code block extraction, CFG generation for nested code blocks, and merging of CFGs for all nested code blocks. Tian et al.   (Tian and Chen, 2023 ) also introduced the first test case-driven code generation technique, named TCoT, to further enhance LLMs’ capabilities in code generation. Including the two studies mentioned earlier, a total of 18 studies applied CoT to improve LLMs’ performance in SE tasks  (Deng et al . , 2023d ; Feng and Chen, 2023 ; Huang et al . , 2023g , f ; Li et al . , 2023m , d ; Liu et al . , 2024b ; Mu et al . , 2023 ) .

Automatic Prompt Engineer (APE). Inspired by classical program synthesis and human prompt engineering methods, Zhou et al.   (Zhou et al . , 2023b ) introduced an Automatic Prompt Engineer (APE) for automatic instruction generation and selection. APE is a system designed to automatically generate effective prompts for LLMs based on the desired task. It aims to simplify the process of prompt engineering by automating the generation of task-specific instructions. Sharing a similar concept of automated prompts, Sun et al.   (Sun et al . , 2023b ) proposed a new prompt learning framework called PromptCS. PromptCS trains a prompt agent that can generate continuous prompts to fully explore LLMs’ potential in code summarization tasks. Continuous prompts, generated under the guidance of LLMs, are easier for LLMs to comprehend compared to manually written discrete prompts.

Chain of Code (CoC) prompting. CoC prompting  (Li et al . , 2023g ) is similar to CoT prompting but is specifically tailored for programming tasks. It involves providing a sequence of prompts or code snippets to guide the model’s code generation process. Huang et al.   (Huang et al . , 2023a ) proposed CodeCoT, and Le et al.   (Le et al . , 2023 ) proposed CodeChain, both of which are reasoning frameworks that better guide LLMs in code generation.

Automatic Chain-of-Thought (Auto-CoT) prompting. Auto-CoT  (Zhang et al . , 2022d ) is an automated version of CoT prompting where the sequence of prompts is generated automatically based on the input and desired task. Paranjape et al.   (Paranjape et al . , 2023 ) introduced a framework, Automatic. ART, for generating intermediate reasoning steps automatically. ART can select multi-step reasoning and tools from a task library based on given tasks at any time and has been experimentally proven effective in code tasks.

Modular-of-Thought (MoT) prompting. In code generation tasks, LLMs often generate solutions in the form of a single block of code, limiting their effectiveness in handling complex problems. To overcome this limitation, Li et al.   (Li et al . , 2023a ) proposed the Modular-of-Thought Coder (MoTCoder). They introduced a new MoT prompting optimization framework to facilitate task decomposition into logical subtasks and submodules. Experimental results demonstrate that MoTCoder significantly improves the modularity and correctness of solutions generated by LLMs in programming tasks.

Structured Chain-of-Thought (SCoT) prompting. Considering that source code contains rich structural information, Li et al.   (Li et al . , 2023d ) proposed SCoT prompting specifically for code generation tasks. Researchers enable LLMs to use program structure to construct CoTs (i.e., intermediate natural language reasoning steps) to obtain SCoTs. Then, LLMs generate the final code based on SCoTs. Compared to CoT prompts, SCoT prompts explicitly constrain LLMs to consider how to address requirements from the source code perspective. Evaluations across multiple benchmarks show that SCoT significantly enhances LLMs’ performance in code generation.

In addition to the eight prompting techniques mentioned above, we identified 76 studies where researchers, although not explicitly mentioning the application of any of the aforementioned prompting techniques, carefully designed prompts or proposed new strategies based on prompts to apply LLMs to SE tasks better. For instance, Ren et al.   (Ren et al . , 2023 ) proposed a code generation method based on knowledge-driven prompt chains. Li et al.   (Li et al . , 2023m ) applied differential prompting to ChaGPT to better identify test cases that cause failures in buggy programs. Ahmed et al.   (Ahmed et al . , 2024 ) enhanced the performance of LLMs in code summarization tasks using automatic semantic augmentation prompts.

5.3. How are evaluation metrics utilized to assess the performance of LLM4SE tasks?

Evaluating the performance of LLM4SE is a crucial aspect of their development and deployment  (Kang et al . , 2022 ) . Benchmarking against existing datasets and using baselines are common practices to evaluate the effectiveness of LLMs  (Cassano et al . , 2023 ) . However, given the diversity of SE tasks, a single evaluation metric may not suffice to capture the model’s performance comprehensively. Thus, researchers often employ a range of evaluation metrics tailored to specific problem types  (Mastropaolo et al . , 2021b ; Niu et al . , 2022 ; Salza et al . , 2022 ) . We categorize the SE tasks summarized from 395 papers into four categories according to their addressed problem types, i.e., regression, classification, recommendation, and generation tasks, as displayed in Fig.  10 (b). The selection of evaluation metrics depends on the target problem types. For example, MAE (Mean Absolute Error) has been used for regression tasks  (Fu and Tantithamthavorn, 2022 ) . We summarize the most frequently used evaluation metrics for each task type.

Problem Type Metric Total
Regression MAE (Mean Absolute Error) (1) 1
Classification Precision (35) Recall (34) 147
F1-score (33) Accuracy (23)
AUC (Area Under the ROC Curve) (9) ROC (Receiver Operating Characteristic) (4)
FPR (False Positive Rate) (4) FNR (Falser Negative Rate) (3)
MCC (Matthews Correlation Coefficient) (2)
Recommendation MRR (Mean Reciprocal Rank) (15) Precision/Precision@k (6) 39
MAP/MAP@k (6) F-score/F-score@k (5)
Recall/Recall@k (4) Accuracy (3)
Generation BLEU/BLEU-4/BLEU-DC (62) Pass@k (54) 338
Accuracy/Accuracy@k (39) EM (Exact Match) (36)
CodeBLEU (29) ROUGE/ROUGE-L (22)
Precision (18) METEOR (16)
Recall (15) F1-score (15)
MRR (Mean Reciprocal Rank) (6) ES (Edit Similarity) (6)
ED (Edit Distance) (5) MAR (Mean Average Ranking) (4)
ChrF (3) CrystalBLEU (3)
CodeBERTScore (2) MFR (Mean First Ranking) (1)
PP (Perplexity) (1)

*See Appendix D for the full table including references.

For classification tasks, the most commonly used metrics are Precision  (Biswas et al . , 2020 ; Chen et al . , 2023a ; Ezzini et al . , 2022 ; Fatima et al . , 2022 ; He et al . , 2022 ) , Recall  (Biswas et al . , 2020 ; Chen et al . , 2023a ; Ezzini et al . , 2022 ; Fatima et al . , 2022 ; He et al . , 2022 ; Hey et al . , 2020 ) and F1-score  (Alhamed and Storer, 2022 ; Biswas et al . , 2020 ; Chen et al . , 2023a ; Ezzini et al . , 2022 ; Fatima et al . , 2022 ; He et al . , 2022 ) , with 35, 34, and 33 tudies, respectively, employing these metrics. For example, in the study conducted by Khan  et al.   (Khan et al . , 2021 ) , F1-score is utilized to evaluate the performance of an automatic bug-fixing model. Similarly, Sharma  et al.   (Sharma et al . , 2022 ) use Precision and Recall to assess the effectiveness of a transformer-based model for code summarization. These metrics are essential for evaluating the model’s ability to correctly classify code snippets  (Fatima et al . , 2022 ) or identify specific SE properties  (Chen et al . , 2023a ) .

For recommendation tasks, MRR (Mean Reciprocal Rank) is the most frequent metric, used in 15 studies  (Ciborowska and Damevski, 2022 ; Izadi et al . , 2022 ; Li et al . , 2021 ; Lin et al . , 2021 ; Rahmani et al . , 2023 ; Salza et al . , 2022 ; Shi et al . , 2022 ; Wei et al . , 2022a ) . MRR is employed to measure the effectiveness of recommendation systems for code completion, as demonstrated in the study by Ciborowska et al.   (Ciborowska and Damevski, 2022 ) . Precision@k  (He et al . , 2023 ; Ciborowska and Damevski, 2022 ; Lin et al . , 2021 ; Zhu et al . , 2023 ) and F1-score@k  (He et al . , 2023 ; Lin et al . , 2021 ; Zhu et al . , 2022 , 2023 ) are also utilized in recommendation tasks, with 6 studies each. These metrics are used to evaluate the precision and F1-score of the recommended code snippets or code completions.

In generation tasks, metrics like BLEU, along with its variants BLEU-4 and BLEU-DC  (Ahmed et al . , 2024 ; Al-Kaswan et al . , 2023 ; Arakelyan et al . , 2023 ; Chen et al . , 2022 ; Ciniselli et al . , 2021 ) , and Pass@k  (Bui et al . , 2023 ; Cassano et al . , 2023 ; Chen et al . , 2023c , 2021b ; Dibia et al . , 2022 ; Döderlein et al . , 2022 ) are the most commonly used, appearing in 62 and 54 studies, respectively. For instance, Wang et al.   (Wang et al . , 2023e ) employed BLEU to evaluate a code-to-code translation model. Pass@k is used in the research by Jiang et al.   (Jiang et al . , 2023a ) to assess code generation models, measuring the proportion of generated code snippets that match the reference solutions. Additionally, ROUGE/ROUGE-L  (Ahmed et al . , 2024 ; Al-Kaswan et al . , 2023 ; Gao et al . , 2023b ; Geng et al . , 2024 ; Li et al . , 2022f ; Mastropaolo et al . , 2021b , 2022b ; Niu et al . , 2022 ; Zan et al . , 2023a ; Li et al . , 2023b ) , METEOR  (Ahmed et al . , 2024 ; Al-Kaswan et al . , 2023 ; Chen et al . , 2022 ; Gao et al . , 2023b ; Niu et al . , 2022 ; Geng et al . , 2024 ) , EM (Exact Match)  (Al-Kaswan et al . , 2023 ; Gao et al . , 2023b ; Gupta et al . , 2023 ; Murali et al . , 2023 ; Wang et al . , 2023e ; Weyssow et al . , 2023b ; Ye et al . , 2023 ; Zhang et al . , 2023j ) , and ES (Edit Similarity)  (Liu et al . , 2023l ) are used in specific studies to evaluate the quality and accuracy of generated code or natural language code descriptions.

RQ3 - Summary

6. RQ4: What SE tasks have been effectively addressed to date using LLM4SE?

6.1. what are the distributions se activities and problem types addressed to date with llm4se.

In this section, we provide a detailed analysis of the use of LLMs in different SE tasks. We summarise reported SE tasks  (Yang et al . , 2022b ) addressed with LLMs, following the six phases of the Software Development Life Cycle (SDLC) (i.e., requirements engineering, software design, software development, software quality assurance, software maintenance, and software management). Fig. 10 (a) describes the distribution of LLMs in these six activities. Table  10 shows a detailed count of studies reporting specific SE tasks addressed with LLMs.

Refer to caption

The highest number of studies is observed in the software development domain, constituting approximately 56.65% of the total research volume . This underscores the primary focus to date on utilizing LLMs to enhance coding and development processes. Software maintenance tasks account for about 22.71% of the research share, highlighting the significance of LLMs in aiding software updates and improvements. The software quality assurance domain holds approximately 15.14% of the research proportion, indicating a growing interest in automating testing procedures. In contrast, requirements engineering and software design activities represent approximately 3.9% and 0.92% of the research share, respectively, suggesting relatively limited exploration so far in these areas. The software management domain has the least research representation, accounting for a tiny 0.69% proportion. This distribution underscores the vital focus on development and maintenance tasks while also indicating potential avenues for further research in testing, design, and management domains.

In our collection of LLM studies for SE tasks, we’ve classified them based on the type of problems they address (shown in Fig. 10 (b)). The distribution reveals that the majority of studies, about 70.97%, center around generation tasks , showcasing the significance of LLMs in producing code or text. Following this, around 21.61% of studies fall under classification tasks, indicating the relevance of LLMs in categorizing software elements. Additionally, roughly 6.77% of studies are related to recommendation tasks, demonstrating the utility of LLMs in suggesting solutions. Lastly, a smaller portion, around 0.65%, is allocated to regression tasks, reflecting the limited exploration of LLMs for predictive modeling. This distribution underscores the broad applicability of LLMs across different SE challenges, with a notable emphasis on code generation and classification tasks .

SE Activity SE Task Total
Requirements Anaphoric ambiguity treatment (4) Requirements classification (4) 17
engineering Requirement analysis and evaluation (2) Specification generation (2)
Coreference detection (1) Requirements elicitation (1)
Specification formalization (1) Traceability automation (1)
Use cases generation (1)
Software design GUI retrieval (1) Rapid prototyping (1) 4
Software specification synthesis (1) System design (1)
Software development Code generation (118) Code completion (22) 247
Code summarization (21) Code search (12)
Code translation (12) Code understanding (8)
API inference (5) Program synthesis (6)
API recommendation (5) Code editing (5)
Code representation (3) Code comment generation (2)
Method name generation (2) Code recommendation (2)
Agile story point estimation (1) API documentation augment (1)
API documentation smells (1) API entity and relation extraction (1)
Data analysis (1) Fuzz driver generation (1)
Control flow graph generation (1) Identifier normalization (1)
Instruction generation (1) Type inference (1)
Others (14)
Software quality Vulnerability detection (18) Test generation (17) 66
assurance Bug localization (5) Verification (5)
Testing automation (4) Fault localization (3)
Defect detection (2) GUI testing (2)
Static analysis (2) Binary taint analysis (1)
Compiler fuzzing (1) Decompilation (1)
Invariant prediction (1) Malicious code localization (1)
Mobile app crash detection (1) Resource leak detection (1)
Test prediction (1)
Software maintenance Program repair (35) Code clone detection (8) 99
Code review (7) Debugging (4)
Bug reproduction (3) Review/commit/code classification (3)
Duplicate bug report detection (3) Logging (3)
Log parsing (3) Code revision (2)
Sentiment analysis (3) Vulnerability repair (2)
API misuses repair (1) Bug prediction (1)
Bug triage (1) Code coverage prediction (1)
Code review explained (1) Code-Review defects repair (1)
Crash bug repair (1) Dockerfile Repair (1)
Incivility detection (1) Patch correctness prediction (1)
Patch detection (1) Program merge conflicts repair (1)
Rename Refactoring (1) Tag recommendation (1)
Technical debt payback (1) Traceability recovery (1)
Web test repair (1) Type error repair (1)
Others (5)
Software management Effort estimation (2) Software tool configuration (1) 3

*See Appendix  E for the full table including references.

6.2. How are LLMs used in requirements engineering?

This section explores the utilization of LLMs in the domain of requirements engineering. It encompasses tasks such as anaphoric ambiguity treatment, requirements classification, coreference detection, requirements elicitation, and software traceability.

Anaphoric ambiguity treatment. Ambiguity in software requirements arises when a single reader can interpret a natural language (NL) requirement in multiple ways, or when different readers have varying understandings of the same requirement. Unclear and ambiguous NL software requirements can lead to suboptimal software artifacts during later development stages. Moharil  et al.   (Moharil and Sharma, 2023 ) and Ezzini  et al.   (Ezzini et al . , 2022 ) have empirically demonstrated the significant role of LLMs such as BERT and SpanBERT in effectively addressing anaphoric ambiguity. Sridhara et al.   (Sridhara et al . , 2023 ) revealed that ChatGPT excels in addressing anaphoric ambiguity in software requirements. Through researchers’ analysis of ten English requirement specifications  (Ezzini et al . , 2022 ) containing anaphora-related challenges, ChatGPT consistently demonstrated its remarkable capability to accurately identify antecedents. This empirical evidence emphasizes the valuable role ChatGPT can play in enhancing the clarity and precision of software requirements, ultimately contributing to more effective software development processes by reducing interpretational uncertainties.

Requirements classification. Originating in NL documents, requirements demand effective classification, especially for early-stage project discernment, like security-related ones  (Knauss et al . , 2011 ; Li et al . , 2014 ) . Automated processing hinges on identifying these requisites. Categorizing into functional (FR) or non-functional (NFR) requirements, with quality constraints, benefits automated approaches  (Li et al . , 2014 ) . Hey et al. ( Hey et al . , 2020 ) employ BERT for requirement classification, where it excels in categorizing both FR and NFR requirements using a fine-tuning transfer learning technique, outstripping traditional methods. Luo et al. ( Luo et al . , 2022 ) introduce a BERT-based software requirement classification method, demonstrating remarkable transferability and generalization, especially in zero-shot scenarios.

Requirements term identification. Moharil et al.   (Moharil and Sharma, 2022 ) propose a technique for identifying terms used in different contexts within the same domain or in interdisciplinary projects. Using BERT, which reads entire word sequences for deeper language understanding, and K-means clustering, they create and group vectors for each term in the corpora. The method has been validated on large Computer Science and multi-domain corpora comprising eight different fields.

Coreference detection. Requirements, authored by diverse stakeholders, continually evolve, leading to terminology differences and inconsistencies across domains. Entity coreference in Requirement Engineering (RE), where various expressions refer to the same real-world entity, can cause confusion and affect comprehensibility. Wang et al.   (Wang et al . , 2020c ) offer a novel application of the BERT model for coreference detection.

Traceability automation. Software and system traceability refers to the ability to establish and maintain relationships between software artifacts, such as requirements, design definitions, code, and test cases, for product querying and development support  (Rierson, 2017 ) . Lin et al.   (Lin et al . , 2021 ) found that T-BERT can effectively migrate knowledge from code search to NLA-PLA (i.e., Natural Language Artifacts to Programming Language Artifacts) traceability, even with limited training instances. It outperforms existing techniques in accuracy and can be adapted to different domains without intermediate training for each project, offering a promising step toward practical, trustworthy traceability.

Others. In addition to the four requirement engineering tasks detailed above, LLMs can also be applied to requirement analysis and evaluation  (Poudel et al . , 2023 ; Ronanki et al . , 2023 ) , specification generation  (Ma et al . , 2024a ; Xie et al . , 2023b ) , requirements elicitation  (White et al . , 2023b ) , specification formalization  (Endres et al . , 2023 ) , and use case generation  (Zhang et al . , 2024d ) .

6.3. How are LLMs used in software design?

GUI (Graphical User Interface) retrieval. Kolthoff et al.   (Kolthoff et al . , 2023 ) present the application of BERT in the task of GUI retrieval in SE. The authors fine-tune a BERT-based learning-to-rank (LTR) model for this task. GUIs, which are not standard well-structured text documents, present unique challenges for text-based ranking tasks. The BERT model is prepared by concatenating the natural language query and the GUI document text, and then this input is used to train different BERT-LTR models. The models are evaluated based on their performance in NL-based GUI ranking.

Rapid prototyping. Rapid prototyping enables developers to quickly visualize and iterate on software designs, thereby accelerating the development process and ensuring alignment with user needs. White et al.   (White et al . , 2023b ) investigate the role of LLMs in augmenting this process. The study introduces prompt design techniques, organized into patterns, providing a structured methodology to tackle prevalent challenges in LLM4SE. This research indicates that the realm of rapid prototyping stands to benefit from deeper integration with advanced machine learning techniques, thereby creating opportunities for additional research and refinement aimed at producing more intuitive and user-centric software designs.

Software specification synthesis. Software configuration is vital for system behavior, but managing configurations and specifications becomes complex with larger systems. Mandal et al.   (Mandal et al . , 2023 ) introduce SpecSyn, a framework using an LLM for automatic software specification synthesis from natural language sources. This end-to-end approach treats the task as a sequence-to-sequence learning problem, surpassing the previous state-of-the-art tool by 21% in F1 score, and can find specifications from both single and multiple sentences.

6.4. How are LLMs used in software development?

Our analysis identifies wide-ranging applications of LLMs for software development, encompassing tasks such as code generation, code completion, and code summarization.

Code generation. Code generation has long been a task of interest: there is extensive work on program synthesis using symbolic and neural-semiotic approaches  (Alur et al . , 2013 ; Wu et al . , 2023b ) . Recently, LLMs trained for text generation have demonstrated the ability to complete programs  (Brown et al . , 2020 ; Black et al . , 2022 ) . Since 2020, several code generation models have been trained or fine-tuned on programming language text  (Nijkamp et al . , 2022b ; Chen et al . , 2021b ; Fried et al . , 2022 ; Xu et al . , 2022 ; Feng et al . , 2020 ; Clement et al . , 2020 ) . Unlike traditional program synthesis techniques, neurolinguistic models can be conditioned on natural language (e.g., code annotations) as well as generate programming language text. Researchers have experimentally demonstrated that LLMs like GPT-4  (Bareiß et al . , 2022 ; Liu et al . , 2023k ; Jiang et al . , 2023c ; Gilbert et al . , 2023 ) , GPT-2/GPT-3/GPT-3.5  (Azaria et al . , 2023 ; Yetiştiren et al . , 2023 ; Ke et al . , 2023 ; Liu et al . , 2023k ; Nascimento et al . , 2023 ; Li et al . , 2023c ; Wang et al . , 2023h ; Liu et al . , 2023a ; Dong et al . , 2023b ) , BERT series  (Zeng et al . , 2022 ; Lai et al . , 2023 ) , Codex  (Dibia et al . , 2022 ; Bareiß et al . , 2022 ; Yu et al . , 2023a ; Chen et al . , 2021b ; Gupta et al . , 2023 ; Madaan et al . , 2022 ; Kuznia et al . , 2022 ) , CodeGen  (Dibia et al . , 2022 ; Jones and Steinhardt, 2022 ; Zan et al . , 2022a ) , InCoder  (Murali et al . , 2023 ; Kou et al . , 2023b ; Liu et al . , 2023k ; Wang et al . , 2022b ) , Copilot  (Wu et al . , 2023b ) and CodeGeeX  (Zheng et al . , 2023c ) , play a key role in code generation. By pre-training on large-scale text data, these models learn rich linguistic knowledge and semantic representations that enable them to understand the meaning and structure of natural language. LLMs can automate code generation by converting natural language descriptions into code  (Jiang et al . , 2023a ) . These models generate program code from natural language descriptions, enhancing code-writing efficiency and accuracy. They show excellent performance in code completion, automatic code generation, and conversion of natural language annotations to code, providing software developers with powerful auxiliary tools and promoting further automation and intelligence in the code writing and development process.

Within the domain of LLMs applied to software development tasks, studies centered on code generation distinctly dominate the academic landscape. As reflected in Table  11 , the GPT series, particularly GPT-4, emerged as a key focus, with many more studies using them in the realm of code generation   (Dong et al . , 2023b ; Du et al . , 2023b ; Li et al . , 2023c ; Liu et al . , 2023k ) . Analyzing these studies, several noteworthy findings surface:

Programming thinking in LLMs. Techniques that evoke “programming thinking” within LLMs, such as the TIP (i.e., Thinking in Programming)  (Li et al . , 2023c ) methodology, have shown promising strides. By guiding LLMs to first craft a high-level code sketch before delving into detailed implementations, the synthesized code exhibits higher accuracy and robustness.

Class-level vs. Method-level generation. LLMs, while adept at method-level code generation, present varied performance metrics when tasked with class-level generation  (Du et al . , 2023b ) . This divergence underscores the evolving nature of challenges as the granularity of code synthesis shifts.

Expanding LLM capabilities. The next frontier in this discipline seems to lie in harmoniously integrating LLMs with established SE tools and practices. The emergence of frameworks like EvalPlus  (Dong et al . , 2023b ) indicates a trend towards enhancing the evaluation and accuracy of LLM-generated code, possibly ushering in an era where human developers and LLMs collaboratively craft software solutions.

Model Baseline Benchmark Metric Date Reference

GPT-3.5

Codex, CodeGen, CodeGeeX, LLaMA, InCoder, PyCodeGPT, CodeParrot, GPT-2

HumanEval, MBPP, MBCPP

Pass@k May 11, 2023 , )

GPT-4

PaLM Coder, Codex, CodeGen-Mono, Incoder, CodeGeeX, AlphaCode

HumanEval, HumanEval-ET, MBPP, MBPP-ET

Pass@k May 24, 2023 , )

GPT-4

GPT-3.5, StarCoder, CodeGen, CodeGen2, Vicuna, SantaCoder, Incoder, GPT-J, GPT-Neo, PolyCoder, StableLM

HumanEval, HumanEval+, HumanEval-mini

Pass@k Jun 12, 2023 , )

GPT-4

GPT-3.5, WizardCoder, Instruct-StarCoder, SantaCoder, Instruct-CodeGen, CodeGeeX, InCoder, Vicuna, ChatGLM, PolyCoder

ClassEval, HumanEval

Pass@k Aug 3, 2023 , )

Code completion. Code completion is an assistive feature provided by many integrated development environments (IDEs) and code editors. Its purpose is to automatically display possible code suggestions or options as developers write code  (Amann et al . , 2016 ) . This innovation has been advanced by Language Models (LMs), evolving from n-gram and RNN models to transformer-based models like Copilot  (GitHub, 2023 ) and CodeGPT  (Judini, 2023 ) , pre-trained on extensive code datasets. Recent LLMs equipped with billions of parameters, excel in generating code snippets. These models are trained on vast amounts of natural language text, equipping them with powerful semantic understanding capabilities. In the context of code completion, LLMs such as Codex  (Li et al . , 2022e ; Pearce et al . , 2021 ; Döderlein et al . , 2022 ; Chen et al . , 2021b ) , BERT series  (Khan and Uddin, 2022 ) , GitHub Copilot  (Li et al . , 2022e ; Pudari and Ernst, 2023 ; Döderlein et al . , 2022 ) , CodeParrot  (Li et al . , 2022e ; Xu et al . , 2022 ) , GPT series  (Xu et al . , 2022 ; Ochs et al . , 2023 ) , T5  (Ciniselli et al . , 2021 ) , InCoder  (Fried et al . , 2022 ) , PolyCoder  (Xu et al . , 2022 ) , CodeGen  (Ding et al . , 2023 ; Dinh et al . , 2023 ; Li et al . , 2022e ; Nijkamp et al . , 2022a ) , and other LLMs  (Izadi et al . , 2022 ; Ochs et al . , 2023 ) , can generate accurate and intelligent code suggestions based on code context and syntax structures. They comprehend the developer’s intent, predict the next possible code snippet, and provide appropriate recommendations based on the context.

With the support of LLMs, code completion achieves significant improvements in efficiency and accuracy. Developers can save time by avoiding manual input of lengthy code and reducing the risk of code errors. LLMs also learn from extensive code repositories, acquiring knowledge and best practices to offer more intelligent and precise suggestions, aiding developers in better understanding and utilizing code  (Ciniselli et al . , 2021 ) . Additionally, these models can provide personalized code recommendations based on developers’ coding styles and preferences, further enhancing the effectiveness and user experience of code completion  (Liu et al . , 2023l ) .

Code summarization. Code summarization is a task that attempts to understand the code and automatically generate descriptions directly from the source code. It can also be viewed as an extended form of documentation. Successful code summarization not only facilitates the maintenance of source code  (Iyer et al . , 2016 ; Nguyen and Nguyen, 2017 ) but can also be used to improve the performance of code search using natural language queries  (Nie et al . , 2016 ; Yang et al . , 2016 ) and code classification  (Nguyen and Nguyen, 2017 ) . LLMs play a significant role in code summarization by analyzing code structures and contexts to generate informative natural language summaries. Specifically, LLMs such as Codex  (Gao et al . , 2023b ; Arakelyan et al . , 2023 ; Ahmed et al . , 2023 ) , CodeBERT  (Chen et al . , 2022 ; Gu et al . , 2022 ; Gao et al . , 2023b ) , and T5  (Mastropaolo et al . , 2021b , 2022b ) comprehend the functionality and logic of the code, producing easily understandable human language descriptions. For example, Arakelyan et al.   (Arakelyan et al . , 2023 ) rigorously evaluate the efficacy of CodeT5 and Codex across code generation and summarization tasks, shedding light on their performance under distribution shifts. It unveils practical adaptation techniques, underscoring Codex’s commendable performance. Additionally, the study demonstrates that while adapted models exhibit proficiency in code generation, their generality can present trade-offs in the context of code summarization. As a result, code summarization with the support of LLMs enhances code readability, improves software documentation quality, and accelerates code comprehension and collaboration among developers. This advanced approach to code summarization demonstrates great potential for automating and streamlining various aspects of software development in modern SE practices with the employment of LLMs.

Code search. Code search, or code retrieval, is the task of retrieving source code from a large code base, usually based on a user’s natural language query. Despite the success of neural models in code search, such models are relatively shallow and are not capable of learning large amounts of data  (Salza et al . , 2022 ) . In recent years, some bimodal pre-training models based on the BERT neural architecture have been proposed to capture semantic links between natural and programming languages  (Feng et al . , 2020 ; Guo et al . , 2020 ; Roziere et al . , 2021 ; Wang et al . , 2021b ) , such as CodeBERT  (Feng et al . , 2020 ) and GraphCodeBERT  (Guo et al . , 2020 ) . Bimodal pre-training models learn generic representations from large amounts of data in an unsupervised manner by designing pre-training goals. Salza et al.   (Salza et al . , 2022 ) explored the effectiveness of LLMs such as BERT  (Salza et al . , 2022 ) and RoBERTa  (Chen et al . , 2022 ) in understanding natural language and code semantics and enhancing code search and retrieval. These studies show that pre-training tasks alone may not be sufficient for code search, which emphasizes the need for a multimodal understanding of data  (Shi et al . , 2022 ) , including both natural language and code. In addition, research has shown that the use of code generation models such as Codex  (Li et al . , 2022d ) can enhance code retrieval by generating code snippets from natural language documents, thereby improving semantic similarity and obtaining state-of-the-art results on benchmark datasets.

Code understanding. In contrast to code summarization, which focuses on automatically generating human-readable descriptions from source code, code understanding involves a deep analysis of source code to comprehend its logic, structure, functionality, and dependencies, as well as understanding the programming languages, frameworks, and libraries used  (Shen et al . , 2022 ) . LLMs can assist in code understanding by leveraging their powerful natural language processing capabilities to interpret code-related text, such as comments and documentation  (Wang et al . , 2023e ; Kanade et al . , 2020 ) . They aid developers in grasping code functionality, identifying dependencies, and generating relevant code documentation  (Shen et al . , 2022 ; Ma et al . , 2023a ) . Through their ability to comprehend both code and natural language, LLMs enhance the efficiency and accuracy of code understanding, empowering developers to maintain, optimize, and integrate code effectively  (Kanade et al . , 2020 ) .

Program synthesis. Program synthesis is the automated process of generating code that satisfies a given specification or set of constraints, emphasizing the derivation of functional properties of the code  (Chen et al . , 2017 , 2021a ; Manna and Waldinger, 1980 ; Srivastava et al . , 2010 ; Parisotto et al . , 2016 ) . It differs from code generation, which primarily translates higher-level representations into target code without necessarily deriving its functionality from scratch  (Siddiq et al . , 2023a ; Zhang et al . , 2023l ; Zheng et al . , 2023c ) . Several studies have demonstrated that LLMs can be used for program synthesis tasks. LLMs have a significant impact on program synthesis due to their advanced language understanding and generation capabilities. LLMs can effectively interpret natural language descriptions, code comments, and requirements, and then generate corresponding code snippets that fulfill the given specifications. This helps developers rapidly prototype code and automate repetitive coding tasks  (Kuznia et al . , 2022 ; Gandhi et al . , 2023 ) . When applied to program synthesis, LLMs enhance productivity and reduce the burden on developers by automating the code-writing process based on high-level input  (Jain et al . , 2022 ) . Their ability to understand the nuances of both natural language and programming languages makes them valuable tools in advancing the field of SE and streamlining the development lifecycle.

API recommendation. Several methods have been proposed to automate API (Application Programming Interface) recommendations  (Gu et al . , 2016 ; Huang et al . , 2018 ; Liu et al . , 2018 ; Nguyen et al . , 2016 ) , falling into two orthogonal approaches: information retrieval-based (IR-based) and neural-based. In this context, our focus is on the latter. Wei et al.   (Wei et al . , 2022a ) introduced CLEAR, an API recommendation method that employs the BERT sentence embedding model to represent queries, capturing continuous semantic information. Through contrast training, CLEAR enables BERT to learn precise semantic representations of queries, independent of their lexical content. Recently, Zhang et al.   (Zhang et al . , 2023k ) developed ToolCoder, which combines API search tools with existing models to aid in code generation and API selection. This approach involves an automated data annotation method using ChatGPT, adding tool usage information to the source code data, followed by fine-tuning the code generation model. During inference, an API search tool is integrated into the generation process, allowing the model to utilize the tool for suggestions when selecting APIs automatically.

API inference. The automated generation of application programming interface calls, known as API synthesis, plays a crucial role in bridging human intent with machine execution. In recent studies, Wang et al.   (Wang et al . , 2023d ) and Patil et al.   (Patil et al . , 2023 ) have both explored the potential of LLMs in this realm. Utilizing models like GPT-4 and LLaMA-based architectures, these researchers showcase the prowess of LLMs in generating accurate API calls and adapting to real-time documentation changes, effectively addressing challenges like hallucination and inaccurate input arguments. The integration of LLMs in API synthesis signifies a paradigm shift, promising enhanced accuracy, adaptability, and reliability in code generation. As illuminated by these studies, the future of API synthesis may be deeply anchored in advanced machine learning, heralding new research avenues and refinements for more seamless human-machine interactions.

Code representation. Code representation learning (also known as code embedding) aims to encode the code semantics into distributed vector representations and plays a key role in recent deep-learning-based models for code intelligence. Code representation can be used to support a variety of downstream tasks, such as code completion  (Raychev et al . , 2014 ) , code search  (Gu et al . , 2018 ; Wan et al . , 2019 ) , and code summarization  (Wan et al . , 2018 ; Zhang et al . , 2020a ) . Niu et al.   (Niu et al . , 2022 ) propose a novel sequence-to-sequence pre-training model that utilizes structural information from source code to enhance its representation learning. The model is trained on a large corpus of source code, which enables it to capture the complex patterns and dependencies inherent in programming languages. Wan et al.   (Wan et al . , 2022b ) show through their research that attention is highly consistent with the syntactic structure of the code, that pre-trained code language models can preserve the syntactic structure of the code in the intermediate representations of each converter layer, and that pre-trained code models have the ability to induce a syntactic tree of the code. These revelations suggest that incorporating the syntactic structure of the code into the pre-training process results in better code representations.

Code comment generation. Code comment generation, the automatic creation of comments for source code, serves to elucidate code functionality, implementation logic, and input-output details, thereby enhancing readability and maintainability  (Geng et al . , 2024 ) . As code complexity grows, manually crafting these comprehensive and accurate comments can become burdensome and prone to errors. Automation in this domain can markedly enhance the efficiency and quality of code documentation. LLMs such as Codex  (Geng et al . , 2024 ) and T5  (Mastropaolo et al . , 2021a ) have been effectively applied to code comment generation. These models are pre-trained on vast amounts of data and possess powerful natural language processing and semantic understanding capabilities. During comment generation, LLMs analyze the structure, semantics, and context of the source code to automatically generate high-quality comments that correspond to the code’s functionality and logic. Addressing the often observed disconnect between code evolution and its accompanying documentation, Mastropaolo et al.   (Mastropaolo et al . , 2021a ) explore the potential of LLMs, particularly the T5 architecture, in assisting developers with code comment completion. Their empirical study juxtaposes the performance of the T5 model against an n-gram model, revealing T5’s superior capabilities, though the n-gram model remains a competitive alternative. The research underscores the significance of open-source datasets for training and highlights the scant use of industrial datasets in current studies.

Method name generation. Method names significantly affect program comprehensibility, serving as a brief summary of the source code and indicating the developer’s intent  (Ko et al . , 2006 ) . The importance of method names in program comprehension is further evidenced by recent studies showing that some programmers even write down important method names to help them figure out the procedures of an application  (Roehm et al . , 2012 ) . Zhu et al.   (Zhu et al . , 2023 ) present AUMENA, a novel approach using the CodeT5 model for context-aware method naming in SE. AUMENA first learns the contextualized representation of programming and natural language, then leverages LLMs with prompt tuning to detect inconsistent method names and suggest accurate alternatives. This method avoids previous generate-then-compare consistency checking limitations, modeling the task as a two-class classification problem.

Agile story point estimation. Agile story point estimation, representing the total work needed to implement a product backlog item, is a complex task in agility. Story points are typically estimated by team consensus, using methods like plan poker and expert judgment, and considering factors like workload and complexity. However, subjective estimates may introduce uncertainty. Fu et al.   (Fu and Tantithamthavorn, 2022 ) present GPT2SP, a Transformer-based approach that overcomes limitations of a previous method called Deep-SE. Unlike Deep-SE, which restricts language models to known words within a trained project, GPT2SP employs a broader context, making it transferable across projects. GPT2SP’s performance is comparable to Deep-SE in within-repository evaluations and surpasses it in 62.5% of cases, with improvements ranging from 3% to 46% across various projects.

API documentation smell detection. APIs, vital for modern software development, are often accompanied by official documentation. Good documentation is key to proper API use, while poor quality can hinder adoption and negatively impact developers’ productivity  (Aghajani et al . , 2020 ; Robillard, 2009 ; Robillard and DeLine, 2011 ) . Khan et al.   (Khan et al . , 2021 ) identified five API documentation smells and presented a benchmark of 1,000 API documentation units containing the five smells found in the official API documentation. The authors developed classifiers to detect these odors, with BERT showing the best performance, demonstrating the potential of LLMs in automatically monitoring and warning about API documentation quality.

API entity and relation extraction. Extracting APIs and their semantic relationships from unstructured text (e.g., data from Stack Overflow) is a fundamental task in SE, but existing methods require labor-intensive manual rule creation or data labeling. Huang et al.   (Huang et al . , 2023d ) present an innovative approach, AERJE, that leverages LLMs for this task. AERJE consists of a BERT-based dynamic hint generator and a T5-based joint entity-relationship extractor, which together enable efficient extraction of API entities and relationships without manual effort. The approach achieved an F1 score of 96.51% for API entity extraction and 81.2% for API relationship extraction, offering a significant advancement over traditional methods.

Code recommendation. Zhou et al.   (Zhou et al . , 2019 ) pointed out that software developers tend to write similar code examples several times due to the need to implement similar features in different projects. Therefore, during the software development process, recommender systems can provide programmers with the most pertinent and high-quality examples written by other programmers, thus helping them to complete their tasks quickly and efficiently  (Di Rocco et al . , 2021 ) . Open-source projects and informal documentation are the two main sources of information that developers rely on to perform programming tasks. For example, open-source projects on GitHub provide code examples and code resources for various tasks. Rahmani et al.   (Rahmani et al . , 2023 ) introduce a methodology to improve code example recommendations for Java programming language on Stack Overflow using BERT and Query-Aware Locality-Sensitive Hashing (LSH). They employ BERT to convert code into numerical vectors and then apply two LSH variants, Random Hyperplane-based, and Query-Aware, to identify Approximate Nearest Neighbors (ANN).

Control flow graph generation. Control Flow Graphs (CFGs) are a cornerstone of SE that illustrate program behavior by showing sequences of statements and their execution order conditions  (Allen, 1970 ) . As a graphical representation of program behavior, CFGs are critical in many SE tasks, including code search  (Guo et al . , 2020 ; Chen et al . , 2019b ) , code clone detection  (Wang et al . , 2020a ; Hu et al . , 2018 ; Wei and Li, 2017 ) and code classification  (Wang et al . , 2020b ; Zhang et al . , 2019 ) . Huang et al.   (Huang et al . , 2023g ) presented a novel approach for generating behaviorally correct CFGs of statically typed partial code by leveraging the error-tolerant and understanding ability of LLMs. The approach involves a Chain of Thoughts (CoT) with four steps: structure hierarchy extraction, nested code block extraction, CFG generation of nested code blocks, and fusion of all nested code blocks’ CFGs  (Le-Cong et al . , 2022 ) . The CoT is broken down into an AI chain according to the single responsibility principle, along with effective prompt instructions. This results in superior node and edge coverage compared to traditional program analysis-based methods and the original CoT method.

Identifier normalization. Identifiers usually consist of multiple words, and a certain number of identifiers contain abbreviations  (Jiang et al . , 2020 ) . Consequently, the lexical meaning of identifiers and the overall functionality of source code written by one developer may be challenging for other developers to comprehend. In addition, the source code cannot match the vocabulary in other software artifacts described in natural language, thus invalidating some automated algorithms. Therefore, there is a strong need to normalize identifiers with the aim of aligning the vocabulary in identifiers with the natural language vocabulary in other software artifacts. Zhang et al.   (Zhang et al . , 2022a ) addressed this by introducing BEQAIN, an approach for identifier normalization. BEQAIN combines BERT with a Question and Answering (Q&A) system and Conditional Random Fields (CRF), treating identifier splitting as sequence labeling and abbreviation expansion as a Q&A task. It uses programming context to refine expansion results when multiple expansions are possible, aligning identifier vocabulary with natural language and enhancing software development comprehension and automation.

Type inference. Type inference, the automated process of determining data types in programming, plays a crucial role in enhancing readability, maintainability, and reducing runtime errors  (Hellendoorn et al . , 2018 ; Pierce and Turner, 2000 ) . TypeScript, with its unique blend of optional typing, presents a nuanced challenge, especially when navigating the vast landscape of user-defined types. Addressing this complexity, Jesse et al.   (Jesse et al . , 2022 ) introduced an approach that leverages the capabilities of a BERT-style pre-trained model. Their solution, DIVERSETYPER, adeptly infers types for user-defined classes and interfaces by uniquely correlating class and interface declarations with their respective usage contexts. Beyond merely filling the gaps of previous methodologies, DIVERSETYPER sets a new benchmark in type inference, especially for user-defined types.

Others. In addition to the 18 software development tasks detailed above, LLMs can also be applied to code translation  (Jana et al . , 2023 ; Pan et al . , 2023c , a ; Qi et al . , 2023 ; Yan et al . , 2023b ; Yang et al . , 2023g ) , code editing  (Bairi et al . , 2023 ; Gupta et al . , 2023 ; Li et al . , 2023e ; Moon et al . , 2023 ; Shypula et al . , 2023 ) , API documentation augment  (Yang et al . , 2023c ) , data analysis  (Cheng et al . , 2023 ) , fuzz driver generation  (Zhang et al . , 2023a ) , instruction generation  (Zhou et al . , 2023b ) .

6.5. How are LLMs used in software quality assurance?

Within the domain of software quality assurance, LLMs have emerged as valuable tools with diverse applications for various tasks, including vulnerability detection, test generation, bug localization, verification, test automation, etc.

Vulnerability detection. The number of software vulnerabilities is rapidly increasing, as shown by the vulnerability reports from Common Vulnerabilities and Exposures (CVEs)  (Anon, 2022 ) in recent years. As the number of vulnerabilities increases, there will be more possibilities for cybersecurity attacks, which can cause serious economic and social harm. Therefore, vulnerability detection is crucial to ensure the security of software systems and protect social and economic stability. Traditional static detection methods are based on static analysis and predefined matching rules, which rely on developers’ expertise and make it difficult to detect unknown vulnerabilities. With the assistance of LLMs  (Thapa et al . , 2022 ; Chan et al . , 2023 ; Chen et al . , 2023a ) , Tang et al.   (Tang et al . , 2023d ) introduced novel approaches using LLMs to enhance vulnerability detection. One of their proposed models, CSGVD, combines sequence and graph embedding for function-level vulnerability detection, outperforming other deep learning-based models on a real-world benchmark dataset. Their study also explores the application of CodeT5 for vulnerability detection, highlighting the importance of code-specific pre-training tasks.

Test generation. Test generation involves automating the process of creating test cases to evaluate the correctness and functionality of software applications. It encompasses various aspects, including test case generation  (Zhang et al . , 2023o ) , unit test generation  (Tang et al . , 2023c ; Yuan et al . , 2023b ; Schäfer et al . , 2023a ; Xie et al . , 2023a ; Siddiq et al . , 2023b ) , etc. LLM application in test generation offers several advantages, including the ability to automatically generate diverse test cases, improving test coverage  (Schäfer et al . , 2023a ; Siddiq et al . , 2023b ) and identifying potential defects  (Xie et al . , 2023a ) . LLMs can also assist in generating test cases based on natural language descriptions, fostering better collaboration between developers and testers. Additionally, they help identify areas lacking test coverage and suggest relevant test cases, ensuring comprehensive testing and reducing the risk of undiscovered issues  (Zhang et al . , 2023o ) . By enhancing test efficiency and effectiveness, LLMs contribute to producing more reliable and high-quality software products.

Bug localization. Bug localization refers to the process of identifying the specific source code files, functions, or lines of code that are responsible for a reported bug or software defect. Bug localization typically involves analyzing bug reports or issue descriptions provided by users or testers and correlating them with the relevant portions of the source code. This process can be challenging, especially in large and complex software projects, where codebases can contain thousands or even millions of lines of code. Traditional bug localization methods often rely on heuristics, code metrics, or stack trace analysis, which may not always provide precise results. Ciborowska et al.   (Ciborowska and Damevski, 2023 ) investigated data augmentation techniques to enhance bug localization models. They introduce a pipeline applying token-level operations such as dictionary replacement, insertion, random swapping, and deletion, along with paragraph-level back-translation to bug reports. By employing augmented data to train BERT-based models for bug localization, they demonstrate that these techniques can substantially expand the training data and boost the models’ performance.

Verification. Verification techniques, including prominent methods such as formal verification, hold a pivotal role in the domain of software quality assurance  (Charalambous et al . , 2023 ; Tihanyi et al . , 2023 ) . These techniques validate the correctness of software systems, improving their reliability and security against potential threats. Utilizing mathematical and logical principles in the verification process facilitates thorough error detection and correction before deployment, ensuring stable and secure performance in different operational contexts. Charalambous et al.   (Charalambous et al . , 2023 ) leverage LLMs, particularly the GPT-3.5, in the realm of formal verification. Their approach combines LLMs with bounded model checking (BMC) to automatically repair software based on formal methods, showcasing the model’s capability to understand intricate software structures and generate accurate repairs.

Test automation. Automated testing methodologies offer a comprehensive array of tools and strategies designed for the evaluation of software applications’ accuracy, reliability, and performance. These methodologies encompass various techniques, such as mutation testing  (Khanfir et al . , 2023 ) and fuzzing  (Deng et al . , 2023c , d ) . LLMs have been used for mutation testing, introducing faults to the codebase to assess the effectiveness of test suites in identifying and detecting errors  (Khanfir et al . , 2023 ) . Furthermore, LLMs can aid in fuzzing, generating valid and diverse input programs that help identify vulnerabilities and bugs, particularly in challenging domains like deep learning libraries  (Deng et al . , 2023c ) . By incorporating LLMs into test techniques, software engineers benefit from improved test coverage, reduced manual effort, and enhanced bug detection  (Deng et al . , 2023d ) , leading to more robust and reliable software systems.

Fault localization. Test suites typically include two types of test cases: pass-through test cases and fault-inducing test cases  (Li et al . , 2023l ) . In practice, there are far more pass test cases for faults than fault-inducing test cases, which hinders the effectiveness of program debugging. However, in practice, it is difficult to find fault-inducing test cases. This is because developers first need to find test inputs that trigger program faults, and the search space for such test inputs is huge  (Fraser et al . , 2015 ) . Moreover, developers need to build a test oracle to automatically detect program faults, and building a test oracle is often an undecidable problem  (Ibrahimzada et al . , 2022 ) . Li et al.   (Li et al . , 2023l ) investigated the application of ChatGPT to the task of finding fault-inducing test cases in SE. While recognizing ChatGPT’s potential, they initially observed suboptimal performance in pinpointing these cases, particularly when two versions of a program had similar syntax. The authors identified this as a weakness in ChatGPT’s ability to discern subtle code differences. To enhance its performance, they devised a novel approach blending ChatGPT with difference testing. Leveraging ChatGPT’s strength in inferring expected behavior from erroneous programs, they synthesized programs that amplified subtle code differences. The experimental results reveal that this approach greatly increases the probability of finding the correct fault-inducing test case.

Others. In addition to the six software quality assurance tasks detailed above, LLMs can also be applied to defect detection  (Sun et al . , 2023a ; Wong et al . , 2023 ) , GUI testing  (Yoon et al . , 2023 ; Liu et al . , 2023b ) , static analysis  (Hao et al . , 2023 ; Mohajer et al . , 2023 ) , binary taint analysis  (Liu et al . , 2023f ) , compiler fuzzing  (Quan et al . , 2023 ) , decompilation  (Xu et al . , 2023b ) , invariant prediction  (Pei et al . , 2023 ) , malicious code localization  (Sun et al . , 2023a ) , mobile app crash detection  (Liu et al . , 2023c ) , and resource leak detection  (Wang et al . , 2023g ) .

6.6. How are LLMs used in software maintenance?

Within the context of software maintenance, LLMs have been leveraged for bug prediction, program repair, code review, debugging, and an array of other activities.

Program repair. The goal of automated program repair (APR) is to automatically identify and fix bugs or defects in software  (Zhang et al . , 2023j ) . It involves leveraging automated techniques to analyze buggy code and generate correct patches to address the identified issues. LLMs, such as BERT  (Zhang et al . , 2023c ; Tian et al . , 2023a ) , CodeBERT  (Le-Cong et al . , 2023 ) , CodeT5  (Paul et al . , 2023a ) , Codex  (Fan et al . , 2022 ; Jin et al . , 2023b ; Wu et al . , 2023a ) , PLBART  (Paul et al . , 2023a ; Wu et al . , 2023a ) , T5  (Yuan et al . , 2022 ; Mastropaolo et al . , 2022b ) and GPT series  (Xia and Zhang, 2023b ; Tian et al . , 2023b ; Xia and Zhang, 2023a ; Lajkó et al . , 2022 ; Charalambous et al . , 2023 ; Sobania et al . , 2023 ; Cao et al . , 2023 ) , have shown effectiveness in generating syntactically correct and contextually relevant code. Leveraging LLMs for program repair can achieve competitive performance in generating patches for various types of bugs and defects  (Xia and Zhang, 2023b ) . These models can effectively capture the underlying semantics and dependencies in the code  (Charalambous et al . , 2023 ) , leading to the production of accurate and effective patches  (Zhang et al . , 2023c ; Xia and Zhang, 2023a ) . Moreover, LLMs can be fine-tuned on specific code repair datasets  (Mastropaolo et al . , 2022b ) , further improving their ability to generate high-quality patches for real-world software projects. The application of LLMs in program repair not only accelerates the bug-fixing process but also enables software developers to focus on more complex tasks, leading to enhanced software reliability and maintainability.

Model Baseline Benchmark Metric Date Reference

Codex

GPT-Neo, GPT-J, GPT-NeoX, CodeT5, InCoder

QuixBugs-Python and Java, Defects4J 1.2 and 2.0, ManyBugs

Correct / plausible patches

May 20, 2023 , )

Codex

CodeT5, CodeGen, PLBART, InCoder

Vul4J, VJBench,

Correct / plausible patches

May 29, 2023 , )

ChatGPT

Codex, CodeGen-16B, CodeGen-6B, CodeGen-2B, CodeGen-350M

QuixBugs-Python and Java

Correct / plausible patches

Jan 30, 2023 )

ChatGPT

Codex, CodeBERT, SelfAPR, RewardRepair, Recoder, TBar, CURE, CoCoNuT

QuixBugs-Python and Java, Defects4J 1.2 and 2.0

Correct fixes

Apr 1, 2023 )

In recent research, program repair has emerged as a prevalent application. Among the LLMs, as shown in Table  12 , Codex  (Wu et al . , 2023a ; Xia et al . , 2023 ) and ChatGPT  (Xia and Zhang, 2023a ) have particularly distinguished themselves in the program repair domain. ChatGPT edges ahead due to its inherent interactive design, enabling a continuous feedback loop that yields refined and contextually apt patches   (Xia and Zhang, 2023a , b ) . Such conversational dynamics, coupled with rigorous comparisons across diverse baselines, underscore its superior adaptability and efficiency.

Summarising several key findings from research on LLMs for program repair:

Interactive feedback. Incorporating an interactive feedback loop, as observed with ChatGPT, significantly augments the accuracy of program repair  (Xia and Zhang, 2023a ) . This dynamic interplay between patch generation and validation fosters a deeper understanding of the software’s semantics, leading to more effective repairs.

Domain-specific integration. Merging the capabilities of LLMs with domain-specific knowledge and techniques further enhances their performance. Customized prompts, project-specific fine-tuning, and leveraging SE techniques  (Xia et al . , 2023 ; Wang et al . , 2023c ) can dramatically elevate the efficacy of LLM-driven program repairs.

Comparative analysis. Rigorous evaluation against diverse baselines reveals the versatility and adaptability of LLMs, especially ChatGPT. This wide-ranging comparison not only establishes their superiority but also underscores areas for potential improvement  (Xia and Zhang, 2023b ) .

Code clone detection. Code clones are code samples that are identical to each other  (Baxter et al . , 1998 ; Karampatsis and Sutton, 2020 ) . These code samples can have structural or semantic equivalence  (Svajlenko et al . , 2014 ) . Sharma et al.   (Sharma et al . , 2022 ) investigate BERT’s application in code clone detection through an exploratory study. Analyzing BERT’s attention to code markers, they found that identifiers received higher attention, advocating their use in clone detection. This insight enhanced clone detection across all layers, and the implications extended beyond BERT. The researchers suggest that these findings could lead to the development of smaller models with performance akin to larger ones, thus mitigating computational accessibility issues.

Code review. Code review is a critical quality assurance practice used to inspect, assess, and validate the quality and consistency of software code  (Sghaier and Sahraoui, 2023 ) . Code review aims to identify potential errors, vulnerabilities, and code quality issues, while also improving code maintainability, readability, and scalability. LLMs like BERT  (Sghaier and Sahraoui, 2023 ) , ChatGPT  (Sridhara et al . , 2023 ) , and T5  (Tufano et al . , 2022 ; Li et al . , 2022f ) , trained on massive code repositories, possess the ability to understand and learn the semantics, structures, and contextual information of code  (Zhang et al . , 2022c ) . In the code review process, LLMs assist reviewers in comprehensively understanding code intent and implementation details, enabling more accurate detection of potential issues and errors. Moreover, these models can generate suggestions for code improvements and optimizations, providing valuable insights and guidance to reviewers. By combining the intelligence of LLMs with the expertise of human reviewers, code review becomes more efficient and precise, further enhancing software quality and reliability.

Debugging. Debugging targets identifying, locating, and resolving software defects or errors, commonly known as bugs. The debugging process involves scrutinizing the code, tracing the execution flow, and isolating the root cause of the problem to correct the error effectively. LLMs, such as BERT and other converter-based architectures, excel at utilizing contextual information and natural language understanding. In terms of debugging, LLMs can be used to simulate the scientific debugging process, such as AutoSD proposed by Kang et al.   (Kang et al . , 2023b ) . This model generates hypotheses about code problems and extracts relevant values to identify potential problems. In addition, the SELF-DEBUGGING method proposed by Chen et al.   (Chen et al . , 2023b ) enables LLM to debug its own generated code by learning a small number of presentations and explanations, which effectively improves the accuracy and sampling efficiency of code generation. Using LLMs in debugging not only improves fixing performance by generating competitive fixes but also provides insights into and explanations of the model’s decision-making process, making it an important tool for improving software quality and developer productivity.

Bug reproduction. Bug reports are crucial for software maintenance, allowing users to inform developers of problems encountered while using the software. Therefore, researchers have invested significant resources in automating error playback to speed up the software maintenance process. The success of current automated approaches depends heavily on the characteristics and quality of error reports, as they are limited by manually created schemas and predefined vocabularies. Inspired by the success of the LLMs in natural language understanding, Feng et al.   (Feng and Chen, 2023 ) propose AdbGPT, which utilizes natural language understanding and logical reasoning capabilities of the LLM to extract Steps to Reproduce (S2R) entities from bug reports and guide the bug replay process based on the current graphical user interface (GUI) state. The researchers describe how cue engineering, a small amount of learning, and thought chain reasoning can be utilized to leverage the knowledge of the LLM for automated error replay. This approach is significantly lightweight compared to traditional approaches, which utilize a single LLM to address both phases of S2R entity extraction and guided replay through novel hint engineering.

Duplicate bug report detection. In large software projects, multiple users may encounter and report the same or similar bugs independently, resulting in a proliferation of duplicate bug reports  (Isotani et al . , 2021 ) . Duplicate bug report detection involves analyzing the textual content of bug reports and comparing them to find similarities and redundancies. LLM models, such as BERT  (Isotani et al . , 2021 ) , ChatGPT  (Sridhara et al . , 2023 ) , and other transformer-based architectures, are well-suited for natural language understanding and contextual representation. When applied to this task, LLMs can effectively capture the semantic similarities between bug reports, even in cases with slight variations in language or phrasing. The utilization of LLMs in this context not only enhances efficiency in managing bug reports but also contributes to improving the overall software development and maintenance workflow, reducing redundancy, and ensuring prompt bug resolution  (Zhang et al . , 2023e ) .

Logging. Logging involves the systematic recording of events, messages, or information during the operation of a software application. It provides valuable information for understanding the behavior, performance, and potential problems of an application. Developers strategically insert logging statements throughout the code base to capture relevant data such as variable values, function calls, and error messages. These logs are an important tool for testing  (Chen et al . , 2018 , 2019a ) , debugging  (Satyanarayanan et al . , 1992 ) , monitoring  (Harty et al . , 2021 ; Hasselbring and van Hoorn, 2020 ) , and analyzing the behavior of software operations, helping developers identify and diagnose bugs, performance bottlenecks, and other critical issues. Mastropaolo et al.   (Mastropaolo et al . , 2022b ) introduce LANCE, a system for automatically generating and injecting full log statements into Java code using the T5 model. Sridhara et al.   (Sridhara et al . , 2023 ) present that ChatGPT performs well in the log summarization task, generating aggregated results that are better than the current state of the art.

Sentiment analysis. Sentiment analysis involves determining emotions in text data related to software products, such as user feedback or comments  (Guzman et al . , 2014 ; Jongeling et al . , 2015 ; Islam and Zibran, 2017 ) . The goal of sentiment analysis is to automatically classify the sentiment of the text as positive, negative, or neutral, providing valuable insights into how users perceive and react to software applications. Zhang et al.   (Zhang et al . , 2020b ) conducted a study comparing pre-trained Transformer models like BERT, RoBERTa, XLNet, and ALBERT with existing SA4SE tools across six datasets. The results show that the Transformer models outperformed previous tools by 6.5% to 35.6% in macro/micro-averaged F1-scores, albeit with a trade-off in runtime efficiency. However, this accuracy boost comes with some runtime costs, indicating that while Transformer models are less efficient than existing SA4SE approaches, their runtime cost is not prohibitively high.

Vulnerability repair. Vulnerability repair is the process of identifying and fixing security holes or weaknesses in software applications. Pearce et al.   (Pearce et al . , 2021 ) investigate how to use LLMs for software zero-point vulnerability remediation. The authors explore the challenges faced in designing hints to induce LLMs to generate fixed versions of insecure code. It shows that while the approach is promising, with LLMs capable of fixing 100% of synthetic and hand-created scenarios, a qualitative assessment of the model’s performance on a corpus of historical real-life examples reveals challenges in generating functionally correct code. It is concluded that despite the potential for future targeted LLM applications in this area, challenges remain. For a complete end-to-end system, the full system needs to be evaluated in conjunction with error localization and an improved testbed.

Bug prediction. Gomes et al.   (Gomes et al . , 2023 ) conduct a BERT and TF-IDF (Term Frequency-Inverted Document Frequency) application for long-lived bug prediction in Free/Libre Open-Source Software (FLOSS) study to compare their accuracy in predicting long-lived errors. The results show that BERT-based feature extraction consistently outperforms TF-IDF, demonstrating BERT’s ability to capture the semantic context in error reports. In addition, smaller BERT architectures also show competitive results, highlighting the effectiveness of LLMs in bug prediction. This approach promises to enable more accurate error detection in FLOSS projects and improve software quality and maintenance.

Bug triage. Bug triage is pivotal for effective issue management in large projects. It entails prioritizing bugs and assigning appropriate developers for resolution. While bug triage is straightforward for smaller projects, scalability brings complexity. Finding the right developers with the needed skills becomes intricate as bugs vary in expertise requirements. Some even demand combined skills, amplifying the intricacy. Lee et al.   (Lee et al . , 2022 ) introduce the Light Bug Triage framework (LBT-P). This innovative approach employs BERT to extract semantic information from bug reports. To surmount challenges with LLMs in bug triage, the researchers employ techniques like model compression, knowledge preservation fine-tuning, and a new loss function.

Program merge conflicts repair. Program merge conflicts repair addresses the challenges faced when integrating individual code changes, which can lead to textual or semantic inconsistencies. Zhang et al.   (Zhang et al . , 2022b ) explored the potential of using k-shot learning with LLMs like GPT-3 to automate this repair process. While these models showed promise in resolving semantic conflicts for Microsoft Edge, they didn’t fully replace the benefits of domain-specific languages for certain synthesis patterns.

Tag recommendation. Improper tagging in software Q&A sites can lead to redundancy and other issues such as tag explosion. He et al.   (He et al . , 2022 ) introduced PTM4Tag, a framework utilizing PLMs with a triplet architecture to recommend tags for posts. By separately modeling the title, description, and code snippets of posts, PTM4Tag was compared using five popular PLMs, including BERT, CodeBERT, etc. The SE-specialized CodeBERT showed the best performance, notably surpassing CNN-based methods. An ablation study revealed that while the title was crucial in tag prediction, using all post components achieved the optimal result.

Traceability recovery. Traceability recovery focuses on re-establishing lost or unclear connections between related software artifacts, thereby facilitating coherent software evolution and maintenance  (Gethers et al . , 2011 ) . While traditional methods have offered some solutions, the integration of LLMs has recently emerged as a promising avenue for enhancing the accuracy and efficiency of this task. Zhu et al.   (Zhu et al . , 2022 ) present TRACEFUN, a traceability link recovery framework enhanced with unlabeled data, serves as a testament to this potential, leveraging LLMs to bridge the gap between labeled and unlabeled data, thereby refining traceability link predictions.

Others. In addition to the 14 software maintenance tasks detailed above, LLMs can also be applied to review/commit/code classification  (Ghadhab et al . , 2021 ; Kou et al . , 2023a ; Yang et al . , 2022c ) , log parsing  (Liu et al . , 2024b ; Ma et al . , 2024b ; Yu et al . , 2023b ) , code revision  (Kabir et al . , 2023 ; Wadhwa et al . , 2023 ) , API misuses repair  (Zhang et al . , 2023h ) , Code coverage prediction  (Tufano et al . , 2023 ) , code review explained  (Widyasari et al . , 2023 ) , Code-Review defects repair  (Zhao et al . , 2023c ) , crash bug repair  (Du et al . , 2023a ) , dockerfile Repair  (Henkel et al . , 2021 ) , incivility detection  (Ferreira et al . , 2024 ) , patch correctness prediction  (Zhang et al . , 2024a ) , patch detection  (Tang et al . , 2023a ) , rename Refactoring  (Liu et al . , 2023i ) , technical debt payback  (Mastropaolo et al . , 2023a ) , web test repair  (Xu et al . , 2023a ) , type error repair  (Chow et al . , 2024 ) , etc.

6.7. How are LLMs used in software management?

Research papers describing the utilization of LLMs in software management are still limited.

Effort estimation. Effort estimation refers to the process of predicting the amount of time, resources, and manpower required to complete a software development project. Alhamed et al.   (Alhamed and Storer, 2022 ) conduct an evaluation of the application of BERT in the task of effort estimation for software maintenance. Their study underscores BERT’s potential to offer valuable insights and aid in the decision-making process while also highlighting the associated challenges and need for further investigation.

RQ4 - Summary

7. Threats to Validity

Paper search omission. One key limitation is the possibility of omitting relevant papers during the search process. When gathering papers related to LLM4SE tasks from various publishers, it is possible to miss some papers due to incomplete summarization of keywords for software engineering tasks or LLMs. To address this concern, we adopted a comprehensive approach, combining manual search, automated search, and snowballing techniques, to minimize the risk of missing relevant papers. For manual search, we systematically searched for LLM papers related to SE tasks in six top-tier SE venues and extracted authoritative and comprehensive SE tasks and LLM keywords from these sources. Using these constructed search strings, we conducted automated searches on seven widely used publisher platforms. Additionally, to further augment our search results, we employed both forward and backward snowballing.

Study selection bias. Another limitation is the potential study selection bias. We established inclusion and exclusion criteria to perform the initial selection of papers, followed by manual verification based on quality assessment criteria (QAC). This process involves a combination of automated and manual procedures. The automated selection process may result in mislabeling of papers due to incomplete or ambiguous information in their corresponding BibTeX records. To mitigate this issue, any papers that cannot be confidently excluded are temporarily retained for manual verification. However, the manual verification stage could be influenced by the subjective judgment and biases of the researchers, affecting the accuracy of the quality assessment of papers. To address these concerns, we invited two experienced reviewers in the fields of SE and LLM research to conduct a secondary review of the study selection results. This step aims to enhance the accuracy of our paper selection and minimize the likelihood of omission or misclassification. By using these measures, we strive to ensure that the selected papers are accurate and comprehensive, minimizing the impact of study selection bias and enhancing the reliability of our systematic literature review. We additionally provide a replication package 6 6 6 https://github.com/xinyi-hou/LLM4SE_SLR for others to view.

Empirical knowledge bias. This SLR, along with 395 relevant studies in the LLM4SE field, answers four RQs. This implies the need for manual analysis and understanding of each study. In this process, there may be biases introduced by subjective judgments and experiential knowledge. To minimize potential errors in this regard, we have made the following efforts. Firstly, in determining the RQs, as the first comprehensive overview of the LLM4SE field, we aim to provide a comprehensive interpretation of the current state and trends in this domain. Considering the commonality in AI4SE research, we referred to Yang et al.’s survey on DL4SE  (Yang et al . , 2022b ) during our RQ formulation. We finally decided to focus on LLM types, datasets, tuning, evaluation, and targeted SE tasks. Secondly, for the understanding and analysis of each study, to ensure accurate comprehension of paper details, before addressing each RQ, we extensively reviewed relevant literature to predefine the approximate categories and details for each RQ. For example, in RQ3, based on prior work  (Sahoo et al . , 2024 ; Weyssow et al . , 2023a ; Zhao et al . , 2023d ) , we identified differences between tuning techniques for LLMs and those commonly used in traditional machine learning, such as prompt engineering and PEFT.

8. Challenges and Opportunities

8.1. challenges, 8.1.1. challenges in llm applicability..

Model size and deployment. The size of LLMs has seen a marked increase over time, moving from GPT-1’s 117M parameters to GPT-2’s 1.5B, and further to GPT-3’s 175B parameters  (Yang et al . , 2023b ) . The billions and even trillions  (Moss, 2021 ) of parameters pose significant storage, memory, and computational challenges, which can hinder LLMs in resource-limited and real-time scenarios, especially when developers lack access to powerful GPUs or TPUs. CodeBERT  (Feng et al . , 2020 ) , a pre-trained model proposed in 2019, has a total of 125M parameters, resulting in a large model size of 476 MB. Recently proposed models like Codex  (Chen et al . , 2021b ) and CodeGen  (Nijkamp et al . , 2022a ) , have over 100 billion parameters and over 100 GB in size. The large sizes also require more computational resources. As pointed out by Hugging Face team  (Bekman, 2022 ) , training a 176B model (i.e., BLOOM  (Scao et al . , 2022 ) ) on 1.5 TB datasets consumes an estimated 1,082,880 GPU hours. Similarly, the training of the GPT-NeoX-20B model  (Black et al . , 2022 ) on the Pile dataset  (Gao et al . , 2020 ) , encompassing over 825 GiB of raw text data, requires the deployment of eight NVIDIA A100-SXM4-40GB GPUs. Each of these GPUs comes with a price tag of over 6,000 dollars  (Amazon, 2023b ) , and the training extends to 1,830 hours or approximately 76 days. Moreover, even training a relatively smaller model like the PolyCoder (2.7B)  (Xu et al . , 2022 ) , employing eight NVIDIA RTX 8000 GPUs on a single machine, demands a commitment of around 6 weeks. These examples illustrate the significant computational costs associated with training LLMs. These also have significant energy costs with predictions of massively increased energy usage by LLM-based platforms (Rillig et al . , 2023 ) . Fortunately, there are preliminary studies on reducing code models’ size and improving their efficiency. Shi et al.   (Shi et al . , 2023b ) use a genetic algorithm to compress CodeBERT into only 3 MB and reduce its response latency by more than 70%. Overall, the challenge of increasing model sizes and efficient deployment requires further attention from the communities.

Data dependency. In Section 4 , we provide a detailed analysis of the datasets used in 395 studies and the data preprocessing process, finding that LLMs rely heavily on a large number of different datasets for training and fine-tuning, posing the data dependency challenge. The quality, diversity, and quantity of data directly affect the performance and generalizability of the models. Given their size, LLMs often require large amounts of data to capture nuances, but obtaining such data can be challenging. Relying on limited or biased datasets may cause the model to inherit these biases, resulting in biased or inaccurate predictions. In addition, the domain-specific data required for fine-tuning can be a bottleneck. Due to the relatively short period of time since the emergence of LLM, such large-scale datasets are still relatively rare, especially in the SE domain. Another issue is the risk of benchmark data contamination, where training and test data overlaps could lead to inflated performance metrics  (Zhao et al . , 2021 ) . For instance, Brown et al.   (Brown et al . , 2020 ) discovered a code bug that prevented them from fully removing all overlapping data. They were unable to afford retraining and resorted to using “cleaned” variants of the benchmarks to mitigate the issue. Moreover, there are grave concerns around the inclusion of Personally Identifiable Information (PII) in pre-training corpora. Instances of PII, such as phone numbers and email addresses, have led to privacy leaks during the prompting process  (Kulkarni, 2021 ; El-Mhamdi et al . , 2023 ) .

Ambiguity in code generation. Ambiguity in code generation poses a significant challenge for LLMs in SE tasks. When code intent is unclear (e.g., multiple valid solutions exist), LLMs may struggle to produce accurate and contextually appropriate code. This can lead to syntactically correct but functionally incorrect code, impacting the reliability and effectiveness of LLM-based code generation. Addressing this issue requires exploring techniques to incorporate additional context, domain-specific knowledge, or multi-model ensembles to improve LLMs’ ability to handle ambiguity and generate precise code, ensuring their successful integration into real-world software development processes.

8.1.2. Challenges in LLM Generalizability

The generalizability of LLMs refers to the ability of these models to consistently and accurately perform tasks in different tasks, datasets, or domains outside their training environment. While LLMs are trained on massive amounts of data, ensuring extensive knowledge capture, their performance is sometimes problematic when confronted with specific or idiosyncratic tasks outside the scope of their training. This challenge is particularly evident in the SE domain, where we present the application of LLMs to 85 SE tasks in Section  6 . We observed that the context and semantics of code or documents vary greatly across projects, languages, or domains. Ensuring that the LLM generalizes well requires careful fine-tuning, validation on different datasets, and continuous feedback loops. Without these measures, models run the risk of over-adapting their training data, thus limiting their usefulness in a variety of real-world applications. Recent studies have shown that the LLMs cannot generalize their good performance to inputs after semantic-preserving transformations. For example, Yang et al.   (Yang et al . , 2022a ) show that the performance of CodeBERT on different tasks decreases significantly after substituting the variables’ names in the input.

8.1.3. Challenges in LLM Evaluation

We summarized key evaluation metrics used in different types of SE tasks according to four task types: regression, classification, recommendation, and generation (Section  6 ). We found that when applying LLMs in the software engineering domain, the methodology for evaluating the performance of the models is usually based on a set of predefined metrics. Unfortunately, these metrics (e.g., Accuracy, Recall, or F1-score), while useful in some cases, may not fully capture all the effects and impacts of a model in a given SE task. For example, a model may perform well in terms of accuracy but may fail in processing specific types of inputs or in some specific situations. In addition, these metrics may not capture certain qualitative aspects of the model, such as its interpretability, robustness, or sensitivity to specific types of errors. Some of the most recent studies on LLM4SE tasks  (Agrawal et al . , 2023 ; Hu et al . , 2023 ; Singla, 2023 ; Xu et al . , 2023b ; Yuan et al . , 2023a ; Zhang et al . , 2023l ) , in which researchers customized some evaluation metrics to assess the performance of models, also further illustrate the limitations of some of the widely used evaluation metrics in the field of LLM.

8.1.4. Challenges in LLM Interpretability, Trustworthiness, and Ethical Usage

Interpretability and trustworthiness are crucial aspects in the adoption of LLMs for SE tasks. The challenge lies in understanding the decision-making process of these models, as their black-box nature often makes it difficult to explain why or how a particular code snippet or recommendation is generated. Recent studies  (Yang et al . , 2023e ; Wan et al . , 2022a ; Li et al . , 2022c ) also show that LLM of code trained on low-quality datasets can have vulnerabilities (e.g., generating insecure code). The lack of interpretability and trustworthiness can lead to uncertainty and hesitation among developers, who may be hesitant to rely on LLM-generated code without a clear understanding of how it was derived. Establishing trust in LLMs requires efforts to develop techniques and tools that provide insights into the model’s internal workings and enable developers to comprehend the reasoning behind the generated outputs. Enhancing interpretability and trustworthiness can ultimately promote the widespread adoption of LLMs in SE, leading to more efficient and effective development practices. Many LLMs are not open and it is unclear what data they have been trained on, both quality and representativeness but also ownership of the source training data. This brings into question ownership of the derivative data, e.g., generated designs, code, or test cases. There is also potential for various adversarial attacks e.g. deliberately seeding LLMs with code vulnerabilities so that automatically generated code snippets have subtle but vulnerable aspects.

8.2. Opportunities

8.2.1. optimization of llm4se.

The advent of code-specialized LLMs in SE. The recent emergence of code-specialized LLMs, such as GitHub Copilot  (GitHub, 2023 ) , Amazon’s CodeWhisperer  (Amazon, 2023a ) , OpenAI Code Interpreter  (OpenAI, 2023a ) integrated into ChatGPT, and Code Llama  (Meta, 2023 ) from Meta’s Llama family, signals a transformative phase in LLM4SE. These specialized LLMs, fine-tuned on code-specific datasets, are not merely incremental improvements but paradigm shifts in code understanding, generation, and efficiency. They offer new avenues for automated coding, personalized developer assistance, enhanced code review, and quality assurance, among other tasks, setting the stage for groundbreaking advancements in the SE domain.

Influence and applications of ChatGPT. ChatGPT’s popularity in recent academic research, as evidenced by its large presence in our 395 analyzed papers, emphasizes its escalating influence and acceptance within academia. Researchers’ preference for ChatGPT over other LLMs and LLM-based applications since its release can be attributed to its computational efficiency, adaptability to various tasks, and potential cost-effectiveness  (Laskar et al . , 2023 ; Li et al . , 2023c ; Xia and Zhang, 2023a ) . Its applications extend beyond mere code efficiency and debugging, fostering a collaborative era in development. This paradigm shift signifies a broader move towards integrating advanced natural language understanding into conventional coding practices  (Laskar et al . , 2023 ; Ma et al . , 2023a ; Sadik et al . , 2023 ) . By thoughtfully analyzing these dynamics and trends, we can foresee the potential pathways for LLMs and LLM applications like ChatGPT in shaping more robust, efficient, and collaborative software development procedures. Such insights stand as a promising indication of the future revolutionary impact of LLMs on SE.

Performance enhancement from task-specific model training. The choice between leveraging commercially available pre-trained models like GPT-4 and building upon open-source frameworks such as Llama 2  (Touvron et al . , 2023b ) , Gemma  (Google, 2024 ) , and Mistral  (AI, 2023 ) provides a nuanced set of options for individual or organizational customization in specialized tasks. The distinction between these two approaches lies in the degree of control and customization. Pre-trained models like GPT-4 are generally not designed for large-scale retraining due to their proprietary nature, but they allow quick task-specific adaptations with limited data, thereby minimizing computational overhead. On the other hand, frameworks like LLaMA offer an open-source foundation for more extensive customization. While they come pre-trained, organizations often modify the source code and retrain these models on their own large-scale datasets to meet specialized requirements  (ymcui, 2023 ; hiyouga, 2023 ) . This process is computationally intensive, leading to greater resource allocation and cost, but affords the advantage of creating highly domain-specific models. Hence, the primary trade-off is between the ease of use and quick deployment offered by models like GPT-4, and the deep customization capabilities but higher computational demands associated with open-source frameworks like LLaMA.

Collaborative LLMs. From our review it is evident that LLMs have made significant strides in addressing various SE challenges. However, as the complexity of SE tasks continues to grow, there’s an emerging need for more sophisticated and tailored solutions. One promising direction is the concept of Collaborative LLMs. This approach involves integrating multiple LLMs  (Dong et al . , 2023b ; Zhao et al . , 2023b ) or combining LLMs with specialized machine-learning models  (Ezzini et al . , 2022 ; Zhang et al . , 2022a ) to enhance their efficacy for SE tasks. By harnessing the collective strengths of different models, we believe that the SE community can achieve more precise and efficient outcomes, from code completion to bug detection.

8.2.2. Expanding LLM’s NLP Capabilities in More SE Phases.

Integration of new input forms. In our analysis, we observed that the predominant input forms were code-based datasets and text-based datasets. However, there was a noticeable scarcity of graph-based datasets  (Kolthoff et al . , 2023 ) (Section  4 ). Leveraging new input forms of natural language, such as spoken language, diagrams, and multimodal inputs, presents an opportunity to enhance the LLMs’ ability to understand and process diverse user requirements. Integrating spoken language could improve interactions between developers and models, enabling more natural and context-rich communication. Diagrams can facilitate visual representations of code and requirements, offering a complementary perspective for code generation. Furthermore, multimodal inputs that combine text, audio, and visual cues could offer a more comprehensive context understanding, leading to more accurate and contextually appropriate code generation. Additionally, exploring graph-based datasets could be crucial for addressing complex code scenarios, as graphs capture the structural relationships and dependencies in code, allowing LLMs to better comprehend code interactions and dependencies.

Widening LLM applications across SE phases. We observed a pronounced emphasis on the application of LLMs in software development and maintenance. These areas have undoubtedly benefited from the capabilities of LLMs, leading to enhanced code completion  (Izadi et al . , 2022 ; Li et al . , 2022e ; Liu et al . , 2023l ) , bug detection  (Ciborowska and Damevski, 2023 ; Feng and Chen, 2023 ; Kang et al . , 2023b ) , and other related tasks. The current application of LLMs in requirements engineering, software design, and software management remains relatively sparse. This presents a significant opportunity: by expanding the use of LLMs to these under-explored areas, we can potentially improve how requirements are elicited, how software designs are conceptualized, and how projects are managed.

8.2.3. Enhancing LLM’s Performance in Existing SE Tasks

Tackling domain-specific challenges. Many SE domains, including safety-critical systems and specific industries, suffer from a scarcity of open-source datasets, hindering the application of LLMs in these specialized areas. Future research can focus on creating domain-specific datasets and fine-tuning LLMs to cater to the unique challenges and intricacies of these fields  (Biswas et al . , 2020 ; Sun et al . , 2023e ) . Collaboration with domain experts and practitioners is vital to curate relevant data, and fine-tuning LLMs on this data can enhance their effectiveness and ensure better alignment with the specific requirements of each domain, paving the way for LLMs to address real-world challenges  (Bubeck et al . , 2023 ) in diverse software engineering domains  (Li et al . , 2023l ) .

Establishing a comprehensive evaluation framework for LLM4SE. The necessity for a universal, yet adaptable, evaluation framework for LLM4SE is pressing for both academic and industrial sectors. In academia, such a framework enables streamlined assessments of LLM performance, efficacy, and limitations, serving as a benchmark to verify the models’ practical readiness. On the industrial side, collaborations with real-world development teams using this framework yield empirical insights into LLMs’ utility, including their impacts on productivity, code quality, and team collaboration, while also revealing challenges like model biases, misinterpretation of code semantics, and context-specific limitations. Establishing this framework is critical for standardizing assessments and facilitating responsible LLM adoption in both academic research and practical applications  (Biswas et al . , 2020 ; Gong et al . , 2023 ) .

8.3. Roadmap

We provide a roadmap for future development in leveraging Large Language Models for Software Engineering (LLM4SE), with an additional high-level perspective that acknowledges the reciprocal relationship and emerging exploration of Software Engineering for Large Language Models (SE4LLM).

Automated coding, development and personalized developer assistance. The pursuit of automation in coding encompasses the auto-generation of code snippets, bug fixes, system optimization, and the creation of intelligent, personalized assistance for developers that is context-aware and adaptable to individual needs. LLM’s generative capabilities can be leveraged to help developers better understand requirements and generate syntactically and semantically correct code, thereby accelerating development cycles and improving software quality. Leveraging LLM’s natural language processing to develop context-aware tools allows for interaction with developers in a more intuitive and responsive manner. Additionally, fine-tuning LLMs for specific coding tasks and developer assistance can further enhance their accuracy and efficiency, customizing the automation process to suit the unique demands of different projects and individuals.

Advancing testing and analysis. The inclusion of LLMs in software testing methods opens up avenues for enhanced test case generation, bug classification, and defect prediction, thereby improving the precision and efficiency of the software testing process. For instance, LLMs show potential to be fine-tuned to a project’s specific requirements to generate customized test cases, which elevates the likelihood of early detection of subtle bugs or security vulnerabilities. Furthermore, the integration of LLMs with traditional SE techniques, including both static and dynamic program analysis presents a compelling direction for more rigorous code analysis. The potential for utilizing LLMs in formal analysis methodologies, including formal verification, is another area that merits investigation  (Charalambous et al . , 2023 ) . These advancements not only facilitate the early discovery of complex errors but also lead to reduced development costs and quicker time-to-market, ultimately contributing to the robustness and reliability of the software products.

Integrating programming knowledge into LLMs. One critical future direction lies in the integration of specialized code representation methods and programming domain knowledge into LLM4SE  (Wan et al . , 2022b ; Ma et al . , 2023b ) . This integration aims to enhance the capability of LLMs to generate code that is not only functionally accurate but also secure and compliant with programming standards. Leveraging advanced techniques in code embedding, syntax tree parsing, and semantic analysis could significantly refine the generation capabilities of LLMs. Moreover, embedding domain-specific rules and best practices into these models would enable them to auto-generate code that adheres to industry or language-specific guidelines for security and style.

Enhanced code review and quality assurance. The transformation of the code review process can be supported by employing LLMs to analyze code context, perform intelligent comparisons, and offer insights that go beyond traditional automated review systems. The application of fine-tuned LLMs for code review can allow for more precise error detection and tailored feedback, offering a more nuanced understanding of code quality and potential improvements.

Extracting insights from data mining. LLMs can play a critical role in mining insights from platforms like GitHub, StackOverflow, and app stores. Through the application in tasks such as requirement extraction, traceability, validation, and various types of mining (tag, app, developer-based), LLMs can provide valuable insights that inform development strategies and decision-making. By automating and enhancing these mining tasks, LLMs contribute to a deeper understanding of user needs, emerging trends, and the efficiency of development practices.

Empowering predictive analytics and decision support. Leveraging LLMs for effort cost prediction, software classification, code classification, incident detection, and software quality evaluation may support better data-driven insights and predictive analytics. This empowers organizations to make informed decisions throughout the development lifecycle. LLMs’ ability to model and analyze vast amounts of data enables more accurate forecasts of project timelines, resource needs, and potential risks.

LLMs in software security. The growing impact of LLM4SE offers both unparalleled opportunities and challenges in the domain of software security. On the one hand, LLMs offer promising solutions for automated security audits, compliance verifications, and vulnerability detection. These models can potentially be leveraged for automated code reviews to ensure compliance with industry standards and legal regulations, while also identifying potential security vulnerabilities  (Ferrag et al . , 2023 ; Ahmad et al . , 2023 ; Feng and Chen, 2023 ; Pearce et al . , 2023 ; Deng et al . , 2023b ; Happe and Cito, 2023 ) . For instance, Ferrag et al.   (Ferrag et al . , 2023 ) showcased the efficacy of LLMs in cyber reasoning tasks related to software security. On the other hand, the usage of LLMs introduces novel security concerns. Their complexity makes them susceptible to attacks, demanding novel strategies to fortify the models themselves  (Wu et al . , 2023e ; Rao et al . , 2023b ; Elizondo, 2023 ; Ramly, 2023 ; Deng et al . , 2023a ; Liu et al . , 2023d ) . As an example, Wu et al.   (Wu et al . , 2023e ) delve into methods to secure LLMs against jailbreak attacks. An intriguing direction for future research lies in enabling LLMs to automatically identify and rectify their own vulnerabilities. Specifically, the focus could be on equipping LLMs to generate self-applied patches to their underlying code, thereby enhancing their inherent security, as opposed to merely implementing application-layer restrictions. Given this landscape, future research should adopt a balanced approach, aiming to exploit LLMs for automating and enhancing existing software security protocols while concurrently developing techniques to secure the LLMs themselves. This dual focus is crucial for fully realizing the potential of LLMs in enhancing the security and compliance assurance of software systems.

Software Engineering for Large Language Models (SE4LLM). As the capabilities and complexities of LLMs continue to expand, there arises a reciprocal need for specialized SE practices tailored for the development, optimization, and maintenance of these models. SE4LLM encompasses a range of challenges and opportunities, including the design of scalable and maintainable architectures, the creation of efficient training algorithms, the development of rigorous testing frameworks for model robustness and fairness, and the implementation of ethical guidelines and compliance mechanisms. The convergence of SE with LLMs not only facilitates the growth of more sophisticated and adaptable models but also opens up new avenues for interdisciplinary research and innovation, bringing together the expertise of both the AI and SE communities. This aligns with a broader vision where SE practices become an integral part of the lifecycle of LLMs, ensuring their robustness, efficiency, and ethical alignment with societal values.

9. Conclusion

LLMs are bringing significant changes to the field of SE. The potential of these models to handle complex tasks can fundamentally reshape many SE practices and tools. In this SLR, we analyzed the emerging utilization of LLMs for software engineering, encompassing papers published since the inception of the first LLM (BERT). We examined the diverse LLMs that have been employed in SE tasks and explored their distinct features and applications (RQ1). We then investigated the processes involved in data collection, preprocessing, and usage, emphasizing the significant role well-curated datasets play in the successful application of LLMs to solve SE tasks (RQ2). Following this, we investigated the various strategies utilized to optimize and assess the performance of LLMs for SE tasks (RQ3). Lastly, we reviewed the wide range of SE tasks where LLMs have been applied to date, shedding light on the practical contributions LLMs have made (RQ4). We summarised some key existing challenges of LLM4SE and provided a research roadmap, outlining promising future research directions.

  • Agarwal et al . (2024) Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, and Jie Chen. 2024. Structured Code Representations Enable Data-Efficient Adaptation of Code Language Models. arXiv preprint arXiv:2401.10716 (2024).
  • Aghajani et al . (2020) Emad Aghajani, Csaba Nagy, Mario Linares-Vásquez, Laura Moreno, Gabriele Bavota, Michele Lanza, and David C Shepherd. 2020. Software documentation: the practitioners’ perspective. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering . 590–601.
  • Agrawal et al . (2023) Lakshya Agrawal, Aditya Kanade, Navin Goyal, Shuvendu K Lahiri, and Sriram Rajamani. 2023. Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context. In Thirty-seventh Conference on Neural Information Processing Systems .
  • Ahmad et al . (2023) Baleegh Ahmad, Shailja Thakur, Benjamin Tan, Ramesh Karri, and Hammond Pearce. 2023. Fixing Hardware Security Bugs with Large Language Models. arXiv preprint arXiv:2302.01215 (2023).
  • Ahmad et al . (2021) Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified pre-training for program understanding and generation. arXiv preprint arXiv:2103.06333 (2021).
  • Ahmed et al . (2023) Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T Barr. 2023. Improving Few-Shot Prompts with Relevant Static Analysis Products. arXiv preprint arXiv:2304.06815 (2023).
  • Ahmed et al . (2024) Toufique Ahmed, Kunal Suresh Pai, Premkumar Devanbu, and Earl T. Barr. 2024. Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization). arXiv:2304.06815 [cs.SE]
  • AI (2023) Mistral AI. 2023. Mistral. https://mistral.ai/ .
  • Al-Kaswan et al . (2023) Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant, Premkumar Devanbu, and Arie van Deursen. 2023. Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binarie. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 260–271.
  • Alam et al . (2023) Ajmain I Alam, Palash R Roy, Farouq Al-Omari, Chanchal K Roy, Banani Roy, and Kevin A Schneider. 2023. GPTCloneBench: A comprehensive benchmark of semantic clones and cross-language clones using GPT-3 model and SemanticCloneBench. In 2023 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 1–13.
  • Alhamed and Storer (2022) Mohammed Alhamed and Tim Storer. 2022. Evaluation of Context-Aware Language Models and Experts for Effort Estimation of Software Maintenance Issues. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 129–138.
  • Allen (1970) Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices 5, 7 (1970), 1–19.
  • Alur et al . (2013) Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh, Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2013. Syntax-guided synthesis . IEEE.
  • Amann et al . (2016) Sven Amann, Sebastian Proksch, Sarah Nadi, and Mira Mezini. 2016. A study of visual studio usage in practice. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER) , Vol. 1. IEEE, 124–134.
  • Amazon (2023a) Amazon. 2023a. Amazon CodeWhisperer. https://aws.amazon.com/cn/codewhisperer/ .
  • Amazon (2023b) Amazon. 2023b. NVIDIA Tesla A100 Ampere 40 GB Graphics Card - PCIe 4.0 - Dual Slot. https://www.amazon.com/NVIDIA-Tesla-A100-Ampere-Graphics/dp/B0BGZJ27SL .
  • Anon (2022) M Anon. 2022. National vulnerability database. https://www.nist.gov/programs-projects/national-vulnerability-database-nvd .
  • Anthropic (2023) Anthropic. 2023. Claude. https://www.anthropic.com/claude .
  • Arakelyan et al . (2023) Shushan Arakelyan, Rocktim Jyoti Das, Yi Mao, and Xiang Ren. 2023. Exploring Distributional Shifts in Large Language Models for Code Analysis. arXiv preprint arXiv:2303.09128 (2023).
  • Azaria et al . (2023) Amos Azaria, Rina Azoulay, and Shulamit Reches. 2023. ChatGPT is a Remarkable Tool–For Experts. arXiv preprint arXiv:2306.03102 (2023).
  • Bairi et al . (2023) Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al . 2023. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499 (2023).
  • Bareiß et al . (2022) Patrick Bareiß, Beatriz Souza, Marcelo d’Amorim, and Michael Pradel. 2022. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  • Bashroush et al . (2017) Rabih Bashroush, Muhammad Garba, Rick Rabiser, Iris Groher, and Goetz Botterweck. 2017. Case tool support for variability management in software product lines. ACM Computing Surveys (CSUR) 50, 1 (2017), 1–45.
  • Baxter et al . (1998) Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272) . IEEE, 368–377.
  • Bekman (2022) Stas Bekman. 2022. The Technology Behind BLOOM Training. https://huggingface.co/blog/bloom-megatron-deepspeed .
  • Biswas et al . (2020) Eeshita Biswas, Mehmet Efruz Karabulut, Lori Pollock, and K Vijay-Shanker. 2020. Achieving reliable sentiment analysis in the software engineering domain using bert. In 2020 IEEE International conference on software maintenance and evolution (ICSME) . IEEE, 162–173.
  • Black et al . (2022) Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, et al . 2022. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745 (2022).
  • Black et al . (2021) Sid Black, Gao Leo, Phil Wang, Connor Leahy, and Stella Biderman. 2021. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow . https://doi.org/10.5281/zenodo.5297715
  • Brown et al . (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al . 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  • Bubeck et al . (2023) Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al . 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023).
  • Bui et al . (2023) Nghi DQ Bui, Hung Le, Yue Wang, Junnan Li, Akhilesh Deepak Gotmare, and Steven CH Hoi. 2023. CodeTF: One-stop Transformer Library for State-of-the-art Code LLM. arXiv preprint arXiv:2306.00029 (2023).
  • Buscemi (2023) Alessio Buscemi. 2023. A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages. arXiv preprint arXiv:2308.04477 (2023).
  • Cao et al . (2023) Jialun Cao, Meiziniu Li, Ming Wen, and Shing-chi Cheung. 2023. A study on prompt design, advantages and limitations of chatgpt for deep learning program repair. arXiv preprint arXiv:2304.08191 (2023).
  • Cassano et al . (2023) Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney, Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q Feldman, et al . 2023. MultiPL-E: a scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering (2023).
  • Chan et al . (2023) Aaron Chan, Anant Kharkar, Roshanak Zilouchian Moghaddam, Yevhen Mohylevskyy, Alec Helyar, Eslam Kamal, Mohamed Elkamhawy, and Neel Sundaresan. 2023. Transformer-based Vulnerability Detection in Code at EditTime: Zero-shot, Few-shot, or Fine-tuning? arXiv preprint arXiv:2306.01754 (2023).
  • Chang et al . (2023) Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2023. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109 (2023).
  • Charalambous et al . (2023) Yiannis Charalambous, Norbert Tihanyi, Ridhi Jain, Youcheng Sun, Mohamed Amine Ferrag, and Lucas C Cordeiro. 2023. A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification. arXiv preprint arXiv:2305.14752 (2023).
  • Chen et al . (2023c) Angelica Chen, Jérémy Scheurer, Tomasz Korbak, Jon Ander Campos, Jun Shern Chan, Samuel R Bowman, Kyunghyun Cho, and Ethan Perez. 2023c. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749 (2023).
  • Chen et al . (2018) Boyuan Chen, Jian Song, Peng Xu, Xing Hu, and Zhen Ming Jiang. 2018. An automated approach to estimating code coverage measures via execution logs. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering . 305–316.
  • Chen et al . (2022) Fuxiang Chen, Fatemeh H Fard, David Lo, and Timofey Bryksin. 2022. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension . 401–412.
  • Chen et al . (2019a) Jinfu Chen, Weiyi Shang, Ahmed E Hassan, Yong Wang, and Jiangbin Lin. 2019a. An experience report of generating load tests using log-recovered workloads at varying granularities of user behaviour. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 669–681.
  • Chen et al . (2019b) Long Chen, Wei Ye, and Shikun Zhang. 2019b. Capturing source code semantics via tree-based convolution over API-enhanced AST. In Proceedings of the 16th ACM International Conference on Computing Frontiers . 174–182.
  • Chen et al . (2021b) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al . 2021b. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  • Chen et al . (2023d) Meng Chen, Hongyu Zhang, Chengcheng Wan, Zhao Wei, Yong Xu, Juhong Wang, and Xiaodong Gu. 2023d. On the effectiveness of large language models in domain-specific code generation. arXiv preprint arXiv:2312.01639 (2023).
  • Chen et al . (2023b) Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023b. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
  • Chen et al . (2017) Xinyun Chen, Chang Liu, and Dawn Song. 2017. Towards synthesizing complex programs from input-output examples. arXiv preprint arXiv:1706.01284 (2017).
  • Chen et al . (2021a) Xinyun Chen, Dawn Song, and Yuandong Tian. 2021a. Latent execution for neural program synthesis beyond domain-specific languages. Advances in Neural Information Processing Systems 34 (2021), 22196–22208.
  • Chen et al . (2023a) Yizheng Chen, Zhoujie Ding, Xinyun Chen, and David Wagner. 2023a. DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection. arXiv preprint arXiv:2304.00409 (2023).
  • Chen et al . (2024) Yujia Chen, Cuiyun Gao, Muyijie Zhu, Qing Liao, Yong Wang, and Guoai Xu. 2024. APIGen: Generative API Method Recommendation. arXiv preprint arXiv:2401.15843 (2024).
  • Cheng et al . (2023) Liying Cheng, Xingxuan Li, and Lidong Bing. 2023. Is GPT-4 a Good Data Analyst? arXiv preprint arXiv:2305.15038 (2023).
  • Chiang et al . (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al . 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) (2023).
  • Chochlov et al . (2022) Muslim Chochlov, Gul Aftab Ahmed, James Vincent Patten, Guoxian Lu, Wei Hou, David Gregg, and Jim Buckley. 2022. Using a Nearest-Neighbour, BERT-Based Approach for Scalable Clone Detection. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 582–591.
  • Chow et al . (2024) Yiu Wai Chow, Luca Di Grazia, and Michael Pradel. 2024. PyTy: Repairing Static Type Errors in Python. arXiv preprint arXiv:2401.06619 (2024).
  • Ciborowska and Damevski (2022) Agnieszka Ciborowska and Kostadin Damevski. 2022. Fast changeset-based bug localization with BERT. In Proceedings of the 44th International Conference on Software Engineering . 946–957.
  • Ciborowska and Damevski (2023) Agnieszka Ciborowska and Kostadin Damevski. 2023. Too Few Bug Reports? Exploring Data Augmentation for Improved Changeset-based Bug Localization. arXiv preprint arXiv:2305.16430 (2023).
  • Ciniselli et al . (2021) Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Massimiliano Di Penta, and Gabriele Bavota. 2021. An empirical study on the usage of transformer models for code completion. IEEE Transactions on Software Engineering 48, 12 (2021), 4818–4837.
  • Clement et al . (2020) Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. 2020. PyMT5: multi-mode translation of natural language and Python code with transformers. arXiv preprint arXiv:2010.03150 (2020).
  • Dakhel et al . (2023) Arghavan Moradi Dakhel, Amin Nikanjam, Vahid Majdinasab, Foutse Khomh, and Michel C Desmarais. 2023. Effective test generation using pre-trained large language models and mutation testing. arXiv preprint arXiv:2308.16557 (2023).
  • Deligiannis et al . (2023) Pantazis Deligiannis, Akash Lal, Nikita Mehrotra, and Aseem Rastogi. 2023. Fixing rust compilation errors using llms. arXiv preprint arXiv:2308.05177 (2023).
  • Deng et al . (2023a) Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2023a. Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. arXiv preprint arXiv:2307.08715 (2023).
  • Deng et al . (2023b) Gelei Deng, Yi Liu, Víctor Mayoral-Vilches, Peng Liu, Yuekang Li, Yuan Xu, Tianwei Zhang, Yang Liu, Martin Pinzger, and Stefan Rass. 2023b. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. arXiv preprint arXiv:2308.06782 (2023).
  • Deng et al . (2023c) Yinlin Deng, Chunqiu Steven Xia, Haoran Peng, Chenyuan Yang, and Lingming Zhang. 2023c. Large Language Models are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023) .
  • Deng et al . (2023d) Yinlin Deng, Chunqiu Steven Xia, Chenyuan Yang, Shizhuo Dylan Zhang, Shujing Yang, and Lingming Zhang. 2023d. Large language models are edge-case fuzzers: Testing deep learning libraries via fuzzgpt. arXiv preprint arXiv:2304.02014 (2023).
  • Devlin et al . (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  • Di Rocco et al . (2021) Juri Di Rocco, Davide Di Ruscio, Claudio Di Sipio, Phuong T Nguyen, and Riccardo Rubei. 2021. Development of recommendation systems for software engineering: the CROSSMINER experience. Empirical Software Engineering 26, 4 (2021), 69.
  • Dibia et al . (2022) Victor Dibia, Adam Fourney, Gagan Bansal, Forough Poursabzi-Sangdeh, Han Liu, and Saleema Amershi. 2022. Aligning Offline Metrics and Human Judgments of Value of AI-Pair Programmers. arXiv preprint arXiv:2210.16494 (2022).
  • Ding et al . (2023) Hantian Ding, Varun Kumar, Yuchen Tian, Zijian Wang, Rob Kwiatkowski, Xiaopeng Li, Murali Krishna Ramanathan, Baishakhi Ray, Parminder Bhatia, Sudipta Sengupta, et al . 2023. A static evaluation of code completion by large language models. arXiv preprint arXiv:2306.03203 (2023).
  • Dinh et al . (2023) Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, and George Karypis. 2023. Large Language Models of Code Fail at Completing Code with Potential Bugs. arXiv preprint arXiv:2306.03438 (2023).
  • Dinh et al . (2024) Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, and George Karypis. 2024. Large language models of code fail at completing code with potential bugs. Advances in Neural Information Processing Systems 36 (2024).
  • Döderlein et al . (2022) Jean-Baptiste Döderlein, Mathieu Acher, Djamel Eddine Khelladi, and Benoit Combemale. 2022. Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? arXiv preprint arXiv:2210.14699 (2022).
  • Dong et al . (2023c) Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayiheng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2023c. How abilities in large language models are affected by supervised fine-tuning data composition. arXiv preprint arXiv:2310.05492 (2023).
  • Dong et al . (2023a) Yihong Dong, Jiazheng Ding, Xue Jiang, Ge Li, Zhuo Li, and Zhi Jin. 2023a. Codescore: Evaluating code generation by learning code execution. arXiv preprint arXiv:2301.09043 (2023).
  • Dong et al . (2023b) Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023b. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590 (2023).
  • Dou et al . (2023) Shihan Dou, Junjie Shan, Haoxiang Jia, Wenhao Deng, Zhiheng Xi, Wei He, Yueming Wu, Tao Gui, Yang Liu, and Xuanjing Huang. 2023. Towards Understanding the Capability of Large Language Models on Code Clone Detection: A Survey. arXiv preprint arXiv:2308.01191 (2023).
  • Du et al . (2023a) Xueying Du, Mingwei Liu, Juntao Li, Hanlin Wang, Xin Peng, and Yiling Lou. 2023a. Resolving Crash Bugs via Large Language Models: An Empirical Study. arXiv preprint arXiv:2312.10448 (2023).
  • Du et al . (2023b) Xueying Du, Mingwei Liu, Kaixin Wang, Hanlin Wang, Junwei Liu, Yixuan Chen, Jiayi Feng, Chaofeng Sha, Xin Peng, and Yiling Lou. 2023b. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. arXiv preprint arXiv:2308.01861 (2023).
  • Du and Yu (2023) Yali Du and Zhongxing Yu. 2023. Pre-training code representation with semantic flow graph for effective bug localization. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 579–591.
  • Eghbali and Pradel (2024) Aryaz Eghbali and Michael Pradel. 2024. De-Hallucinator: Iterative Grounding for LLM-Based Code Completion. arXiv preprint arXiv:2401.01701 (2024).
  • El-Hajjami et al . (2023) Abdelkarim El-Hajjami, Nicolas Fafin, and Camille Salinesi. 2023. Which AI Technique Is Better to Classify Requirements? An Experiment with SVM, LSTM, and ChatGPT. arXiv preprint arXiv:2311.11547 (2023).
  • El-Mhamdi et al . (2023) El-Mahdi El-Mhamdi, Sadegh Farhadkhani, Rachid Guerraoui, Nirupam Gupta, Lê-Nguyên Hoang, Rafael Pinot, Sébastien Rouault, and John Stephan. 2023. On the Impossible Safety of Large AI Models. arXiv:2209.15259 [cs.LG]
  • Elizondo (2023) Andre Elizondo. 2023. LangKit: Making Large Language Models Safe and Responsible. https://whylabs.ai/blog/posts/langkit-making-large-language-models-safe-and-responsible .
  • Endres et al . (2023) Madeline Endres, Sarah Fakhoury, Saikat Chakraborty, and Shuvendu K Lahiri. 2023. Formalizing Natural Language Intent into Program Specifications via Large Language Models. arXiv preprint arXiv:2310.01831 (2023).
  • Ezzini et al . (2022) Saad Ezzini, Sallam Abualhaija, Chetan Arora, and Mehrdad Sabetzadeh. 2022. Automated handling of anaphoric ambiguity in requirements: a multi-solution study. In Proceedings of the 44th International Conference on Software Engineering . 187–199.
  • Fakhoury et al . (2023) Sarah Fakhoury, Saikat Chakraborty, Madan Musuvathi, and Shuvendu K Lahiri. 2023. Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions. arXiv preprint arXiv:2304.03816 (2023).
  • Fan et al . (2023b) Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023b. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533 (2023).
  • Fan et al . (2024) Guodong Fan, Shizhan Chen, Cuiyun Gao, Jianmao Xiao, Tao Zhang, and Zhiyong Feng. 2024. Rapid: Zero-shot Domain Adaptation for Code Search with Pre-trained Models. ACM Transactions on Software Engineering and Methodology (2024).
  • Fan et al . (2023c) Wenqi Fan, Zihuai Zhao, Jiatong Li, Yunqing Liu, Xiaowei Mei, Yiqi Wang, Jiliang Tang, and Qing Li. 2023c. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
  • Fan et al . (2023a) Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023a. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1469–1481.
  • Fan et al . (2022) Zhiyu Fan, Xiang Gao, Abhik Roychoudhury, and Shin Hwei Tan. 2022. Automated Repair of Programs from Large Language Models. arXiv preprint arXiv:2205.10583 (2022).
  • Fatima et al . (2022) Sakina Fatima, Taher A Ghaleb, and Lionel Briand. 2022. Flakify: A black-box, language model-based predictor for flaky tests. IEEE Transactions on Software Engineering (2022).
  • Feng and Chen (2023) Sidong Feng and Chunyang Chen. 2023. Prompting Is All Your Need: Automated Android Bug Replay with Large Language Models. arXiv preprint arXiv:2306.01987 (2023).
  • Feng et al . (2020) Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al . 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020).
  • Ferrag et al . (2023) Mohamed Amine Ferrag, Ammar Battah, Norbert Tihanyi, Merouane Debbah, Thierry Lestable, and Lucas C Cordeiro. 2023. SecureFalcon: The Next Cyber Reasoning System for Cyber Security. arXiv preprint arXiv:2307.06616 (2023).
  • Ferreira et al . (2024) Isabella Ferreira, Ahlaam Rafiq, and Jinghui Cheng. 2024. Incivility detection in open source code review and issue discussions. Journal of Systems and Software 209 (2024), 111935.
  • First et al . (2023) Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1229–1241.
  • Fraser et al . (2015) Gordon Fraser, Matt Staats, Phil McMinn, Andrea Arcuri, and Frank Padberg. 2015. Does automated unit test generation really help software testers? a controlled empirical study. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 4 (2015), 1–49.
  • Fried et al . (2022) Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  • Fu and Tantithamthavorn (2022) Michael Fu and Chakkrit Tantithamthavorn. 2022. GPT2SP: A transformer-based agile story point estimation approach. IEEE Transactions on Software Engineering 49, 2 (2022), 611–625.
  • Gandhi et al . (2023) Apurva Gandhi, Thong Q Nguyen, Huitian Jiao, Robert Steen, and Ameya Bhatawdekar. 2023. Natural Language Commanding via Program Synthesis. arXiv preprint arXiv:2306.03460 (2023).
  • Gao et al . (2020) Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al . 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020).
  • Gao et al . (2024) Shuzheng Gao, Wenxin Mao, Cuiyun Gao, Li Li, Xing Hu, Xin Xia, and Michael R Lyu. 2024. Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models. arXiv preprint arXiv:2401.01060 (2024).
  • Gao et al . (2023b) Shuzheng Gao, Xin-Cheng Wen, Cuiyun Gao, Wenxuan Wang, and Michael R Lyu. 2023b. Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. arXiv preprint arXiv:2304.07575 (2023).
  • Gao et al . (2023a) Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, and Chao Zhang. 2023a. How Far Have We Gone in Vulnerability Detection Using Large Language Models. arXiv preprint arXiv:2311.12420 (2023).
  • Geng et al . (2024) Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi Jin, Xiaoguang Mao, and Xiangke Liao. 2024. Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning. (2024).
  • Gethers et al . (2011) Malcom Gethers, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2011. On integrating orthogonal information retrieval methods to improve traceability recovery. In 2011 27th IEEE International Conference on Software Maintenance (ICSM) . IEEE, 133–142.
  • Ghadhab et al . (2021) Lobna Ghadhab, Ilyes Jenhani, Mohamed Wiem Mkaouer, and Montassar Ben Messaoud. 2021. Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Information and Software Technology 135 (2021), 106566.
  • Gilbert et al . (2023) Henry Gilbert, Michael Sandborn, Douglas C Schmidt, Jesse Spencer-Smith, and Jules White. 2023. Semantic Compression With Large Language Models. arXiv preprint arXiv:2304.12512 (2023).
  • Github (2023) Github. 2023. Github. https://github.com/ .
  • GitHub (2023) GitHub. 2023. Github copilot. https://copilot.github.com .
  • Gomes et al . (2023) Luiz Gomes, Ricardo da Silva Torres, and Mario Lúcio Côrtes. 2023. BERT-and TF-IDF-based feature extraction for long-lived bug prediction in FLOSS: a comparative study. Information and Software Technology 160 (2023), 107217.
  • Gong et al . (2023) Lina Gong, Jingxuan Zhang, Mingqiang Wei, Haoxiang Zhang, and Zhiqiu Huang. 2023. What is the intended usage context of this model? An exploratory study of pre-trained models on various model repositories. ACM Transactions on Software Engineering and Methodology 32, 3 (2023), 1–57.
  • Google (2023) Google. 2023. Gemini. https://gemini.google.com/ .
  • Google (2024) Google. 2024. Gemma. https://blog.google/technology/developers/gemma-open-models/ .
  • Grishina et al . (2023) Anastasiia Grishina, Max Hort, and Leon Moonen. 2023. The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification. arXiv preprint arXiv:2305.04940 (2023).
  • Gu et al . (2022) Jian Gu, Pasquale Salza, and Harald C Gall. 2022. Assemble foundation models for automatic code summarization. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 935–946.
  • Gu et al . (2018) Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. In Proceedings of the 40th International Conference on Software Engineering . 933–944.
  • Gu et al . (2016) Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering . 631–642.
  • Guo et al . (2020) Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, et al . 2020. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  • Guo et al . (2023) Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. 2023. LongCoder: A Long-Range Pre-trained Language Model for Code Completion. arXiv preprint arXiv:2306.14893 (2023).
  • Guo et al . (2024b) Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y Wu, YK Li, et al . 2024b. DeepSeek-Coder: When the Large Language Model Meets Programming–The Rise of Code Intelligence. arXiv preprint arXiv:2401.14196 (2024).
  • Guo et al . (2024a) Qi Guo, Junming Cao, Xiaofei Xie, Shangqing Liu, Xiaohong Li, Bihuan Chen, and Xin Peng. 2024a. Exploring the potential of chatgpt in automated code refinement: An empirical study. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13.
  • Gupta et al . (2023) Priyanshu Gupta, Avishree Khare, Yasharth Bajpai, Saikat Chakraborty, Sumit Gulwani, Aditya Kanade, Arjun Radhakrishna, Gustavo Soares, and Ashish Tiwari. 2023. GrACE: Generation using Associated Code Edits. arXiv preprint arXiv:2305.14129 (2023).
  • Guzman et al . (2014) Emitza Guzman, David Azócar, and Yang Li. 2014. Sentiment analysis of commit comments in GitHub: an empirical study. In Proceedings of the 11th working conference on mining software repositories . 352–355.
  • Hajali and Budvytis (2023) Patrick Hajali and Ignas Budvytis. 2023. Function-constrained Program Synthesis. arXiv:2311.15500 [cs.LG]
  • Hao et al . (2023) Yu Hao, Weiteng Chen, Ziqiao Zhou, and Weidong Cui. 2023. E&V: Prompting Large Language Models to Perform Static Analysis by Pseudo-code Execution and Verification. arXiv preprint arXiv:2312.08477 (2023).
  • Happe and Cito (2023) Andreas Happe and Jürgen Cito. 2023. Getting pwn’d by AI: Penetration Testing with Large Language Models. arXiv preprint arXiv:2308.00121 (2023).
  • Harty et al . (2021) Julian Harty, Haonan Zhang, Lili Wei, Luca Pascarella, Mauricio Aniche, and Weiyi Shang. 2021. Logging practices with mobile analytics: An empirical study on firebase. In 2021 IEEE/ACM 8th International Conference on Mobile Software Engineering and Systems (MobileSoft) . IEEE, 56–60.
  • Hasselbring and van Hoorn (2020) Wilhelm Hasselbring and André van Hoorn. 2020. Kieker: A monitoring framework for software engineering research. Software Impacts 5 (2020), 100019.
  • He et al . (2023) Junda He, Zhou Xin, Bowen Xu, Ting Zhang, Kisub Kim, Zhou Yang, Ferdian Thung, Ivana Irsan, and David Lo. 2023. Representation Learning for Stack Overflow Posts: How Far are We? arXiv preprint arXiv:2303.06853 (2023).
  • He et al . (2022) Junda He, Bowen Xu, Zhou Yang, DongGyun Han, Chengran Yang, and David Lo. 2022. PTM4Tag: sharpening tag recommendation of stack overflow posts with pre-trained models. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension . 1–11.
  • Hellendoorn et al . (2018) Vincent J Hellendoorn, Christian Bird, Earl T Barr, and Miltiadis Allamanis. 2018. Deep learning type inference. In Proceedings of the 2018 26th acm joint meeting on european software engineering conference and symposium on the foundations of software engineering . 152–162.
  • Helmeczi et al . (2023) Robert Kraig Helmeczi, Mucahit Cevik, and Savas Yıldırım. 2023. Few-shot learning for sentence pair classification and its applications in software engineering. arXiv preprint arXiv:2306.08058 (2023).
  • Hendrycks et al . (2021) Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al . 2021. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938 (2021).
  • Henkel et al . (2021) Jordan Henkel, Denini Silva, Leopoldo Teixeira, Marcelo d’Amorim, and Thomas Reps. 2021. Shipwright: A human-in-the-loop system for dockerfile repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 1148–1160.
  • Hey et al . (2020) Tobias Hey, Jan Keim, Anne Koziolek, and Walter F Tichy. 2020. Norbert: Transfer learning for requirements classification. In 2020 IEEE 28th International Requirements Engineering Conference (RE) . IEEE, 169–179.
  • hiyouga (2023) hiyouga. 2023. LLaMA Efficient Tuning. https://github.com/hiyouga/LLaMA-Efficient-Tuning .
  • Hoffmann et al . (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al . 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556 (2022).
  • Hong et al . (2023) Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al . 2023. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352 (2023).
  • Houlsby et al . (2019) Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning . PMLR, 2790–2799.
  • Hu et al . (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021).
  • Hu et al . (2023) Jie Hu, Qian Zhang, and Heng Yin. 2023. Augmenting Greybox Fuzzing with Generative AI. arXiv preprint arXiv:2306.06782 (2023).
  • Hu et al . (2024) Xueyu Hu, Kun Kuang, Jiankai Sun, Hongxia Yang, and Fei Wu. 2024. Leveraging Print Debugging to Improve Code Generation in Large Language Models. arXiv preprint arXiv:2401.05319 (2024).
  • Hu et al . (2018) Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th conference on program comprehension . 200–210.
  • Huang et al . (2023a) Dong Huang, Qingwen Bu, and Heming Cui. 2023a. Codecot and beyond: Learning to program and test like a developer. arXiv preprint arXiv:2308.08784 (2023).
  • Huang et al . (2023b) Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. 2023b. AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation. arXiv preprint arXiv:2312.13010 (2023).
  • Huang et al . (2023c) Di Huang, Ziyuan Nan, Xing Hu, Pengwei Jin, Shaohui Peng, Yuanbo Wen, Rui Zhang, Zidong Du, Qi Guo, Yewen Pu, et al . 2023c. ANPL: Compiling Natural Programs with Interactive Decomposition. arXiv preprint arXiv:2305.18498 (2023).
  • Huang et al . (2023d) Qing Huang, Yanbang Sun, Zhenchang Xing, Min Yu, Xiwei Xu, and Qinghua Lu. 2023d. API Entity and Relation Joint Extraction from Text via Dynamic Prompt-tuned Language Model. arXiv preprint arXiv:2301.03987 (2023).
  • Huang et al . (2023e) Qing Huang, Yishun Wu, Zhenchang Xing, He Jiang, Yu Cheng, and Huan Jin. 2023e. Adaptive Intellect Unleashed: The Feasibility of Knowledge Transfer in Large Language Models. arXiv preprint arXiv:2308.04788 (2023).
  • Huang et al . (2018) Qiao Huang, Xin Xia, Zhenchang Xing, David Lo, and Xinyu Wang. 2018. API method recommendation without worrying about the task-API knowledge gap. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering . 293–304.
  • Huang et al . (2023f) Qing Huang, Jiahui Zhu, Zhenchang Xing, Huan Jin, Changjing Wang, and Xiwei Xu. 2023f. A Chain of AI-based Solutions for Resolving FQNs and Fixing Syntax Errors in Partial Code. arXiv preprint arXiv:2306.11981 (2023).
  • Huang et al . (2023g) Qing Huang, Zhou Zou, Zhenchang Xing, Zhenkang Zuo, Xiwei Xu, and Qinghua Lu. 2023g. AI Chain on Large Language Model for Unsupervised Control Flow Graph Generation for Statically-Typed Partial Code. arXiv preprint arXiv:2306.00757 (2023).
  • Huang et al . (2024) Yuchao Huang, Junjie Wang, Zhe Liu, Yawen Wang, Song Wang, Chunyang Chen, Yuanzhe Hu, and Qing Wang. 2024. Crashtranslator: Automatically reproducing mobile application crashes directly from stack trace. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13.
  • Ibrahimzada et al . (2023) Ali Reza Ibrahimzada, Yang Chen, Ryan Rong, and Reyhaneh Jabbarvand. 2023. Automated Bug Generation in the era of Large Language Models. arXiv preprint arXiv:2310.02407 (2023).
  • Ibrahimzada et al . (2022) Ali Reza Ibrahimzada, Yigit Varli, Dilara Tekinoglu, and Reyhaneh Jabbarvand. 2022. Perfect is the enemy of test oracle. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 70–81.
  • Islam and Zibran (2017) Md Rakibul Islam and Minhaz F Zibran. 2017. Leveraging automated sentiment analysis in software engineering. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR) . IEEE, 203–214.
  • Islam et al . (2024) Nafis Tanveer Islam, Joseph Khoury, Andrew Seong, Gonzalo De La Torre Parra, Elias Bou-Harb, and Peyman Najafirad. 2024. LLM-Powered Code Vulnerability Repair with Reinforcement Learning and Semantic Reward. arXiv preprint arXiv:2401.03374 (2024).
  • Islam and Najafirad (2024) Nafis Tanveer Islam and Peyman Najafirad. 2024. Code Security Vulnerability Repair Using Reinforcement Learning with Large Language Models. arXiv preprint arXiv:2401.07031 (2024).
  • Isotani et al . (2021) Haruna Isotani, Hironori Washizaki, Yoshiaki Fukazawa, Tsutomu Nomoto, Saori Ouji, and Shinobu Saito. 2021. Duplicate bug report detection by using sentence embedding and fine-tuning. In 2021 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 535–544.
  • Iyer et al . (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing source code using a neural attention model. In 54th Annual Meeting of the Association for Computational Linguistics 2016 . Association for Computational Linguistics, 2073–2083.
  • Izadi et al . (2022) Maliheh Izadi, Roberta Gismondi, and Georgios Gousios. 2022. Codefill: Multi-token code completion by jointly learning from structure and naming sequences. In Proceedings of the 44th International Conference on Software Engineering . 401–412.
  • Jain et al . (2023a) Abhinav Jain, Chima Adiole, Thomas Reps, Swarat Chaudhuri, and Chris Jermaine. 2023a. Coarse-Tuning Models of Code with Reinforcement Learning Feedback. (2023).
  • Jain et al . (2022) Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large language models meet program synthesis. In Proceedings of the 44th International Conference on Software Engineering . 1219–1231.
  • Jain et al . (2023b) Naman Jain, Tianjun Zhang, Wei-Lin Chiang, Joseph E Gonzalez, Koushik Sen, and Ion Stoica. 2023b. Llm-assisted code cleaning for training accurate code generators. arXiv preprint arXiv:2311.14904 (2023).
  • Jana et al . (2023) Prithwish Jana, Piyush Jha, Haoyang Ju, Gautham Kishore, Aryan Mahajan, and Vijay Ganesh. 2023. Attention, Compilation, and Solver-based Symbolic Analysis are All You Need. arXiv preprint arXiv:2306.06755 (2023).
  • Jesse et al . (2022) Kevin Jesse, Premkumar T Devanbu, and Anand Sawant. 2022. Learning to predict user-defined types. IEEE Transactions on Software Engineering 49, 4 (2022), 1508–1522.
  • Ji et al . (2023) Zhenlan Ji, Pingchuan Ma, Zongjie Li, and Shuai Wang. 2023. Benchmarking and Explaining Large Language Model-based Code Generation: A Causality-Centric Approach. arXiv preprint arXiv:2310.06680 (2023).
  • Jiang et al . (2023b) Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023b. Impact of code language models on automated program repair. arXiv preprint arXiv:2302.05020 (2023).
  • Jiang et al . (2023c) Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023c. SelfEvolve: A Code Evolution Framework via Large Language Models. arXiv preprint arXiv:2306.02907 (2023).
  • Jiang et al . (2023a) Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. 2023a. Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689 (2023).
  • Jiang et al . (2020) Yanjie Jiang, Hui Liu, Jiahao Jin, and Lu Zhang. 2020. Automated expansion of abbreviations based on semantic relation and transfer expansion. IEEE Transactions on Software Engineering 48, 2 (2020), 519–537.
  • Jimenez et al . (2023) Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770 (2023).
  • Jin et al . (2023b) Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svyatkovskiy. 2023b. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263 (2023).
  • Jin et al . (2023c) Pengxiang Jin, Shenglin Zhang, Minghua Ma, Haozhe Li, Yu Kang, Liqun Li, Yudong Liu, Bo Qiao, Chaoyun Zhang, Pu Zhao, et al . 2023c. Assess and Summarize: Improve Outage Understanding with Large Language Models. arXiv preprint arXiv:2305.18084 (2023).
  • Jin et al . (2023a) Xin Jin, Jonathan Larson, Weiwei Yang, and Zhiqiang Lin. 2023a. Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models. arXiv preprint arXiv:2312.09601 (2023).
  • Jones and Steinhardt (2022) Erik Jones and Jacob Steinhardt. 2022. Capturing failures of large language models via human cognitive biases. Advances in Neural Information Processing Systems 35 (2022), 11785–11799.
  • Jongeling et al . (2015) Robbert Jongeling, Subhajit Datta, and Alexander Serebrenik. 2015. Choosing your weapons: On sentiment analysis tools for software engineering research. In 2015 IEEE international conference on software maintenance and evolution (ICSME) . IEEE, 531–535.
  • Judini (2023) Judini. 2023. The future of software development powered by AI. https://codegpt.co/ .
  • Kabir et al . (2024) Azmain Kabir, Shaowei Wang, Yuan Tian, Muhammad Asaduzzaman, Wenbin Zhang, et al . 2024. ZS4C: Zero-Shot Synthesis of Compilable Code for Incomplete Code Snippets using ChatGPT. arXiv preprint arXiv:2401.14279 (2024).
  • Kabir et al . (2023) Md Mahir Asef Kabir, Sk Adnan Hassan, Xiaoyin Wang, Ying Wang, Hai Yu, and Na Meng. 2023. An empirical study of ChatGPT-3.5 on question answering and code maintenance. arXiv preprint arXiv:2310.02104 (2023).
  • Kanade et al . (2020) Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2020. Learning and evaluating contextual embedding of source code. In International conference on machine learning . PMLR, 5110–5121.
  • Kang et al . (2023a) Sungmin Kang, Gabin An, and Shin Yoo. 2023a. A preliminary evaluation of llm-based fault localization. arXiv preprint arXiv:2308.05487 (2023).
  • Kang et al . (2023b) Sungmin Kang, Bei Chen, Shin Yoo, and Jian-Guang Lou. 2023b. Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. arXiv preprint arXiv:2304.02195 (2023).
  • Kang et al . (2023c) Sungmin Kang, Juyeon Yoon, Nargiz Askarbekkyzy, and Shin Yoo. 2023c. Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction. arXiv preprint arXiv:2311.04532 (2023).
  • Kang et al . (2022) Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large language models are few-shot testers: Exploring llm-based general bug reproduction. arXiv preprint arXiv:2209.11515 (2022).
  • Kannan (2023) Jai Kannan. 2023. Can LLMs Configure Software Tools. arXiv preprint arXiv:2312.06121 (2023).
  • Karampatsis and Sutton (2020) Rafael-Michael Karampatsis and Charles Sutton. 2020. Scelmo: Source code embeddings from language models. arXiv preprint arXiv:2004.13214 (2020).
  • Ke et al . (2023) Li Ke, Hong Sheng, Fu Cai, Zhang Yunhe, and Liu Ming. 2023. Discriminating Human-authored from ChatGPT-Generated Code Via Discernable Feature Analysis. arXiv:2306.14397 [cs.SE]
  • Khakhar et al . (2023) Adam Khakhar, Stephen Mell, and Osbert Bastani. 2023. PAC Prediction Sets for Large Language Models of Code. arXiv preprint arXiv:2302.08703 (2023).
  • Khan et al . (2021) Junaed Younus Khan, Md Tawkat Islam Khondaker, Gias Uddin, and Anindya Iqbal. 2021. Automatic detection of five api documentation smells: Practitioners’ perspectives. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 318–329.
  • Khan and Uddin (2022) Junaed Younus Khan and Gias Uddin. 2022. Automatic detection and analysis of technical debts in peer-review documentation of r packages. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 765–776.
  • Khan et al . (2023a) Mohammad Abdullah Matin Khan, M Saiful Bari, Xuan Long Do, Weishi Wang, Md Rizwan Parvez, and Shafiq Joty. 2023a. xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. arXiv preprint arXiv:2303.03004 (2023).
  • Khan et al . (2023b) Muhammad Fawad Akbar Khan, Max Ramsdell, Erik Falor, and Hamid Karimi. 2023b. Assessing the Promise and Pitfalls of ChatGPT for Automated Code Generation. arXiv preprint arXiv:2311.02640 (2023).
  • Khanfir et al . (2023) Ahmed Khanfir, Renzo Degiovanni, Mike Papadakis, and Yves Le Traon. 2023. Efficient Mutation Testing via Pre-Trained Language Models. arXiv preprint arXiv:2301.03543 (2023).
  • Khare et al . (2023) Avishree Khare, Saikat Dutta, Ziyang Li, Alaia Solko-Breslin, Rajeev Alur, and Mayur Naik. 2023. Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities. arXiv preprint arXiv:2311.16169 (2023).
  • Kirinuki and Tanno (2024) Hiroyuki Kirinuki and Haruto Tanno. 2024. ChatGPT and Human Synergy in Black-Box Testing: A Comparative Analysis. arXiv preprint arXiv:2401.13924 (2024).
  • Kitchenham et al . (2007) Barbara Kitchenham, Stuart Charters, et al . 2007. Guidelines for performing systematic literature reviews in software engineering.
  • Kitchenham et al . (2022) Barbara Kitchenham, Lech Madeyski, and David Budgen. 2022. SEGRESS: Software engineering guidelines for reporting secondary studies. IEEE Transactions on Software Engineering 49, 3 (2022), 1273–1298.
  • Knauss et al . (2011) Eric Knauss, Siv Houmb, Kurt Schneider, Shareeful Islam, and Jan Jürjens. 2011. Supporting requirements engineers in recognising security issues. In Requirements Engineering: Foundation for Software Quality: 17th International Working Conference, REFSQ 2011, Essen, Germany, March 28-30, 2011. Proceedings 17 . Springer, 4–18.
  • Ko et al . (2006) Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on software engineering 32, 12 (2006), 971–987.
  • Koide et al . (2023) Takashi Koide, Naoki Fukushi, Hiroki Nakano, and Daiki Chiba. 2023. Detecting Phishing Sites Using ChatGPT. arXiv preprint arXiv:2306.05816 (2023).
  • Kojima et al . (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems 35 (2022), 22199–22213.
  • Kolthoff et al . (2023) Kristian Kolthoff, Christian Bartelt, and Simone Paolo Ponzetto. 2023. Data-driven prototyping via natural-language-based GUI retrieval. Automated Software Engineering 30, 1 (2023), 13.
  • Kou et al . (2023a) Bonan Kou, Muhao Chen, and Tianyi Zhang. 2023a. Automated Summarization of Stack Overflow Posts. arXiv preprint arXiv:2305.16680 (2023).
  • Kou et al . (2023b) Bonan Kou, Shengmai Chen, Zhijie Wang, Lei Ma, and Tianyi Zhang. 2023b. Is Model Attention Aligned with Human Attention? An Empirical Study on Large Language Models for Code Generation. arXiv preprint arXiv:2306.01220 (2023).
  • Kulkarni (2021) Amit Kulkarni. 2021. GitHub Copilot AI Is Leaking Functional API Keys. https://analyticsdrift.com/github-copilot-ai-is-leaking-functional-api-keys/ .
  • Kuznia et al . (2022) Kirby Kuznia, Swaroop Mishra, Mihir Parmar, and Chitta Baral. 2022. Less is more: Summary of long instructions is better for program synthesis. arXiv preprint arXiv:2203.08597 (2022).
  • Lahiri et al . (2022) Shuvendu K Lahiri, Aaditya Naik, Georgios Sakkas, Piali Choudhury, Curtis von Veh, Madanlal Musuvathi, Jeevana Priya Inala, Chenglong Wang, and Jianfeng Gao. 2022. Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950 (2022).
  • Lai et al . (2023) Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. 2023. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning . PMLR, 18319–18345.
  • Lajkó et al . (2022) Márk Lajkó, Viktor Csuvik, and László Vidács. 2022. Towards JavaScript program repair with generative pre-trained transformer (GPT-2). In Proceedings of the Third International Workshop on Automated Program Repair . 61–68.
  • Lan et al . (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019).
  • Laskar et al . (2023) Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and Jimmy Xiangji Huang. 2023. A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets. arXiv preprint arXiv:2305.18486 (2023).
  • Le et al . (2023) Hung Le, Hailin Chen, Amrita Saha, Akash Gokul, Doyen Sahoo, and Shafiq Joty. 2023. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992 (2023).
  • Le-Cong et al . (2022) Thanh Le-Cong, Hong Jin Kang, Truong Giang Nguyen, Stefanus Agus Haryono, David Lo, Xuan-Bach D Le, and Quyet Thang Huynh. 2022. Autopruner: transformer-based call graph pruning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 520–532.
  • Le-Cong et al . (2023) Thanh Le-Cong, Duc-Minh Luong, Xuan Bach D Le, David Lo, Nhat-Hoa Tran, Bui Quang-Huy, and Quyet-Thang Huynh. 2023. Invalidator: Automated patch correctness assessment via semantic and syntactic reasoning. IEEE Transactions on Software Engineering (2023).
  • Lee et al . (2022) Jaehyung Lee, Kisun Han, and Hwanjo Yu. 2022. A Light Bug Triage Framework for Applying Large Pre-trained Language Model. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–11.
  • Lester et al . (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  • Li et al . (2023g) Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. 2023g. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474 (2023).
  • Li et al . (2022d) Dong Li, Yelong Shen, Ruoming Jin, Yi Mao, Kuan Wang, and Weizhu Chen. 2022d. Generation-Augmented Query Expansion For Code Retrieval. arXiv preprint arXiv:2212.10692 (2022).
  • Li et al . (2014) Feng-Lin Li, Jennifer Horkoff, John Mylopoulos, Renata SS Guizzardi, Giancarlo Guizzardi, Alexander Borgida, and Lin Liu. 2014. Non-functional requirements as qualities, with a spice of ontology. In 2014 IEEE 22nd International Requirements Engineering Conference (RE) . IEEE, 293–302.
  • Li et al . (2024c) Haochen Li, Xin Zhou, and Zhiqi Shen. 2024c. Rewriting the Code: A Simple Method for Large Language Model Augmented Code Search. arXiv preprint arXiv:2401.04514 (2024).
  • Li et al . (2023a) Jingyao Li, Pengguang Chen, and Jiaya Jia. 2023a. MoTCoder: Elevating Large Language Models with Modular of Thought for Challenging Programming Tasks. arXiv preprint arXiv:2312.15960 (2023).
  • Li et al . (2021) Jingxuan Li, Rui Huang, Wei Li, Kai Yao, and Weiguo Tan. 2021. Toward less hidden cost of code completion with acceptance and ranking models. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 195–205.
  • Li et al . (2023c) Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023c. Enabling Programming Thinking in Large Language Models Toward Code Generation. arXiv preprint arXiv:2305.06599 (2023).
  • Li et al . (2023d) Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023d. Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599 (2023).
  • Li et al . (2023e) Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023e. Codeeditor: Learning to edit source code with pre-trained models. ACM Transactions on Software Engineering and Methodology 32, 6 (2023), 1–22.
  • Li et al . (2024a) Jia Li, Ge Li, Yunfei Zhao, Yongmin Li, Zhi Jin, Hao Zhu, Huanyu Liu, Kaibo Liu, Lecheng Wang, Zheng Fang, et al . 2024a. DevEval: Evaluating Code Generation in Practical Software Projects. arXiv preprint arXiv:2401.06401 (2024).
  • Li et al . (2022c) Jia Li, Zhuo Li, Huangzhao Zhang, Ge Li, Zhi Jin, Xing Hu, and Xin Xia. 2022c. Poison Attack and Defense on Deep Source Code Processing Models. https://doi.org/10.48550/ARXIV.2210.17029
  • Li et al . (2017) Li Li, Tegawendé F Bissyandé, Mike Papadakis, Siegfried Rasthofer, Alexandre Bartel, Damien Octeau, Jacques Klein, and Le Traon. 2017. Static analysis of android apps: A systematic literature review. Information and Software Technology 88 (2017), 67–95.
  • Li et al . (2022f) Lingwei Li, Li Yang, Huaxi Jiang, Jun Yan, Tiejian Luo, Zihan Hua, Geng Liang, and Chun Zuo. 2022f. AUGER: automatically generating review comments with pre-training models. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 1009–1021.
  • Li et al . (2023i) Peng Li, Tianxiang Sun, Qiong Tang, Hang Yan, Yuanbin Wu, Xuanjing Huang, and Xipeng Qiu. 2023i. CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors. arXiv preprint arXiv:2305.05711 (2023).
  • Li et al . (2023l) Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, and Shing-Chi Cheung. 2023l. Finding Failure-Inducing Test Cases with ChatGPT. arXiv preprint arXiv:2304.11686 (2023).
  • Li et al . (2023m) Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, and Jeff Kramer. 2023m. Nuances are the key: Unlocking chatgpt to find failure-inducing tests with differential prompting. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 14–26.
  • Li et al . (2022b) Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022b. CodeRetriever: A Large Scale Contrastive Pre-Training Method for Code Search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing . 2898–2910.
  • Li and Liang (2021) Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
  • Li et al . (2023k) Xin-Ye Li, Jiang-Tian Xue, Zheng Xie, and Ming Li. 2023k. Think Outside the Code: Brainstorming Boosts Large Language Models in Code Generation. arXiv preprint arXiv:2305.10679 (2023).
  • Li et al . (2022a) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al . 2022a. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  • Li et al . (2023b) Yichen Li, Yintong Huo, Zhihan Jiang, Renyi Zhong, Pinjia He, Yuxin Su, and Michael R Lyu. 2023b. Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study. arXiv preprint arXiv:2307.05950 (2023).
  • Li et al . (2024b) Yue Li, Zhong Ren, Zhiqi Wang, Lanxin Yang, Liming Dong, Chenxing Zhong, and He Zhang. 2024b. Fine-SE: Integrating Semantic Features and Expert Features for Software Effort Estimation. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–12.
  • Li et al . (2023h) Youjia Li, Jianjun Shi, and Zheng Zhang. 2023h. A Novel Approach for RapidDevelopment Based on ChatGPT and Prompt Engineering. arXiv preprint arXiv:2312.13115 (2023).
  • Li et al . (2022g) Yao Li, Tao Zhang, Xiapu Luo, Haipeng Cai, Sen Fang, and Dawei Yuan. 2022g. Do Pre-trained Language Models Indeed Understand Software Engineering Tasks? arXiv preprint arXiv:2211.10623 (2022).
  • Li et al . (2023f) Zhihao Li, Chuanyi Li, Ze Tang, Wanhong Huang, Jidong Ge, Bin Luo, Vincent Ng, Ting Wang, Yucheng Hu, and Xiaopeng Zhang. 2023f. PTM-APIRec: Leveraging Pre-trained Models of Source Code in API Recommendation. ACM Transactions on Software Engineering and Methodology (2023).
  • Li et al . (2023j) Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Dong Chen, Shuai Wang, and Cuiyun Gao. 2023j. Cctest: Testing and repairing code completion systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1238–1250.
  • Li et al . (2022e) Zongjie Li, Chaozheng Wang, Zhibo Liu, Haoxuan Wang, Shuai Wang, and Cuiyun Gao. 2022e. CCTEST: Testing and Repairing Code Completion Systems. arXiv preprint arXiv:2208.08289 (2022).
  • Liang and Zhu (2018) Yuding Liang and Kenny Zhu. 2018. Automatic generation of text descriptive comments for code blocks. In Proceedings of the AAAI Conference on Artificial Intelligence , Vol. 32.
  • Lin et al . (2021) Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021. Traceability transformed: Generating more accurate links with pre-trained bert models. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 324–335.
  • Lin et al . (2023) Yu-Chen Lin, Akhilesh Kumar, Wen-Liang Zhang, Norman Chang, Muhammad Zakir, Rucha Apte, Chao Wang, and Jyh-Shing Roger Jang. 2023. Applications of Large Language Models in Data Processing: Innovative Approaches to Segmenting and Renewing Information. arXiv preprint arXiv:2311.16267 (2023).
  • Liu et al . (2023a) Chao Liu, Xuanlin Bao, Hongyu Zhang, Neng Zhang, Haibo Hu, Xiaohong Zhang, and Meng Yan. 2023a. Improving ChatGPT Prompt for Code Generation. arXiv preprint arXiv:2305.08360 (2023).
  • Liu et al . (2020) Fang Liu, Ge Li, Yunfei Zhao, and Zhi Jin. 2020. Multi-task learning based pre-trained language model for code completion. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering . 473–485.
  • Liu et al . (2022a) Haokun Liu, Derek Tam, Mohammed Muqeeth, Jay Mohta, Tenghao Huang, Mohit Bansal, and Colin A Raffel. 2022a. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems 35 (2022), 1950–1965.
  • Liu et al . (2023i) Hao Liu, Yanlin Wang, Zhao Wei, Yong Xu, Juhong Wang, Hui Li, and Rongrong Ji. 2023i. RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring. arXiv preprint arXiv:2305.17708 (2023).
  • Liu et al . (2023g) Jinrun Liu, Xinyu Tang, Linlin Li, Panpan Chen, and Yepang Liu. 2023g. Which is a better programming assistant? A comparative study between chatgpt and stack overflow. arXiv preprint arXiv:2308.13851 (2023).
  • Liu et al . (2023k) Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023k. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  • Liu et al . (2023f) Puzhuo Liu, Chengnian Sun, Yaowen Zheng, Xuan Feng, Chuan Qin, Yuncheng Wang, Zhi Li, and Limin Sun. 2023f. Harnessing the power of llm to support binary taint analysis. arXiv preprint arXiv:2310.08275 (2023).
  • Liu et al . (2023j) Shangqing Liu, Bozhi Wu, Xiaofei Xie, Guozhu Meng, and Yang Liu. 2023j. Contrabert: Enhancing code pre-trained models via contrastive learning. arXiv preprint arXiv:2301.09072 (2023).
  • Liu et al . (2023l) Tianyang Liu, Canwen Xu, and Julian McAuley. 2023l. RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems. arXiv preprint arXiv:2306.03091 (2023).
  • Liu et al . (2018) Xiaoyu Liu, LiGuo Huang, and Vincent Ng. 2018. Effective API recommendation without historical software repositories. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering . 282–292.
  • Liu et al . (2023d) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023d. Prompt Injection attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499 (2023).
  • Liu et al . (2023e) Yue Liu, Thanh Le-Cong, Ratnadira Widyasari, Chakkrit Tantithamthavorn, Li Li, Xuan-Bach D Le, and David Lo. 2023e. Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues. arXiv preprint arXiv:2307.12596 (2023).
  • Liu et al . (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
  • Liu et al . (2022b) Yue Liu, Chakkrit Tantithamthavorn, Li Li, and Yepang Liu. 2022b. Deep learning for android malware defenses: a systematic literature review. Comput. Surveys 55, 8 (2022), 1–36.
  • Liu et al . (2024a) Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, and Li Li. 2024a. On the Reliability and Explainability of Language Models for Program Generation. ACM Transactions on Software Engineering and Methodology (2024).
  • Liu et al . (2024b) Yilun Liu, Shimin Tao, Weibin Meng, Jingyu Wang, Wenbing Ma, Yanqing Zhao, Yuhang Chen, Hao Yang, Yanfei Jiang, and Xun Chen. 2024b. Interpretable Online Log Analysis Using Large Language Models with Prompt Strategies. arXiv:2308.07610 [cs.SE]
  • Liu et al . (2023b) Zhe Liu, Chunyang Chen, Junjie Wang, Xing Che, Yuekai Huang, Jun Hu, and Qing Wang. 2023b. Fill in the blank: Context-aware automated text input generation for mobile gui testing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) . IEEE, 1355–1367.
  • Liu et al . (2023c) Zhe Liu, Chunyang Chen, Junjie Wang, Mengzhuo Chen, Boyu Wu, Xing Che, Dandan Wang, and Qing Wang. 2023c. Testing the limits: Unusual text inputs generation for mobile app crash detection with large language model. arXiv preprint arXiv:2310.15657 (2023).
  • Liu et al . (2023h) Zhijie Liu, Yutian Tang, Xiapu Luo, Yuming Zhou, and Liang Feng Zhang. 2023h. No Need to Lift a Finger Anymore? Assessing the Quality of Code Generation by ChatGPT. arXiv preprint arXiv:2308.04838 (2023).
  • Lu et al . (2023) Junyi Lu, Lei Yu, Xiaojia Li, Li Yang, and Chun Zuo. 2023. LLaMA-Reviewer: Advancing Code Review Automation with Large Language Models through Parameter-Efficient Fine-Tuning. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 647–658.
  • Lu et al . (2021) Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  • Lubowitz (2023) James H Lubowitz. 2023. ChatGPT, an artificial intelligence chatbot, is impacting medical literature. Arthroscopy 39, 5 (2023), 1121–1122.
  • Luitel et al . (2023) Dipeeka Luitel, Shabnam Hassani, and Mehrdad Sabetzadeh. 2023. Improving Requirements Completeness: Automated Assistance through Large Language Models. arXiv preprint arXiv:2308.03784 (2023).
  • Luo et al . (2022) Xianchang Luo, Yinxing Xue, Zhenchang Xing, and Jiamou Sun. 2022. PRCBERT: Prompt Learning for Requirement Classification using BERT-based Pretrained Language Models. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–13.
  • Luo et al . (2023) Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568 (2023).
  • Ma et al . (2024a) Lezhi Ma, Shangqing Liu, Yi Li, Xiaofei Xie, and Lei Bu. 2024a. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. arXiv preprint arXiv:2401.08807 (2024).
  • Ma et al . (2024b) Lipeng Ma, Weidong Yang, Bo Xu, Sihang Jiang, Ben Fei, Jiaqing Liang, Mingjie Zhou, and Yanghua Xiao. 2024b. KnowLog: Knowledge Enhanced Pre-trained Language Model for Log Understanding. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13.
  • Ma et al . (2023a) Wei Ma, Shangqing Liu, Wenhan Wang, Qiang Hu, Ye Liu, Cen Zhang, Liming Nie, and Yang Liu. 2023a. The Scope of ChatGPT in Software Engineering: A Thorough Investigation. arXiv preprint arXiv:2305.12138 (2023).
  • Ma et al . (2023b) Wei Ma, Mengjie Zhao, Xiaofei Xie, Qiang Hu, Shangqing Liu, Jie Zhang, Wenhan Wang, and Yang Liu. 2023b. Are Code Pre-trained Models Powerful to Learn Code Syntax and Semantics?
  • Madaan et al . (2022) Aman Madaan, Shuyan Zhou, Uri Alon, Yiming Yang, and Graham Neubig. 2022. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 (2022).
  • Mandal et al . (2023) Shantanu Mandal, Adhrik Chethan, Vahid Janfaza, SM Mahmud, Todd A Anderson, Javier Turek, Jesmin Jahan Tithi, and Abdullah Muzahid. 2023. Large Language Models Based Automatic Synthesis of Software Specifications. arXiv preprint arXiv:2304.09181 (2023).
  • Manh et al . (2023) Dung Nguyen Manh, Nam Le Hai, Anh TV Dau, Anh Minh Nguyen, Khanh Nghiem, Jin Guo, and Nghi DQ Bui. 2023. The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation. arXiv preprint arXiv:2305.06156 (2023).
  • Manna and Waldinger (1980) Zohar Manna and Richard Waldinger. 1980. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems (TOPLAS) 2, 1 (1980), 90–121.
  • Mao et al . (2023) Yuetian Mao, Chengcheng Wan, Yuze Jiang, and Xiaodong Gu. 2023. Self-supervised query reformulation for code search. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 363–374.
  • Mastropaolo et al . (2021a) Antonio Mastropaolo, Emad Aghajani, Luca Pascarella, and Gabriele Bavota. 2021a. An empirical study on code comment completion. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 159–170.
  • Mastropaolo et al . (2022a) Antonio Mastropaolo, Nathan Cooper, David Nader Palacio, Simone Scalabrino, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2022a. Using transfer learning for code-related tasks. IEEE Transactions on Software Engineering 49, 4 (2022), 1580–1598.
  • Mastropaolo et al . (2023a) Antonio Mastropaolo, Massimiliano Di Penta, and Gabriele Bavota. 2023a. Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 585–597.
  • Mastropaolo et al . (2022b) Antonio Mastropaolo, Luca Pascarella, and Gabriele Bavota. 2022b. Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering . 2279–2290.
  • Mastropaolo et al . (2023b) Antonio Mastropaolo, Luca Pascarella, Emanuela Guglielmi, Matteo Ciniselli, Simone Scalabrino, Rocco Oliveto, and Gabriele Bavota. 2023b. On the robustness of code generation techniques: An empirical study on github copilot. arXiv preprint arXiv:2302.00438 (2023).
  • Mastropaolo et al . (2021b) Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto, and Gabriele Bavota. 2021b. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE) . IEEE, 336–347.
  • Meta (2023) Meta. 2023. Code Llama: Open Foundation Models for Code. https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/ .
  • Mohajer et al . (2023) Mohammad Mahdi Mohajer, Reem Aleithan, Nima Shiri Harzevili, Moshi Wei, Alvine Boaye Belle, Hung Viet Pham, and Song Wang. 2023. SkipAnalyzer: An Embodied Agent for Code Analysis with Large Language Models. arXiv preprint arXiv:2310.18532 (2023).
  • Moharil and Sharma (2022) Ambarish Moharil and Arpit Sharma. 2022. Identification of intra-domain ambiguity using transformer-based machine learning. In Proceedings of the 1st International Workshop on Natural Language-based Software Engineering . 51–58.
  • Moharil and Sharma (2023) Ambarish Moharil and Arpit Sharma. 2023. TABASCO: A Transformer Based Contextualization Toolkit. Science of Computer Programming (2023), 102994.
  • Moon et al . (2023) Seungjun Moon, Yongho Song, Hyungjoo Chae, Dongjin Kang, Taeyoon Kwon, Kai Tzu-iunn Ong, Seung-won Hwang, and Jinyoung Yeo. 2023. Coffee: Boost your code llms by fixing bugs with feedback. arXiv preprint arXiv:2311.07215 (2023).
  • Moore and Lewis (2010) Robert C Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 conference short papers . 220–224.
  • Moss (2021) Sebastian Moss. 2021. Google Brain unveils trillion-parameter AI language model, the largest yet. https://aibusiness.com/nlp/google-brain-unveils-trillion-parameter-ai-language-model-the-largest-yet .
  • Motger et al . (2024) Quim Motger, Alessio Miaschi, Felice Dell’Orletta, Xavier Franch, and Jordi Marco. 2024. T-FREX: A Transformer-based Feature Extraction Method from Mobile App Reviews. arXiv preprint arXiv:2401.03833 (2024).
  • Mu et al . (2023) Fangwen Mu, Lin Shi, Song Wang, Zhuohao Yu, Binquan Zhang, Chenxue Wang, Shichao Liu, and Qing Wang. 2023. ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification. arXiv preprint arXiv:2310.10996 (2023).
  • Mukherjee and Hellendoorn (2023) Manisha Mukherjee and Vincent J Hellendoorn. 2023. Stack Over-Flowing with Results: The Case for Domain-Specific Pre-Training Over One-Size-Fits-All Models. arXiv preprint arXiv:2306.03268 (2023).
  • Murali et al . (2023) Vijayaraghavan Murali, Chandra Maddila, Imad Ahmad, Michael Bolin, Daniel Cheng, Negar Ghorbani, Renuka Fernandez, and Nachiappan Nagappan. 2023. CodeCompose: A Large-Scale Industrial Deployment of AI-assisted Code Authoring. arXiv preprint arXiv:2305.12050 (2023).
  • Nam et al . (2023) Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2023. In-IDE Generation-based Information Support with a Large Language Model. arXiv preprint arXiv:2307.08177 (2023).
  • Nascimento et al . (2023) Nathalia Nascimento, Paulo Alencar, and Donald Cowan. 2023. Comparing Software Developers with ChatGPT: An Empirical Investigation. arXiv preprint arXiv:2305.11837 (2023).
  • Nasir et al . (2023) Muhammad U Nasir, Sam Earle, Julian Togelius, Steven James, and Christopher Cleghorn. 2023. LLMatic: Neural Architecture Search via Large Language Models and Quality-Diversity Optimization. arXiv preprint arXiv:2306.01102 (2023).
  • Naveed et al . (2024) Hira Naveed, Chetan Arora, Hourieh Khalajzadeh, John Grundy, and Omar Haggag. 2024. Model driven engineering for machine learning components: A systematic literature review. Information and Software Technology (2024), 107423.
  • Nguyen et al . (2016) Anh Tuan Nguyen, Michael Hilton, Mihai Codoban, Hoan Anh Nguyen, Lily Mast, Eli Rademacher, Tien N Nguyen, and Danny Dig. 2016. API code recommendation using statistical learning from fine-grained changes. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering . 511–522.
  • Nguyen and Nguyen (2017) Anh Tuan Nguyen and Tien N Nguyen. 2017. Automatic categorization with deep neural network for open-source java projects. In 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C) . IEEE, 164–166.
  • Nguyen et al . (2023) Phuong T Nguyen, Juri Di Rocco, Claudio Di Sipio, Riccardo Rubei, Davide Di Ruscio, and Massimiliano Di Penta. 2023. Is this Snippet Written by ChatGPT? An Empirical Study with a CodeBERT-Based Classifier. arXiv preprint arXiv:2307.09381 (2023).
  • Ni et al . (2023a) Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. 2023a. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning . PMLR, 26106–26128.
  • Ni et al . (2023b) Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, et al . 2023b. L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models. arXiv preprint arXiv:2309.17446 (2023).
  • Nichols et al . (2024) Daniel Nichols, Joshua H Davis, Zhaojun Xie, Arjun Rajaram, and Abhinav Bhatele. 2024. Can Large Language Models Write Parallel Code? arXiv preprint arXiv:2401.12554 (2024).
  • Nie et al . (2016) Liming Nie, He Jiang, Zhilei Ren, Zeyi Sun, and Xiaochen Li. 2016. Query expansion based on crowd knowledge for code search. IEEE Transactions on Services Computing 9, 5 (2016), 771–783.
  • Nijkamp et al . (2023) Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, and Yingbo Zhou. 2023. Codegen2: Lessons for training llms on programming and natural languages. arXiv preprint arXiv:2305.02309 (2023).
  • Nijkamp et al . (2022a) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022a. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474 (2022).
  • Nijkamp et al . (2022b) Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022b. A conversational paradigm for program synthesis.
  • Niu et al . (2022) Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. Spt-code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering . 2006–2018.
  • Noever (2023) David Noever. 2023. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345 (2023).
  • Ochs et al . (2023) Marcel Ochs, Krishna Narasimhan, and Mira Mezini. 2023. Evaluating and improving transformers pre-trained on ASTs for Code Completion. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 834–844.
  • Olausson et al . (2023) Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. 2023. Demystifying GPT Self-Repair for Code Generation. arXiv preprint arXiv:2306.09896 (2023).
  • OpenAI (2022a) OpenAI. 2022a. Chatgpt: Optimizing language models for dialogue. https://chat.openai.com .
  • OpenAI (2022b) OpenAI. 2022b. GPT-3.5. https://platform.openai.com/docs/models/gpt-3-5 .
  • OpenAI (2023a) OpenAI. 2023a. Code Interpreter. https://openai.com/blog/chatgpt-plugins#code-interpreter .
  • OpenAI (2023b) OpenAI. 2023b. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  • Ouyang et al . (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al . 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  • Ouyang et al . (2023) Shuyin Ouyang, Jie M Zhang, Mark Harman, and Meng Wang. 2023. LLM is Like a Box of Chocolates: the Non-determinism of ChatGPT in Code Generation. arXiv preprint arXiv:2308.02828 (2023).
  • Overflow (2023) Stack Overflow. 2023. Stack Overflow. https://stackoverflow.com/ .
  • Pan et al . (2023c) Jialing Pan, Adrien Sadé, Jin Kim, Eric Soriano, Guillem Sole, and Sylvain Flamant. 2023c. SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation. arXiv preprint arXiv:2310.15539 (2023).
  • Pan et al . (2023a) Rangeet Pan, Ali Reza Ibrahimzada, Rahul Krishna, Divya Sankar, Lambert Pouguem Wassi, Michele Merler, Boris Sobolev, Raju Pavuluri, Saurabh Sinha, and Reyhaneh Jabbarvand. 2023a. Understanding the Effectiveness of Large Language Models in Code Translation. arXiv preprint arXiv:2308.03109 (2023).
  • Pan et al . (2023b) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2023b. Unifying Large Language Models and Knowledge Graphs: A Roadmap. arXiv preprint arXiv:2306.08302 (2023).
  • Paranjape et al . (2023) Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. 2023. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014 (2023).
  • Parisotto et al . (2016) Emilio Parisotto, Abdel-rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet Kohli. 2016. Neuro-symbolic program synthesis. arXiv preprint arXiv:1611.01855 (2016).
  • Patel et al . (2023) Arkil Patel, Siva Reddy, Dzmitry Bahdanau, and Pradeep Dasigi. 2023. Evaluating In-Context Learning of Libraries for Code Generation. arXiv preprint arXiv:2311.09635 (2023).
  • Patil et al . (2023) Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334 (2023).
  • Paul et al . (2023a) Rishov Paul, Md Mohib Hossain, Masum Hasan, and Anindya Iqbal. 2023a. Automated Program Repair Based on Code Review: How do Pre-trained Transformer Models Perform? arXiv preprint arXiv:2304.07840 (2023).
  • Paul et al . (2023b) Rishov Paul, Md. Mohib Hossain, Mohammed Latif Siddiq, Masum Hasan, Anindya Iqbal, and Joanna C. S. Santos. 2023b. Enhancing Automated Program Repair through Fine-tuning and Prompt Engineering. arXiv:2304.07840 [cs.LG]
  • Pearce et al . (2021) Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2021. Examining zero-shot vulnerability repair with large language models. arXiv preprint arXiv:2112.02125 (2021).
  • Pearce et al . (2023) Hammond Pearce, Benjamin Tan, Baleegh Ahmad, Ramesh Karri, and Brendan Dolan-Gavitt. 2023. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP) . IEEE, 2339–2356.
  • Pegolotti et al . (2023) Tommaso Pegolotti, Elias Frantar, Dan Alistarh, and Markus Püschel. 2023. QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models. arXiv preprint arXiv:2307.03738 (2023).
  • Pei et al . (2023) Kexin Pei, David Bieber, Kensen Shi, Charles Sutton, and Pengcheng Yin. 2023. Can Large Language Models Reason about Program Invariants? (2023).
  • Peng et al . (2024) Yun Peng, Shuzheng Gao, Cuiyun Gao, Yintong Huo, and Michael Lyu. 2024. Domain knowledge matters: Improving prompts with fix templates for repairing python type errors. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13.
  • Phan et al . (2021) Long Phan, Hieu Tran, Daniel Le, Hieu Nguyen, James Anibal, Alec Peltekian, and Yanfang Ye. 2021. Cotext: Multi-task learning with code-text transformer. arXiv preprint arXiv:2105.08645 (2021).
  • Pierce and Turner (2000) Benjamin C Pierce and David N Turner. 2000. Local type inference. ACM Transactions on Programming Languages and Systems (TOPLAS) 22, 1 (2000), 1–44.
  • Piya and Sullivan (2023) Sanyogita Piya and Allison Sullivan. 2023. LLM4TDD: Best Practices for Test Driven Development Using Large Language Models. arXiv preprint arXiv:2312.04687 (2023).
  • Plein et al . (2023) Laura Plein, Wendkûuni C Ouédraogo, Jacques Klein, and Tegawendé F Bissyandé. 2023. Automatic generation of test cases based on bug reports: a feasibility study with large language models. arXiv preprint arXiv:2310.06320 (2023).
  • Poudel et al . (2023) Amrit Poudel, Jinfeng Lin, and Jane Cleland-Huang. 2023. Leveraging Transformer-based Language Models to Automate Requirements Satisfaction Assessment. arXiv preprint arXiv:2312.04463 (2023).
  • Prenner and Robbes (2021) Julian Aron Prenner and Romain Robbes. 2021. Making the most of small Software Engineering datasets with modern machine learning. IEEE Transactions on Software Engineering 48, 12 (2021), 5050–5067.
  • Pudari and Ernst (2023) Rohith Pudari and Neil A Ernst. 2023. From Copilot to Pilot: Towards AI Supported Software Development. arXiv preprint arXiv:2303.04142 (2023).
  • Qi et al . (2023) Mengnan Qi, Yufan Huang, Maoquan Wang, Yongqiang Yao, Zihan Liu, Bin Gu, Colin Clement, and Neel Sundaresan. 2023. SUT: Active Defects Probing for Transcompiler Models. arXiv preprint arXiv:2310.14209 (2023).
  • Qian et al . (2023) Chen Qian, Xin Cong, Cheng Yang, Weize Chen, Yusheng Su, Juyuan Xu, Zhiyuan Liu, and Maosong Sun. 2023. Communicative Agents for Software Development. arXiv preprint arXiv:2307.07924 (2023).
  • Quan et al . (2023) Vu Le Anh Quan, Chau Thuan Phat, Kiet Van Nguyen, Phan The Duy, and Van-Hau Pham. 2023. XGV-BERT: Leveraging Contextualized Language Model and Graph Neural Network for Efficient Software Vulnerability Detection. arXiv preprint arXiv:2309.14677 (2023).
  • Radford et al . (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al . 2018. Improving language understanding by generative pre-training. (2018).
  • Radford et al . (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al . 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  • Raffel et al . (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
  • Rahmani et al . (2023) Sajjad Rahmani, AmirHossein Naghshzan, and Latifa Guerrouj. 2023. Improving Code Example Recommendations on Informal Documentation Using BERT and Query-Aware LSH: A Comparative Study. arXiv preprint arXiv:2305.03017 (2023).
  • Ramirez et al . (2018) Aurora Ramirez, Jose Raul Romero, and Christopher L Simons. 2018. A systematic review of interaction in search-based software engineering. IEEE Transactions on Software Engineering 45, 8 (2018), 760–781.
  • Ramly (2023) Sami Ramly. 2023. Preventing Abuse of LLMs’ Alignment Deficit by Injection Neutralization (PALADIN). https://medium.com/@SamiRamly/prompt-attacks-are-llm-jailbreaks-inevitable-f7848cc11122 .
  • Rao et al . (2023b) Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. 2023b. Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks. arXiv preprint arXiv:2305.14965 (2023).
  • Rao et al . (2023a) Nikitha Rao, Jason Tsay, Kiran Kate, Vincent J Hellendoorn, and Martin Hirzel. 2023a. AI for Low-Code for AI. arXiv preprint arXiv:2305.20015 (2023).
  • Raychev et al . (2014) Veselin Raychev, Martin Vechev, and Eran Yahav. 2014. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation . 419–428.
  • Ren et al . (2023) Xiaoxue Ren, Xinyuan Ye, Dehai Zhao, Zhenchang Xing, and Xiaohu Yang. 2023. From Misuse to Mastery: Enhancing Code Generation with Knowledge-Driven AI Chaining. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 976–987.
  • Ridnik et al . (2024) Tal Ridnik, Dedy Kredo, and Itamar Friedman. 2024. Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering. arXiv preprint arXiv:2401.08500 (2024).
  • Rierson (2017) Leanna Rierson. 2017. Developing safety-critical software: a practical guide for aviation software and DO-178C compliance . CRC Press.
  • Rillig et al . (2023) Matthias C Rillig, Marlene Ågerstrand, Mohan Bi, Kenneth A Gould, and Uli Sauerland. 2023. Risks and benefits of large language models for the environment. Environmental Science & Technology 57, 9 (2023), 3464–3466.
  • Robillard (2009) Martin P Robillard. 2009. What makes APIs hard to learn? Answers from developers. IEEE software 26, 6 (2009), 27–34.
  • Robillard and DeLine (2011) Martin P Robillard and Robert DeLine. 2011. A field study of API learning obstacles. Empirical Software Engineering 16 (2011), 703–732.
  • Roehm et al . (2012) Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. 2012. How do professional developers comprehend software?. In 2012 34th International Conference on Software Engineering (ICSE) . IEEE, 255–265.
  • Ronanki et al . (2023) Krishna Ronanki, Beatriz Cabrero-Daniel, and Christian Berger. 2023. ChatGPT as a tool for User Story Quality Evaluation: Trustworthy Out of the Box? arXiv preprint arXiv:2306.12132 (2023).
  • Roziere et al . (2021) Baptiste Roziere, Marie-Anne Lachaux, Marc Szafraniec, and Guillaume Lample. 2021. Dobf: A deobfuscation pre-training objective for programming languages. arXiv preprint arXiv:2102.07492 (2021).
  • Ruiz et al . (2024) Fernando Vallecillos Ruiz, Anastasiia Grishina, Max Hort, and Leon Moonen. 2024. A Novel Approach for Automatic Program Repair using Round-Trip Translation with Large Language Models. arXiv preprint arXiv:2401.07994 (2024).
  • Saberi et al . (2023a) Iman Saberi, Fatemeh Fard, and Fuxiang Chen. 2023a. Multilingual Adapter-based Knowledge Aggregation on Code Summarization for Low-Resource Languages. arXiv preprint arXiv:2307.07854 (2023).
  • Saberi et al . (2023b) Iman Saberi, Fatemeh Fard, and Fuxiang Chen. 2023b. Utilization of Pre-trained Language Model for Adapter-based Knowledge Transfer in Software Engineering. arXiv preprint arXiv:2307.08540 (2023).
  • Sadik et al . (2023) Ahmed Sadik, Antonello Ceravola, Frank Joublin, and Jibesh Patra. 2023. Analysis of ChatGPT on Source Code. arXiv preprint arXiv:2306.00597 (2023).
  • Sahoo et al . (2024) Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv preprint arXiv:2402.07927 (2024).
  • Saieva et al . (2023) Anthony Saieva, Saikat Chakraborty, and Gail Kaiser. 2023. On Contrastive Learning of Semantic Similarity forCode to Code Search. arXiv preprint arXiv:2305.03843 (2023).
  • Sakib et al . (2023) Fardin Ahsan Sakib, Saadat Hasan Khan, and AHM Karim. 2023. Extending the Frontier of ChatGPT: Code Generation and Debugging. arXiv preprint arXiv:2307.08260 (2023).
  • Salza et al . (2022) Pasquale Salza, Christoph Schwizer, Jian Gu, and Harald C Gall. 2022. On the effectiveness of transfer learning for code search. IEEE Transactions on Software Engineering (2022).
  • Satyanarayanan et al . (1992) Mahadev Satyanarayanan, David C Steere, Masashi Kudo, and Hank Mashburn. 1992. Transparent logging as a technique for debugging complex distributed systems. In Proceedings of the 5th workshop on ACM SIGOPS European workshop: Models and paradigms for distributed systems structuring . 1–3.
  • Scao et al . (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al . 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  • Schäfer et al . (2023a) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023a. Adaptive test generation using a large language model. arXiv preprint arXiv:2302.06527 (2023).
  • Schäfer et al . (2023b) Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2023b. An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering (2023).
  • Schlag et al . (2023) Imanol Schlag, Sainbayar Sukhbaatar, Asli Celikyilmaz, Wen tau Yih, Jason Weston, Jürgen Schmidhuber, and Xian Li. 2023. Large Language Model Programs. arXiv preprint arXiv:2305.05364 (2023).
  • Schroder (2023) Martin Schroder. 2023. AutoScrum: Automating Project Planning Using Large Language Models. arXiv preprint arXiv:2306.03197 (2023).
  • Sghaier and Sahraoui (2023) Oussama Ben Sghaier and Houari Sahraoui. 2023. A Multi-Step Learning Approach to Assist Code Review. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 450–460.
  • Shanahan (2022) Murray Shanahan. 2022. Talking about large language models. arXiv preprint arXiv:2212.03551 (2022).
  • Shapkin et al . (2023) Anton Shapkin, Denis Litvinov, and Timofey Bryksin. 2023. Entity-augmented code generation. arXiv preprint arXiv:2312.08976 (2023).
  • Sharma et al . (2022) Rishab Sharma, Fuxiang Chen, Fatemeh Fard, and David Lo. 2022. An exploratory study on code attention in BERT. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension . 437–448.
  • Shen et al . (2022) Da Shen, Xinyun Chen, Chenguang Wang, Koushik Sen, and Dawn Song. 2022. Benchmarking Language Models for Code Syntax Understanding. arXiv preprint arXiv:2210.14473 (2022).
  • Sheng et al . (2023) Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Re, Ion Stoica, and Ce Zhang. 2023. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. (2023).
  • Shestov et al . (2024) Alexey Shestov, Anton Cheshkov, Rodion Levichev, Ravil Mussabayev, Pavel Zadorozhny, Evgeny Maslov, Chibirev Vadim, and Egor Bulychev. 2024. Finetuning Large Language Models for Vulnerability Detection. arXiv preprint arXiv:2401.17010 (2024).
  • Shi et al . (2023a) Ensheng Shi, Yanlin Wang, Hongyu Zhang, Lun Du, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023a. Towards Efficient Fine-tuning of Pre-trained Code Models: An Experimental Study and Beyond. arXiv preprint arXiv:2304.05216 (2023).
  • Shi et al . (2023c) Ensheng Shi, Fengji Zhang, Yanlin Wang, Bei Chen, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023c. SoTaNa: The Open-Source Software Development Assistant. arXiv preprint arXiv:2308.13416 (2023).
  • Shi et al . (2023b) Jieke Shi, Zhou Yang, Bowen Xu, Hong Jin Kang, and David Lo. 2023b. Compressing Pre-Trained Models of Code into 3 MB (ASE ’22) . Association for Computing Machinery, New York, NY, USA, Article 24, 12 pages. https://doi.org/10.1145/3551349.3556964
  • Shi et al . (2022) Zejian Shi, Yun Xiong, Xiaolong Zhang, Yao Zhang, Shanshan Li, and Yangyong Zhu. 2022. Cross-Modal Contrastive Learning for Code Search. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 94–105.
  • Shin et al . (2023a) Jiho Shin, Sepehr Hashtroudi, Hadi Hemmati, and Song Wang. 2023a. Domain Adaptation for Deep Unit Test Case Generation. arXiv e-prints (2023), arXiv–2308.
  • Shin et al . (2023b) Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, and Hadi Hemmati. 2023b. Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv preprint arXiv:2310.10508 (2023).
  • Shirafuji et al . (2023) Atsushi Shirafuji, Yutaka Watanobe, Takumi Ito, Makoto Morishita, Yuki Nakamura, Yusuke Oda, and Jun Suzuki. 2023. Exploring the Robustness of Large Language Models for Solving Programming Problems. arXiv preprint arXiv:2306.14583 (2023).
  • Shypula et al . (2023) Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, and Amir Yazdanbakhsh. 2023. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867 (2023).
  • Siddiq et al . (2023a) Mohammed Latif Siddiq, Beatrice Casey, and Joanna Santos. 2023a. A Lightweight Framework for High-Quality Code Generation. arXiv preprint arXiv:2307.08220 (2023).
  • Siddiq et al . (2023b) Mohammed Latif Siddiq, Joanna Santos, Ridwanul Hasan Tanvir, Noshin Ulfat, Fahmid Al Rifat, and Vinicius Carvalho Lopes. 2023b. Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
  • Silva et al . (2023) André Silva, Sen Fang, and Martin Monperrus. 2023. RepairLLaMA: Efficient Representations and Fine-Tuned Adapters for Program Repair. arXiv preprint arXiv:2312.15698 (2023).
  • Singla (2023) Adish Singla. 2023. Evaluating ChatGPT and GPT-4 for Visual Programming. arXiv preprint arXiv:2308.02522 (2023).
  • Sobania et al . (2023) Dominik Sobania, Martin Briesch, Carol Hanna, and Justyna Petke. 2023. An analysis of the automatic bug fixing performance of chatgpt. arXiv preprint arXiv:2301.08653 (2023).
  • Sridhara et al . (2023) Giriprasad Sridhara, Sourav Mazumdar, et al . 2023. ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks. arXiv preprint arXiv:2305.16837 (2023).
  • Srivastava et al . (2010) Saurabh Srivastava, Sumit Gulwani, and Jeffrey S Foster. 2010. From program verification to program synthesis. In Proceedings of the 37th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages . 313–326.
  • Steenhoek et al . (2024) Benjamin Steenhoek, Hongyang Gao, and Wei Le. 2024. Dataflow Analysis-Inspired Deep Learning for Efficient Vulnerability Detection. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–13.
  • Steenhoek et al . (2023) Benjamin Steenhoek, Michele Tufano, Neel Sundaresan, and Alexey Svyatkovskiy. 2023. Reinforcement Learning from Automatic Feedback for High-Quality Unit Test Generation. arXiv preprint arXiv:2310.02368 (2023).
  • Su et al . (2022) Hongjin Su, Jungo Kasai, Chen Henry Wu, Weijia Shi, Tianlu Wang, Jiayi Xin, Rui Zhang, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, et al . 2022. Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975 (2022).
  • Sun et al . (2023d) Chuyue Sun, Ying Sheng, Oded Padon, and Clark Barrett. 2023d. Clover: Closed-Loop Verifiable Code Generation. arXiv preprint arXiv:2310.17807 (2023).
  • Sun et al . (2023f) Jiamou Sun, Zhenchang Xing, Qinghua Lu, Xiwei Xu, Liming Zhu, Thong Hoang, and Dehai Zhao. 2023f. Silent Vulnerable Dependency Alert Prediction with Vulnerability Key Aspect Explanation. arXiv preprint arXiv:2302.07445 (2023).
  • Sun et al . (2023a) Tiezhu Sun, Kevin Allix, Kisub Kim, Xin Zhou, Dongsun Kim, David Lo, Tegawendé F Bissyandé, and Jacques Klein. 2023a. Dexbert: effective, task-agnostic and fine-grained representation learning of Android bytecode. IEEE Transactions on Software Engineering (2023).
  • Sun et al . (2023b) Weisong Sun, Chunrong Fang, Yudu You, Yuchen Chen, Yi Liu, Chong Wang, Jian Zhang, Quanjun Zhang, Hanwei Qian, Wei Zhao, et al . 2023b. A Prompt Learning Framework for Source Code Summarization. arXiv preprint arXiv:2312.16066 (2023).
  • Sun et al . (2023c) Weisong Sun, Chunrong Fang, Yudu You, Yun Miao, Yi Liu, Yuekang Li, Gelei Deng, Shenghan Huang, Yuchen Chen, Quanjun Zhang, et al . 2023c. Automatic Code Summarization via ChatGPT: How Far Are We? arXiv preprint arXiv:2305.12865 (2023).
  • Sun et al . (2024b) Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Miaolei Shi, and Yang Liu. 2024b. LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning. arXiv preprint arXiv:2401.16185 (2024).
  • Sun et al . (2023e) Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Haijun Wang, Zhengzi Xu, Xiaofei Xie, and Yang Liu. 2023e. When GPT Meets Program Analysis: Towards Intelligent Detection of Smart Contract Logic Vulnerabilities in GPTScan. arXiv preprint arXiv:2308.03314 (2023).
  • Sun et al . (2024a) Zhensu Sun, Xiaoning Du, Fu Song, Shangwen Wang, and Li Li. 2024a. When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference. arXiv preprint arXiv:2401.09964 (2024).
  • Sun et al . (2022) Zhensu Sun, Li Li, Yan Liu, Xiaoning Du, and Li Li. 2022. On the importance of building high-quality training datasets for neural code search. In Proceedings of the 44th International Conference on Software Engineering . 1609–1620.
  • Svajlenko et al . (2014) Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal K Roy, and Mohammad Mamun Mia. 2014. Towards a big data curated benchmark of inter-project code clones. In 2014 IEEE International Conference on Software Maintenance and Evolution . IEEE, 476–480.
  • Tabassum et al . (2020) Jeniya Tabassum, Mounica Maddela, Wei Xu, and Alan Ritter. 2020. Code and named entity recognition in stackoverflow. arXiv preprint arXiv:2005.01634 (2020).
  • Tan et al . (2023) Chee Wei Tan, Shangxin Guo, Man Fai Wong, and Ching Nam Hang. 2023. Copilot for Xcode: Exploring AI-Assisted Programming by Prompting Cloud-based Large Language Models. arXiv preprint arXiv:2307.14349 (2023).
  • Tang et al . (2023d) Wei Tang, Mingwei Tang, Minchao Ban, Ziguo Zhao, and Mingjun Feng. 2023d. CSGVD: A deep learning approach combining sequence and graph embedding for source code vulnerability detection. Journal of Systems and Software 199 (2023), 111623.
  • Tang et al . (2023a) Xunzhu Tang, Zhenghan Chen, Kisub Kim, Haoye Tian, Saad Ezzini, and Jacques Klein. 2023a. Just-in-Time Security Patch Detection–LLM At the Rescue for Data Augmentation. arXiv preprint arXiv:2312.01241 (2023).
  • Tang et al . (2023c) Yutian Tang, Zhijie Liu, Zhichao Zhou, and Xiapu Luo. 2023c. ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation. arXiv preprint arXiv:2307.00588 (2023).
  • Tang et al . (2023b) Ze Tang, Jidong Ge, Shangqing Liu, Tingwei Zhu, Tongtong Xu, Liguo Huang, and Bin Luo. 2023b. Domain Adaptive Code Completion via Language Models and Decoupled Domain Databases. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 421–433.
  • Tarassow (2023) Artur Tarassow. 2023. The potential of LLMs for coding with low-resource and domain-specific programming languages. arXiv preprint arXiv:2307.13018 (2023).
  • Taylor et al . (2022) Ross Taylor, Marcin Kardas, Guillem Cucurull, Thomas Scialom, Anthony Hartshorn, Elvis Saravia, Andrew Poulton, Viktor Kerkez, and Robert Stojnic. 2022. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085 (2022).
  • Thakur et al . (2023) Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth Garg. 2023. VeriGen: A Large Language Model for Verilog Code Generation. arXiv preprint arXiv:2308.00708 (2023).
  • Thapa et al . (2022) Chandra Thapa, Seung Ick Jang, Muhammad Ejaz Ahmed, Seyit Camtepe, Josef Pieprzyk, and Surya Nepal. 2022. Transformer-based language models for software vulnerability detection. In Proceedings of the 38th Annual Computer Security Applications Conference . 481–496.
  • Tian et al . (2020) Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F Bissyandé. 2020. Evaluating representation learning of code changes for predicting patch correctness in program repair. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering . 981–992.
  • Tian et al . (2023a) Haoye Tian, Kui Liu, Yinghua Li, Abdoul Kader Kaboré, Anil Koyuncu, Andrew Habib, Li Li, Junhao Wen, Jacques Klein, and Tegawendé F Bissyandé. 2023a. The Best of Both Worlds: Combining Learned Embeddings with Engineered Features for Accurate Prediction of Correct Patches. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–34.
  • Tian et al . (2023b) Haoye Tian, Weiqi Lu, Tsz On Li, Xunzhu Tang, Shing-Chi Cheung, Jacques Klein, and Tegawendé F Bissyandé. 2023b. Is ChatGPT the Ultimate Programming Assistant–How far is it? arXiv preprint arXiv:2304.11938 (2023).
  • Tian et al . (2024) Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Zhiyuan Liu, and Maosong Sun. 2024. Debugbench: Evaluating debugging capability of large language models. arXiv preprint arXiv:2401.04621 (2024).
  • Tian and Chen (2023) Zhao Tian and Junjie Chen. 2023. Test-case-driven programming understanding in large language models for better code generation. arXiv preprint arXiv:2309.16120 (2023).
  • Tihanyi et al . (2023) Norbert Tihanyi, Tamas Bisztray, Ridhi Jain, Mohamed Amine Ferrag, Lucas C Cordeiro, and Vasileios Mavroeidis. 2023. The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification. arXiv preprint arXiv:2307.02192 (2023).
  • Touvron et al . (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al . 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al . (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al . 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Tu et al . (2023) Haoxin Tu, Zhide Zhou, He Jiang, Imam Nur Bani Yusuf, Yuxian Li, and Lingxiao Jiang. 2023. LLM4CBI: Taming LLMs to Generate Effective Test Programs for Compiler Bug Isolation. arXiv preprint arXiv:2307.00593 (2023).
  • Tufano et al . (2023) Michele Tufano, Shubham Chandel, Anisha Agarwal, Neel Sundaresan, and Colin Clement. 2023. Predicting Code Coverage without Execution. arXiv preprint arXiv:2307.13383 (2023).
  • Tufano et al . (2022) Rosalia Tufano, Simone Masiero, Antonio Mastropaolo, Luca Pascarella, Denys Poshyvanyk, and Gabriele Bavota. 2022. Using pre-trained models to boost code review automation. In Proceedings of the 44th International Conference on Software Engineering . 2291–2302.
  • Vaswani et al . (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
  • Vikram et al . (2023) Vasudev Vikram, Caroline Lemieux, and Rohan Padhye. 2023. Can Large Language Models Write Good Property-Based Tests? arXiv preprint arXiv:2307.04346 (2023).
  • Von der Mosel et al . (2022) Julian Von der Mosel, Alexander Trautsch, and Steffen Herbold. 2022. On the validity of pre-trained transformers for natural language processing in the software engineering domain. IEEE Transactions on Software Engineering 49, 4 (2022), 1487–1507.
  • Wadhwa et al . (2023) Nalin Wadhwa, Jui Pradhan, Atharv Sonwane, Surya Prakash Sahu, Nagarajan Natarajan, Aditya Kanade, Suresh Parthasarathy, and Sriram Rajamani. 2023. Frustrated with code quality issues? llms can help! arXiv preprint arXiv:2309.12938 (2023).
  • Wan et al . (2019) Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip Yu. 2019. Multi-modal attention network learning for semantic source code retrieval. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 13–25.
  • Wan et al . (2022a) Yao Wan, Shijie Zhang, Hongyu Zhang, Yulei Sui, Guandong Xu, Dezhong Yao, Hai Jin, and Lichao Sun. 2022a. You See What I Want You to See: Poisoning Vulnerabilities in Neural Code Search. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022) . Association for Computing Machinery, New York, NY, USA, 1233–1245. https://doi.org/10.1145/3540250.3549153
  • Wan et al . (2022b) Yao Wan, Wei Zhao, Hongyu Zhang, Yulei Sui, Guandong Xu, and Hai Jin. 2022b. What do they capture? a structural analysis of pre-trained language models for source code. In Proceedings of the 44th International Conference on Software Engineering . 2377–2388.
  • Wan et al . (2018) Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering . 397–407.
  • Wang and Komatsuzaki (2021) Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 billion parameter autoregressive language model.
  • Wang et al . (2023g) Chong Wang, Jianan Liu, Xin Peng, Yang Liu, and Yiling Lou. 2023g. Boosting Static Resource Leak Detection via LLM-based Resource-Oriented Intention Inference. arXiv preprint arXiv:2311.04448 (2023).
  • Wang et al . (2024c) Chong Wang, Jian Zhang, Yebo Feng, Tianlin Li, Weisong Sun, Yang Liu, and Xin Peng. 2024c. Teaching Code LLMs to Use Autocompletion Tools in Repository-Level Code Generation. arXiv preprint arXiv:2401.06391 (2024).
  • Wang et al . (2023a) Deze Wang, Boxing Chen, Shanshan Li, Wei Luo, Shaoliang Peng, Wei Dong, and Xiangke Liao. 2023a. One Adapter for All Programming Languages? Adapter Tuning for Code Search and Summarization. arXiv preprint arXiv:2303.15822 (2023).
  • Wang et al . (2023c) Junjie Wang, Yuchao Huang, Chunyang Chen, Zhe Liu, Song Wang, and Qing Wang. 2023c. Software Testing with Large Language Model: Survey, Landscape, and Vision. arXiv preprint arXiv:2307.07221 (2023).
  • Wang et al . (2023h) Jian Wang, Shangqing Liu, Xiaofei Xie, and Yi Li. 2023h. Evaluating AIGC Detectors on Code Content. arXiv preprint arXiv:2304.05193 (2023).
  • Wang et al . (2024a) Shuai Wang, Liang Ding, Li Shen, Yong Luo, Bo Du, and Dacheng Tao. 2024a. OOP: Object-Oriented Programming Evaluation Benchmark for Large Language Models. arXiv preprint arXiv:2401.06628 (2024).
  • Wang et al . (2023b) Shangwen Wang, Mingyang Geng, Bo Lin, Zhensu Sun, Ming Wen, Yepang Liu, Li Li, Tegawendé F Bissyandé, and Xiaoguang Mao. 2023b. Natural Language to Code: How Far Are We?. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 375–387.
  • Wang et al . (2022a) Simin Wang, Liguo Huang, Amiao Gao, Jidong Ge, Tengfei Zhang, Haitao Feng, Ishna Satyarth, Ming Li, He Zhang, and Vincent Ng. 2022a. Machine/deep learning for software engineering: A systematic literature review. IEEE Transactions on Software Engineering 49, 3 (2022), 1188–1231.
  • Wang et al . (2023d) Shufan Wang, Sebastien Jean, Sailik Sengupta, James Gung, Nikolaos Pappas, and Yi Zhang. 2023d. Measuring and Mitigating Constraint Violations of In-Context Learning for Utterance-to-API Semantic Parsing. arXiv preprint arXiv:2305.15338 (2023).
  • Wang et al . (2022b) Shiqi Wang, Zheng Li, Haifeng Qian, Chenghao Yang, Zijian Wang, Mingyue Shang, Varun Kumar, Samson Tan, Baishakhi Ray, Parminder Bhatia, et al . 2022b. ReCode: Robustness Evaluation of Code Generation Models. arXiv preprint arXiv:2212.10264 (2022).
  • Wang et al . (2020a) Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020a. Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 261–271.
  • Wang et al . (2020b) Wenhan Wang, Ge Li, Sijie Shen, Xin Xia, and Zhi Jin. 2020b. Modular tree network for source code representation learning. ACM Transactions on Software Engineering and Methodology (TOSEM) 29, 4 (2020), 1–23.
  • Wang et al . (2023j) Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. 2023j. Rap-gen: Retrieval-augmented patch generation with codet5 for automatic program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 146–158.
  • Wang et al . (2023i) Xingyao Wang, Hao Peng, Reyhaneh Jabbarvand, and Heng Ji. 2023i. LeTI: Learning to Generate from Textual Interactions. arXiv preprint arXiv:2305.10314 (2023).
  • Wang et al . (2021b) Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021b. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556 (2021).
  • Wang et al . (2024b) Yanlin Wang, Yanxian Huang, Daya Guo, Hongyu Zhang, and Zibin Zheng. 2024b. SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization. arXiv preprint arXiv:2401.14727 (2024).
  • Wang et al . (2023e) Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Steven CH Hoi. 2023e. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  • Wang et al . (2020c) Yawen Wang, Lin Shi, Mingyang Li, Qing Wang, and Yun Yang. 2020c. A deep context-wise method for coreference detection in natural language requirements. In 2020 IEEE 28th International Requirements Engineering Conference (RE) . IEEE, 180–191.
  • Wang et al . (2022c) Yawen Wang, Junjie Wang, Hongyu Zhang, Xuran Ming, Lin Shi, and Qing Wang. 2022c. Where is your app frustrating users?. In Proceedings of the 44th International Conference on Software Engineering . 2427–2439.
  • Wang et al . (2021a) Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021a. Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021).
  • Wang et al . (2023f) Zejun Wang, Jia Li, Ge Li, and Zhi Jin. 2023f. ChatCoder: Chat-based Refine Requirement Improves LLMs’ Code Generation. arXiv preprint arXiv:2311.00272 (2023).
  • Watson et al . (2022) Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2022. A systematic literature review on the use of deep learning in software engineering research. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 2 (2022), 1–58.
  • Wei and Li (2017) Huihui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code.. In IJCAI . 3034–3040.
  • Wei et al . (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al . 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
  • Wei et al . (2022a) Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, and Song Wang. 2022a. Clear: contrastive learning for api recommendation. In Proceedings of the 44th International Conference on Software Engineering . 376–387.
  • Wei et al . (2023a) Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023a. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120 (2023).
  • Wei et al . (2023b) Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023b. Copiloting the copilots: Fusing large language models with completion engines for automated program repair. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 172–184.
  • Weyssow et al . (2023a) Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. 2023a. Exploring parameter-efficient fine-tuning techniques for code generation with large language models. arXiv preprint arXiv:2308.10462 (2023).
  • Weyssow et al . (2023b) Martin Weyssow, Xin Zhou, Kisub Kim, David Lo, and Houari Sahraoui. 2023b. On the Usage of Continual Learning for Out-of-Distribution Generalization in Pre-trained Language Models of Code. arXiv preprint arXiv:2305.04106 (2023).
  • White et al . (2023a) Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C Schmidt. 2023a. A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
  • White et al . (2023b) Jules White, Sam Hays, Quchen Fu, Jesse Spencer-Smith, and Douglas C Schmidt. 2023b. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
  • Widjojo and Treude (2023) Patricia Widjojo and Christoph Treude. 2023. Addressing Compiler Errors: Stack Overflow or Large Language Models? arXiv preprint arXiv:2307.10793 (2023).
  • Widyasari et al . (2023) Ratnadira Widyasari, Ting Zhang, Abir Bouraffa, and David Lo. 2023. Explaining Explanation: An Empirical Study on Explanation in Code Reviews. arXiv preprint arXiv:2311.09020 (2023).
  • Wong et al . (2023) Man-Fai Wong, Shangxin Guo, Ching-Nam Hang, Siu-Wai Ho, and Chee-Wei Tan. 2023. Natural Language Generation and Understanding of Big Code for AI-Assisted Programming: A Review. Entropy 25, 6 (2023), 888.
  • Wu et al . (2024) Di Wu, Yang Feng, Hongyu Zhang, and Baowen Xu. 2024. Automatic recognizing relevant fragments of APIs using API references. Automated Software Engineering 31, 1 (2024), 3.
  • Wu et al . (2023d) Fangzhou Wu, Xiaogeng Liu, and Chaowei Xiao. 2023d. Deceptprompt: Exploiting llm-driven code generation via adversarial natural language instructions. arXiv preprint arXiv:2312.04730 (2023).
  • Wu et al . (2023e) Fangzhao Wu, Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, and Xing Xie. 2023e. Defending ChatGPT against Jailbreak Attack via Self-Reminder. (2023).
  • Wu et al . (2023b) Tongshuang Wu, Kenneth Koedinger, et al . 2023b. Is AI the better programming partner? Human-Human Pair Programming vs. Human-AI pAIr Programming. arXiv preprint arXiv:2306.05153 (2023).
  • Wu et al . (2023a) Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023a. How Effective Are Neural Networks for Fixing Security Vulnerabilities. arXiv preprint arXiv:2305.18607 (2023).
  • Wu et al . (2023c) Yonghao Wu, Zheng Li, Jie M Zhang, Mike Papadakis, Mark Harman, and Yong Liu. 2023c. Large language models in fault localisation. arXiv preprint arXiv:2308.15276 (2023).
  • Xia et al . (2024) Chunqiu Steven Xia, Matteo Paltenghi, Jia Le Tian, Michael Pradel, and Lingming Zhang. 2024. Fuzz4all: Universal fuzzing with large language models. arXiv preprint arXiv:2308.04748 (2024).
  • Xia et al . (2022) Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2022. Practical program repair in the era of large pre-trained language models. arXiv preprint arXiv:2210.14179 (2022).
  • Xia et al . (2023) Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated Program Repair in the Era of Large Pre-Trained Language Models. In Proceedings of the 45th International Conference on Software Engineering (ICSE ’23) . https://doi.org/10.1109/ICSE48619.2023.00129
  • Xia and Zhang (2023a) Chunqiu Steven Xia and Lingming Zhang. 2023a. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  • Xia and Zhang (2023b) Chunqiu Steven Xia and Lingming Zhang. 2023b. Keep the Conversation Going: Fixing 162 out of 337 bugs for 0.42 0.42 0.42 0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
  • Xie et al . (2023b) Danning Xie, Byungwoo Yoo, Nan Jiang, Mijung Kim, Lin Tan, Xiangyu Zhang, and Judy S Lee. 2023b. Impact of Large Language Models on Generating Software Specifications. arXiv preprint arXiv:2306.03324 (2023).
  • Xie et al . (2023a) Zhuokui Xie, Yinghao Chen, Chen Zhi, Shuiguang Deng, and Jianwei Yin. 2023a. ChatUniTest: a ChatGPT-based automated unit test generation tool. arXiv preprint arXiv:2305.04764 (2023).
  • Xiong et al . (2023) Weimin Xiong, Yiwen Guo, and Hao Chen. 2023. The Program Testing Ability of Large Language Models for Code. arXiv preprint arXiv:2310.05727 (2023).
  • Xu et al . (2022) Frank F Xu, Uri Alon, Graham Neubig, and Vincent Josua Hellendoorn. 2022. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming . 1–10.
  • Xu et al . (2024) Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al . 2024. UniLog: Automatic Logging via LLM and In-Context Learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–12.
  • Xu et al . (2023b) Xiangzhe Xu, Zhuo Zhang, Shiwei Feng, Yapeng Ye, Zian Su, Nan Jiang, Siyuan Cheng, Lin Tan, and Xiangyu Zhang. 2023b. LmPa: Improving Decompilation by Synergy of Large Language Model and Program Analysis. arXiv preprint arXiv:2306.02546 (2023).
  • Xu et al . (2023a) Zhuolin Xu, Yuanzhang Lin, Qiushi Li, and Shin Hwei Tan. 2023a. Guiding ChatGPT to Fix Web UI Tests via Explanation-Consistency Checking. arXiv preprint arXiv:2312.05778 (2023).
  • Yan et al . (2023a) Dapeng Yan, Zhipeng Gao, and Zhiming Liu. 2023a. A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1887–1898.
  • Yan et al . (2023b) Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023b. Codetransocean: A comprehensive multilingual benchmark for code translation. arXiv preprint arXiv:2310.04951 (2023).
  • Yang et al . (2024) Aidan ZH Yang, Claire Le Goues, Ruben Martins, and Vincent Hellendoorn. 2024. Large language models for test-free fault localization. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering . 1–12.
  • Yang et al . (2023a) Chenyuan Yang, Yinlin Deng, Runyu Lu, Jiayi Yao, Jiawei Liu, Reyhaneh Jabbarvand, and Lingming Zhang. 2023a. White-box compiler fuzzing empowered by large language models. arXiv preprint arXiv:2310.15991 (2023).
  • Yang et al . (2023c) Chengran Yang, Jiakun Liu, Bowen Xu, Christoph Treude, Yunbo Lyu, Ming Li, and David Lo. 2023c. APIDocBooster: An Extract-Then-Abstract Framework Leveraging Large Language Models for Augmenting API Documentation. arXiv preprint arXiv:2312.10934 (2023).
  • Yang et al . (2022c) Chengran Yang, Bowen Xu, Junaed Younus Khan, Gias Uddin, Donggyun Han, Zhou Yang, and David Lo. 2022c. Aspect-based api review classification: How far can pre-trained transformer model go?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 385–395.
  • Yang et al . (2016) Di Yang, Aftab Hussain, and Cristina Videira Lopes. 2016. From query to usable code: an analysis of stack overflow code snippets. In Proceedings of the 13th International Conference on Mining Software Repositories . 391–402.
  • Yang et al . (2023f) Guang Yang, Yu Zhou, Xiang Chen, Xiangyu Zhang, Yiran Xu, Tingting Han, and Taolue Chen. 2023f. A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code Generation. arXiv preprint arXiv:2303.05061 (2023).
  • Yang et al . (2023g) Guang Yang, Yu Zhou, Xiangyu Zhang, Xiang Chen, Tingting Han, and Taolue Chen. 2023g. Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation. arXiv preprint arXiv:2310.18587 (2023).
  • Yang et al . (2023b) Jingfeng Yang, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, Bing Yin, and Xia Hu. 2023b. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712 (2023).
  • Yang et al . (2023d) Kang Yang, Xinjun Mao, Shangwen Wang, Tanghaoran Zhang, Bo Lin, Yanlin Wang, Yihao Qin, Zhang Zhang, and Xiaoguang Mao. 2023d. Enhancing Code Intelligence Tasks with ChatGPT. arXiv preprint arXiv:2312.15202 (2023).
  • Yang et al . (2021) Lanxin Yang, He Zhang, Haifeng Shen, Xin Huang, Xin Zhou, Guoping Rong, and Dong Shao. 2021. Quality assessment in systematic literature reviews: A software engineering perspective. Information and Software Technology 130 (2021), 106397.
  • Yang et al . (2022b) Yanming Yang, Xin Xia, David Lo, and John Grundy. 2022b. A survey on deep learning for software engineering. ACM Computing Surveys (CSUR) 54, 10s (2022), 1–73.
  • Yang et al . (2022a) Zhou Yang, Jieke Shi, Junda He, and David Lo. 2022a. Natural Attack for Pre-Trained Models of Code. In Proceedings of the 44th International Conference on Software Engineering (Pittsburgh, Pennsylvania) (ICSE ’22) . Association for Computing Machinery, New York, NY, USA, 1482–1493. https://doi.org/10.1145/3510003.3510146
  • Yang et al . (2023e) Zhou Yang, Bowen Xu, Jie M. Zhang, Hong Jin Kang, Jieke Shi, Junda He, and David Lo. 2023e. Stealthy Backdoor Attack for Code Models. https://doi.org/10.48550/ARXIV.2301.02496
  • Ye et al . (2023) Jiacheng Ye, Chengzu Li, Lingpeng Kong, and Tao Yu. 2023. Generating Data for Symbolic Language with Large Language Models. arXiv preprint arXiv:2305.13917 (2023).
  • Yen et al . (2023) Ryan Yen, Jiawen Zhu, Sangho Suh, Haijun Xia, and Jian Zhao. 2023. CoLadder: Supporting Programmers with Hierarchical Code Generation in Multi-Level Abstraction. arXiv preprint arXiv:2310.08699 (2023).
  • Yetiştiren et al . (2023) Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, and Eray Tüzün. 2023. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778 (2023).
  • Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696 (2017).
  • ymcui (2023) ymcui. 2023. Chinese LLaMA & Alpaca Large Language Models. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/blob/main/README_EN.md .
  • Yoon et al . (2023) Juyeon Yoon, Robert Feldt, and Shin Yoo. 2023. Autonomous Large Language Model Agents Enabling Intent-Driven Mobile GUI Testing. arXiv preprint arXiv:2311.08649 (2023).
  • Yu et al . (2023a) Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Tao Xie, and Qianxiang Wang. 2023a. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. arXiv preprint arXiv:2302.00288 (2023).
  • Yu et al . (2023b) Siyu Yu, Yifan Wu, Zhijing Li, Pinjia He, Ningjiang Chen, and Changjian Liu. 2023b. Log Parsing with Generalization Ability under New Log Types. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering . 425–437.
  • Yuan et al . (2022) Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin. 2022. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis . 678–690.
  • Yuan et al . (2023a) Zhiqiang Yuan, Junwei Liu, Qiancheng Zi, Mingwei Liu, Xin Peng, and Yiling Lou. 2023a. Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation. arXiv:2308.01240 [cs.CL]
  • Yuan et al . (2023b) Zhiqiang Yuan, Yiling Lou, Mingwei Liu, Shiji Ding, Kaixin Wang, Yixuan Chen, and Xin Peng. 2023b. No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation. arXiv preprint arXiv:2305.04207 (2023).
  • Zan et al . (2023a) Daoguang Zan, Bei Chen, Yongshun Gong, Junzhi Cao, Fengji Zhang, Bingchao Wu, Bei Guan, Yilong Yin, and Yongji Wang. 2023a. Private-library-oriented code generation with large language models. arXiv preprint arXiv:2307.15370 (2023).
  • Zan et al . (2022a) Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. 2022a. When language model meets private library. arXiv preprint arXiv:2210.17236 (2022).
  • Zan et al . (2022b) Daoguang Zan, Bei Chen, Dejian Yang, Zeqi Lin, Minsu Kim, Bei Guan, Yongji Wang, Weizhu Chen, and Jian-Guang Lou. 2022b. CERT: Continual Pre-training on Sketches for Library-oriented Code Generation. arXiv preprint arXiv:2206.06888 (2022).
  • Zan et al . (2023b) Daoguang Zan, Bei Chen, Fengji Zhang, Dianjie Lu, Bingchao Wu, Bei Guan, Wang Yongji, and Jian-Guang Lou. 2023b. Large Language Models Meet NL2Code: A Survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 7443–7464.
  • Zelikman et al . (2023) Eric Zelikman, Eliana Lorch, Lester Mackey, and Adam Tauman Kalai. 2023. Self-taught optimizer (stop): Recursively self-improving code generation. arXiv preprint arXiv:2310.02304 (2023).
  • Zeng et al . (2022) Zhengran Zeng, Hanzhuo Tan, Haotian Zhang, Jing Li, Yuqun Zhang, and Lingming Zhang. 2022. An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis . 39–51.
  • Zhang et al . (2023a) Cen Zhang, Mingqiang Bai, Yaowen Zheng, Yeting Li, Xiaofei Xie, Yuekang Li, Wei Ma, Limin Sun, and Yang Liu. 2023a. Understanding Large Language Model Based Fuzz Driver Generation. arXiv preprint arXiv:2307.12469 (2023).
  • Zhang et al . (2023m) Chenyuan Zhang, Hao Liu, Jiutian Zeng, Kejing Yang, Yuhong Li, and Hui Li. 2023m. Prompt-enhanced software vulnerability detection using chatgpt. arXiv preprint arXiv:2308.12697 (2023).
  • Zhang et al . (2011) He Zhang, Muhammad Ali Babar, and Paolo Tell. 2011. Identifying relevant studies in software engineering. Information and Software Technology 53, 6 (2011), 625–637.
  • Zhang et al . (2022a) Jingxuan Zhang, Siyuan Liu, Lina Gong, Haoxiang Zhang, Zhiqiu Huang, and He Jiang. 2022a. BEQAIN: An Effective and Efficient Identifier Normalization Approach With BERT and the Question Answering System. IEEE Transactions on Software Engineering (2022).
  • Zhang et al . (2022b) Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, and Shuvendu K Lahiri. 2022b. Using pre-trained language models to resolve textual and semantic merge conflicts (experience paper). In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis . 77–88.
  • Zhang et al . (2023n) Jiyang Zhang, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2023n. Multilingual Code Co-Evolution Using Large Language Models. arXiv preprint arXiv:2307.14991 (2023).
  • Zhang et al . (2022c) Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2022c. Coditt5: Pretraining for source code and natural language editing. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12.
  • Zhang et al . (2020a) Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2020a. Retrieval-based neural source code summarization. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering . 1385–1397.
  • Zhang et al . (2019) Jian Zhang, Xu Wang, Hongyu Zhang, Hailong Sun, Kaixuan Wang, and Xudong Liu. 2019. A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) . IEEE, 783–794.
  • Zhang et al . (2023k) Kechi Zhang, Ge Li, Jia Li, Zhuo Li, and Zhi Jin. 2023k. ToolCoder: Teach Code Generation Models to use APIs with search tools. arXiv preprint arXiv:2305.04032 (2023).
  • Zhang et al . (2024b) Kechi Zhang, Jia Li, Ge Li, Xianjie Shi, and Zhi Jin. 2024b. CodeAgent: Enhancing Code Generation with Tool-Integrated Agent Systems for Real-World Repo-level Coding Challenges. arXiv preprint arXiv:2401.07339 (2024).
  • Zhang et al . (2023l) Kechi Zhang, Zhuo Li, Jia Li, Ge Li, and Zhi Jin. 2023l. Self-Edit: Fault-Aware Code Editor for Code Generation. arXiv preprint arXiv:2305.04087 (2023).
  • Zhang et al . (2023o) Kexun Zhang, Danqing Wang, Jingtao Xia, William Yang Wang, and Lei Li. 2023o. ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers. arXiv preprint arXiv:2305.14591 (2023).
  • Zhang et al . (2024c) Lichen Zhang, Shuai Lu, and Nan Duan. 2024c. Selene: Pioneering Automated Proof in Software Verification. arXiv preprint arXiv:2401.07663 (2024).
  • Zhang et al . (2023b) Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023b. A Survey of Learning-based Automated Program Repair. arXiv preprint arXiv:2301.03270 (2023).
  • Zhang et al . (2023c) Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2023c. Boosting Automated Patch Correctness Prediction via Pre-trained Language Model. arXiv preprint arXiv:2301.12453 (2023).
  • Zhang et al . (2024a) Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2024a. APPT: Boosting Automated Patch Correctness Prediction via Fine-tuning Pre-trained Models. IEEE Transactions on Software Engineering (2024).
  • Zhang et al . (2023d) Quanjun Zhang, Chunrong Fang, Tongke Zhang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023d. Gamma: Revisiting template-based automated program repair via mask prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 535–547.
  • Zhang et al . (2024d) Simiao Zhang, Jiaping Wang, Guoliang Dong, Jun Sun, Yueling Zhang, and Geguang Pu. 2024d. Experimenting a New Programming Practice with LLMs. arXiv preprint arXiv:2401.01062 (2024).
  • Zhang et al . (2023e) Ting Zhang, DongGyun Han, Venkatesh Vinayakarao, Ivana Clairine Irsan, Bowen Xu, Ferdian Thung, David Lo, and Lingxiao Jiang. 2023e. Duplicate bug report detection: How far are we? ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–32.
  • Zhang et al . (2023f) Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, and David Lo. 2023f. Cupid: Leveraging chatgpt for more accurate duplicate bug report detection. arXiv preprint arXiv:2308.10022 (2023).
  • Zhang et al . (2023g) Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, and David Lo. 2023g. Revisiting sentiment analysis for software engineering in the era of large language models. arXiv preprint arXiv:2310.11113 (2023).
  • Zhang et al . (2023h) Ting Zhang, Ivana Clairine Irsan, Ferdian Thung, David Lo, Asankhaya Sharma, and Lingxiao Jiang. 2023h. Evaluating Pre-trained Language Models for Repairing API Misuses. arXiv preprint arXiv:2310.16390 (2023).
  • Zhang et al . (2020b) Ting Zhang, Bowen Xu, Ferdian Thung, Stefanus Agus Haryono, David Lo, and Lingxiao Jiang. 2020b. Sentiment analysis for software engineering: How far can pre-trained transformer models go?. In 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 70–80.
  • Zhang et al . (2023p) Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. 2023p. Coder reviewer reranking for code generation. In International Conference on Machine Learning . PMLR, 41832–41846.
  • Zhang et al . (2023i) Yuwei Zhang, Zhi Jin, Ying Xing, and Ge Li. 2023i. STEAM: simulating the interactive behavior of programmers for automatic bug fixing. arXiv preprint arXiv:2308.14460 (2023).
  • Zhang et al . (2023j) Yuwei Zhang, Ge Li, Zhi Jin, and Ying Xing. 2023j. Neural Program Repair with Program Dependence Analysis and Effective Filter Mechanism. arXiv preprint arXiv:2305.09315 (2023).
  • Zhang et al . (2022d) Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. 2022d. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493 (2022).
  • Zhao et al . (2023a) Jianyu Zhao, Yuyang Rong, Yiwen Guo, Yifeng He, and Hao Chen. 2023a. Understanding Programs by Exploiting (Fuzzing) Test Cases. arXiv preprint arXiv:2305.13592 (2023).
  • Zhao et al . (2023d) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al . 2023d. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
  • Zhao et al . (2023b) Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, and Qizhe Xie. 2023b. Automatic Model Selection with Large Language Models for Reasoning. arXiv preprint arXiv:2305.14333 (2023).
  • Zhao et al . (2021) Yanjie Zhao, Li Li, Haoyu Wang, Haipeng Cai, Tegawendé F Bissyandé, Jacques Klein, and John Grundy. 2021. On the impact of sample duplication in machine-learning-based android malware detection. ACM Transactions on Software Engineering and Methodology (TOSEM) 30, 3 (2021), 1–38.
  • Zhao et al . (2023c) Zelin Zhao, Zhaogui Xu, Jialong Zhu, Peng Di, Yuan Yao, and Xiaoxing Ma. 2023c. The Right Prompts for the Job: Repair Code-Review Defects with Large Language Model. arXiv preprint arXiv:2312.17485 (2023).
  • Zheng et al . (2023c) Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen, Andi Wang, Yang Li, et al . 2023c. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568 (2023).
  • Zheng et al . (2023b) Wenqing Zheng, SP Sharan, Ajay Kumar Jaiswal, Kevin Wang, Yihan Xi, Dejia Xu, and Zhangyang Wang. 2023b. Outline, then details: Syntactically guided coarse-to-fine code generation. arXiv preprint arXiv:2305.00909 (2023).
  • Zheng et al . (2023a) Zibin Zheng, Kaiwen Ning, Yanlin Wang, Jingwen Zhang, Dewu Zheng, Mingxi Ye, and Jiachi Chen. 2023a. A survey of large language models for code: Evolution, benchmarking, and future trends. arXiv preprint arXiv:2311.10372 (2023).
  • Zhong and Wang (2023) Li Zhong and Zilong Wang. 2023. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335 (2023).
  • Zhou et al . (2023a) Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham Neubig. 2023a. Codebertscore: Evaluating code generation with pretrained models of code. arXiv preprint arXiv:2302.05527 (2023).
  • Zhou et al . (2019) Shufan Zhou, Beijun Shen, and Hao Zhong. 2019. Lancer: Your code tell me what you need. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1202–1205.
  • Zhou et al . (2023c) Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. 2023c. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. arXiv preprint arXiv:2308.03279 (2023).
  • Zhou et al . (2023b) Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2023b. Large Language Models Are Human-Level Prompt Engineers. arXiv preprint arXiv:2211.01910 (2023).
  • Zhu et al . (2023) Jie Zhu, Lingwei Li, Li Yang, Xiaoxiao Ma, and Chun Zuo. 2023. Automating Method Naming with Context-Aware Prompt-Tuning. arXiv preprint arXiv:2303.05771 (2023).
  • Zhu et al . (2022) Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2022. Enhancing Traceability Link Recovery with Unlabeled Data. In 2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 446–457.
  • Zhuo (2023) Terry Yue Zhuo. 2023. Large Language Models Are State-of-the-Art Evaluators of Code Generation. arXiv preprint arXiv:2304.14317 (2023).
  • Zhuo et al . (2023) Terry Yue Zhuo, Xiaoning Du, Zhenchang Xing, Jiamou Sun, Haowei Quan, Li Li, and Liming Zhu. 2023. Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names? arXiv preprint arXiv:2309.07804 (2023).

Appendix A Data Types

We classified the data types of all datasets into five categories: code-based, text-based, graph-based, software repository-based, and combined data types, as shown in Table  13 .

Category Data type # Studies References
Text-based datasets

Programming tasks/problems

42

) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

Prompts

33

, ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

SO (i.e., Stack Overflow) posts

12

, ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , )

Bug reports

11

) ) , ) ) , ) , ) , ) , ) , ) , ) , )

Requirements documentation

9

, ) , ) , ) , ) , ) , ) ) , ) , )

APIs/API documentation

8

, ) , ) , ) , ) , ) , ) , ) , )

Q&A pairs

6

, ) , ) , ) , ) , ) , )

Vulnerability descriptions

4

, ) , ) , ) , )

Reviews

4

, ) , ) , ) , )

Logs

3

, ) , ) , )

Methods

3

, ) , ) , )

Project issues

3

) , ) , )

Code comments

2

) , )

Theorems

2

, ) , )

Buggy text

1

, )

Dockerfiles

1

, )

Outage descriptions

1

, )

Semantic merge conflicts

1

, )

Site text

1

, )

Software development tasks

1

, )

User intents

1

, )

Software specifications

1

, )

User reviews

1

, )

Code-based datasets

Source code

60

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Bugs/Buggy code

16

, ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , )

Vulnerable source code

8

, ) , ) , ) , ) , ) , ) , ) , )

Patches

4

, ) , ) , ) , )

Code changes

3

, ) , ) , )

Test suites/cases

3

, ) , ) , )

Bug-fix pairs

2

, ) , )

Error code

2

, ) )

Error-fix pairs

1

, )

Flaky test cases

1

, )

Category Data type # Studies References
Code-based datasets

Identifiers

1

, )

Labeled clone pairs

1

, )

Packages

1

, )

Graph-based datasets

GUI Images

1

, )

Software repository-

Code repository

9

, ) , ) ) , ) , ) , ) , ) , ) , )

based datasets

Android apps

3

, ) , ) , )

Issues and commits

3

) , ) , )

Pull-requests

2

) , )

Industrial projects

1

, )

Open-source projects

1

, )

Web applications

1

, )

Combined datasets

Programming tasks and test suites/cases

17

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , )

Source code and comments

12

, ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , )

Programming tasks and solutions

8

, ) , ) ) , ) , ) , ) , ) , )

Source code and description

3

, ) , ) , )

Code-text pairs

2

, ) , )

Souce code and API usage sequences

2

, ) , )

Source code and test suites/cases

2

, ) , )

Bug report and test suites/cases

1

, )

Buggy code and comments

1

, )

Buggy code and solutions

1

, )

Code files and summaries

1

, )

Binary code and related annotations

1

, )

Failing test code and error messages

1

)

Source code and Q&A pairs

1

, )

Source code, methods, and logs

1

, )

Vulnerable code and description

1

, )

Appendix B Input Forms

In LLM4SE research, data is often transformed into specific formats to be used as input for LLMs. Table   14 illustrates four input formats, namely token-based input, tree/graph-based input, pixel-based input, and hybrid-based input, along with all the papers that utilize each type.

Category Input forms # Studies References
Token-based input

Text in tokens

150

) , ) ) , ) , ) , ) , ) ) ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Category Input forms # Studies References

Text in tokens

, ) , ) , ) ) , ) , ) , ) , ) , ) ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) ) , ) , ) , ) , ) ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

Token-based input

Code in tokens

118

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Code and text in tokens

78

, ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Tree/Graph-based input

Code in tree structure

2

, ) , )

Code in graph structure

3

) , ) , )

Pixel-based input

Pixel

1

, )

Hybrid-based input

Hybrid input forms

2

, ) , )

Appendix C Prompt Engineering

Table   15 showcases eight prompt engineering techniques mentioned in 395 studies: Few-shot prompting, Zero-shot prompting, CoT (Chain-of-Thought) prompting, APE (Automatic Prompt Engineer), CoC (Chain of Code) prompting, Auto-CoT (Automatic Chain-of-Thought) prompting, MoT (Modular-of-Thought) prompting, and SCoT (Structured Chain-of-Thought) prompting.

Prompt engineering # Studies References

Few-shot prompting

88

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Prompt engineering # Studies References

Few-shot prompting

) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , )

Zero-shot prompting

79

, ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

CoT (Chain-of-Thought) prompting

18

, ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , )

APE (Automatic Prompt Engineer)

2

, ) , )

CoC (Chain of Code) prompting

2

, ) , )

Auto-CoT (Automatic Chain-of-Thought) prompting

1

, )

MoT (Modular-of-Thought) prompting

1

, )

SCoT (Structured Chain-of-Thought) prompting

1

, )

Others

76

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Appendix D Evaluation Metrics

We categorize the types of tasks that LLMs address in SE into four categories: regression, classification, recommendation, and generation. Each task has commonly used evaluation metrics, as shown in Table   16 .

Problem Type Metric # Studies References
Regression

MAE (Mean Absolute Error)

1

)

Classification

Precision

35

, ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Recall

34

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

F1-score

33

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Accuracy

23

, ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , )

Problem Type Metric # Studies References
Classification

Accuracy

, ) , ) , ) , ) , ) , ) , ) , ) , ) , )

AUC (Area Under the ROC Curve)

9

) , ) , ) , ) , ) , ) , ) , ) , )

ROC (Receiver Operating Characteristic)

4

) ) , ) , )

FPR (False Positive Rate)

4

, ) , ) , ) , )

FNR (Falser Negative Rate)

3

, ) , ) , )

MCC (Matthews Correlation Coefficient)

2

, ) , )

Recommendation

MRR (Mean Reciprocal Rank)

15

, ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Precision/Precision@k

6

, ) , ) , ) , ) , ) , )

MAP/MAP@k

6

, ) ) , ) , ) , ) , )

F-score/F-score@k

5

, ) , ) , ) , ) , )

Recall/Recall@k

4

, ) , ) , ) , )

Accuracy

3

, ) , ) , )

Generation

BLEU/BLEU-4/BLEU-DC

62

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

Pass@k

54

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Accuracy/Accuracy@k

38

) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

EM (Exact Match)

36

, ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

CodeBLEU

29

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

ROUGE/ROUGE-L

22

, ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

Precision

18

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

METEOR

16

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) )

Recall

15

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

F1-score

15

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

MRR (Mean Reciprocal Rank)

6

, ) , ) , ) , ) , ) , )

ES (Edit Similarity)

6

) , ) , ) , ) , ) , )

ED (Edit Distance)

5

) , ) , ) , ) , )

MAR (Mean Average Ranking)

4

, ) , ) , ) , )

Problem Type Metric # Studies References
Generation

ChrF

3

, ) , ) )

CrystalBLEU

3

, ) , ) )

CodeBERTScore

2

, ) )

MFR (Mean First Ranking)

1

, )

PP (Perplexity)

1

, )

Appendix E SE Tasks

According to the software development lifecycle, we have categorized the SE tasks mentioned in 395 studies into six categories: Requirements engineering, Software design, Software development, Software quality assurance, Software maintenance, and Software management. Table   17 presents all the papers that apply LLMs to these tasks.

SE Activity SE Task # Studies References
Requirements engineering

Anaphoric ambiguity treatment

4

, ) ) ) , )

Requirements classification

4

, ) , ) , ) , )

Requirement analysis and evaluation

2

, ) , )

Specification generation

2

, ) , )

Coreference detection

1

, )

Requirements elicitation

1

, )

Specification formalization

1

, )

Traceability automation

1

, )

Use cases generation

1

, )

Software design

GUI retrieval

1

, )

Rapid prototyping

1

, )

Software specification synthesis

1

, )

System design

1

, )

Software development

Code generation

118

, ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) )

Code completion

22

, ) , ) , ) , ) ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) ) ) , ) , ) , ) , ) , )

Code summarization

21

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Code search

12

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

SE Activity SE Task # Studies References
Software development

Code translation

12

, ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Code understanding

8

, ) , ) , ) , ) , ) , ) , ) , )

API inference

5

, ) , ) , ) , ) , )

Program synthesis

6

, ) ) , ) , ) , ) )

API recommendation

5

, ) , ) , ) , ) , )

Code editing

5

, ) , ) , ) , ) , )

Code representation

3

, ) , ) , )

Code comment generation

2

, ) , )

Method name generation

2

, ) , )

Code recommendation

2

, ) , )

Agile story point estimation

1

)

API documentation augment

1

, )

API documentation smells

1

, )

API entity and relation extraction

1

, )

Data analysis

1

, )

Fuzz driver generation

1

, )

Control flow graph generation

1

, )

Identifier normalization

1

, )

Instruction generation

1

, )

Type inference

1

, )

Others

14

, ) , ) , ) ) , ) , ) , ) ) , ) , ) , ) , ) , ) , )

Software quality assurance

Vulnerability detection

18

, ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Test generation

17

, ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , )

Bug localization

5

) ) ) ) , )

Verification

5

, ) , ) , ) , ) , )

Testing automation

4

, ) , ) , ) , )

Fault localization

3

, ) , ) , )

Defect detection

2

, ) , )

GUI testing

2

, ) , )

Static analysis

2

, ) , )

Binary taint analysis

1

, )

Compiler fuzzing

1

, )

Decompilation

1

, )

Invariant prediction

1

, )

Malicious code localization

1

, )

Mobile app crash detection

1

, )

Resource leak detection

1

, )

Test prediction

1

, )

Software maintenance

Program repair

35

, ) , ) , ) , ) , ) , ) , ) ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) , ) ) , ) , ) ) ) , ) , ) , ) , ) , )

Code clone detection

8

, ) , ) , ) , ) , ) , ) , ) , )

Code review

7

, ) , ) , ) , ) ) , ) , )

Debugging

4

, ) , ) , ) , )

Bug reproduction

3

, ) , ) , )

Review/commit/code classification

3

, ) , ) , )

Duplicate bug report detection

3

, ) , ) , )

SE Activity SE Task # Studies References
Software maintenance

Logging

3

, ) , ) , )

Log parsing

3

, ) , ) , )

Sentiment analysis

3

, ) , ) , )

Code revision

2

, ) , )

Vulnerability repair

2

, ) , )

API misuses repair

1

, )

Bug prediction

1

, )

Bug triage

1

, )

Code coverage prediction

1

, )

Code review explained

1

, )

Code-Review defects repair

1

, )

Crash bug repair

1

, )

Crash bug repair

1

, )

Dockerfile Repair

1

, )

Patch correctness prediction

1

, )

Patch detection

1

, )

Program merge conflicts repair

1

, )

Rename Refactoring

1

, )

Tag recommendation

1

, )

Technical debt payback

1

, )

Traceability recovery

1

, )

Web test repair

1

, )

Type error repair

1

, )

Others

5

, ) , ) ) , ) , )

Software management

Effort estimation

2

) , )

Software tool configuration

1

)

Analysing app reviews for software engineering: a systematic literature review

  • Open access
  • Published: 20 January 2022
  • Volume 27 , article number  43 , ( 2022 )

Cite this article

You have full access to this open access article

literature reviews in software engineering

  • Jacek Dąbrowski   ORCID: orcid.org/0000-0003-3392-0690 1 , 2 ,
  • Emmanuel Letier 1 ,
  • Anna Perini 2 &
  • Angelo Susi 2  

10k Accesses

28 Citations

1 Altmetric

Explore all metrics

A Correction to this article was published on 15 March 2022

This article has been updated

App reviews found in app stores can provide critically valuable information to help software engineers understand user requirements and to design, debug, and evolve software products. Over the last ten years, a vast amount of research has been produced to study what useful information might be found in app reviews, and how to mine and organise such information as efficiently as possible. This paper presents a comprehensive survey of this research, covering 182 papers published between 2012 and 2020. This survey classifies app review analysis not only in terms of mined information and applied data mining techniques but also, and most importantly, in terms of supported software engineering activities. The survey also reports on the quality and results of empirical evaluation of existing techniques and identifies important avenues for further research. This survey can be of interest to researchers and commercial organisations developing app review analysis techniques and to software engineers considering to use app review analysis.

Similar content being viewed by others

literature reviews in software engineering

Mining non-functional requirements from App store reviews

literature reviews in software engineering

Finding and Analyzing App Reviews Related to Specific Features: A Research Preview

literature reviews in software engineering

Software Development Analytics in Practice: A Systematic Literature Review

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

App stores have become important platforms for the distribution of software products. In 2020, Google Play Store and Apple Store host over 5 million apps and are widely used for the discovery, purchase and updates of software products (Clement 2020 ). The emergence of these App Stores have had important effects on software engineering practices, notably by bridging the gap between developers and users, by increasing market transparency and by affecting release management (AlSubaihin et al. 2019 ). In 2017, Martin et al. ( 2017 ) used the term ‘app store analysis’ to denote the emerging research using app store data for software engineering. Their survey identified the richness and diversity of research using App Store data, notably for API analysis, feature analysis, release engineering, security and review analysis (Martin et al. 2017 ).

This paper focuses on analysing app reviews for software engineering. App reviews are textual feedback associated with a star rating that app users can provide to other App Store users and app developers about their experience of an app (App Store 2021 ). Most reviews have length up to 675 characters (Pagano and Maalej 2013 ); and convey information on variety of topics such as feature requests, bug reports or user opinions (Martin et al. 2017 ; Al-Hawari 2020 ). Analysing these reviews can benefit a range of software engineering activities. For example, for requirements engineering, analyzing app reviews can help software engineers to elicit new requirements about app features that users desire (Johann et al. 2017 ; Dąbrowski et al. 2020 ); for testing, app reviews can help in finding bugs (Maalej and Nabil 2015 ; Iacob et al. 2016 ; Shams et al. 2020 ) and evaluating users’ reactions to released beta versions of their apps (Gao et al. 2019 ; AlSubaihin et al. 2019 ); during product evolution, analysing app reviews may help in identifying and prioritizing change requests (Villarroel et al. 2016 ; Gao et al. 2018b ; Gao et al. 2019 ; Dąbrowski et al. 2020 ).

In recent years, scholars have been also studying on-line user feedback from other digital sources such as microblogs e.g., Twitter (Guzman et al. 2017 ), on-line forums e.g., Reddit (Khan et al. 2019 ), or issue tracking systems e.g., JIRA (Nyamawe et al. 2019 ). Most research efforts, however, have been focused on analyzing app reviews (Lim et al. 2021 ). Supposedly, the large number of this data, their availability and their usefulness make app reviews unique and thus the most frequently studied type of on-line user feedback (Lim et al. 2021 ).

Significant research has been devoted to study what relevant information can be found in app reviews; how the information can be analysed using manual and automatic approaches; and how the information can help software engineers. However, this knowledge is scattered in literature, and consequently there is no clear view on how app review analysis can support software engineering. The previous survey on app store data analysis (Martin et al. 2017 ) identified app review analysis as one important topic within the broader area of app store analysis but does not present a detailed comprehensive analysis of app review analysis techniques. Other literature reviews focus on specific types of review analysis such as opinion mining (Genc-Nayebi and Abran 2017 ) and information extraction (Tavakoli et al. 2018 ; Noei and Lyons 2019 ) but they do not cover the whole range of research on analysing app reviews. In contrast, this paper provides a systematic literature review of the whole range of research on analysing app reviews from the first paper published in 2012 up to the end of 2020. The paper objectives are to:

identify and classify the range of app review analysis proposed in the literature;

identify the range of natural language processing and data mining techniques that support such analysis;

identify the range of software engineering activities that app review analysis can support;

report the methods and results of the empirical evaluation of app review analysis approaches.

To accomplish these objectives, we have conducted a systematic literature review following a well-defined methodology that identifies, evaluates, and interprets the relevant studies with respect to specific research questions (Kitchenham 2004 ). After a systematic selection and screening procedure, we ended up with a set of 182 papers, covering the period 2012 to 2020, that were carefully examined to answer the research questions.

The primary contributions of the study are: (i) synthesis of approaches and techniques for mining app reviews, (ii) new knowledge on how software engineering scenarios can be supported by mining app reviews, (iii) a summary of empirical evaluation of review mining approaches, and finally (iv) a study of literature growth patterns, gaps, and directions for future research.

2 Research Method

To conduct our systematic literature review, we followed the methodology proposed by Kitchenham ( 2004 ). We first defined research questions and prepared a review protocol, which guided our conduct of the review and the collection of data. We then performed the literature search and selection based on agreed criteria. The selected studies were read thoroughly, and data items as in Table  3 were collected using a data extraction form. Finally, we synthesized the results for reporting.

2.1 Research Questions

The primary aim of the study is to understand how analysing app reviews can support software engineering. Based on the objective, the following research questions have been derived:

RQ1: What are the different types of app review analyses?

RQ2: What techniques are used to realize app review analyses?

RQ3: What software engineering activities are claimed to be supported by analysing app reviews?

RQ4: How are app review analysis approaches empirically evaluated?

RQ5: How well do existing app review analysis approaches support software engineers?

The aim of RQ1 is to identify and classify the different types of app review analysis presented in primary literature; where an app review analysis refers to a task of examining, transforming, or modeling data with the goal of discovering useful information. The aim of RQ2 is to identify the range of techniques used to realize the different types of app review analysis identified in RQ1; where a technique stands for a way for facilitating an app review analysis. The aim of RQ3 is to identify the range of software engineering activities that have been claimed to be supported by analyzing app reviews; where a software engineering activity refers to a task performed along the software development life cycle (Bourque et al. 1999 ). The aim of RQ4 is to understand how primary studies obtain empirical evidences about effectiveness and the perceived-quality of their review analysis approaches. The aim of RQ5 is to summarize the results of empirical studies about effectiveness and user-perceived quality of different types of app review analysis.

2.2 Literature Search and Selection

We followed a systematic search and selection process to collect relevant literature published between January 2010 Footnote 1 and December 2020. Figure  1 outlines the process as a PRISMA diagram Footnote 2 ; it illustrates the main steps of the process and their outcomes (the number of publications). Footnote 3

figure 1

PRISMA diagram showing study search and selection

The initial identification of publications was performed using keyword-based search on six major digital libraries: ACM Digital Library , IEEE Xplore Digital Library , Springer Link Online Library , Wiley Online Library and Elsevier Science Direct . We defined two search queries that we applied in both the meta-data and full-text (when available) of the publications. To construct the first query, we looked at the content of several dozen publications analysing reviews for software engineering. Footnote 4 We identified key terms that these papers share and used the terms to formulate a specific query:

figure e

To not omit other relevant papers not covered by this specific query, we formulated a general query based on phrases reflecting key concepts of our research objective:

figure f

The initial search via digital libraries resulted in 1,656 studies, where 303 of them were duplicated. We screened 1,353 studies obtained through the initial search and selected them in accordance with the inclusion and exclusion criteria (see Table  1 ). To ensure the reliability of our screening process, the four authors of this paper independently classified a sample of 20 papers Footnote 5 (each paper was assigned to two authors). We then assessed their inter-rater agreement (Cohen’s Kappa = 0.9) (Viera and Garrett 2005 ).

Due to the conservative searching, the majority of the studies were found to be unrelated to the scope of the survey. We excluded 1,225 publications that did not meet the inclusion criteria. Subsequently, we complemented our search process with two other strategies to find relevant papers that could have been missed in the initial search. We performed a manual issue-by-issue search of major conference proceedings and journals in software engineering in the period from January 2010 to December 2020. The searched journal and proceedings are listed in Table  2 . That step produced another 14 unique publications. Finally, we completed the searching with a snowballing procedure following guidelines proposed by Wohlin ( 2014 ). We performed backward snowballing considering all the references from relevant studies found by previous searching strategies. Moreover, we conducted forward snowballing based on the 10 most cited papers. Using snowballing procedure, an additional 40 relevant articles were found to match our inclusion criteria. We used these criteria to screen the papers based on the title, abstract and full-text (if needed). Accordingly, we ended up with 182 articles included in the survey.

2.3 Data Extraction

The first author created a data extraction form to collect detailed contents for each of the selected studies. They used extracted data items to synthesize information from primary studies and answer research questions RQ1-RQ5. Table  3 presents the data items the first author extracted:

Title, Author(s), Year, Venue, Citation (F1-F5) are used to identify the paper and its bibliographic information. For F5, we record the citation count for each paper according to Google Scholar as of the 4th of August 2021).

Review Analysis (F6) records the type of app review analysis (F6.1) (e.g. review classification), mined information (F6.2) (e.g. bug report) and supplementary description (F6.3).

Technique (F7) records what techniques are used to realize the analysis. We recorded the technique type (F7.1) e.g., machine learning and its name (7.2) e.g., Naïve Bayes.

Software Engineering Activity (F8) records the specific software engineering activities (e.g. requirements elicitation) mentioned in the paper as being supported by the proposed app review analysis method. We used widely known taxonomy of software engineering phases and activities to identify and record these items (Bourque et al. 1999 ).

Justification (F9) records the paper’s explanation for how the app review analysis support the software engineering activities. Some papers do not provide any justification.

Evaluation Objective (F10) records the general objective of the paper’s evaluation section (F10.1) (e.g. quantitative effectiveness, or user-perceived usefulness) and the type of evaluated app review analysis (F10.2).

Evaluation Procedure (F11) records the paper’s evaluation method and detailed evaluation steps.

Evaluation Metrics and Criteria (F12) records the quantitative metrics (e.g. precision and recall) and criteria (e.g. usability) used in the evaluation.

Evaluation Result (F13) records the result of empirical evaluation with respect to the evaluation metrics and criteria.

Annotated Dataset (F14) records information about the datasets used in the study. We stored information about App Store name from which reviews were collected (F14.1) e.g., Google Play, and the number of annotated reviews (F14.2).

Annotation Task (F15) records the task that humans annotators performed when labeling a sample of app reviews e.g., classify reviews by discussed issue types.

Number of Annotators (F16) records number of human annotators labeling app reviews for empirical evaluation.

Quality Measure (F17) are the measures used for assessing reliability of the annotated dataset e.g., Cohen’s Kappa.

Replication Package (F18) records whether a replication package is available. When one is available, we also recorded details about its content such as the availability of an annotated dataset, analysis method implementation, and experiment’s scripts. In addition to the reported information; we contacted the authors of primary studies to check the availability of the replication packages.

The reliability of data extraction was evaluated through inter- and intra- rater agreements (Ide and Pustejovsky 2017 ). The agreements were measured using percentage agreement on a recommended sample size (Graham et al. 2012 ; Bujang and Baharum 2017 ). To evaluate intra-rater agreement, the first author re-extracted data items from a random sample of 20% of selected papers. An external assessor Footnote 6 then validated the extraction results between the first and second rounds; and computed percentage agreement. To evaluate inter-rater agreement, the assessor cross-checked data extraction; the assessor independently extracted data items from a new random sample of 10% of selected papers. The first author and the assessor then compared their results and computed agreement. The intra-rater agreement was at the level of 93% whereas the inter-rater agreement was of 90%, indicating nearly the complete agreement (Ide and Pustejovsky 2017 ).

2.4 Data Synthesis

Most data in our review are grounded in qualitative research. As found by other researchers, tabulating the data is useful for aggregation, comparison, and synthesis of information (Kitchenham 2004 ). The data was thus stored in the spreadsheets, manually reviewed, and interpreted to answer research questions. Parts of the extracted data we synthesized using descriptive statistics.

We also used three classification schemas to group collected information on app review analysis (F6), mining techniques (F7) and SE activity (F8). We constructed each schema following the same general procedure based on the content analysis method (Bauer 2007 ); the first author initially examined all the collected information of a specific data item type; then performed an iterative coding process. During the coding, each information was labeled with one of the categories identified in the literature or inferred from the collected data.

To create the schema of app review analyses, we adopted 5 categories proposed in the previous survey (Martin et al. 2017 ). As the categories were not exhaustive for the coding; we extended them with 14 additional categories: 7 categories from the taxonomy of mining tasks (Cannataro and Comito 2003 ), and 7 standard types of text analytics (Miner et al. 2012 ); we referred to data and text mining areas as they have well defined terminology for text analysis. We then merged semantically-related categories; and removed those unrelated to the domain of app review analysis. The resulting list of 8 categories we then extended by adding the Recommendation category abstracted from the remaining unlabelled data. With 9 categories, the first author performed the final coding. Table  7 , in the corresponding result section, presents the nine types of app review analyses.

The classification schema of mining techniques is informed by categories in previous survey on intelligent mining techniques (Tavakoli et al. 2018 ) and text analytics (Miner et al. 2012 ; Singh 2021 ; Software 2021 ). We first identified 5 categories of mining techniques: 4 categories proposed in the previous survey (Tavakoli et al. 2018 ); and 1 category identified from text analytics i.e., statistical analysis (Miner et al. 2012 ; Singh 2021 ; Software 2021 ). While coding, we however excluded feature extraction category referring to an instance of general information extraction task rather than a type of technique (Miner et al. 2012 ); and performed the final coding using the remaining 4 categories. The resulting mining techniques categories can be found in Table  9 .

We derived the schema of SE activities based on the terminology from the software engineering body of knowledge (Bourque et al. 1999 ); we first identified 258 terms related to the main software engineering concepts; and then selected 58 terms describing candidate activities for the coding process. While coding, we excluded 44 terms as they did not match any data items; and performed the final coding using the remaining 14 terms (from now SE activities). Table  13 list the the resulting software engineering activities in the corresponding result section.

We validated the coding reliability of each schema using inter- and intra- rater agreement. We measured the reliability using percentage agreement on a recommended sample size (Graham et al. 2012 ; Bujang and Baharum 2017 ). To evaluate intra-rater agreement, the first authors re-coded a random sample of 20% of selected papers. The external assessor then checked the coding between the first and second coding. To evaluate inter-rater agreement, both the first author and the assessor coded a new random sample of 10% of the papers. They then cross-checked their results. The percentage intra- and inter-rater agreements were equal or above 90% and 80% for coding each schema, indicating their very good quality (Ide and Pustejovsky 2017 ); Table  4 provides detail statistics for the reliability evaluation.

The spreadsheets resulting from our data extraction and data grouping can be found in the supplementary material of this survey (Dąbrowski 2021 ).

3 Result Analysis

3.1 demographics.

Figure  2 shows the number of primary studies per year, including breakdown of publication type (Journal, Conference, Workshop, and Book). The publication date of primary studies ranges from 2012 to 2020. Footnote 7 We observed that 53% of the primary studies were published in the last 3 years, indicating a growing interest in research on analyzing app reviews to support software engineering.

figure 2

Number of publications per year. The first papers on app review analysis were published in 2012

Figure  3 shows the distribution by venue type: 65% of papers are published in conferences, 23% in journals, 10% in workshops and 2% as book chapters. Table  5 lists the top ten major venues in terms of the number of published papers. Footnote 8 The venues include the main conferences and journals in the software engineering community. Table  6 lists twenty most cited papers in the field of app review analysis for software engineering; and summarize their contributions. These studies advanced the field in substantial ways, or introduced influential ideas.

figure g

Pie chart showing the distribution of research papers per venue type in the period from 2010 to December 31, 2020

3.2 RQ1: App Review Analysis

In this section, we answer RQ1 (what are the different types of app review analysis) based on data extracted in F6 (review analyses). To answer the question, we grouped data items into one of nine general categories, each representing a different review analysis type (F6.1). We performed the grouping following the classification schema we had constructed for this study (see Section  2.4 ); and categories previously proposed in the context of app store analysis (Martin et al. 2017 ) as well as data and text mining (Cannataro and Comito 2003 ; Miner et al. 2012 ). Here, we focused on an abstract representation, because primary studies sometimes use slightly different terms to refer to the same type of analysis. Table  7 lists the different types of app review analyses and their prevalence in the literature.

3.2.1 Information Extraction

App reviews are unstructured text. Manually extracting relevant information from large volume of reviews is not cost-effective (Vu et al. 2015a ). To address the problem, 56 of the primary studies (31%) proposed approaches facilitating information extraction. Formally, information extraction is the task of extracting specific (pre-specified) information from the content of a review; this information may concern app features (Guzman and Maalej 2014 ; Johann et al. 2017 ; Dąbrowski et al. 2020 ), qualities (Groen et al. 2017 ; Wang et al. 2020b ), problem reports and/or new feature requests (e.g., Iacob and Harrison 2013 ; Wang et al. 2017 ; Gao et al. 2019 ; Shams et al. 2020 ), opinions about favored or unfavored features (e.g., Guzman and Maalej 2014 ; Gu and Kim 2015 ; Vu et al. 2015a ; Li et al. 2017 ) as well as user stories (Guo and Singh 2020 ). Relevant information can be found at any location in the reviews. For instance, a problematic feature can be discussed in a middle of a sentence (Guzman and Maalej 2014 ; Williams et al. 2020 ), or a requested improvement can be expressed anywhere in a review (Gao et al. 2015 ; Guo and Singh 2020 ).

3.2.2 Classification

Classification consists of assigning predefined categories to reviews or textual snippets (e.g., sentences or phrases). Classification is by far the most common type of app review analysis found in the literature: 58% of publications describe techniques for classifying reviews. Classification can be used to separate informative reviews from those that are uninformative (e.g., Oh et al. 2013 ; Chen et al. 2014 ; Di Sorbo et al. 2016 ; Di Sorbo et al. 2020 ), spam (Chandy and Gu 2012 ) or fake (Martens and Maalej 2019b ). Informative reviews can be subsequently classified to detect user intentions (e.g., Maalej et al. 2016 ; Zhou et al. 2020 ) and discussion topics (e.g., Di Sorbo et al. 2017 ; van Vliet et al. 2020 ). User intentions include reporting an issue or requesting a new feature (Panichella et al. 2015 ; Panichella et al. 2016 ; Srisopha et al. 2020b ).

Discussion topics include a variety of concerns such as installation problems, user interface, or price (Mujahid et al. 2017 ; Ciurumelea et al. 2018 ; Williams et al. 2020 ); topics concerning user perception e.g., rating, user experience or praise (Pagano and Maalej 2013 ; Li et al. 2020 ); or topics reporting different types of issues (Khalid 2013 ; McIlroy et al. 2016 ; Tao et al. 2020 ). For instance, review classification has been proposed to detect different types of usability and user experience issues (Bakiu and Guzman 2017 ; Alqahtani and Orji 2019 ), quality concerns (Mercado et al. 2016 ; Wen and Chen 2020 ) or different types of security and privacy issues (Cen et al. 2014 ; Tao et al. 2020 ). Similarly, app store feedback can be classified by their reported requirements type (Yang and Liang 2015 ; Deocadez et al. 2017a ; Lu and Liang 2017 ; Wang et al. 2018 ; Wang et al. 2018 ; Wen and Chen 2020 ). This could help distinguish reviews reporting functional requirements from those reporting non-functional requirements (Yang and Liang 2015 ; Deocadez et al. 2017a ; Wang et al. 2018 ; Wang et al. 2020b ); distilling non-functional requirements into fine-grained quality categories such as reliability, performance, or efficiency (Lu and Liang 2017 ; Wang et al. 2018 ). Another key use of the classification task is rationale mining; it involves detecting types of argumentations and justification users describe in reviews when making certain decisions, e.g. about upgrading, installing, or switching apps (Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ; Kunaefi and Aritsugi 2020 ).

3.2.3 Clustering

Clustering consists of organizing reviews, sentences, and/or snippets into groups (called a cluster) whose members share some similarity. Members in the same group are more similar (in some sense) to each other than to those in other groups. Unlike classification, clustering does not have predefined categories. Clustering is thus widely used as an exploratory analysis technique to infer topics commonly discussed by users (Pagano and Maalej 2013 ; Guzman et al. 2014 ; Guzman and Maalej 2014 ; Liu et al. 2018 ) and aggregate reviews containing semantically related information (Chen et al. 2014 ; Guzman et al. 2015 ; Palomba et al. 2017 ; Zhou et al. 2020 ). Clustering can be used for grouping reviews that request the same feature (Peng et al. 2016 ; Di Sorbo et al. 2016 ), report similar problems (Martin et al. 2015 ; Villarroel et al. 2016 ; Gao et al. 2018b ; Williams et al. 2020 ), or discuss a similar characteristic of the app (Vu et al. 2016 ; Chen et al. 2019 ; Xiao et al. 2020 ). The generated clusters might help software engineers synthesize information from a group of reviews referring to the same topics rather than examining each review individually (Fu et al. 2013 ; Gao et al. 2015 ; Wang et al. 2017 ; Hadi and Fard 2020 ).

3.2.4 Search and Information Retrieval

Search and information retrieval concerns finding and tracing reviews (or their textual snippets) that match needed information. The task can be used to find reviews discussing a queried app feature (Vu et al. 2015a ; Vu et al. 2015b ; Dąbrowski et al. 2019 ), to obtain the most diverse user opinions in reviews (Guzman et al. 2015 ), or to trace what features described in the app description are discussed by users (Johann et al. 2017 ; Li et al. 2018 ). Information retrieval is also used to establish traceability links between app reviews and other software engineering artefacts (Palomba et al. 2015 ; Palomba et al. 2018 ), such as source code (Palomba et al. 2017 ; Zhou et al. 2020 ; Shams et al. 2020 ), stack tracers (Pelloni et al. 2018 ), issues from tracking systems (Palomba et al. 2015 ; Noei et al. 2019 ), and warnings from static analysis tools (Wei et al. 2017 ) in order to locate problems in source code (Palomba et al. 2017 ; Ciurumelea et al. 2017 ; Grano et al. 2018 ), suggest potential changes (Palomba et al. 2015 ; Palomba et al. 2017 ), or to flag errors and bugs in an application under test (Wei et al. 2017 ). Such traceability links can be also detected between reviews and feedback from other source like Twitter to study if the same issues are discussed in both digital channels (Yadav and Fard 2020 ; Yadav et al. 2020 ; Oehri and Guzman 2020 ); or between reviews and goals in goal-model to understand the extent to which app satisfies the users’ goals (Liu et al. 2020 ; Gao et al. 2020 ).

Table  8 summarizes types of data that have been combined with app reviews using search and information retrieval; indicates the purpose of the analysis; and provides references to primary studies.

3.2.5 Sentiment Analysis

Sentiment analysis (also known as opinion mining) refers to the task of interpreting user emotions in app reviews. The task consists in detecting the sentiment polarity (i.e., positive, neutral, or negative) in a full review (Martens and Johann 2017 ; Martens and Maalej 2019a ; Srisopha et al. 2020c ), in a sentence (Guzman and Maalej 2014 ; Panichella et al. 2015 ; Panichella et al. 2016 ), or on in a phrase (Gu and Kim 2015 ; Dąbrowski et al. 2020 ).

App reviews are a rich source of user opinions (Guzman and Maalej 2014 ; Malik et al. 2018 ; Masrury and Alamsyah 2019 ; Martens and Maalej 2019a ; Wen and Chen 2020 ). Mining these opinions involves identifying user sentiment about discussed topics (Gu and Kim 2015 ; Dąbrowski et al. 2020 ), features (Guzman and Maalej 2014 ; Gunaratnam and Wickramarachchi 2020 ) or software qualities (Bakiu and Guzman 2017 ; Masrury and Alamsyah 2019 ; Franzmann et al. 2020 ). These opinions can help software engineers understand how users perceive their app (Guzman and Maalej 2014 ; Gu and Kim 2015 ; Huebner et al. 2018 ; Franzmann et al. 2020 ), discover users’ requirements (Dąbrowski et al. 2019 ; Dalpiaz and Parente 2019 ) and preferences (Guzman and Maalej 2014 ; Bakiu and Guzman 2017 ; Malik et al. 2018 ; Nicolai et al. 2019 ), and factors influencing sales and downloads of the app (Liang et al. 2015 ). Not surprisingly, knowing user opinions is an important information need developers seek to satisfy (Buse and Zimmermann 2012 ; Begel and Zimmermann 2014 ; Dąbrowski et al. 2020 ).

3.2.6 Content Analysis

Content analysis studies the presence of given words, themes, or concepts within app reviews.

For example, studies have analysed the relation between user ratings and the vocabulary and length of their reviews (Hoon et al. 2012 ; Vasa et al. 2012 ). Studies have shown that users discuss diverse topics in reviews (Pagano and Maalej 2013 ; Shams et al. 2020 ), such as app features, qualities (Williams and Mahmoud 2018 ; Franzmann et al. 2020 ), requirements (Wang et al. 2018 ; Wang et al. 2018 ) or issues (Khalid 2013 ; Alqahtani and Orji 2019 ; Kalaichelavan et al. 2020 ; Williams et al. 2020 ). For example, using content analysis, researchers analysed recurring types of issues reported by users (McIlroy et al. 2016 ; Wang et al. 2020a ; Shams et al. 2020 ), their distribution in reviews as well as as relations between app issue type and other information such as price and rating (Iacob et al. 2013b ; Hassan et al. 2018 ) or between issue type and code quality indicators (Di Sorbo et al. 2020 ). Interestingly, studies have pointed out that users’ perception for the same apps can vary per country (Srisopha et al. 2019 ), user gender (Guzman and Paredes Rojas 2019 ), development framework (Malavolta et al. 2015a ), and app store (Ali et al. 2017 ). Content analysis can be also beneficial for software engineers to understand whether cross-platform apps achieve consistency of users’ perceptions across different app stores (Hu et al. 2018 ; Hu et al. 2019 ), or whether hybrid development tools achieve their main purpose: delivering an app that is perceived similarly by users across platforms (Hu et al. 2019 ). Finally, studying the dialogue between users and developers has shown evidences that the chances of users to update their rating for an app increase as result of developer’s response to reviews (McIlroy et al. 2015 ; Hassan et al. 2018 ).

3.2.7 Recommendation

Recommendation task aims to suggest course of action that software engineers should follow. Several mining approaches, for instance (Chen et al. 2014 ; Villarroel et al. 2016 ; Scalabrino et al. 2019 ; Gao et al. 2020 ), have been proposed to recommend reviews that software engineers should investigate. These approaches typically assign priorities to a group of comments reporting the same bug (Gao et al. 2015 ; Man et al. 2016 ; Gao et al. 2018b ), requesting the same modification or improvement (Villarroel et al. 2016 ; Keertipati et al. 2016 ; Scalabrino et al. 2019 ; Zhou et al. 2020 ). Such assigned priorities indicate relative importance of the information that these reviews convey from the users’ perspective. Factors affecting the importance vary from e.g., the number of reviews in these groups (Chen et al. 2014 ; Zhou et al. 2020 ), to the influence of this feedback on app download (Tong et al. 2018 ), and the overall sentiment these comments convey (Licorish et al. 2017 ; Gunaratnam and Wickramarachchi 2020 ). In line with this direction, mining approaches have been elaborated to recommend feature refinement plans for the next release (Licorish et al. 2017 ; Zhang et al. 2019 ), to highlight static analysis warnings that developers should check (Wei et al. 2017 ), to recommend test cases triggering bugs (Shams et al. 2020 ), to indicate mobile devices that should be tested (Khalid et al. 2014 ), and to suggest reviews that developers should reply (Greenheld et al. 2018 ; Gao et al. 2019 ; Srisopha et al. 2020c ); the approaches can analogously recommend responses for these reviews (Greenheld et al. 2018 ; Gao et al. 2019 ), stimulating users to upgrade their ratings or to revise feedback to be more positive (McIlroy et al. 2015 ; Vu et al. 2019 ).

3.2.8 Summarization

Review summarization aims to provide a concise and precise summary of one or more reviews. Review summarisation can be performed based on common topics, user intentions, and user sentiment for each topic (e.g., Guzman and Maalej 2014 ; Ciurumelea et al. 2018 ; Liu et al. 2020 ). For example, Di Sorbo et al. ( 2016 , 2017 ) proposed summarizing thousands of app reviews by an interactive report that suggest to software engineers what maintenance tasks need to be performed (e.g., bug fixing or feature enhancement) with respect to specific topics discussed in reviews (e.g., UI improvements). Other review summarization techniques give developers a quick overview about users’ perception specific to core features of their apps (Iacob and Harrison 2013 ; Guzman and Maalej 2014 ; Xiao et al. 2020 ), software qualities (Ciurumelea et al. 2018 ), and/or main users’ concerns (Iacob et al. 2013a ; Iacob et al. 2016 ; Ciurumelea et al. 2017 ; Tao et al. 2020 ). With the addition of statistics e.g., the number of reviews discussing each topic or requesting specific changes, such a summary can help developers to prioritize their work by focusing on the most important modifications (Ciurumelea et al. 2017 ). In addition, such a summary can be exported to other software management tools e.g., GitHub, JIRA (Iacob et al. 2016 ) to generate new issue tickets and help in problems resolution (Phetrungnapha and Senivongse 2019 ).

3.2.9 Visualization

Visualization can aid developers in identifying patterns, trends and outliers, making it easier to interpret information mined from reviews (Guzman et al. 2014 ; Liu et al. 2020 ). To communicate information clearly and efficiently, review visualization uses tables, charts, and other graphical representations (Guzman et al. 2014 ; Maalej et al. 2016 ), accompanied by numerical data (Maalej et al. 2016 ; Bakiu and Guzman 2017 ). For example, Maalej et al. ( 2016 ) demonstrated that trend analysis of review type (e.g., bug report, feature request, user experience) over time can be used by software engineers as an overall indicator of how the project’s health. Other studies proposed visualizing dynamics of main themes discussed in reviews to identify emerging issues (Gao et al. 2015 ; Gao et al. 2015 ; Gao et al. 2018b ; Gao et al. 2019 ), or to show the issue distribution for an app across different app stores (Man et al. 2016 ). Simple statistics about these issue (e.g., ‘How many reviews reported specific issues?’) can give an overall idea about the main problems, in particular if compared against other apps (e.g., ‘Do users complain more about security of my app compared to similar apps?’). Similarly, analyzing the evolution of user opinions and bug reports about specific features can help software engineers monitor the health of these features and to prioritize maintenance tasks (Vu et al. 2015a ; Vu et al. 2016 ; Bakiu and Guzman 2017 ; Shah et al. 2019c ). For instance, software engineers can analyse how often negative opinions emerge, for how long these opinions have been reported, and whether their frequency is rising or declining (Vu et al. 2015a ; Gu and Kim 2015 ; Tao et al. 2020 ). This information could provide developers with evidence of the relative importance of these opinions from a users’ perspective (Bakiu and Guzman 2017 ; Dąbrowski et al. 2019 ).

figure h

3.3 RQ2: Mining Techniques

App review analyses (see Section  3.2 ) are realized using different text mining techniques. In this section, we address RQ2 (what techniques are used to realize app review analysis) based on extracted data in F7 (mining technique) that we grouped following the classification schema we had constructed for this study (see Section  2.4 ). The categories of this schema comes from the survey on intelligent mining techniques and tools (Tavakoli et al. 2018 ) and text analytics area (Miner et al. 2012 ; Singh 2021 ; Software 2021 ).

In answer to this question, we identified 4 broad categories of mining techniques: content analysis (CA), natural language processing (NLP), machine learning (ML) and statistical analysis (SA). Table  9 lists the techniques and their prevalence in the literature. It can be observed more than a half of studies employed NLP or ML; whereas MA and SA were present in 25% and 29% of the studies. Table  10 reports how many studies used a certain technique to realize a given type of app review analysis. We observe that the NLP or ML are dominant for realizing app review analyses, except for Content Analysis that is mostly performed using MA or SA technique.

A single study frequently used the same type of technique for realizing several app review analyses (e.g., Clustering, Classification) Footnote 9 ; on the other hand, we also recorded studies frequently combined the techniques together to perform a single app review analysis. Table  11 reports what combinations of techniques were used in the literature and how many studies used each combination for realizing a specific app review analysis. Footnote 10 The results indicates NLP and ML were mostly combined for Classification; MA and SA were used together for Content Analysis; whereas NLP and SA was adopted for Information Extraction. The following sections discuss each type of technique.

3.3.1 Manual Analysis

Scholars have shown an interest in manual analysis of app reviews (Kurtanovic and Maalej 2018 ; van Vliet et al. 2020 ). The technique is used to facilitate Content Analysis e.g., to understand topics users discuss (Pagano and Maalej 2013 ; Franzmann et al. 2020 ; Williams et al. 2020 ) and to develop a ground truth dataset for training and evaluating mining techniques (Kurtanović and Maalej 2017 ; Dąbrowski et al. 2020 ). Manual analysis typically takes a form of tagging a group of sample reviews with one or more meaningful tags (representing certain concepts). For example, tags might indicate types of user complaint (Khalid et al. 2015 ; Wang et al. 2020a ), feature discussed in reviews (Maalej and Nabil 2015 ; Dąbrowski et al. 2020 ), or sentiment users expresses (Sänger et al. 2016 ). To make replicable and valid inferences upon manual analysis, studies perform it in a systematic manner. Figure  4 illustrates the overall procedure of manual analysis. Scholars first formulate the analysis objective corresponding to the exploration of review content (e.g., understanding types of user complaints) or the development of ground truth (e.g., labelling types of user feedback). They then select the reviews to be analysed, and specify the unit of analysis (e.g., a review or a sentence). Next, one or more humans (called ‘coders’) follow a coding process to systematically annotate the reviews. A coder examines a sample of reviews and tags them with specific concepts. Unless these concepts are known in advance or coders agree about the tagging, the step is iterative; When, for example, new concepts are identified, coders examine once again all the previously tagged reviews and check if they should be also tagged with the new concepts. Such iterations minimize the threat of human error when tagging the reviews. Once all the reviews are tagged, authors either analyse findings or use the dataset to evaluate other mining techniques (Stanik et al. 2019 ; Williams et al. 2020 ; Dąbrowski et al. 2020 ).

figure 4

Figure showing the overall process of manual analysis

Manual analysis is time-consuming and require a vast human effort (Pagano and Maalej 2013 ; Guzman and Maalej 2014 ; van Vliet et al. 2020 ); a pilot study typically proceeds an actual analysis (Sänger et al. 2016 ; Kurtanović and Maalej 2017 ; Dąbrowski et al. 2020 ); subsequently the actual tagging, focusing on a statistically representative sample of reviews, takes places (Khalid et al. 2015 ). For example, Guzman and Maalej ( 2014 ) involved seven coders who independently tagged 2800 randomly sampled user reviews. For each review, two coders independently tagged the type of user feedback, features mentioned in the review and sentiments associated to these features. The study reports that coders spent between 8 and 12.5 hours for coding around 900 reviews.

3.3.2 Natural Language Processing

User-generated content of app reviews takes the form of text (Hoon et al. 2012 ; Vasa et al. 2012 ). Such text has plenty of linguistic structure intended for human consumption rather than for computers (Jurafsky and Martin 2009 ). The content must, therefore, undergo a good amount of natural language processing (NLP) before it can be used (Manning et al. 2008 ; Jurafsky and Martin 2009 ). Given this fact, it is not surprising that the majority of primary studies (62% of surveyed papers) adopt NLP techniques to support review analysis (see Section  3.2 ). At a high level, pre-processing can be simply seen as turning review content into a form that is analysable for a specific mining task (see Section  3.2 ). There are different ways to pre-process reviews including text normalization, cleaning and augmenting (Manning et al. 2008 ; Jurafsky and Martin 2009 ; Panichella et al. 2015 ; Gao et al. 2020 ). These pre-processing steps typically involve converting texts into lowercase (Fu et al. 2013 ; Sänger et al. 2016 ; Hadi and Fard 2020 ), breaking up a text into individual sentences (Lu and Liang 2017 ; Jha and Mahmoud 2017a ; Zhou et al. 2020 ), separating out words i.e., tokenization (Iacob et al. 2016 ; Palomba et al. 2017 ; Al-Hawari 2020 ), spelling correction (Palomba et al. 2017 ; Grano et al. 2018 ) as well as turning words into their base forms e.g., stemming or lemmatization (Maalej and Nabil 2015 ; Lu and Liang 2017 ; Panichella et al. 2015 ; Xiao 2019 ). Of course, not all the review content is meaningful (Guzman and Maalej 2014 ; Chen et al. 2014 ; Oehri and Guzman 2020 ). Some parts are noisy and obstruct text analysis (Palomba et al. 2015 ; Palomba et al. 2017 ; Gunaratnam and Wickramarachchi 2020 ). The content is thus cleaned by removing punctuation (Puspaningrum et al. 2018 ; Hu et al. 2019 ), filtering out noisy words like stop words (Johann et al. 2017 ; Ciurumelea et al. 2017 ; Gunaratnam and Wickramarachchi 2020 ), or non-English words (Palomba et al. 2015 ; Stanik et al. 2019 ). Such normalized and cleaned text tends to be augmented with additional information based on linguistic analysis e.g., part-of-speech tagging (PoS) (Puspaningrum et al. 2018 ; Zhang et al. 2019 ; Gunaratnam and Wickramarachchi 2020 ) or dependency parsing (Gu and Kim 2015 ; Liu et al. 2018 ; Song et al. 2020 ).

A review can be modelled as a words sequence (Johann et al. 2017 ), bag-of-words (BoW) (Maalej and Nabil 2015 ) or in vector space model (VSM) (Vu et al. 2015a ) to sereve as input for other mining techniques. In particular, primary studies refers to NLP techniques comparing text similarity (Vu et al. 2015b ; Wang et al. 2018 ), pattern matching (Groen et al. 2017 ; Johann et al. 2017 ; Song et al. 2020 ) and collocations finding (Guzman and Maalej 2014 ; Li et al. 2018 ; Dalpiaz and Parente 2019 ; Xiao et al. 2020 ).

Text similarity techniques (employed in 21 studies) determine how “close” two textual snippets (e.g., review sentences) are (Manning et al. 2008 ). These snippets, represented in VSM or BoW, are compared using similarity measure like Cosine similarity (Vu et al. 2015a ; Shams et al. 2020 ), Dice similarity coefficient (Palomba et al. 2015 ; Zhou et al. 2020 ) or Jaccard index (Iacob et al. 2016 ). These techniques support Searching and Information Retrieval e.g., to link reviews with issue reports from issue tracking systems (Noei et al. 2019 ), Recommendation e.g., to recommend review responses based on old ones that have been posted to similar reviews (Greenheld et al. 2018 ), Clustering e.g., to group semantically similar user opinions (Vu et al. 2016 ; Malgaonkar et al. 2020 ), and Content Analysis e.g., to compare review content (Malavolta et al. 2015a ).

Pattern matching techniques (employed in 22 studies) localize parts of review text (or its linguistic analysis) matching hand-crafted patterns. Such patterns can take many forms, such as, regular expressions (Yang and Liang 2015 ; Groen et al. 2017 ; Uddin et al. 2020 ), PoS sequences (Vu et al. 2016 ; Johann et al. 2017 ), dependencies between words (Gu and Kim 2015 ; Peng et al. 2016 ; Di Sorbo et al. 2017 ; Srisopha et al. 2020c ) or simple keyword matching (Yang and Liang 2015 ; Maalej et al. 2016 ; Di Sorbo et al. 2017 ; Tao et al. 2020 ). The technique has been adopted in Information Extraction e.g., to extract requirements from reviews (Yang and Liang 2015 ; Groen et al. 2017 ), Classification e.g., to classify requirements into functional and non-functional (Yang and Liang 2015 ) and Summarization e.g., to provide a bug report summary (Groen et al. 2017 ).

Collocation finding techniques are employed for Information Extraction e.g., to extract features (Guzman and Maalej 2014 ; Xiao 2019 ) or issues (Gao et al. 2018b ) from reviews. Such collocations are phrases consisting of two or more words, where these words appear side-by-side in a given context more commonly than the word parts appear separately (Jurafsky and Martin 2009 ). The two most common types of collocation detected in the primary studies are bigrams i.e., two adjacent words (Guzman and Maalej 2014 ; Dalpiaz and Parente 2019 ). Co-occurrences may be insufficient as phrases such as ’all the’ may co-occur frequently but are not meaningful. Hence, primary studies explore several methods to filter out the most meaningful collocations, such as Pointwise Mutual Information (PMI) (Gao et al. 2018b ; Malgaonkar et al. 2020 ) and hypothesis testing (Jurafsky and Martin 2009 ; Guzman and Maalej 2014 ; Dąbrowski et al. 2020 ).

3.3.3 Machine Learning

Overall, 108 of 182 primary studies (59%) reported the use of machine learning (ML) techniques to facilitate mining tasks and review analysis. Table  12 reports ten most commonly applied ML techniques. Most of them (i.e., 8 techniques) are supervised, whereas 2 of them are unsupervised (Bishop 2006 ). The widespread interest in ML techniques may be attributed to the fact that Clustering e.g., to group reviews discussing the same topics (Fu et al. 2013 ; Srisopha et al. 2020b ) and Classification e.g., to categorize user feedback based on user intention (Dhinakaran et al. 2018 ; Zhou et al. 2020 ), among the most common review analysis types (see Table  7 ), are mainly facilitated using ML. When looking at the whole spectrum of review analysis these ML techniques support, we have also recorded their use for Sentiment Analysis e.g., to identify feature-specific sentiment (Gu and Kim 2015 ), Recommendation e.g., to assign priorities to reviews reporting bugs (Villarroel et al. 2016 ) and Information Extraction e.g., to identify features (Sänger et al. 2017 ; Wang et al. 2020b ).

Scholars experimented with many textual and non-textual review properties Footnote 11 to make ML techniques work best (Maalej and Nabil 2015 ; Guzman et al. 2015 ). Choosing informative and independent properties is a crucial step to make these techniques effective (Bishop 2006 ; Maalej et al. 2016 ). Textual properties, for example, concern: text length, tense of text (Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ), importance of words e.g., td-idf (Lu and Liang 2017 ; Williams et al. 2020 ), a word sequence e.g., n-gram (Maalej and Nabil 2015 ; Al-Hawari 2020 ) as well as linguistic analysis e.g., dependency relationship (Shah et al. 2018 ). These properties are commonly combined with non-textual properties like user sentiment (Maalej et al. 2016 ; Srisopha et al. 2020a ), review rating (Kurtanović and Maalej 2017 ) or app category (Gao et al. 2019 ). We found that primary studies experiment with different properties (Maalej et al. 2016 ; Kurtanovic and Maalej 2018 ; Al-Hawari 2020 ).

3.3.4 Statistical Analysis

Statistical analysis is used in many papers to report research results (Martin et al. 2015 ; Sänger et al. 2016 ; Di Sorbo et al. 2020 ), demonstrate their significance (Vasa et al. 2012 ; Khalid et al. 2016 ), and draw conclusions of a large population of reviews by analysing their tiny portion (Pagano and Maalej 2013 ; Mercado et al. 2016 ; Wang et al. 2020a ). We observed an interest in use of descriptive and inferential techniques for Content Analysis e.g., Vasa et al. ( 2012 ), Pagano and Maalej ( 2013 ), Mercado et al. ( 2016 ), Guzman et al. ( 2018 ), and Wang et al. ( 2020a ). Summary statistics, box plots, and cumulative distribution charts help to gain understanding of review characteristics like their vocabulary size (Hoon et al. 2012 ; Vasa et al. 2012 ), issue type distribution (McIlroy et al. 2016 ; Hu et al. 2018 ; Williams et al. 2020 ), or topics these reviews convey (Pagano and Maalej 2013 ; Srisopha and Alfayez 2018 ). Scholars employ different statistical tests to test check their hypothesis (Khalid et al. 2016 ; Guzman and Paredes Rojas 2019 ; Franzmann et al. 2020 ), to examine relationship between reviews characteristics (Srisopha and Alfayez 2018 ; Guzman and Paredes Rojas 2019 ; Di Sorbo et al. 2020 ), and to study how sampling bias affects the validity of research results (Martin et al. 2015 ).

Guzman et al. ( 2018 ) and Guzman and Paredes Rojas ( 2019 ), for example, conducted an exploratory study investigating 919 reviews from eight countries. They studied how reviews written by male and female users differ in terms of content, sentiment, rating, timing, and length. The authors employed Chi-square (e.g., content) and Mann-Whitney (e.g., rating) non-parametric tests for nominal and ordinal variables respectively (Guzman and Paredes Rojas 2019 ). Srisopha and Alfayez ( 2018 ) studied whether a relationship exists between user satisfaction and the application’s internal quality characteristics. Having employed Pearson correlation coefficient, the authors studied to what extent do warnings reported by static code analysis tools correlate with different types of user feedback and the average user ratings. Similarly, another study employed the Mann-Whitney test to examine if densities of such warnings differ between apps with high and low ratings (Khalid et al. 2016 ).

figure i

3.4 RQ3: Supporting Software Engineering

To answer RQ3 (what software engineering activities might be supported by analysing app reviews), we used data extracted in F8 (software engineering activity) and F9 (justification) as well as the classification schema of SE activities derived from the software engineering body of knowledge (see Section  2.4 ). Table  13 provides mapping between primary studies and SE activities that the studies claim to support Footnote 12 ; it also reports the number and the percentage of the studies per each activity. We can observe that primary studies motivated their approaches to support activities across different software engineering phases, including requirements (36%), maintenance (36%), testing (15%) and design (4%); 14 SE activities are supported in total; mostly research effort is focused on requirements elicitation (26%), requirements prioritization (10%), validation by users (11%), problem and modification analysis (23%), and requested modification prioritization (11%). We also recorded that 62 studies (34%) did not specify any SE activity that their approaches support.

To support the SE activities, primary studies used 9 broad types of app review analysis we identified with answer to RQ1 (see Section  3.2 ). Table  14 shows how often a type of review analysis was used for a SE activity. Footnote 13 It can be observed that each SE activity was supported using multiple analyses; classification was the most commonly used one; this was also the only analysis motivated for all the activities. A further result analysis revealed studies used the analyses in combination to mine useful information and support SE activities; we recorded 53 unique combinations; each composed of 1 to 5 types of analysis with the median of 2. Table  15 lists combinations used at least in 2 primary studies. The following sections provides a through synthesis on how mining useful information from app reviews might support SE activities.

3.4.1 Requirements

Requirements engineering includes involving system users, obtaining their feedback and agreeing on the purpose of a software to be built (Maalej et al. 2016 ). It therefore is not surprising that review analysis has received much attention to support requirements engineering activities, including requirements elicitation, requirements classification, requirements prioritization and requirements specification (see Table  13 ).

Requirements Elicitation

In app reviews, users give feedback describing their experience with apps, expressing their satisfaction with software products and raising needs for improvements (Pagano and Maalej 2013 ; AlSubaihin et al. 2019 ). Software engineers can make use of the reviews to elicit new requirements (AlSubaihin et al. 2019 ; Dalpiaz and Parente 2019 ; Dąbrowski et al. 2019 ; 2020 ). For instance, they can employ opinion mining approaches to examine reviews talking negatively about app features (Guzman and Maalej 2014 ; Shah et al. 2016 ; Li et al. 2018 ; Shah et al. 2019c ; Liu et al. 2019 ; Dalpiaz and Parente 2019 ; Dąbrowski et al. 2019 ; 2020 ). This can help developers to understand user concerns about problematic features, and potentially help eliciting new requirements (Johann et al. 2017 ; Dalpiaz and Parente 2019 ; Dąbrowski et al. 2019 ; 2020 ). Additionally, searching and retrieving users reviews that refer to a specific feature they are responsible for will allow them to quickly identify what users have been saying about their feature (Li et al. 2018 ; Dąbrowski et al. 2019 ; Liu et al. 2019 ). In line with this direction, approaches have been proposed to classify reviews by their user intention (e.g., reviewer requesting a new feature) (Iacob et al. 2013a ; Maalej and Nabil 2015 ; Maalej et al. 2016 ; Villarroel et al. 2016 ; Scalabrino et al. 2019 ; Song et al. 2020 ) and by the type of requirements these reviews formulate (e.g., functional or non-functional) (Yang and Liang 2015 ; Lu and Liang 2017 ; Al Kilani et al. 2019 ; Jha and Mahmoud 2019 ; Wen and Chen 2020 ). Such aggregated information can be further summarized and visualized to developers as a report of all the feature requests reported for an app (Iacob et al. 2013a ; Iacob et al. 2016 ; Di Sorbo et al. 2016 ; Di Sorbo et al. 2017 ; Ciurumelea et al. 2018 ; Liu et al. 2020 ).

Requirements Classification

User feedback can be classified in a number of dimensions (Bourque et al. 1999 ). Several studies classified user comments based on types of requirements the feedback conveys (Yang and Liang 2015 ; Deocadez et al. 2017a ; Lu and Liang 2017 ; Groen et al. 2017 ; Wang et al. 2018 ; Wang et al. 2018 ; Jha and Mahmoud 2019 ; Wen and Chen 2020 ). These works typically classified the feedback into two broad categories: functional requirements (FRs) specifying the behavior of an app, and non-functional requirements (NFRs) describing the constraints and quality characteristics of the app. The classification at a further level of granularity has been also demonstrated (Lu and Liang 2017 ; Wang et al. 2018 ; Jha and Mahmoud 2019 ; Wen and Chen 2020 ; van Vliet et al. 2020 ); User feedback can be classified into the concrete quality characteristics it refers to e.g., defined by ISO 25010 model (ISO/IEC 2501 0 2011) so that software engineers could analyse candidate requirements more efficiently.

Requirements Prioritization

Statistics about user opinions and requests can help prioritizing software maintenance and evolution tasks (Pagano and Maalej 2013 ; Guzman and Maalej 2014 ; Maalej et al. 2016 ; Johann et al. 2017 ; Dąbrowski et al. 2019 ; 2020 ). Bugs and missing features that are more commonly reported can be prioritized over those less commonly reported (Villarroel et al. 2016 ; Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ; Scalabrino et al. 2019 ; Di Sorbo et al. 2020 ). Users’ request may not by themselves be sufficient for prioritization (one must also consider costs and the needs of other stakeholders) but can provide valuable evidence-based information to support prioritization (Maalej et al. 2016 ; Shah et al. 2019c ; Oehri and Guzman 2020 ).

Requirements Specification

Requirements specification consists in structuring and documenting detailed descriptions of the software required behaviour and quality properties (van Lamsweerde 2009 ). App reviews can instead serve for generating lightweight partial documentation of user requirements; they conveys information about functional and non-functional requirements, usage scenarios and user experience (Pagano and Maalej 2013 ; Maalej et al. 2016 ; Maalej et al. 2016 ; Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ; Williams et al. 2020 ). Software engineers can immediately benefit from review mining approaches to facilitate this information in the form of first drafts of software requirements specifications (SRS) or user stories (Pagano and Maalej 2013 ; Maalej et al. 2016 ; Maalej et al. 2016 ). These approaches can for example classify reviews by the type of requests users make (e.g., asking for new functions); summarise reviews referring to the same requests and generate provisional SRS based on the information. Such SRS may list new functions that users require; recap scenarios in which these functions are used; and report statistics indicating relative importance of the requirements e.g., by the number of users requesting the functions (Maalej et al. 2016 ). Since users often justify their needs and opinions, SRS may also document user rationales serving later for requirements negotiation or design decisions (Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ).

3.4.2 Design

A few studies motivated app review analysis to assist software design activities: user interface (UI) design (Alqahtani and Orji 2019 ; Sharma and Bashir 2020 ; Franzmann et al. 2020 ) and capturing design rationale (Groen et al. 2017 ; Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ; Jha and Mahmoud 2019 ; Kunaefi and Aritsugi 2020 ).

User Interface Design

The success of mobile applications depends substantially on user experience (AlSubaihin et al. 2019 ; Franzmann et al. 2020 ). For the app to be successful, software engineers should design the interface to match the experience, skills and needs of users (Bourque et al. 1999 ). Alqahtani and Orji performed the content analysis of user reviews to identify usability issues in mental health apps (Alqahtani and Orji 2019 ). They manually tagged 1,236 reviews with different types of usability issues for 106 apps from Apple’s App Store and Google Play. Poor design of user interface was the second most frequently reported issue. It has been found that user-submitted content concerning interface may provide valuable design recommendations on how to improve interface layout, boost readability and easy app navigation. UI/UX designers should therefore take advantage of the feedback. If addressed, it would likely increase user engagement with the apps and reduce the attrition rate (Franzmann et al. 2020 ).

Design Rationale Capture

Design rationale is essential for making the right design decisions and for evaluating architectural alternatives for a software system (Nuseibeh 2001 ; Burge et al. 2008 ). A few studies motivated their approaches to capture potential reasons for design decisions (Groen et al. 2017 ; Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ; Jha and Mahmoud 2019 ; Kunaefi and Aritsugi 2020 ). Kurtanović and Maalej devised a grounded theory for gathering user rationale and evaluated different review classification approaches to mine the information from app reviews (Kurtanović and Maalej 2017 ; Kurtanovic and Maalej 2018 ). User justifications e.g., on problems they encounter or criteria they chose for app assessment (e.g., reliability or performance) can enrich documentation with new design rationale and guide design decisions. Similarly, user-reported NFR can convey architecturally significant requirements and serve as rationale behind an architecture decision (Nuseibeh 2001 ; Groen et al. 2017 ; Kunaefi and Aritsugi 2020 ). To capture such requirements, app reviews can be classified by quality characteristics users discuss (Nuseibeh 2001 ; Groen et al. 2017 )

3.4.3 Testing

App reviews analysis can be used to support various testing activities: validation by users (Iacob et al. 2013a ; Iacob et al. 2013b ; Iacob and Harrison 2013 ; Guzman et al. 2014 ; Guzman and Maalej 2014 ; Maalej and Nabil 2015 ; Gu and Kim 2015 ; Maalej et al. 2016 ; Bakiu and Guzman 2017 ; Ciurumelea et al. 2018 ; Durelli et al. 2018 ; Liu et al. 2018 ; Shah et al. 2019c ; Gao et al. 2019 ; AlSubaihin et al. 2019 ; Dąbrowski et al. 2020 ; Xiao et al. 2020 ), test documentation (Iacob et al. 2016 ; Grano et al. 2018 ; Pelloni et al. 2018 ), test design (Man et al. 2016 ; Maalej et al. 2016 ; Groen et al. 2017 ; Shams et al. 2020 ) and test prioritization (Khalid et al. 2014 ).

Validation by Users

Evaluating a software system with users usually involves expensive usability testing in a laboratory (Iacob et al. 2013a ) or acceptance testing performed in a formal manner (IEEE 1990 ). In the case of mobile apps, software engineers can exploit user feedback to assess user satisfaction (Fu et al. 2013 ; Iacob et al. 2013a ; Iacob et al. 2013b ; Gu and Kim 2015 ; Bakiu and Guzman 2017 ; Ciurumelea et al. 2018 ; Shah et al. 2019c ; Xiao 2019 ; Dąbrowski et al. 2020 ) and to identify any glitches with their products (Iacob et al. 2013a ; Maalej and Nabil 2015 ; Gu and Kim 2015 ; Maalej et al. 2016 ; Ciurumelea et al. 2018 ; AlSubaihin et al. 2019 ; Gao et al. 2019 ). A recent survey with practitioners has shown that developers release the alpha/beta version of their apps to test the general reaction of users and to discover bugs (AlSubaihin et al. 2019 ).

In line with the direction, several approaches have been proposed to mine user opinions (Guzman and Maalej 2014 ; Guzman et al. 2014 ; Gu and Kim 2015 ; Bakiu and Guzman 2017 ; Shah et al. 2019c ; Dąbrowski et al. 2020 ; Xiao et al. 2020 ) and to generate bug reports (Iacob et al. 2013a ; Maalej and Nabil 2015 ; Maalej et al. 2016 ; Man et al. 2016 ; Ciurumelea et al. 2018 ; Liu et al. 2018 ; Shah et al. 2019c ). Opinion mining approaches help to discover the most problematic features and to quantify the number of negative opinions. Knowing what features users praise or hate can give a developer a hint about user acceptance of these features (Bakiu and Guzman 2017 ; AlSubaihin et al. 2019 ; Dąbrowski et al. 2020 ). Assuming core features have been modified, the team may want to know how users react to these features so that they can fix any issues quickly and refine these features. Analogously, identifying and quantifying reported bugs within a given time frame can help a development team during beta testing before official release (Iacob et al. 2013a ; Iacob et al. 2013b ; Ciurumelea et al. 2018 ; Gao et al. 2019 ; Shah et al. 2019c ). If the number of reported issues is unusually high, development teams can reschedule the release of a new version in order to refocus on quality management and testing (Maalej and Nabil 2015 ; Maalej et al. 2016 ).

Test Documentation

Test documentation can be partly supported by analysing app reviews (Iacob et al. 2016 ; Pelloni et al. 2018 ; Grano et al. 2018 ). Iacob et al. developed a tool that produce a summary of bugs reported in reviews with breakdown by app version and features that these bugs refer to (Iacob et al. 2016 ). Such summary can form the basis for later debugging the app and fixing the problems. User comments can also be integrated into mobile app testing tools (Pelloni et al. 2018 ; Grano et al. 2018 ). Originally, the tools generate a report of stack traces leading to an app crash (Pelloni et al. 2018 ; Grano et al. 2018 ). Analyzing the information to understand the root of the problems can be often counterintuitive. In such case, user comments can be used as a human readable companion for such report; linked to a related stack trace, user-written description of the problem can instantly guide testers where to look up for the emerged fault (Pelloni et al. 2018 ; Grano et al. 2018 ).

Test Design

Analysing app reviews can support test case design (Man et al. 2016 ; Maalej et al. 2016 ; Groen et al. 2017 ; Shams et al. 2020 ). Analysing reported issues can help testers determine the app behavior, features, and functionality to be tested (Man et al. 2016 ). Reviews may describe particular use of the software in which users encountered an unusual situation (e.g., crashing without informing users of what happened) or inform about the lack of supporting users in finding a workaround (Maalej et al. 2016 ). Such information may help testers to design test cases capturing exceptions leading to a problem or to exercise new alternative scenarios other those initially considered (Maalej et al. 2016 ; Groen et al. 2017 ; Shams et al. 2020 ). Additionally, identifying negative comments on quality characteristics can help in specifying acceptance criteria an app should hold (Groen et al. 2017 ). For example, user complaints about performance efficiency can indicate performance criteria for functions that are expected to finish faster or more smoothly (Groen et al. 2017 ).

Test Prioritization

Reviews and their ratings have been found to correlate with a download rank, a key measure of the app’s success (Khalid et al. 2015 ; Martin et al. 2017 ). User complaints about specific issues can have a negative impact on rating, and in turn discourage users from downloading apps (Khalid et al. 2015 ). Therefore, it has been therefore suggested to prioritize issue-related test cases based on frequency and impact of these complaints (Khalid et al. 2015 ; Man et al. 2016 ). To address device-specific problems a development team must test their apps on a large number of devices, which is inefficient and costly (Erfani et al. 2013 ). The problem can be partially ameliorated by selecting devices submitted from reviews having the greatest impact on app ratings (Khalid et al. 2014 ). The strategy can be particularly useful for the team with limited resources that can only afford to buy a few devices. Using the strategy, they can determine the optimal set of devices they can buy on which to test their app (Khalid et al. 2014 ).

3.4.4 Maintenance

In attempt to support software maintenance, review analysis has been proposed for problem and modification analysis, requested modification prioritization, help desk and impact analysis (see Table  13 ).

Problem and Modification Analysis

Software engineers strive continuously to satisfy user needs and keep their app product competitive in the market (AlSubaihin et al. 2019 ). To this end, they can exploit approaches facilitating problem and modification analysis (Fu et al. 2013 ; Khalid 2013 ; Cen et al. 2014 ; Guzman et al. 2014 ; Gao et al. 2015 ; Gomez et al. 2015 ; Panichella et al. 2015 ; Gao et al. 2015 ; Palomba et al. 2015 ; Guzman et al. 2015 ; Khalid et al. 2015 ; Khalid et al. 2015b ; Malik and Shakshuki 2016 ; Vu et al. 2016 ; Di Sorbo et al. 2016 ; Iacob et al. 2016 ; Wei et al. 2017 ; Licorish et al. 2017 ; Johann et al. 2017 ; Bakiu and Guzman 2017 ; Deocadez et al. 2017b ; Wang et al. 2017 ; Palomba et al. 2017 ; Malik et al. 2018 ; Gao et al. 2018b ; Muñoz et al. 2018 ; Palomba et al. 2018 ; Pelloni et al. 2018 ; Tong et al. 2018 ; Dalpiaz and Parente 2019 ; Phetrungnapha and Senivongse 2019 ; Gao et al. 2019 ; Shah et al. 2019c ; AlSubaihin et al. 2019 ; Li et al. 2020 ; Hadi and Fard 2020 ; Zhou et al. 2020 ). The approaches detect user requests in app store feedback and classify them as problem reports and modifications requests (Zhou et al. 2020 ). Fine-grained classification can be carried out too, for example, to detect specific issues like privacy (Khalid 2013 ; Cen et al. 2014 ; Tao et al. 2020 ) or concrete change requests like features enhancement (Palomba et al. 2017 ; Al-Hawari 2020 ). Mining such information allows software engineers to determine and analyze user demands in timely and efficient fashion (Gao et al. 2015 ; Wang et al. 2017 ; Gao et al. 2018b ; Gao et al. 2019 ; Guo and Singh 2020 ). By analysing the dynamics of reported problems over time, software engineers can immediately spot when a "hot issue" emerges and link it to a possibly flawed release (Fu et al. 2013 ; Guzman et al. 2014 ; Gao et al. 2015 ; Shah et al. 2019c ). Moreover, they can generate a summary of user demands to obtain interim documentation serving as change request/problem report (Iacob et al. 2016 ; Di Sorbo et al. 2016 ; Phetrungnapha and Senivongse 2019 ).

Requested Modification Prioritization

App developers may receive hundreds or even thousands of reviews requesting modifications and reporting problems (Khalid 2013 ; Villarroel et al. 2016 ; Noei et al. 2019 ). It is therefore not a trivial task for developers to select those requests which should be addressed in the next release (Villarroel et al. 2016 ). As with requirements, developers can investigate statistics concerning these requests (e.g., how many people requested specific modifications), estimate their impact on perceived app quality (e.g., expressed as user rating) or analyze the how these requests change over time (Gu and Kim 2015 ; Gao et al. 2015 ; Khalid et al. 2015 ; Man et al. 2016 ; Keertipati et al. 2016 ; Villarroel et al. 2016 ; Iacob et al. 2016 ; Licorish et al. 2017 ; Wei et al. 2017 ; Muñoz et al. 2018 ; Scalabrino et al. 2019 ; Dąbrowski et al. 2019 ; Hu et al. 2019 ; Noei et al. 2019 ; Noei et al. 2019 ; Oehri and Guzman 2020 ). Assuming developers have to decide which change to address first, they could select one with the largest share in the numbers of requests, or the one whose feedback most drives down the most app rating (Gu and Kim 2015 ; Dąbrowski et al. 2019 ; Di Sorbo et al. 2020 ). Similarly, observing a sharp growth in feedback reporting of a specific problem (e.g., security and privacy), it may suggest that the issue is harmful to users and should be resolved quickly.

Help desk typically provides end-users with answers to their questions, resolve their problems or assist in troubleshooting (Bourque et al. 1999 ). Analogously, app developers can respond to specific user reviews to answer users’ questions, to inform about fixing problems or to thank users for their kind remarks about apps (McIlroy et al. 2015 ; Hassan et al. 2018 ; Srisopha et al. 2020a ; Srisopha et al. 2020c ). Though the task is not traditionally included in the typical responsibilities of software engineers, user support and managing the product reputation on the app store are essential to the app success; they should be viewed as important activities in in the software lifecycle. In fact, responding to reviews motivate app users to revise their feedback and ratings to be more positive (McIlroy et al. 2015 ). Some users even update their feedback to inform developers that the response solved users’ problems or to thank for help (McIlroy et al. 2015 ; Hassan et al. 2018 ). Since responding to a large number of reviews can be time-consuming, developers can make use of approaches highlighting reviews that are more likely to require a response Srisopha et al. ( 2020a ) and Srisopha et al. ( 2020c ); and generating automatic replies to these reviews (Greenheld et al. 2018 ; Hassan et al. 2018 ; Vu et al. 2019 ; Gao et al. 2019 ).

Impact Analysis

Review mining approaches help developers to discover modification requests posted in reviews; to identify app source code affected by these modifications (Zhou et al. 2020 ); and to estimate how implementing the modifications may impact users’ satisfaction; (Palomba et al. 2015 ; Ciurumelea et al. 2017 ; Palomba et al. 2017 ; Palomba et al. 2018 ). The approaches typically cluster feedback requesting the same modifications (Ciurumelea et al. 2017 ; Palomba et al. 2017 ; Zhou et al. 2020 ), then search and retrieve links between review clusters and corresponding source code artefacts referring to the modifications (Palomba et al. 2015 ; Ciurumelea et al. 2017 ; Palomba et al. 2017 ; Palomba et al. 2018 ; Zhou et al. 2020 ). Such information can be useful for engineers before an issue of new release as well as afterwards. Software engineers can track which requests have (not) been implemented; monitor the proportion of reviews linked to software changes; and estimate the number of users affected by these changes. After the release has been issued, software engineers can also use the approaches to observe gain/loss in terms of average rating with respect to implemented changes.

figure l

3.5 RQ4: Empirical Evaluation

To answer RQ4 (how are app review analysis approaches empirically evaluated), we used data items: F10 (evaluation objective), F11 (evaluation procedure), F12 (metrics and criteria), F14 (annotated datasets), F15 (annotation task), F16 (number of annotators), F17 (quality measure) and F18 (replication package). We found that 109 primary studies performed empirical evaluation of review mining approaches; 105 studies included evaluation of effectiveness and 23 of user-perceived quality.

3.5.1 Effectiveness Evaluation

A common procedure for effectiveness assessment consists of four steps: (i) formulate an evaluation objective, (ii) create an annotated dataset, (iii) apply the approach on the annotated dataset, and (iv) quantify the effectiveness. The evaluation objective refers to assessing the degree to which an approach can correctly perform a specific mining task or analysis (see Section  3.2 ). Human judgement is usually required to create the annotated dataset.

Primary studies involved humans performing the task manually on a sample of reviews and annotating the sample with correct solutions. Such annotated dataset (called the “ground truth”) served as a baseline for evaluating the approach and quantifying the outcome.

Most studies provided a detail description of how each step of their evaluation methods have performed. Hence, we could record additional information:

Availability of Dataset and Tool

Most studies have not released their annotated datasets nor the tools they evaluated. Footnote 14 Tables  16 provides an overview of 23 annotated datasets that are publicly available, reporting the reference to the paper, a short description of the dataset and its size in terms of number of reviews, whereas Table  17 presents 16 available tools, Footnote 15 providing the reference to the paper and a short description of the characteristics of the tool.

Evaluation Objective

Scholars evaluated the effectiveness of their app review mining approaches in performing: Classification, Clustering, Sentiment Analysis, Information Extraction, Searching and Information Retrieval, Recommendation and Summarization.

Annotation Procedure

The number of annotators labeling the same review sample (or their fragment) ranged from 1 to 5 with the median of 2 human annotators. Only 26 primary studies (25%) reported how the quality of their annotated datasets has been measured. The three most common metrics for inter-rater agreement evaluation were Cohen’s Kappa (Pustejovsky and Stubbs 2012 ), Percentage Agreement (Hallgren 2012 ) and Jaccard index (Manning et al. 2008 ). Percentage Agreement and Cohen’s Kappa were used to measure the quality of human annotation for Classification, Sentiment Analysis, or Feature Extraction; Jaccard index was used for assessing the human agreement for the task of Searching and Information Retrieval; whereas Fleiss’ Kappa was used to assess the quality of manual Clustering. No study reported how the agreement was measured when annotators performed, Recommendation, or Summarization task.

Characteristics of Dataset

Most annotated datasets were created using reviews coming from Google Play and Apple Store (84% in total); the remaining datasets have been created using reviews from Amazon Appstore, Black Berry App World; Huawei Store, Windows Phone Store and 360 Mobile Assistant. On average, an annotated dataset has been prepared using 2,800 reviews collected from a single app store; the reviews were collected for 19 apps from 6 app categories. Table  18 provides five-number summary that details descriptive statistics about the datasets.

Effectiveness Quantification

Three most common metrics used for assessing the effectiveness of app review mining approach are precision, recall, and F1-measure (Manning et al. 2008 ). The metrics were employed for evaluating Classification, Clustering, Information Extraction, Searching and Information retrieval, Sentiment Analysis, Recommendation and Summarization.

A few studies deviate from the common procedure outlined above. The studies evaluated their review mining approaches without annotated datasets:

Eight studies asked annotators to assess the quality of output produced by their approaches, instead of creating an annotated dataset before applying the mining approach. This was practiced for evaluating Classification (Li et al. 2017 ), Clustering (Guzman and Maalej 2014 ; Vu et al. 2015a ; Palomba et al. 2017 ), Information Extraction (Johann et al. 2017 ; Li et al. 2017 ), Searching and Information Retrieval (Wei et al. 2017 ), and Recommendation (Shams et al. 2020 ).

Seven studies used other software artefacts as an evaluation baseline rather than creating an annotated dataset (Gao et al. 2015 ; Man et al. 2016 ; Gao et al. 2018b ; Uddin et al. 2020 ; Srisopha et al. 2020a ; Srisopha et al. 2020c ; Xiao et al. 2020 ). To evaluate Recommendation (e.g., determining priorities for reported issues), the studies compared recommended priorities for issues with priorities for the issues reported in user forums or changelogs; to assess the quality of Clustering, the studies benchmarked the output of their approaches with topics from app changelogs; whereas to evaluate their approaches in Recommending reviews that need to be responded, the studies used information of already responded reviews that developers posted in app stores.

3.5.2 User Study

Twenty three studies evaluated their review mining approaches through user studies (Guzman et al. 2014 ; Chen et al. 2014 ; Gu and Kim 2015 ; Guzman et al. 2015 ; Villarroel et al. 2016 ; Maalej et al. 2016 ; Panichella et al. 2016 ; Di Sorbo et al. 2016 ; Di Sorbo et al. 2017 ; Ciurumelea et al. 2017 ; Palomba et al. 2017 ; Ciurumelea et al. 2018 ; Greenheld et al. 2018 ; Liu et al. 2018 ; Gao et al. 2018b ; Dalpiaz and Parente 2019 ; Scalabrino et al. 2019 ; Liu et al. 2019 ; Zhou et al. 2020 ; Gao et al. 2020 ; Tao et al. 2020 ; Shams et al. 2020 ; Liu et al. 2020 ). The objective of these evaluation was to qualitatively assess how the approach and/or their facilitated analysis are perceived by intended users (e.g., software engineers). Such evaluation procedure typically consists of the following steps: (i) define an evaluation subject and assessment criteria, (ii) recruit participants, (iii) instruct participants to perform a task with an approach or a produced analysis, (iv) elicit participant’s opinion of the approach through questionnaire and/or interviews.

We looked in details at how studies perform each of the steps. The extracted data yields the following insights:

Evaluation Subjects

User studies evaluated the following types of app review analyses: Clustering, Classification, Sentiment Analysis, Information Extraction, Search and Information Retrieval, Recommendation, Summarization, and Visualization.

Assessment Criteria

Five evaluation criteria were typically taken into account: 1) Usefulness denoting the quality of being applicable or having practical worth; 2) Accuracy indicating the ability of being correct; 3) Usability signifying the quality of being easy to use; 4) Efficiency indicating the capability of producing desired results with little or no human effort; and 5) Informativeness denoting the condition of being informative and instructive. Table  19 provides reference mapping of user studies with a breakdown of evaluation criteria and evaluated subjects.

Study Participants

The number of participants involved in the study ranges from 1 to 85 with the median of 9 participants. The participants included professionals, scientists and students; Table  20 details the types of participants taking part in user studies and provide references to the corresponding studies.

Evaluation Procedure

A The participants were instructed to either perform specific task with or without the use of the mining approach being evaluated, to review the outputs produced by the approach, or to simply trial the proposed approach without being given any specific tasks.

figure m

3.6 RQ5: Empirical Results

We answered RQ5 (how well do existing app review analysis approaches support software engineers) based on data item F13 (evaluation result). The data come from 87 studies reporting results of their empirical evaluations: effectiveness evaluations (83 studies) and user studies (18 studies). We synthesize results of these studies in the subsequent subsections.

3.6.1 Effectiveness Evaluation Results

The methodology that primary studies employed for effectiveness evaluation was too diverse to undertake a meta-analysis or other statistical synthesis methods (Higgins et al. 2019 ); these studies characterized for example diversity in their treatment (e.g., review mining approach), population (e.g., review dataset) or study design (e.g., annotation procedure). We thus employed ‘summarizing effect estimates’ method (Higgins et al. 2019 ); Table  21 reports the magnitude and range of effectiveness results that primary studies reported for different review analyses with breakdown of mined information type. Footnote 16

Information Extraction

The effectiveness of extracting information from reviews depends on the type of mined information. Techniques for extracting features from reviews has the lowest performance: median precision of 58% (Guzman and Maalej 2014 ) and median recall of 62% (Sänger et al. 2016 ); and the most diverging results: precision varies from 21% to 84% (Shah et al. 2019a ; Gao et al. 2020 ). Techniques for extracting user requests and NFRs from reviews have higher performance with a median precision above 90% (Iacob et al. 2016 ; Groen et al. 2017 ) and only small variations between techniques.

Classification

App reviews can be classified by information types these reviews contain, such as user requests, NFRs and issues. State-of-the-art review classification techniques have a median precision above 81% (Yang and Liang 2015 ; Lu and Liang 2017 ; Deshpande and Rokne 2018 ; Scoccia et al. 2018 ) and median recall around 83% (Peng et al. 2016 ; Lu and Liang 2017 ; Scoccia et al. 2018 ; Nayebi et al. 2018 ; Jha and Mahmoud 2019 ).

Studies have shown the accuracy of clustering semantically related reviews to be 83% (Vu et al. 2015a ); this result is in line with findings concerning the quality of review clustering, where authors reported MojoFM of 80% (Villarroel et al. 2016 ; Scalabrino et al. 2019 ).

Search and Information Retrieval

Mining approaches showed effectiveness in retrieving reviews to specific information needs; in particular, the results show that tracing information between reviews and issues in ticketing systems and between reviews and source code can be precise: the median precision above 75% (Palomba et al. 2017 ; Palomba et al. 2018 ; Pelloni et al. 2018 ); and complete: median recall above 70% (Palomba et al. 2015 ; Palomba et al. 2018 ; Pelloni et al. 2018 ; Grano et al. 2018 ); whereas linking reviews to goals in goal-models have been achieved with the median precision of 85%; and the median recall of 73% (Liu et al. 2020 ; Gao et al. 2020 ) Similarly, finding reviews related to specific features has been reported with 70% of precision and recall of 56% (Johann et al. 2017 ). The variability of the results e.g., precision between 36%-80% (Dąbrowski et al. 2019 ; Liu et al. 2019 ), however, may lead to inconclusive findings.

Sentiment Analysis

The overall sentiment of a review can be identified with an accuracy of 91% (Masrury and Alamsyah 2019 ). Identifying the sentiment of a review with respect to a specific app feature is less effective with the median precision of 71% and the median recall of 67% (Bakiu and Guzman 2017 ; Dąbrowski et al. 2020 ).

Recommendation

Recommending priorities for user requests was reported with medium to high effectiveness: the median accuracy of 78% (Villarroel et al. 2016 ; Scalabrino et al. 2019 ) and precision of 62% (Gao et al. 2015 ; Gao et al. 2018b ). Whereas, generating review responses was reported with BLEU-4 Footnote 17 greater than 30% (Gao et al. 2019 ), which reflects human-understandable text.

Summarization

Mining techniques were recorded to generate a compact description outlining the main themes present in reviews with recall of 71% (Jha and Mahmoud 2018 ).

3.6.2 User Study Results

Twenty three studies evaluated user-perceived quality of review mining approaches. Table  22 provides synthesis of user study results that primary studies reported for different review analyses with breakdown of evaluation criterion.

Extracting information from reviews e.g., issue reports and user opinions is useful for developers (Gao et al. 2018b ); it can help to elicit new requirements or prioritize development effort (Guzman et al. 2015 ; Dalpiaz and Parente 2019 ). In particular, machine learning techniques are able to identify issues with acceptable accuracy (Gao et al. 2018b ); feature extraction methods instead produce too imprecise analyses to be applicable in practice (Dalpiaz and Parente 2019 ).

Review classification showed their utility for identifying different users’ needs e.g., feature requests, or bug reports (Di Sorbo et al. 2016 ; Panichella et al. 2016 ; Maalej et al. 2016 ; Ciurumelea et al. 2017 ; Liu et al. 2018 ; Ciurumelea et al. 2018 ; Zhou et al. 2020 ). Such categorized feedback is informative and ease further manual review inspection (Ciurumelea et al. 2017 ; Liu et al. 2018 ; Dalpiaz and Parente 2019 ). Practitioners reported to save up to 75% of their time thanks to the analysis (Chen et al. 2014 ; Ciurumelea et al. 2017 ; Ciurumelea et al. 2018 ); and that their accuracy is sufficient for the practical application (Villarroel et al. 2016 ; Di Sorbo et al. 2016 ; Panichella et al. 2016 ; Scalabrino et al. 2019 ).

Review clustering is convenient for grouping feedback conveying similar content; for example, those reporting the same feature request or discussing the same topic (Palomba et al. 2017 ; Zhou et al. 2020 ). Evaluated approaches can perform the analysis with a high level of precision and completeness (Palomba et al. 2017 ; Zhou et al. 2020 ).

Searching and Information Retrieval

Developers admitted the usefulness linking reviews to the source code components to be changed (Palomba et al. 2017 ); the task traditionally requires an enormous manual effort and is highly error-prone.

Analyzing user opinions can help to identify problematic features and to prioritize development effort to improve these features (Guzman et al. 2015 ).

Project managers found recommending priorities of user requests useful for release planning (Villarroel et al. 2016 ; Scalabrino et al. 2019 ); it can support their decision-making w.r.t. requirements and modifications that users wish to address. Developers perceived an automatic review response system as more usable than the traditional mechanism (Greenheld et al. 2018 ); recommending reviews that require responding and suggesting responses to the reviews can reduce developers’ workload (Greenheld et al. 2018 ). Similarly, recommending goals that an app needs to satisfy is informative and may guide this app evolution (Gao et al. 2020 ); whereas suggesting test cases triggering bugs can be useful for developers to reproduce bug-related user reviews; and save cost on manual bug reproduction (Shams et al. 2020 ).

Compact description outlining most important review content is useful for developers in their software engineers activities (Di Sorbo et al. 2017 ; Liu et al. 2019 ; Tao et al. 2020 ); in particular, summaries conveying information about frequently discussed topics, user opinions, user requests and security issues. Facilitating this information in a tabular form is easy to read and expressive (Di Sorbo et al. 2016 ; Di Sorbo et al. 2017 ; Dalpiaz and Parente 2019 ). Such summaries are generated with sufficient accuracy to be used in practical scenarios (Di Sorbo et al. 2017 ; Tao et al. 2020 ); in fact, developers reported to save up to 50% of their time thanks to the analysis (Di Sorbo et al. 2016 ; Di Sorbo et al. 2017 ; Liu et al. 2019 ; Tao et al. 2020 ).

Visualization

Presenting trends of frequently discussed topics can inform developers about urgent issues, ’hot features’, or popular user opinions (Guzman et al. 2014 ; Gao et al. 2018b ). Heat-map illustrating feature-specific sentiment (i.e., user options) help developers to understand users experience with these features (Gu and Kim 2015 ); it indicates which features users praise and which are problematic. Visualizing how user opinions change over time aids developers in examining users’ reactions e.g., to newly implemented modifications for these features; and understanding to what extent an app satisfies users’ goals (Liu et al. 2020 ).

figure n

4 Discussion

In this section we highlight and discuss some of the findings from our study, summarize literature gaps, pointing to directions for future research.

4.1 Mining App Reviews Is a Growing Research Area

Mining app reviews for software engineering is a relatively new research area. The first use of app reviews for software engineering purposes can be dated back to 2012. Nevertheless, the analysis of demographics has revealed that the research area increasingly attracts the attention of scholars. The number of papers published in line with the directions has grown substantially in the last three years. A recent survey in app store analysis found 45 papers relevant to app review analysis published up to 2015 (Martin et al. 2017 ). Our findings show that the number of published papers in the area has quadrupled by the end of 2020. The most frequent venues where scholars have published their work concern high-quality software engineering conferences and journals (see Table  5 ). These imply there is not only increasing effort on exploring the research direction, but also suggest contributions of these efforts are relevant from a software engineering perspective; in fact, empirical evidences (RQ5) demonstrate that software engineers find mining app reviews useful in support of their SDLC activities; mining approaches can reduce their workload; facilitate knowledge that would be difficult to obtain manually. As other work (Martin et al. 2017 ), we also hypothesize factors leading to the research interest in the field concerns increased popularity of mobile apps, an easy access to user feedback on a scale not seen before as well as a general interest in adopting data mining techniques for mining software repository.

4.2 Software Engineering Goals and Use Cases

App reviews analysis has broad applications in software engineering (RQ3). It can be used to support a variety of activities in requirements, design, testing and maintenance (see Table 6). Researchers however do not always clearly describe the envisioned software engineering use cases for their techniques.

So far, research in this area has been driven mostly by the opportunity to apply ML techniques on app reviews. Most studies (61%) relate their approaches to potential software engineering activities, but the remain vague about details of how they envision the techniques to be used in practice. A greater focus on software engineering goals and use cases would increase the relevance and impacts of app review analysis techniques. This systematic literature review includes a complete inventory of already envisioned software engineering use cases for the various app review analysis technique (RQ3). This inventory can provide the basis for a more detailed investigation of software engineering goals and use cases for app review analysis tools. This investigation will contribute to designing future app review analysis tools that best serves the needs of software engineers.

4.3 Need Of Reference Model For Review Mining Tools

Reference model of stakeholders goals, use cases and system architectures for review mining tools would help structuring research efforts in this area, and communicate how fitting review mining techniques together help to address real stakeholders’ needs. In the future, scholars can elaborate such model by generalizing existing review mining solutions; explaining how different components help to realize intended use cases and satisfy stakeholders’ goals. The model would also help researchers to identify and reuse common components in a typical architecture of review mining tools as well as explain the novelty and contribution of their work within that framework.

4.4 Small Size Of Evaluation Datasets

A great deal of effort has been made to evaluate the effectiveness of data mining techniques (RQ4). Primary studies, however, used evaluation datasets of small size (on average 2,800 reviews). This is a tiny portion of user-submitted feedback in app stores. Popular mobile apps (like WhatsApp or Instagram) can receive more than 5,000 reviews per day, and more than one million reviews in a year (App Annie 2020 ). This is a significant threat to the validity of their results when trying to generalize them e.g., (Ciurumelea et al. 2017 ; Deocadez et al. 2017a ; Dąbrowski et al. 2019 ). The problem is attributed to the substantial effort of manual review annotation; labeling 900 reviews can take up to 12.5 hours (Guzman and Maalej 2014 ). As none of the surveyed studies tried to tackle the problem, it opens an avenue for future research. Researchers may experiment with semi-automated data labeling techniques currently exploited to minimize effort for preparing training datasets (Deocadez et al. 2017b ; Dhinakaran et al. 2018 ; Miller et al. 2020 ). Providing the problem was handled, scholars should still be mindful of a sampling bias when curating dataset (Annis 2005 ). Techniques to ameliorate the latter problem, however, has been well-studied in a recent study (Martin et al. 2015 ).

4.5 Replication Packages

Most papers did not make available their review mining tools and evaluation datasets (see Table  16 and Table  17 ). This hinders the replicability of these works as well as new comparative studies. Our survey contains a single replication study and that study reported the challenge in validating results of the original work due the absence of annotated dataset and insufficiently documented evaluation procedure (Shah et al. 2019a ). Future studies should provide replication packages, including evaluation datasets, procedures, and approaches so that researchers will be able to validate existing works and confirm reported findings. It will also help in benchmarking approaches and provide a baseline for evaluating new approaches aiming at improving performance of review mining techniques.

4.6 Impacts On Software Engineering Practice

It is not yet clear whether app review analysis techniques are already good enough to be useful in practice (RQ5). Identifying what performance the approaches should have to be useful for software engineers is an important open question (Berry 2017 ; 2018 ). Essentially, an approach facilitating review analysis should synthesize reviews so that the effort for further manual inspection of the outcomes of that analysis would be negligible or at least manageable. Clearly, the effort would depend on a scenario an approach aims to realize. In addition to evaluating review analysis tools in terms of ML performance metrics (e.g precision and recall), it will become increasingly important to evaluate them in terms of software engineering concerns: Does it save time? Does it improve the quality of, for example, the requirements elicitation and prioritisation process? etc. Evaluating techniques with respect to software engineering concerns is more difficult but necessary to ensure research efforts are aligned with real stakeholders’ goals. Such evaluation will involve a combination of quantitative and quantitative studies aimed at reducing our current uncertainty about potential impacts of review mining techniques on software engineering activities.

4.7 Practitioners’ Requirements For App Review Mining Tools

Numerous tools have been developed in the context of app review analysis research; they satisfy requirements coming mainly from scholars rather than practitioners. We have recorded no research studying what features the tools should facilitate nor what goals they should satisfy. The current research is data-driven rather than goal-driven . The studies apply different types of app review analyses and techniques to mine information from app reviews without explicitly examining the practitioners’ perspective. It is not clear to what extent the tools satisfy the real practitioners’ goals. Though existing user studies provides evidences software practitioners find certain types of analyses valuable e.g., Classification (Palomba et al. 2017 ), yet more systematic research is necessary in such directions to understand practitioners’ needs. Future research should plan to actively involve practitioners, for example via interview sessions or the analysis of their development practices, to understand why the tools are needed; what SE goal they want to satisfy with the tools; what features the tools should facilitate; and how the tool would be used in the organizational settings. Such knowledge will help to understand the actual use cases scenarios of the tools, and to identify whether there is misalignment between what state-of-the-art tools offer and what practitioners actually need.

4.8 Verifying the Industrial Needs for App Review Analysis

Most studies motivated their mining approaches to reduce the manual effort for app review analysis. Such rationale seems to be reasonable in the context of popular apps (e.g., WhatsApp or Facebook Messenger) that are frequently commented and receive hundreds or thousands reviews per day. However, an average app receives 22 reviews per day (Pagano and Maalej 2013 ). It seems therefore legitimate to study the potential impact of the app review analysis research on the app store industry; and to what extent the mining tools would be useful in the industrial settings. Such a study could address this problem from multiple perspectives e.g., what small, medium and large app development organization are interested in app review mining tools? who in the organization would use the tools? is the manual app review analysis ‘the real pain’ of the practitioners? if so, how ‘the pain’ manifests itself? are any tasks obstructed? is the problem generating additional costs? Answering the questions could help to understand who are the actual beneficiaries of the app review analysis research; and what is the size of that market. Not only it would help to scope and justify the future research directions, but it would also provide insights to commercializing this research.

4.9 Pay Attention to Efficiency and Scalability of Mining Tools

Primary studies are mostly focused on evaluating effectiveness and perceived quality of their mining tools. We however recorded no study focused on assessing the efficiency and scalability of their tools; studying the efficiency informs how much time the tools take to produce their outcomes; whereas scalability informs how the time changes when the input of the tools increase. Efficiency and scalability are fundamental qualities of analytics tools (Talia 2019 ); app review mining tools are no exception. The number of reviews that an app receives can vary from a few to more than thousands. Existing approaches e.g., for feature extraction (Guzman and Maalej 2014 ) or app review classification (Maalej and Nabil 2015 ) rely on NLP and ML techniques that may be challenging to scale-up (Analytics India Mag 2020 ). Future studies, therefore, should take the efficiency and scalability into consideration when developing and evaluating their mining tools to demonstrate the tools can be used in the practical settings.

4.10 The Problem of Training ML Techniques

Machine learning is the most frequent type of techniques used for app review analysis (RQ2). Most of these techniques, however, are supervised one and requires a training dataset consisting of manually annotated reviews. Preparing manually annotated dataset is time-consuming and often error-prone (Guzman and Maalej 2014 ). More importantly, such annotated dataset might be domain- and time-specific; an annotated reviews of one app might not be re-usable for training a technique for the feedback of the other app; further, the dataset may be prone to data drift - a phenomenon in which the characteristics of app reviews change over the time. In such a case, ML technique must be periodically trained with up-to-date training dataset to maintain their predictive abilities (Explorium 2020 ). Recent studies thus experimented with active learning (Dhinakaran et al. 2018 ) and semi-supervised techniques (Deocadez et al. 2017b ) to reduce the cost of annotating a large amount of data. More research is however needed to understand how many reviews should be annotated for preparing a training dataset when the techniques is used in the industrial settings; how often such dataset needs to prepared; and whether or not the practitioners would accept the cost of preparing this dataset.

5 Threats to Validity

One of the main threats to the validity of this systematic literature review is incompleteness. The risk of this threat highly depends on the selected list of keywords forming search queries. To decrease the risk of an incomplete keyword list, we have used an iterative approach to keyword-list construction. We constructed two queries: generic and one specific. The generic query was formed using keywords appearing in the index of terms in sample studies analysing app reviews for SE. Specific query was formed based on a set of keywords representing concepts of our research objective. As in any other literature survey, we are also prone to a publication bias. To mitigate this threat, we complemented a digital library search with other strategies. We conducted an issue-by-issue search of top-level conferences and journals as well as performed a backward and forward snowballing.

To ensure the quality and reliability of our study, we defined a systematic procedure for conducting our survey, including research questions to answer, searching strategies and selection criteria for determining primary studies of interest. We conducted a pilot study to assess the technical issues such as the completeness of the data form and usability issues such as the clarity of procedure instructions. The protocol was reviewed by the panel of researchers in addition to the authors of the study. It was then revised based on their critical feedback. Consequently, the selection of primary studies followed a strict protocol in accordance to well-founded guidelines (Kitchenham 2004 ; Kitchenham et al. 2004 ; Ralph et al. 2020 ).

Another threat to validity we would like to highlight is our subjectivity in screening, data extraction and classification of the studied papers. To mitigate the threat, each step was performed by one coder, who was the first author of this paper. Then, the step was cross-checked by a second coder. Each step was validated on a randomly selected sample of 10% of the selected papers. The percentage inter-coder agreement reached for all the phases was equal or higher than 80%, indicating high agreement between the authors (Ide and Pustejovsky 2017 ). In addition, the intra-rater agreement was performed. The first author re-coded once again a randomly selected sample of 20% of studied papers. Then an external evaluator, who has no relationship with the research, verified the agreement between the first and the second rounds. The percentage intra-coder agreement was higher than 90%, indicating near complete agreement (Ide and Pustejovsky 2017 ).

A similar threat concerns whether our taxonomies are reliable enough for analysing and classifying extracted data. To mitigate this threat, we used an iterative content analysis method to continuously develop each taxonomy. New concepts which emerged when studying the papers were introduced into a taxonomy and changes were made respectively. These taxonomies were discussed between all the authors and agreed upon their final form.

6 Related Work

This review is not the first effort synthesizing knowledge from the literature analysing app reviews for SE (Martin et al. 2017 ; Genc-Nayebi and Abran 2017 ; Tavakoli et al. 2018 ; Noei and Lyons 2019 ). Our SLR, however, differs substantially from previous studies in scope of the literature surveyed and depth of our analysis. Table  23 shows the differences between our study and previous works in accordance with dimensions we considered for the comparison. We grouped the dimensions into information related to study characteristics and topics surveyed in our study. The characteristics concern study type (i.e., systematic literature review or survey), time period covered and number of papers surveyed. The topics concern: Paper Demographics, App Reviews Analyses (RQ1), Mining Techniques (RQ2), Supporting Software Engineering (RQ3), Empirical Evaluation (RQ4) and Empirical Results (RQ5).

Martin et al. ( 2017 ) surveyed literature with the aim to demonstrate a newly emerging research area i.e., app store analysis for software engineering. The scope of their survey is much broader than of our study, as it covers literature analyzing various types of app store data (e.g., API, rank of downloads, or price). Our work has much narrower scope, focussing only on app review analysis, but studies the paper in greater depths in order to answer our five research questions.

Though the related survey also addresses (RQ1), our study is more up-to-date and larger in scale, covering 182 papers. More importantly, most dimensions of our SLR i.e., RQ2-RQ5, are missing in this other study.

Two other studies addressed our RQ2, but partially, as they are narrower in scope (Genc-Nayebi and Abran 2017 ; Tavakoli et al. 2018 ). Tavakoli et al. ( 2018 ) surveyed the literature in the context of techniques and tools for mining app reviews. Similarly, Genc-Nayebi and Abran ( 2017 ) consolidated literature to synthesize information on techniques for opinion mining. Our SLR addresses the dimension more broadly, rather than in context of techniques for a specific review analysis or tool-supported approaches. We have made an effort to consolidate general knowledge on techniques the literature employs for 9 broad types of review analyses. We also provided mapping between different review analyses and techniques facilitating their realization.

Noei and Lyons ( 2019 ) summarized 21 papers analysing app reviews from Google Play. The authors provided an overview of each paper, briefly explaining the applications, and mention their limitations. The surveyed papers were selected subjectively, rather than following a systematic searching procedure. In contrast, our study is a SLR rather than a summary. Following a systematic procedure, we selected 182 studies that we carefully read and then synthesized to answer five research questions. The related work marginally covers information for RQ1 and RQ2.

In summary, previous studies do not cover our research questions related to software engineering activities (RQ3) and empirical evaluations (RQ4 and RQ5). They partly cover our research questions RQ1 and RQ2 but on a smaller set of papers and in less details.

7 Conclusion

In this paper, we presented a systematic literature review of the research on analysing app reviews for software engineering. Through systematic search, we identified 182 relevant studies that we thoroughly examined to answer our research questions. The findings have revealed a growing interest in the research area. Research on analysing app reviews are published in the main software engineering conferences and journals e.g., ICSE, TSE or EMSE and the number of publications has tripled in the last four years. The research in this area will likely continue to gain importance as a consequence of increased interest in mobile app development.

This systematic literature review structures and organizes the knowledge on the different types of app review analyses as well as data mining techniques used for their realization. With that knowledge, researchers and practitioners can understand what useful information can be found in app reviews, and how app review analysis can be facilitated at abstract and technical levels. More importantly, the literature review provides a new light on why mining app reviews can be useful; the findings identifies 14 software engineering activities that have been the target of previous research on app review analysis. Important future research for app review analysis will involve developing a deeper understanding of the stakeholders’ goals and context for app review analysis tools in order to increase the applicability, relevance and value of these tools.

The findings have revealed that software engineers find mining approaches useful and with promising performance to generate different app review analyses. It however remains unclear to what extent these approaches are already good enough to be used in practice.

It will become increasingly important to evaluate them in terms of software engineering specific concerns: Does it improve the quality of, for example, the requirements elicitation and prioritization process? We also recommend empirical evaluation will continue to improve in scale and reproducibility. Research in this area is currently inconsistent quality in terms of evaluation method and ability for the research to be reproduced. Future studies should share evaluation datasets and mining tools, allowing their experiments to be replicated. They should also pay more attention to the scalability and the efficiency of their mining approaches.

In conclusion, this study helps to communicate knowledge on analyzing app reviews for software engineering purposes. We hope our effort will inspire scholars to advance the research area and assist them in positioning their new works.

Change history

15 march 2022.

A Correction to this paper has been published: https://doi.org/10.1007/s10664-022-10135-4

We selected 2010 to be the initial period of our search as the earliest study of app store analysis had been reported that year (Martin et al. 2017 ).

A description of the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) method can be found in (Moher et al. 2009 ).

The first author conducted the entire literature search and selection process.

We identified papers from previous surveys on app store analysis (Martin et al. 2017 ).

We selected this number of studies to satisfy the sample size requirements for Cohen’s Kappa calculation (Bujang and Baharum 2017 ).

The assessor has an engineering background and experience with manual annotation; they has no relationship with this research.

No study was published in 2010 and 2011.

The complete list of venues can be found in supplementary material (Dąbrowski 2021 ).

No. studies, in the furthest right column, is thus less or equal than the sum of a row.

A single study could use a certain combination of techniques to facilitate multiple review analyses. The total number, on the right hand side, is thus less than the sum of a row.

We refer to a property as a concept denoting a feature in the machine learning domain.

It is worth noting that some papers fall into more than one category i.e., claim to support more than one activity. In such case, we assigned the study to all the claimed activities.

Table excludes papers that did not specify any SE activity; in case of papers supporting multiple SE activities, we assigned their facilitated analyses to all the claimed activities.

In addition to the reported information in the surveyed literature; we also contacted the authors of 105 primary studies to request replication packages.

The references to the tools and the datasets are available in the supplementary material (Dąbrowski 2021 )

No effectiveness evaluation was performed w.r.t. content analysis and visualization.

The metrics quantifying the quality of generated text on a scale of 0% to 100%.

Abad ZSH, Sims SDV, Cheema A, Nasir MB, Harisinghani P (2017) Learn more, pay less! lessons learned from applying the wizard-of-oz technique for exploring mobile app requirements. In: 2017 IEEE 25th international requirements engineering conference workshops (REW). pp 132–138

Al-Hawari A (2020) Najadat H, Shatnawi R, Classification of application reviews into software maintenance tasks using data mining techniques. Softw Qual J. https://doi.org/10.1007/s11219-020-09529-8

Al Kilani N, Tailakh R, Hanani A (2019) Automatic classification of apps reviews for requirement engineering: Exploring the customers need from healthcare applications. In: 2019 sixth international conference on social networks analysis, management and security (SNAMS). pp 541–548

Ali M, Joorabchi ME, Mesbah A (2017) Same app, different app stores: A comparative study. In: Proceedings of the 4th international conference on mobile software engineering and systems, MOBILESoft ’17. IEEE Press, pp 79–90

Alqahtani F, Orji R (2019) Usability issues in mental health applications. In: Adjunct publication of the 27th conference on user modeling, adaptation and personalization, USA, UMAP’19 Adjunct. ACM, New York, pp 343–348

AlSubaihin A, Sarro F, Black S, Capra L, Harman M (2019) App store effects on software engineering practices. IEEE Trans Softw Eng :1–1

Analytics India Mag (2020) https://analyticsindiamag.com/challenges-of-implementing-natural-language-processing/ https://analyticsindiamag.com/challenges-of-implementing-natural-language-processing/ , Accessed: 2021-06-01

Annis DH (2005) Probability and statistics: The science of uncertainty, Michael J. Evans and Jeffrey S. Rosenthal. Am Stat 59:276–276

Article   MathSciNet   Google Scholar  

App Annie (2020) https://www.appannie.com/ , Accessed: 2020-07-01

App Store (2021) Ratings, Reviews, and Responses. https://developer.apple.com/app-store/ratings-and-reviews/ , Accessed: 2021-06-01

Bailey K, Nagappan M, Dig D (2019) Examining user-developer feedback loops in the ios app store. In: 52nd Hawaii international conference on system sciences, HICSS 2019, Grand Wailea, Maui, Hawaii, USA, January 8-11, 2019, pp 1–10

Bakiu E, Guzman E (2017) Which feature is unusable? detecting usability and user experience issues from user reviews. In: 2017 IEEE 25th international requirements engineering conference workshops (REW). pp 182–187

Bauer M (2007) Content analysis. an introduction to its methodology – by klaus krippendorff from words to numbers. narrative, data and social science – by roberto franzosi. https://doi.org/10.1111/j.1468-4446.2007.00153\_10.x , vol 58, pp 329–331

Begel A, Zimmermann T (2014) Analyze this! 145 questions for data scientists in software engineering. In: 36th international conference on software engineering. pp 12–13

Berry D (2018) Keynote: Evaluation of NLP tools for hairy RE tasks. In: Joint proceedings of REFSQ-2018 workshops, doctoral symposium, live studies track, and poster track co-located with the 23rd international conference on requirements engineering: foundation for software quality (REFSQ 2018), Utrecht, The Netherlands, March 19, 2018

Berry DM (2017) Evaluation of tools for hairy requirements and software engineering tasks. In: IEEE 25th international requirements engineering conference workshops, RE 2017 Workshops, Lisbon, Portugal, September, 4-8, 2017, pp 284–291

Bishop CM (2006) Pattern recognition and machine learning (information science and statistics). Springer, Berlin

MATH   Google Scholar  

Bourque P, Dupuis R, Abran A, Moore J, Tripp L (1999) The guide to the software engineering body of knowledge. IEEE Softw 16:35–44

Article   Google Scholar  

Bujang M, Baharum N (2017) Guidelines of the minimum sample size requirements for kappa agreement test. Epidemiol Biostat Public Health 14

Burge JE, Carroll JM, McCall R, Mistrk I (2008) Rationale-based software engineering, 1st edn. Springer Publishing Company, Incorporated, Berlin

Book   MATH   Google Scholar  

Buse RPL, Zimmermann T (2012) Information needs for software development analytics. In: 34th international conference on software engineering. pp 987–996

Cannataro M, Comito C (2003) A data mining ontology for grid programming. In: Proc. 1st int. workshop on semantics in peer-to-peer and grid computing, in conjunction with WWW2003. pp 113–134

Carreño LVG, Winbladh K (2013) Analysis of user comments: An approach for software requirements evolution. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, pp 582–591

Cen L, Si L, Li N, Jin H (2014) User comment analysis for android apps and cspi detection with comment expansion. In: Proceeding of the 1st international workshop on privacy-preserving IR (PIR). pp 25–30

Chandy R, Gu H (2012) Identifying spam in the ios app store. In: Proceedings of the 2nd Joint WICOW/airweb Workshop on Web Quality. ACM, pp 56–59

Chen N, Lin J, Hoi SCH, Xiao X, Zhang B (2014) Ar-miner: Mining informative reviews for developers from mobile app marketplace. In: Proceedings of the 36th International Conference on Software Engineering, ICSE 2014. ACM, New York, pp 767–778

Chen R, Wang Q, Xu W (2019) Mining user requirements to facilitate mobile app quality upgrades with big data. Electron Commer Res Appl 38:100889

Ciurumelea A, Schaufelbühl A, Panichella S, Gall HC (2017) Analyzing reviews and code of mobile apps for better release planning. In: 2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER). pp 91–102

Ciurumelea A, Panichella S, Gall HC (2018) Poster: Automated user reviews analyser. In: 2018 IEEE/ACM 40th international conference on software engineering: companion (ICSE-Companion). pp 317–318

Clement J (2020) Number of apps available in leading app stores as of 1st quarter 2020. https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/ , Accessed: 2020-07-01

Dalpiaz F, Parente M (2019) RE-SWOT: from user feedback to requirements via competitor analysis. In: Requirements engineering: foundation for software quality - 25th international working conference, REFSQ 2019, Essen, Germany, March 18-21, 2019, Proceedings. pp 55–70

Deocadez R, Harrison R, Rodriguez D (2017) Automatically classifying requirements from app stores: A preliminary study. In: 2017 IEEE 25th international requirements engineering conference workshops (REW). pp 367–371

Deocadez R, Harrison R, Rodriguez D (2017) Preliminary study on applying semi-supervised learning to app store analysis. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17. ACM, New York, pp 320–323

Deshpande G, Rokne J (2018) User feedback from tweets vs app store reviews: An exploratory study of frequency, timing and content. In: 2018 5th international workshop on artificial intelligence for requirements engineering (AIRE). pp 15–21

Dhinakaran VT, Pulle R, Ajmeri N, Murukannaiah PK (2018) App review analysis via active learning: Reducing supervision effort without compromising classification accuracy. In: 2018 IEEE 26th international requirements engineering conference (RE). pp 170–181

Di Sorbo A, Panichella S, Alexandru CV, Shimagaki J, Visaggio CA, Canfora G, Gall HC (2016) What would users change in my app? summarizing app reviews for recommending software changes. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016. ACM, New York, pp 499–510

Di Sorbo A, Panichella S, Alexandru CV, Visaggio CA, Canfora G (2017) Surf: Summarizer of user reviews feedback. In: Proceedings of the 39th international conference on software engineering companion, ICSE-C ’17. IEEE Press, pp 55–58

Di Sorbo A, Grano G, Aaron Visaggio C, Panichella S (2020) Investigating the criticality of user-reported issues through their relations with app rating. J Softw Evol Process 33(3):e2316. https://doi.org/10.1002/smr.2316 . https://onlinelibrary.wiley.com/doi/abs/10.1002/smr.2316 , e2316 smr.2316

Google Scholar  

Dąbrowski J (2021) Supplementary material for system literature review: analysing app reviews for software engineering. https://github.com/jsdabrowski/SLR-SE/

Dąbrowski J, Letier E, Perini A, Susi A (2019) Finding and analyzing app reviews related to specific features: A research preview. In: Requirements engineering: foundation for software quality - 25th international working conference, REFSQ 2019, Essen, Germany, March 18-21, 2019, Proceedings. pp 183–189

Dąbrowski J, Letier E, Perini A, Susi A (2020) Mining user opinions to support requirement engineering: An empirical study. In: Dustdar S, Yu E, Salinesi C, Rieu D, Pant V (eds) Advanced information systems engineering - 32nd international conference, CAiSE 2020, Grenoble, France, June 8-12, 2020, Proceedings, Springer, Lecture Notes in Computer Science, vol 12127. pp 401–416. https://doi.org/10.1007/978-3-030-49435-3_25

Durelli VHS, Durelli RS, Endo AT, Cirilo E, Luiz W, Rocha L (2018) Please please me: Does the presence of test cases influence mobile app users’ satisfaction. In: Proceedings of the XXXII Brazilian symposium on software engineering, SBES ’18. ACM, New York, pp 132–141

Erfani M, Mesbah A, Kruchten P (2013) Real challenges in mobile app development. In: 2013 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM). pp 15–24

Explorium (2020) Understanding and handling data and concept drift. https://www.explorium.ai/blog/understanding-and-handling-data-and-concept-drift/ , Accessed: 2021-06-01

Franzmann D, Eichner A, Holten R (2020) How mobile app design overhauls can be disastrous in terms of user perception: The case of snapchat. Trans Soc Comput 3(4). https://doi.org/10.1145/3409585

Fu B, Lin J, Li L, Faloutsos C, Hong J, Sadeh N (2013) Why people hate your app: Making sense of user feedback in a mobile app store. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’13. ACM, New York, pp 1276–1284

Gao C, Wang B, He P, Zhu J, Zhou Y, Lyu MR (2015) Paid: prioritizing app issues for developers by tracking user reviews over versions. In: 2015 IEEE 26th international symposium on software reliability engineering (ISSRE). pp 35–45

Gao C, Xu H, Hu J, Zhou Y (2015) Ar-tracker: Track the dynamics of mobile apps via user review mining. In: 2015 IEEE symposium on service-oriented system engineering, SOSE ’15. pp 284–290

Gao C, Zeng J, Lo D, Lin CY, Lyu MR, King I (2018a) Infar: Insight extraction from app reviews. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, ESEC/FSE 2018. ACM, New York, pp 904–907

Gao C, Zeng J, Lyu MR, King I (2018b) Online app review analysis for identifying emerging issues. In: Proceedings of the 40th international conference on software engineering, ICSE ’18. ACM, New York, pp 48–58

Gao C, Zeng J, Xia X, Lo D, Lyu MR, King I (2019) Automating app review response generation. In: 2019 34th IEEE/ACM international conference on automated software engineering (ASE). pp 163–175

Gao C, Zheng W, Deng Y, Lo D, Zeng J, Lyu MR, King I (2019) Emerging app issue identification from user feedback: Experience on wechat. In: Proceedings of the 41st international conference on software engineering: software engineering in practice, ICSE-SEIP ’19. IEEE Press, pp 279–288

Gao S, Liu L, Liu Y, Liu H, Wang Y (2020) Updating the goal model with user reviews for the evolution of an app. J Softw Evol Process 32(8):e2257. https://doi.org/10.1002/smr.2257 . https://onlinelibrary.wiley.com/doi/abs/10.1002/smr.2257 , e2257 JSME-19-0105.R2

Genc-Nayebi N, Abran A (2017) A systematic literature review: Opinion mining studies from mobile app store user reviews. J Syst Softw 125:207–219

Gomez M, Rouvoy R, Monperrus M, Seinturier L (2015) A recommender system of buggy app checkers for app store moderators. In: 2nd ACM international conference on mobile software engineering and systems. IEEE

Goul M, Marjanovic O, Baxley S, Vizecky K (2012) Managing the enterprise business intelligence app store: Sentiment analysis supported requirements engineering. In: 2012 45th Hawaii international conference on system sciences. pp 4168–4177

Graham M, Milanowski AT, Miller J (2012) Measuring and promoting inter-rater agreement of teacher and principal performance ratings

Grano G, Di Sorbo A, Mercaldo F, Visaggio CA, Canfora G, Panichella S (2017) Android apps and user feedback: A dataset for software evolution and quality improvement. In: Proceedings of the 2nd ACM SIGSOFT international workshop on app market analytics, WAMA 2017. ACM, New York, pp 8–11

Grano G, Ciurumelea A, Panichella S, Palomba F, Gall HC (2018) Exploring the integration of user feedback in automated testing of android applications. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER). pp 72–83

Greenheld G, Savarimuthu BTR, Licorish SA (2018) Automating developers’ responses to app reviews. In: 2018 25th Australasian software engineering conference (ASWEC). pp 66–70

Groen EC, Kopczyńska S, Hauer MP, Krafft TD, Doerr J (2017) Users — the hidden software product quality experts?: A study on how app users report quality aspects in online reviews. In: 2017 IEEE 25th international requirements engineering conference (RE). pp 80–89

Gu X, Kim S (2015) "What parts of your apps are loved by users?" (T). In: 30th IEEE/ACM international conference on automated software engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015. pp 760–770

Gunaratnam I, Wickramarachchi D (2020) Computational model for rating mobile applications based on feature extraction. In: 2020 2nd international conference on advancements in computing (ICAC). https://doi.org/10.1109/ICAC51239.2020.9357270 , vol 1, pp 180–185

Guo H, Singh MP (2020) Caspar: extracting and synthesizing user stories of problems from app reviews. In: 2020 IEEE/ACM 42nd international conference on software engineering (ICSE). pp 628–640

Guzman E, Maalej W (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In: 2014 IEEE 22nd international requirements engineering conference (RE). pp 153–162

Guzman E, Paredes Rojas A (2019) Gender and user feedback: An exploratory study. In: 2019 IEEE 27th international requirements engineering conference (RE). pp 381–385

Guzman E, Bhuvanagiri P, Bruegge B (2014) Fave: Visualizing user feedback for software evolution. In: 2014 Second IEEE working conference on software visualization. pp 167–171

Guzman E, Aly O, Bruegge B (2015) Retrieving diverse opinions from app reviews. In: 2015 ACM/IEEE international symposium on empirical software engineering and measurement (ESEM). pp 1–10

Guzman E, El-Halaby M, Bruegge B (2015) Ensemble methods for app review classification: An approach for software evolution. In: Proceedings of the 30th IEEE/ACM international conference on automated software engineering, ASE ’15. IEEE Press, pp 771–776

Guzman E, Ibrahim M, Glinz M (2017) A little bird told me: Mining tweets for requirements and software evolution. In: Moreira A, Arau̇jo J, Hayes J, Paech B (eds) 25th IEEE international requirements engineering conference, RE 2017, Lisbon, Portugal, September 4-8, 2017, IEEE Computer Society, pp 11–20. https://doi.org/10.1109/RE.2017.88

Guzman E, Oliveira L, Steiner Y, Wagner LC, Glinz M (2018) User feedback in the app store: A cross-cultural study. In: 2018 IEEE/ACM 40th international conference on software engineering: software engineering in society (ICSE-SEIS). pp 13–22

Ha E, Wagner D (2013) Do android users write about electric sheep? examining consumer reviews in google play. In: Consumer communications and networking conference (CCNC), 2013 IEEE. pp 149–157

Hadi MA, Fard FH (2020) Aobtm: Adaptive online biterm topic modeling for version sensitive short-texts analysis. In: 2020 IEEE international conference on software maintenance and evolution (ICSME). pp 593–604. https://doi.org/10.1109/ICSME46990.2020.00062

Hallgren K (2012) Computing inter-rater reliability for observational data: An overview and tutorial. Tutor Quant Methods Psychol 8:23–34

Hassan S, Bezemer C, Hassan AE (2018) Studying bad updates of top free-to-download apps in the google play store. IEEE Trans Softw Eng :1–1

Hassan S, Tantithamthavorn C, Bezemer C, Hassan AE (2018) Studying the dialogue between users and developers of free apps in the google play store. Empir Softw Eng 23(3):1275–1312

Higgins JP, Thomas J, Chandler J, Cumpston M, Li T, Page MJ, Welch VA (2019) Cochrane handbook for systematic reviews of interventions, 2nd edn. Wiley, Chichester

Book   Google Scholar  

Hoon L, Vasa R, Schneider JG, Mouzakis K (2012) A preliminary analysis of vocabulary in mobile app user reviews. In: Proceedings of the 24th Australian computer-human interaction conference. ACM, pp 245–248

Hoon L, Vasa R, Martino GY, Schneider JG, Mouzakis K (2013) Awesome! conveying satisfaction on the app store. In: Proceedings of the 25th Australian computer-human interaction conference: augmentation, application, innovation, collaboration, OzCHI ’13. ACM, New York, pp 229–232

Hoon L, Rodriguez-García M, Vasa R, Valencia-García R, Schneider JG (2016) App reviews: Breaking the user and developer language barrier. In: Trends and applications in software engineering, vol 405. Springer International Publishing, pp 223–233

Hu H, Bezemer C, Hassan AE (2018) Studying the consistency of star ratings and the complaints in 1 & 2-star user reviews for top free cross-platform android and ios apps. Empir Softw Eng 23(6):3442–3475

Hu H, Wang S, Bezemer C, Hassan AE (2019) Studying the consistency of star ratings and reviews of popular free hybrid android and ios apps. Empir Softw Eng 24(1):7–32

Huebner J, Frey RM, Ammendola C, Fleisch E, Ilic A (2018) What people like in mobile finance apps: An analysis of user reviews. In: Proceedings of the 17th international conference on mobile and ubiquitous multimedia, MUM 2018, Cairo, Egypt, November 25-28, 2018, pp 293–304

Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: Proceedings of the 10th working conference on mining software repositories, IEEE Press. pp 41–44

Iacob C, Harrison R, Faily S (2013a) Online reviews as first class artifacts in mobile app development. In: Proceedings of the 5th international conference on mobile computing, applications, and services. MobiCASE ’13

Iacob C, Veerappa V, Harrison R (2013b) What are you complaining about?: A study of online reviews of mobile applications. In: Proceedings of the 27th international BCS human computer interaction conference. British Computer Society, pp 29:1–29:6

Iacob C, Faily S, Harrison R (2016) Maram: Tool support for mobile app review management. In: Proceedings of the 8th EAI international conference on mobile computing, applications and services, ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), MobiCASE’16. pp 42–50

Ide N, Pustejovsky J (eds) (2017) Handbook of linguistic annotation. Springer Netherlands, Dordrecht

IEEE (1990) IEEE standard glossary of software engineering terminology

ISO/IEC 25010 (2011) ISO/IEC 25010:2011, systems and software engineering — systems and software quality requirements and evaluation (SQuaRE) — system and software quality models

Jha N, Mahmoud A (2017a) MARC: A mobile application review classifier. In: Joint proceedings of REFSQ-2017 workshops, doctoral symposium, research method track, and poster track co-located with the 22nd international conference on requirements engineering: foundation for software quality (REFSQ 2017), Essen, Germany, February 27, 2017

Jha N, Mahmoud A (2017b) Mining user requirements from application store reviews using frame semantics. In: Requirements engineering: foundation for software quality - 23rd international working conference, REFSQ 2017, Essen, Germany, February 27 - March 2, 2017, Proceedings. pp 273–287

Jha N, Mahmoud A (2018) Using frame semantics for classifying and summarizing application store reviews. Empir Softw Eng 23(6):3734–3767

Jha N, Mahmoud A (2019) Mining non-functional requirements from app store reviews. Empir Softw Eng 24(6):3659–3695

Johann T, Stanik C, B AMA, Maalej W (2017) Safe: A simple approach for feature extraction from app descriptions and app reviews. In: 2017 IEEE 25th international requirements engineering conference (RE). pp 21–30

Jurafsky D, Martin JH (2009) Speech and language processing, 2nd edn. Prentice-Hall, Inc., Hoboken

Kalaichelavan K, Malik H, Husnu N, Sreenath S (2020) What do people complain about drone apps? a large-scale empirical study of google play store reviews. Procedia Comput Sci 170:547–554. https://doi.org/10.1016/j.procs.2020.03.124 . https://www.sciencedirect.com/science/article/pii/S1877050920305627 , the 11th International Conference on Ambient Systems, Networks and Technologies (ANT) / The 3rd International Conference on Emerging Data and Industry 4.0 (EDI40) / Affiliated Workshops

Keertipati S, Savarimuthu BTR, Licorish SA (2016) Approaches for prioritizing feature improvements extracted from app reviews. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering, EASE ’16. ACM, New York

Khalid H (2013) On identifying user complaints of ios apps. In: Proceedings of the 2013 international conference on software engineering. IEEE Press, pp 1474-1476

Khalid H, Nagappan M, Shihab E, Hassan AE (2014) Prioritizing the devices to test your app on: a case study of android game apps. In: Proceedings of the 22nd ACM SIGSOFT international symposium on foundations of software engineering, (FSE-22), Hong Kong, China, November 16-22, 2014, pp 610–620

Khalid H, Shihab E, Nagappan M, Hassan AE (2015) What do mobile app users complain about? IEEE Softw 32(3):70–77

Khalid H, Nagappan M, Hassan AE (2016) Examining the relationship between findbugs warnings and app ratings. IEEE Softw 33(4):34–39

Khalid M, Asif M, Shehzaib U (2015a) Towards improving the quality of mobile app reviews. Int J Inf Technol Comput Sci (IJITCS) 7(10):35

Khalid M, Shehzaib U, Asif M (2015b) A case of mobile app reviews as a crowdsource. Int J Inf Eng Electron Bus (IJIEEB) 7(5):39

Khan J, Xie Y, Liu L, Wen L (2019) Analysis of requirements-related arguments in user forums. https://doi.org/10.1109/RE.2019.00018

Kitchenham BA (2004) Procedures for performing systematic reviews

Kitchenham BA, Dyba T, Jorgensen M (2004) Evidence-based software engineering. In: Proceedings of the 26th international conference on software engineering, ICSE ’04. IEEE Computer Society, pp 273–281

Kunaefi A, Aritsugi M (2020) Characterizing user decision based on argumentative reviews. In: 7th IEEE/ACM international conference on big data computing, applications and technologies, BDCAT 2020, Leicester, United Kingdom, December 7-10, 2020, IEEE. pp 161–170. https://doi.org/10.1109/BDCAT50828.2020.00002

Kurtanović Z, Maalej W (2017) Mining user rationale from software reviews. In: 2017 IEEE 25th international requirements engineering conference (RE). pp 61–70

Kurtanovic Z, Maalej W (2018) On user rationale in software engineering. Requir Eng 23(3):357–379

van Lamsweerde A (2009) Requirements engineering: from system goals to UML models to software specifications. Wiley, Hoboken

Li S, Guo J, Fan M, Lou JG, Zheng Q, Liu T (2020) Automated bug reproduction from user reviews for android applications. In: 2020 IEEE/ACM 42nd international conference on software engineering: software engineering in practice (ICSE-SEIP). pp 51–60

Li T, Zhang F, Wang D (2018) Automatic user preferences elicitation: A data-driven approach. In: Requirements engineering: foundation for software quality - 24th international working conference, REFSQ 2018, Utrecht, The Netherlands, March 19-22, 2018, Proceedings. pp 324–331

Li Y, Jia B, Guo Y, Chen X (2017) Mining user reviews for mobile app comparisons. Proc ACM Interact Mob Wearable Ubiquitous Technol 1(3)

Liang TP, Li X, Yang CT, Wang M (2015) What in consumer reviews affects the sales of mobile apps: A multifacet sentiment analysis approach. Int J Electron Commer 20(2):236–260

Licorish SA, Savarimuthu BTR, Keertipati S (2017) Attributes that predict which features to fix: Lessons for app store mining. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17. ACM, New York, pp 108–117

Lim S, Henriksson A, Zdravkovic J (2021) Data-driven requirements elicitation: A systematic literature review. SN Comput Sci 2. https://doi.org/10.1007/s42979-020-00416-4

Liu Y, Liu L, Liu H, Wang X (2018) Analyzing reviews guided by app descriptions for the software development and evolution. J Softw Evol Process 30(12):e2112. e2112 JSME-17-0184.R2

Liu Y, Liu L, Liu H, Yin X (2019) App store mining for iterative domain analysis: Combine app descriptions with user reviews. Softw Pract Exper 49(6):1013–1040. sPE-19-0009.R1

Liu Y, Liu L, Liu H, Gao S (2020) Combining goal model with reviews for supporting the evolution of apps. IET Softw 14(1):39–49. https://doi.org/10.1049/iet-sen.2018.5192

Lu M, Liang P (2017) Automatic classification of non-functional requirements from augmented app user reviews. In: Proceedings of the 21st international conference on evaluation and assessment in software engineering, EASE’17. ACM, New York, pp 344–353

Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? on automatically classifying app reviews. In: 2015 IEEE 23rd international requirements engineering conference (RE). pp 116–125

Maalej W, Kurtanovic Z, Nabil H, Stanik C (2016) On the automatic classification of app reviews. Requir Eng 21(3):311–331

Maalej W, Nayebi M, Johann T, Ruhe G (2016) Toward data-driven requirements engineering. IEEE Softw 33(1):48–54

Maalej W, Nayebi M, Ruhe G (2019) Data-driven requirements engineering: An update. In: Proceedings of the 41st international conference on software engineering: software engineering in practice, ICSE-SEIP ’19. IEEE Press, pp 289–290

Malavolta I, Ruberto S, Soru T, Terragni V (2015a) End users’ perception of hybrid mobile apps in the google play store. In: Proceedings of the 4th international conference on mobile services (MS). IEEE

Malavolta I, Ruberto S, Terragni V, Soru T (2015b) Hybrid mobile apps in the google play store: an exploratory investigation. In: Proceedings of the 2nd ACM international conference on mobile software engineering and systems, ACM

Malgaonkar S, Licorish SA, Savarimuthu BTR (2020) Towards automated taxonomy generation for grouping app reviews: A preliminary empirical study. In: Shepperd MJ, e Abreu FB, da Silva AR, Pérez-Castillo R (eds) Quality of information and communications technology - 13th international conference, QUATIC 2020, Faro, Portugal, September 9-11, 2020, Proceedings, Communications in Computer and Information Science, vol 1266. Springer, pp 120–134. https://doi.org/10.1007/978-3-030-58793-2\_10

Malik H, Shakshuki EM (2016) Mining collective opinions for comparison of mobile apps. Procedia Comput Sci 94:168–175. the 11th International Conference on Future Networks and Communications (FNC 2016) / The 13th International Conference on Mobile Systems and Pervasive Computing (MobiSPC 2016) / Affiliated Workshops

Malik H, Shakshuki EM, Yoo WS (2018) Comparing mobile apps by identifying ’hot’ features. Future Gener Computer Syst

Man Y, Gao C, Lyu MR, Jiang J (2016) Experience report: Understanding cross-platform app issues from user reviews. In: 2016 IEEE 27th international symposium on software reliability engineering (ISSRE). pp 138–149

Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge

Martens D, Johann T (2017) On the emotion of users in app reviews. In: Proceedings of the 2nd international workshop on emotion awareness in software engineering, SEmotion ’17. IEEE Press, pp 8–14

Martens D, Maalej W (2019) Release early, release often, and watch your users’ emotions: Lessons from emotional patterns. IEEE Softw 36(5):32–37

Martens D, Maalej W (2019) Towards understanding and detecting fake reviews in app stores. Empir Softw Eng 24(6):3316–3355

Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: Proceedings of the 12th working conference on mining software repositories, MSR ’15. IEEE Press, pp 123–133

Martin WJ, Sarro F, Jia Y, Zhang Y, Harman M (2017) A survey of app store analysis for software engineering. IEEE Trans Software Eng 43 (9):817–847

Masrury RA, Alamsyah A (2019) Analyzing tourism mobile applications perceived quality using sentiment analysis and topic modeling. In: 2019 7th international conference on information and communication technology (ICoICT). pp 1–6

McIlroy S, Shang W, Ali N, Hassan A (2015) Is it worth responding to reviews? a case study of the top free apps in the google play store. IEEE Software PP

McIlroy S, Ali N, Khalid H, Hassan AE (2016) Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir Softw Eng 21(3):1067–1106

Mcilroy S, Shang W, Ali N, Hassan AE (2017) User reviews of top mobile apps in apple and google app stores. Commun ACM 60(11):62–67

Mercado IT, Munaiah N, Meneely A (2016) The impact of cross-platform development approaches for mobile applications from the user’s perspective. In: Proceedings of the international workshop on app market analytics, WAMA 2016. ACM, New York, pp 43-49

Miller B, Linder F, Mebane WR (2020) Active learning approaches for labeling text: Review and assessment of the performance of active learning approaches. Polit Anal :1–20

Miner G, Elder J, Hill T, Nisbet R, Delen D, Fast A (2012) Practical text mining and statistical analysis for non-structured text data applications, 1st edn. Academic Press, Cambridge

Moher D, Liberati A, Tetzlaff J, Altman D (2009) Preferred reporting items for systematic reviews and meta-analyses: the prisma statement. Br Med J 8:336–341

Mujahid S, Sierra G, Abdalkareem R, Shihab E, Shang W (2017) Examining user complaints of wearable apps: A case study on android wear. In: 2017 IEEE/ACM 4th international conference on mobile software engineering and systems (MOBILESoft). pp 96–99

Mujahid S, Sierra G, Abdalkareem R, Shihab E, Shang W (2018) An empirical study of android wear user complaints. Empir Softw Eng 23 (6):3476–3502

Muñoz S, Araque O, Llamas AF, Iglesias CA (2018) A cognitive agent for mining bugs reports, feature suggestions and sentiment in a mobile application store. In: 2018 4th international conference on big data innovations and applications (innovate-data). pp 17–24

Nagappan M, Shihab E Menzies T, Williams L, Zimmermann T (eds) (2016) Mobile app store analytics. Morgan Kaufmann, Boston

Nayebi M, Cho H, Farrahi H, Ruhe G (2017) App store mining is not enough. In: 2017 IEEE/ACM 39th international conference on software engineering companion (ICSE-C). pp 152–154

Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23(5):2764–2794

Nicolai M, Pascarella L, Palomba F, Bacchelli A (2019) Healthcare android apps: a tale of the customers’ perspective. In: Proceedings of the 3rd ACM SIGSOFT international workshop on app market analytics, WAMA@ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 27, 2019, pp 33–39

Noei E, Lyons K (2019) A survey of utilizing user-reviews posted on google play store. In: Proceedings of the 29th annual international conference on computer science and software engineering, IBM Corp., USA, CASCON ’19. pp 54–63

Noei E, Da Costa DA, Zou Y (2018) Winning the app production rally. In: Proceedings of the 2018 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering, New York, NY, USA, ESEC/FSE 2018, pp 283–294

Noei E, Zhang F, Wang S, Zou Y (2019) Towards prioritizing user-related issue reports of mobile applications. Empir Softw Eng 24(4):1964–1996

Noei E, Zhang F, Zou Y (2019) Too many user-reviews, what should app developers look at first? IEEE Trans Softw Eng 1–1

Nuseibeh B (2001) Weaving together requirements and architectures. Computer 34(3):115–119

Nyamawe A, Liu H, Niu N, Umer Q, Niu Z (2019) Automated recommendation of software refactorings based on feature requests. pp 187–198. https://doi.org/10.1109/RE.2019.00029

Oehri E, Guzman E (2020) Same same but different: Finding similar user feedback across multiple platforms and languages. In: Breaux T D, Zisman A, Fricker S, Glinz M (eds) 28th IEEE international requirements engineering conference, RE 2020, Zurich, Switzerland, August 31 - September 4, 2020, IEEE. https://doi.org/10.1109/RE48521.2020.00017 , pp 44–54

Oh J, Kim D, Lee U, Lee JG, Song J (2013) Facilitating developer-user interactions with mobile app review digests. In: CHI ’13 extended abstracts on human factors in computing systems, CHI EA ’13. ACM, New York, pp 1809–1814

Pagano D, Maalej W (2013) User feedback in the appstore: An empirical study. In: 2013 21st IEEE international requirements engineering conference (RE). pp 125–134

Palomba F, Linares-Vásquez M, Bavota G, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2015) User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). pp 291–300

Palomba F, Salza P, Ciurumelea A, Panichella S, Gall H, Ferrucci F, De Lucia A (2017) Recommending and localizing change requests for mobile apps based on user reviews. In: Proceedings of the 39th international conference on software engineering, ICSE ’17. IEEE Press, pp 106–117

Palomba F, Linares-Vásquez M, Bavota G, Oliveto R, Penta MD, Poshyvanyk D, Lucia AD (2018) Crowdsourcing user reviews to support the evolution of mobile apps. J Syst Softw 137:143–162

Panichella S, Di Sorbo A, Guzman E, Visaggio CA, Canfora G, Gall HC (2015) How can i improve my app? classifying user reviews for software maintenance and evolution. In: 2015 IEEE international conference on software maintenance and evolution (ICSME). pp 281–290

Panichella S, Di Sorbo A, Guzman E, Visaggio CA, Canfora G, Gall HC (2016) Ardoc: App reviews development oriented classifier. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering, FSE 2016. ACM, New York, pp 1023–1027

Pelloni L, Grano G, Ciurumelea A, Panichella S, Palomba F, Gall HC (2018) Becloma: Augmenting stack traces with user review information. In: 2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER). pp 522–526

Peng Z, Wang J, He K, Tang M (2016) An approach of extracting feature requests from app reviews. In: Collaborate computing: networking, applications and worksharing - 12th international conference, CollaborateCom 2016, Beijing, China, November 10-11, 2016, Proceedings. pp 312–323

Phetrungnapha K, Senivongse T (2019) Classification of mobile application user reviews for generating tickets on issue tracking system. In: 2019 12th international conference on information communication technology and system (ICTS). pp 229–234

Puspaningrum A, Siahaan D, Fatichah C (2018) Mobile app review labeling using lda similarity and term frequency-inverse cluster frequency (tf-icf). In: 2018 10th international conference on information technology and electrical engineering (ICITEE). pp 365–370

Pustejovsky J, Stubbs A (2012) Natural language annotation for machine learning - a guide to corpus-building for applications. O’Reilly, Newton

Ralph P, Baltes S, Bianculli D, Dittrich Y, Felderer M, Feldt R, Filieri A, Furia CA, Graziotin D, He P, Hoda R, Juristo N, Kitchenham BA, Robbes R, Mėndez D, Molleri J, Spinellis D, Staron M, Stol K, Tamburri D, Torchiano M, Treude C, Turhan B, Vegas S (2020) ACM SIGSOFT empirical standards. arXiv: 2010.03525

Sänger M, Leser U, Kemmerer S, Adolphs P, Klinger R (2016) SCARE - the sentiment corpus of app reviews with fine-grained annotations in German. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16)

Sänger M, Leser U, Klinger R (2017) Fine-grained opinion mining from mobile app reviews with word embedding features. In: Natural language processing and information systems - 22nd international conference on applications of natural language to information systems, NLDB 2017, Liège, Belgium, June 21-23, 2017, Proceedings. pp 3–14

Scalabrino S, Bavota G, Russo B, Penta MD, Oliveto R (2019) Listening to the crowd for the release planning of mobile apps. IEEE Trans Softw Eng 45(1):68–86

Scoccia GL, Ruberto S, Malavolta I, Autili M, Inverardi P (2018) An investigation into android run-time permissions from the end users’ perspective. In: Proceedings of the 5th international conference on mobile software engineering and systems, MOBILESoft ’18. ACM, New York, pp 45–55

Shah FA, Sabanin Y, Pfahl D (2016) Feature-based evaluation of competing apps. In: Proceedings of the international workshop on app market analytics, WAMA 2016. ACM, New York, pp 15–21

Shah FA, Sirts K, Pfahl D (2018) Simplifying the classification of app reviews using only lexical features. In: Software Technologies - 13th International Conference, ICSOFT 2018, Porto, Portugal, July 26-28, 2018, Revised Selected Papers. pp 173–193

Shah FA, Sirts K, Pfahl D (2019a) Is the SAFE approach too simple for app feature extraction? A replication study. In: Requirements Engineering: Foundation for Software Quality - 25th International Working Conference, REFSQ 2019, Essen, Germany, March 18-21, 2019, Proceedings. pp 21–36

Shah FA, Sirts K, Pfahl D (2019b) Simulating the impact of annotation guidelines and annotated data on extracting app features from app reviews. International Conference on Software Technologies (ICSOFT, In

Shah FA, Sirts K, Pfahl D (2019c) Using app reviews for competitive analysis: Tool support. In: Proceedings of the 3rd ACM SIGSOFT international workshop on app market analytics, WAMA 2019. ACM, New York, pp 40-46

Shams RA, Hussain W, Oliver G, Nurwidyantoro A, Perera H, Whittle J (2020) Society-oriented applications development: Investigating users’ values from bangladeshi agriculture mobile applications. In: Proceedings of the ACM/IEEE 42nd international conference on software engineering: software engineering in society, ICSE-SEIS ’20. Association for Computing Machinery, New York, pp 53–62. https://doi.org/10.1145/3377815.3381382

Sharma T, Bashir MN (2020) Privacy apps for smartphones: An assessment of users’ preferences and limitations. In: Moallem A (ed) HCI for cybersecurity, privacy and trust - second international conference, HCI-CPT 2020, Held as Part of the 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19-24, 2020, Proceedings, Springer, Lecture Notes in Computer Science, vol 12210. pp 533–546. https://doi.org/10.1007/978-3-030-50309-3_35

Simmons A, Hoon L (2016) Agree to disagree: on labelling helpful app reviews. In: Proceedings of the 28th Australian conference on computer-human interaction, OzCHI ’16. ACM, New York. pp 416–420

Singh V (2021) South Asian University - Department of Computer Science. http://www.sau.int/research-themes/text-analytics.html , Accessed: 2021-06-01

Software T (2021) What is text analytics? http://www.tibco.com/reference-center/what-is-text-analytics , Accessed: 2021-06-01

Song R, Li T, Ding Z (2020) Automatically identifying requirements-oriented reviews using a top-down feature extraction approach. In: 2020 27th Asia-Pacific software engineering conference (APSEC). pp 450–454. https://doi.org/10.1109/APSEC51365.2020.00054

Srisopha K, Alfayez R (2018) Software quality through the eyes of the end-user and static analysis tools: A study on android oss applications. In: Proceedings of the 1st international workshop on software qualities and their dependencies, SQUADE ’18. ACM, New York, pp 1–4

Srisopha K, Phonsom C, Lin K, Boehm B (2019) Same app, different countries: A preliminary user reviews study on most downloaded ios apps. In: 2019 IEEE international conference on software maintenance and evolution (ICSME). pp 76–80

Srisopha K, Link D, Swami D, Boehm B (2020a) Learning features that predict developer responses for ios app store reviews. In: Proceedings of the 14th ACM / IEEE international symposium on empirical software engineering and measurement (ESEM), ESEM ’20. Association for Computing Machinery, New York. https://doi.org/10.1145/3382494.3410686

Srisopha K, Phonsom C, Li M, Link D, Boehm B (2020b) On building an automatic identification of country-specific feature requests in mobile app reviews: Possibilities and challenges. In: Proceedings of the IEEE/ACM 42nd international conference on software engineering workshops, ICSEW’20. Association for Computing Machinery, New York, pp 494–498. https://doi.org/10.1145/3387940.3391492

Srisopha K, Swami D, Link D, Boehm B (2020c) How features in ios app store reviews can predict developer responses. In: Proceedings of the evaluation and assessment in software engineering, EASE ’20. Association for Computing Machinery, New York, pp 336–341. https://doi.org/10.1145/3383219.3383258

Stanik C, Haering M, Maalej W (2019) Classifying multilingual user feedback using traditional machine learning and deep learning. In: 2019 IEEE 27th international requirements engineering conference workshops (REW). pp 220–226

Sun D, Peng R (2015) A scenario model aggregation approach for mobile app requirements evolution based on user comments. In: Requirements engineering in the big data era, vol 558. Springer, Berlin, pp 75–91

Sun Z, Ji Z, Zhang P, Chen C, Qian X, Du X, Wan Q (2017) Automatic labeling of mobile apps by the type of psychological needs they satisfy. Telematics Inform 34(5):767–778

Talia D (2019) A view of programming scalable data analysis: from clouds to exascale. J Cloud Comput 8(1):4

Tao C, Guo H, Huang Z (2020) Identifying security issues for mobile applications based on user review summarization. Inform Softw Technol 122:106290. https://doi.org/10.1016/j.infsof.2020.106290 . https://www.sciencedirect.com/science/article/pii/S0950584920300409

Tavakoli M, Zhao L, Heydari A, Nenadić G (2018) Extracting useful software development information from mobile application reviews: A survey of intelligent mining techniques and tools. Expert Syst Appl 113:186–199

Tizard J, Rietz T, Blincoe K (2020) Voice of the users: A demographic study of software feedback behaviour. In: Breaux T D, Zisman A, Fricker S, Glinz M (eds) 28th IEEE international requirements engineering conference, RE 2020, Zurich, Switzerland, August 31 - September 4, 2020. IEEE, pp 55–65. https://doi.org/10.1109/RE48521.2020.00018

Tong G, Guo B, Yi O, Zhiwen Y (2018) Mining and analyzing user feedback from app reviews: An econometric approach. In: 2018 IEEE SmartWorld, Ubiquitous Intelligence Computing, Advanced Trusted Computing, Scalable Computing Communications, Cloud big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), pp 841–848

Uddin MDK, He Q, Han J, Chua C (2020) App competition matters: How to identify your competitor apps?. In: 2020 IEEE International Conference on Services Computing, SCC 2020, Beijing, China, November 7-11, 2020. IEEE, pp 370–377. https://doi.org/10.1109/SCC49832.2020.00055

Vasa R, Hoon L, Mouzakis K, Noguchi A (2012) A preliminary analysis of mobile app user reviews. In: Proceedings of the 24th Australian Computer-Human Interaction Conference, ACM. pp 241–244

Viera AJ, Garrett JM (2005) Understanding interobserver agreement: the kappa statistic. Fam Med 37(5):360–3

Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on user reviews. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE). pp 14–24

van Vliet M, Groen EC, Dalpiaz F, Brinkkemper S (2020) Identifying and classifying user requirements in online feedback via crowdsourcing. In: Madhavji NH, Pasquale L, Ferrari A, Gnesi S (eds) Requirements engineering: foundation for software quality - 26th International Working Conference, REFSQ 2020, Pisa, Italy, March 24-27, 2020, Proceedings [REFSQ 2020 was postponed], Springer, Lecture Notes in Computer Science, vol 12045. pp 143–159. https://doi.org/10.1007/978-3-030-44429-7\_11

Vu PM, Nguyen TT, Pham HV, Nguyen TT (2015a) Mining user opinions in mobile app reviews: A keyword-based approach. In: Proceedings of the 30th IEEE/ACM International Conference on Automated Software Engineering, ASE ’15. IEEE Press, pp 749–459

Vu PM, Pham HV, Nguyen TT, Nguyen TT (2015b) Tool support for analyzing mobile app reviews. In: 30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015, pp 789–794

Vu PM, Pham HV, Nguyen TT, Nguyen TT (2016) Phrase-based extraction of user opinions in mobile app reviews. In: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE 2016, Singapore, September 3-7, 2016, pp 726–731

Vu PM, Nguyen TT, Nguyen TT (2019) Why do app reviews get responded: A preliminary study of the relationship between reviews and responses in mobile apps. In: Proceedings of the 2019 ACM Southeast Conference, ACM SE ’19. ACM, New York, pp 237–240

Wang C, Zhang F, Liang P, Daneva M, van Sinderen M (2018) Can app changelogs improve requirements classification from app reviews? an exploratory study. In: Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM ’18. ACM, New York

Wang H, Wang L, Wang H (2020a) Market-level analysis of government-backed covid-19 contact tracing apps. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering Workshops, ASE ’20. Association for Computing Machinery, New York, pp 79–84. https://doi.org/10.1145/3417113.3422186

Wang S, Wang Z, Xu X, Sheng QZ (2017) App update patterns: How developers act on user reviews in mobile app stores. In: Service-oriented computing - 15th International Conference, ICSOC 2017, Malaga, Spain, November 13-16, 2017, Proceedings. pp 125–141

Wang T, Liang P, Lu M (2018) What aspects do non-functional requirements in app user reviews describe? an exploratory and comparative study. In: 2018 25th Asia-Pacific Software Engineering Conference (APSEC). pp 494–503

Wang Y, Wang H, Fang H (2017) Extracting user-reported mobile application defects from online reviews. In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW). pp 422–429

Wang Y, Zheng L, Li N (2020b) Rom: A requirement opinions mining method preliminary try based on software review data. In: Proceedings of the 2020 4th International Conference on Management Engineering, Software Engineering and Service Sciences, ICMSS 2020. Association for Computing Machinery, New York, pp 26-33. https://doi.org/10.1145/3380625.3380665

Wei L, Liu Y, Cheung SC (2017) Oasis: Prioritizing static analysis warnings for android apps based on app user reviews. In: Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, ESEC/FSE 2017. ACM, New York, pp 672–682

Weichbroth P, Baj-Rogowska A (2019) Do online reviews reveal mobile application usability and user experience? the case of whatsapp. In: 2019 Federated Conference on Computer Science and Information Systems (FedCSIS). pp 747–754

Wen P, Chen M (2020) A new analysis method for user reviews of mobile fitness apps. In: Kurosu M (ed) Human-computer interaction. human values and quality of life - thematic Area, HCI 2020, Held as Part of the 22nd International Conference, HCII 2020, Copenhagen, Denmark, July 19-24, 2020, Proceedings, Part III, Springer, Lecture Notes in Computer Science, vol 12183. pp 188–199. https://doi.org/10.1007/978-3-030-49065-2\_14

Williams G, Mahmoud A (2018) Modeling user concerns in the app store: A case study on the rise and fall of yik yak. In: 2018 IEEE 26th international requirements engineering conference (rE). pp 64–75

Williams G, Tushev M, Ebrahimi F, Mahmoud A (2020) Modeling user concerns in sharing economy: the case of food delivery apps. Autom Softw Eng 27(3):229–263. https://doi.org/10.1007/s10515-020-00274-7

Wohlin C (2014) Guidelines for snowballing in systematic literature studies and a replication in software engineering. In: Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, EASE ’14, ACM, New York

Xiao J (2019) Ospaci: Online sentiment-preference analysis of user reviews for continues app improvement. In: Yangui S, Bouguettaya A, Xue X, Faci N, Gaaloul W, Yu Q, Zhou Z, Hernandez N, Nakagawa EY (eds) Service-oriented computing - ICSOC 2019 workshops - WESOACS, ASOCA, ISYCC, TBCE, and STRAPS, Toulouse, France, October 28-31, 2019, Revised Selected Papers, Springer, Lecture Notes in Computer Science, vol 12019. pp 273–279. https://doi.org/10.1007/978-3-030-45989-5_23

Xiao J, Chen S, He Q, Wu H, Feng Z, Xue X (2020) Detecting user significant intention via sentiment-preference correlation analysis for continuous app improvement. In: Kafeza E, Benatallah B, Martinelli F, Hacid H, Bouguettaya A, Motahari H (eds) Service-oriented computing - 18th International Conference, ICSOC 2020, Dubai, United Arab Emirates, December 14-17, 2020, Proceedings, Springer, Lecture Notes in Computer Science, vol 12571. pp 386–400. https://doi.org/10.1007/978-3-030-65310-1_27

Yadav A, Fard FH (2020) Semantic analysis of issues on google play and twitter. In: 2020 IEEE/ACM 42nd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). pp 308–309

Yadav A, Sharma R, Fard FH (2020) A semantic-based framework for analyzing app users’ feedback. In: Kontogiannis K, Khomh F, Chatzigeorgiou A, Fokaefs M, Zhou M (eds) 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020. IEEE, pp 572–576. https://doi.org/10.1109/SANER48275.2020.9054843

Yang H, Liang P (2015) Identification and classification of requirements from app user reviews. In: The 27th International Conference on Software Engineering and Knowledge Engineering, SEKE 2015, Wyndham Pittsburgh University Center, Pittsburgh, PA, USA, July 6-8, 2015, pp 7–12

Zhang J, Wang Y, Xie T (2019) Software feature refinement prioritization based on online user review mining. Inf Softw Technol 108:30–34

Zhang L, Huang X, Jiang J, Hu Y (2017) Cslabel: An approach for labelling mobile app reviews. J Comput Sci Technol 32(6):1076–1089

Zhou Y, Su Y, Chen T, Huang Z, Gall HC, Panichella S (2020) User review-based change file localization for mobile applications. IEEE Trans Softw Eng :1–1. https://doi.org/10.1109/TSE.2020.2967383

Download references

Author information

Authors and affiliations.

University College London, London, UK

Jacek Dąbrowski & Emmanuel Letier

Fondazione Bruno Kessler, Trento, Italy

Jacek Dąbrowski, Anna Perini & Angelo Susi

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Jacek Dąbrowski .

Additional information

Communicated by: David Lo

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: The surname of the first author, Jacek Dąbrowski, was misspelled throughout the online version of the article as “Dębrowski.” The surname, however, is correct in the PDF version.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Dąbrowski, J., Letier, E., Perini, A. et al. Analysing app reviews for software engineering: a systematic literature review. Empir Software Eng 27 , 43 (2022). https://doi.org/10.1007/s10664-021-10065-7

Download citation

Accepted : 05 October 2021

Published : 20 January 2022

DOI : https://doi.org/10.1007/s10664-021-10065-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • App store analysis
  • Mining app reviews
  • User feedback
  • Mining software repository
  • Software engineering
  • Systematic literature review
  • Find a journal
  • Publish with us
  • Track your research

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

literature reviews in software engineering

  • Subscribe SciFeed
  • Recommended Articles
  • Author Biographies
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Leveraging llms for efficient topic reviews.

literature reviews in software engineering

1. Introduction

  • A novel framework for semi-automatic literature review processes is proposed, utilizing the synergistic potential of LLMs and BERTopic. This framework is tailored to enhance the depth and breadth of literature analyses, ensuring comprehensive coverage across scholarly databases.
  • A single case study within this framework is developed and presented. This case study encompasses a specific domain demonstrating the applicability and robustness of the proposed method.
  • Specific metrics to assess the quality of the identified topics have been defined, contributing to the validation and continuous improvement of the framework and literature review process.
  • Through surveys and statistical tests such as Fleiss’ Kappa, the case study has been rigorously evaluated by subject-matter experts. This expert validation, along with the statistical analysis, underscores the effectiveness of the framework in extracting relevant and profound insights from extensive scientific literature.

2. Literature Review

3. framework, 3.1. core components, 3.1.1. embedding, 3.1.2. dimensional reduction, 3.1.3. clustering, 3.1.4. feature extraction, 3.1.5. topic representation, 3.2. parameter settings and configurations, 3.3. metrics.

  • Similarity based on indexed keywords: In this study, the cosine-based similarity technique using BERT embeddings is employed to compare the relationship between Scopus indexed keywords and keywords generated by the KeyBERT tool [ 58 ]. This technique relies on the vector representation of texts, where each word is converted into a numerical vector in a high-dimensional space. The similarity between two texts is calculated by evaluating the angle between the vectors representing these texts. The closer the angle is to zero, the higher the similarity between the texts. This approach allows for a quantitative and accurate comparison of keywords, facilitating the evaluation of the quality of the clustered topic generation process. Additionally, statistical tests are conducted to determine the significance of the observed similarity results, using a p -value threshold to assess whether the results are statistically significant at a 95% confidence level.
  • Thematic coherence analysis: In our study, we assess the relevance of clustered themes derived from the TR process, utilizing a survey administered to seven expert raters. The evaluation focuses on two key parameters: meaningfulness and importance; both evaluated as high, medium, and low. Meaningfulness pertains to the extent to which the AI-generated custom names for each topic cluster accurately and significantly represent concepts within the fields of SLR, ML, and NLP, reflecting the depth and relevance of these topics in the broader area of knowledge. Importance evaluates the perceived significance of each custom name, considering its impact, influence, or critical value within the area of knowledge research. To quantify the agreement among raters regarding these evaluations, we employ the Fleiss’ κ score. Fleiss’ Kappa ( κ ) is a statistical measure for assessing the reliability of agreement between a fixed number of raters when assigning categorical ratings to a number of items. It is expressed as: κ = P ¯ − P e ¯ 1 − P e ¯ where P ¯ is the proportion of times that raters agree (observed agreement proportion) and P e ¯ is the hypothetical probability of chance agreement. Using Fleiss’ Kappa, one can determine if the agreement level is better than chance, with a κ of 1 indicating perfect agreement and a κ less than or equal to 0 suggesting no agreement beyond chance.

3.4. Experiment

4. results and discussion, 4.1. results and analysis, 4.1.1. visualization, 4.1.2. evaluation of topic generation quality, 4.1.3. expert analysis, 4.2. discussion, 5. conclusions and future work, author contributions, institutional review board statement, informed consent statement, data availability statement, acknowledgments, conflicts of interest, abbreviations.

Description
Cosine Similarity
Standard Deviation
Proportion of times that raters agree
Values—representing proportions of ratings across High, Medium
and Low categories
Hypothetical probability of chance agreement
Description
AIArtificial intelligence
ATCAutomated Text Classification
BAAIBeijing Academy of Artificial Intelligence
BERTBidirectional Encoder Representations from Transformers
c-TF-IDFClass-based Term Frequency-Inverse Document Frequency
DTADiagnostic Test Accuracy
GPTGenerative Pre-trained Transformer
HDBSCANHierarchical Density-Based Spatial Clustering of Applications with Noise
HFSRMHybrid Feature Selection Rule Measures
LDALatent Dirichlet Allocation
LISLibrary & Information Science
LLMLarge Language Model
MECCIRMethodological Expectations of Campbell Collaboration Intervention Reviews
MeshMedical Subject Headings
MLMachine Learning
MTEBMassive Text Embedding Benchmark
NLPNatural Language Processing
PCAPrincipal Component Analysis
PRISMAPreferred Reporting Items for Systematic reviews and Meta-Analyses
PLSAProbabilistic Latent Semantic Analysis
SLRSystematic Literature Review
TRTopic Review
SGPTGPT Sentence Embeddings for Semantic Search
SVMSupport Vector Machines
t-SNEt-distributed Stochastic Neighbor Embedding
UMAPUniform Manifold Approximation and Projection
  • Sundaram, G.; Berleant, D. Automating systematic literature reviews with natural language processing and text mining: A systematic literature review. In Proceedings of the International Congress on Information and Communication Technology, London, UK, 20–23 February 2023; Springer: Singapore, 2023; pp. 73–92. [ Google Scholar ]
  • De la Torre-López, J.; Ramírez, A.; Romero, J.R. Artificial intelligence to automate the systematic review of scientific literature. Computing 2023 , 105 , 2171–2194. [ Google Scholar ] [ CrossRef ]
  • Moreno-Garcia, C.F.; Jayne, C.; Elyan, E.; Aceves-Martins, M. A novel application of machine learning and zero-shot classification methods for automated abstract screening in systematic reviews. Decis. Anal. J. 2023 , 6 , 100162. [ Google Scholar ] [ CrossRef ]
  • Adeva, J.G.; Atxa, J.P.; Carrillo, M.U.; Zengotitabengoa, E.A. Automatic text classification to support systematic reviews in medicine. Expert Syst. Appl. 2014 , 41 , 1498–1508. [ Google Scholar ] [ CrossRef ]
  • Fu, L.; Aliferis, C. Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics 2010 , 85 , 257–270. [ Google Scholar ] [ CrossRef ]
  • Ali, Z.; Kefalas, P.; Muhammad, K.; Ali, B.; Imran, M. Deep learning in citation recommendation models survey. Expert Syst. Appl. 2020 , 162 , 113790. [ Google Scholar ] [ CrossRef ]
  • Larsen, K.R.; Hovorka, D.; Dennis, A.; West, J.D. Understanding the elephant: The discourse approach to boundary identification and corpus construction for theory review articles. J. Assoc. Inf. Syst. 2019 , 20 , 15. [ Google Scholar ] [ CrossRef ]
  • Kunnath, S.N.; Herrmannova, D.; Pride, D.; Knoth, P. A meta-analysis of semantic classification of citations. Quant. Sci. Stud. 2021 , 2 , 1170–1215. [ Google Scholar ] [ CrossRef ]
  • Nasar, Z.; Jaffry, S.W.; Malik, M.K. Information extraction from scientific articles: A survey. Scientometrics 2018 , 117 , 1931–1990. [ Google Scholar ] [ CrossRef ]
  • Wagner, G.; Lukyanenko, R.; Paré, G. Artificial intelligence and the conduct of literature reviews. J. Inf. Technol. 2022 , 37 , 209–226. [ Google Scholar ] [ CrossRef ]
  • Antons, D.; Breidbach, C.F.; Joshi, A.M.; Salge, T.O. Computational literature reviews: Method, algorithms, and roadmap. Organ. Res. Methods 2023 , 26 , 107–138. [ Google Scholar ] [ CrossRef ]
  • Da Silva Júnior, E.M.; Dutra, M.L. A roadmap toward the automatic composition of systematic literature reviews. Iberoam. J. Sci. Meas. Commun. 2021 , 1 , 1–22. [ Google Scholar ] [ CrossRef ]
  • Tauchert, C.; Bender, M.; Mesbah, N.; Buxmann, P. Towards an Integrative Approach for Automated Literature Reviews Using Machine Learning. In Proceedings of the 53rd Hawaii International Conference on System Sciences, HICSS 2020, Maui, HI, USA, 7–10 January 2020. [ Google Scholar ]
  • Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022 , arXiv:2203.05794. [ Google Scholar ]
  • Garcia, J.; Villavicencio, G.; Altimiras, F.; Crawford, B.; Soto, R.; Minatogawa, V.; Franco, M.; Martínez-Muñoz, D.; Yepes, V. Machine learning techniques applied to construction: A hybrid bibliometric analysis of advances and future directions. Autom. Constr. 2022 , 142 , 104532. [ Google Scholar ] [ CrossRef ]
  • Hofmann, T. Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 2001 , 42 , 177–196. [ Google Scholar ] [ CrossRef ]
  • Blei, D.; Ng, A.; Jordan, M. Latent dirichlet allocation. J. Mach. Learn. Res. 2003 , 3 , 993–1022. [ Google Scholar ]
  • Lee, D.; Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature 1999 , 401 , 788–791. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Arora, S.; Ge, R.; Moitra, A. Learning topic models–going beyond SVD. In Proceedings of the 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science, New Brunswick, NJ, USA, 20–23 October 2012; pp. 1–10. [ Google Scholar ] [ CrossRef ]
  • Pourreza, M.; Ensan, F. Towards semantic-driven boolean query formalization for biomedical systematic literature reviews. Int. J. Med. Inform. 2023 , 170 , 104928. [ Google Scholar ] [ CrossRef ]
  • Scells, H.; Forbes, C.; Clark, J.; Koopman, B.; Zuccon, G. The Impact of Query Refinement on Systematic Review Literature Search: A Query Log Analysis. In Proceedings of the 2022 ACM SIGIR International Conference on Theory of Information Retrieval, Madrid, Spain, 11–12 July 2022; pp. 34–42. [ Google Scholar ]
  • O’Keefe, H.; Rankin, J.; Wallace, S.A.; Beyer, F. Investigation of text-mining methodologies to aid the construction of search strategies in systematic reviews of diagnostic test accuracy—A case study. Res. Synth. Methods 2023 , 14 , 79–98. [ Google Scholar ] [ CrossRef ]
  • Sutton, A.; O’Keefe, H.; Johnson, E.E.; Marshall, C. A mapping exercise using automated techniques to develop a search strategy to identify systematic review tools. Res. Synth. Methods 2023 , 14 , 874–881. [ Google Scholar ] [ CrossRef ]
  • Young, S.; Bethel, A.; Keenan, C.; Ghezzi-Kopel, K.; Moreton, E.; Pickup, D.; Premji, Z.A.; Rogers, M.; Viinholt, B.C. PROTOCOL: Searching and reporting in Campbell Collaboration systematic reviews: An assessment of current methods. Campbell Syst. Rev. 2021 , 17 , e1208. [ Google Scholar ] [ CrossRef ]
  • Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. BMJ 2021 , 372 , n71. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Almeida, H.; Meurs, M.J.; Kosseim, L.; Tsang, A. Data sampling and supervised learning for HIV literature screening. IEEE Trans. Nanobiosci. 2016 , 15 , 354–361. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Norman12, C.; Leeflang, M.; Névéol, A. LIMSI@CLEF ehealth 2017 task 2: Logistic regression for automatic article ranking. In Proceedings of the CEUR Workshop Proceedings: Working Notes of CLEF 2019: Conference and Labs of the Evaluation Forum, Lugano, Switzerland, 9–12 September 2019. [ Google Scholar ]
  • Norman, C.R.; Leeflang, M.M.; Névéol, A. LIMSI@CLEF eHealth 2018 Task 2: Technology Assisted Reviews by Stacking Active and Static Learning. In Proceedings of the CLEF 2018—Working Notes of CLEF 2018 Conference and Labs of the Evaluation Forum, Avignon, France, 10–14 September 2018; Volume 2125, pp. 1–13. [ Google Scholar ]
  • van den Bulk, L.M.; Bouzembrak, Y.; Gavai, A.; Liu, N.; van den Heuvel, L.J.; Marvin, H.J. Automatic classification of literature in systematic reviews on food safety using machine learning. Curr. Res. Food Sci. 2022 , 5 , 84–95. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Torii, M.; Liu, H. Classifier Ensemble for Biomedical Document Retrieval. In LBM (Short Papers) ; CEUR-WS: Washington, DC, USA, 2007; pp. 5.1–5.17. [ Google Scholar ]
  • Qin, X.; Liu, J.; Wang, Y.; Liu, Y.; Deng, K.; Ma, Y.; Zou, K.; Li, L.; Sun, X. Natural language processing was effective in assisting rapid title and abstract screening when updating systematic reviews. J. Clin. Epidemiol. 2021 , 133 , 121–129. [ Google Scholar ] [ CrossRef ]
  • Tsubota, T.; Bollegala, D.; Zhao, Y.; Jin, Y.; Kozu, T. Improvement of intervention information detection for automated clinical literature screening during systematic review. J. Biomed. Inform. 2022 , 134 , 104185. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Van De Schoot, R.; De Bruin, J.; Schram, R.; Zahedi, P.; De Boer, J.; Weijdema, F.; Kramer, B.; Huijts, M.; Hoogerwerf, M.; Ferdinands, G.; et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat. Mach. Intell. 2021 , 3 , 125–133. [ Google Scholar ] [ CrossRef ]
  • Ding, Q.; Ding, D.; Wang, Y.; Guan, C.; Ding, B. Unraveling the landscape of large language models: A systematic review and future perspectives. J. Electron. Bus. Digit. Econ. 2023 , 3 , 3–19. [ Google Scholar ] [ CrossRef ]
  • Guizzardi, S.; Colangelo, M.T.; Mirandola, P.; Galli, C. Modeling new trends in bone regeneration, using the BERTopic approach. Regen. Med. 2023 , 18 , 719–734. [ Google Scholar ] [ CrossRef ]
  • Chen, W.; Rabhi, F.; Liao, W.; Al-Qudah, I. Leveraging State-of-the-Art Topic Modeling for News Impact Analysis on Financial Markets: A Comparative Study. Electronics 2023 , 12 , 2605. [ Google Scholar ] [ CrossRef ]
  • Wang, Z.; Chen, J.; Chen, J.; Chen, H. Identifying interdisciplinary topics and their evolution based on BERTopic. In Scientometrics ; Springer: Berlin/Heidelberg, Germany, 2023; pp. 1–26. [ Google Scholar ]
  • Gan, L.; Yang, T.; Huang, Y.; Yang, B.; Luo, Y.Y.; Richard, L.W.C.; Guo, D. Experimental Comparison of Three Topic Modeling Methods with LDA, Top2Vec and BERTopic. In Artificial Intelligence and Robotics, Proceedings of the 8th International Symposium, ISAIR 2023, Beijing, China, 21–23 October 2023 ; Springer: Singapore, 2023; pp. 376–391. [ Google Scholar ]
  • Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N. C-Pack: Packaged Resources to Advance General Chinese Embedding. arXiv 2023 , arXiv:2309.07597v2. [ Google Scholar ]
  • Kim, D.; Park, C.; Kim, S.; Lee, W.; Song, W.; Kim, Y.; Kim, H.; Kim, Y.; Lee, H.; Kim, J.; et al. SOLAR 10.7 B: Scaling Large Language Models with Simple yet Effective Depth Up-Scaling. arXiv 2023 , arXiv:2312.15166. [ Google Scholar ]
  • Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the Workshop at ICLR, Scottsdale, AZ, USA, 2–4 May 2013. [ Google Scholar ]
  • Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the EMNLP, Doha, Qatar, 25–29 October 2014; Volume 14, pp. 1532–1543. [ Google Scholar ]
  • Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. arXiv 2022 , arXiv:2210.07316. [ Google Scholar ] [ CrossRef ]
  • Muennighoff, N. SGPT: GPT Sentence Embeddings for Semantic Search. arXiv 2022 , arXiv:2202.08904. [ Google Scholar ]
  • McInnes, L.; Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. J. Open Source Softw. 2018 , 3 , 861. [ Google Scholar ] [ CrossRef ]
  • Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008 , 9 , 2579–2605. [ Google Scholar ]
  • Tang, J.; Liu, J.; Zhang, M.; Mei, Q. Visualizing Large-scale and High-dimensional Data. In WWW ’16, Proceedings of the 25th International Conference on World Wide Web, Montréal, QC, Canada, 11–15 April 2016 ; International World Wide Web Conferences Steering Committee: Geneva, Switzerland, 2016. [ Google Scholar ] [ CrossRef ]
  • Campello, R.; Moulavi, D.; Sander, J. Density-Based Clustering Based on Hierarchical Density Estimates. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 17th Pacific-Asia Conference, PAKDD 2013, Gold Coast, Australia, 14–17 April 2013 ; Proceedings, Part II; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7819, pp. 160–172. [ Google Scholar ] [ CrossRef ]
  • Allaoui, M.; Kherfi, M.L.; Cheriet, A. Considerably Improving Clustering Algorithms Using UMAP Dimensionality Reduction Technique: A Comparative Study ; Springer: Berlin/Heidelberg, Germany, 2020; pp. 317–325. [ Google Scholar ]
  • García, J.; Leiva-Araos, A.; Diaz-Saavedra, E.; Moraga, P.; Pinto, H.; Yepes, V. Relevance of Machine Learning Techniques in Water Infrastructure Integrity and Quality: A Review Powered by Natural Language Processing. Appl. Sci. 2023 , 13 , 12497. [ Google Scholar ] [ CrossRef ]
  • Asyaky, M.S.; Mandala, R. Improving the Performance of HDBSCAN on Short Text Clustering by Using Word Embedding and UMAP. In Proceedings of the 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Bandung, Indonesia, 29–30 September 2021; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching Word Vectors with Subword Information. Trans. Assoc. Comput. Linguist. 2016 , 5 , 135–146. [ Google Scholar ] [ CrossRef ]
  • Färber, M.; Steyer, A. Towards Full-Fledged Argument Search: A Framework for Extracting and Clustering Arguments from Unstructured Text. arXiv 2021 , arXiv:2112.00160. [ Google Scholar ] [ CrossRef ]
  • David, U.; Karabatak, M. Text Clustering of COVID-19 Vaccine Tweets. In Proceedings of the 2022 10th International Symposium on Digital Forensics and Security (ISDFS), Istanbul, Turkey, 6–7 June 2022; pp. 1–6. [ Google Scholar ] [ CrossRef ]
  • Gelar, T.; Sari, A.N. Bertopic and NER Stop Words for Topic Modeling on Agricultural Instructional Sentences. In Proceedings of the International Conference on Applied Science and Technology on Engineering Science 2023 (iCAST-ES 2023), Tarakan, Indonesia, 20–22 October 2024; Atlantis Press: Paris, France, 2024; pp. 129–140. [ Google Scholar ] [ CrossRef ]
  • Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. arXiv 2023 , arXiv:2305.14314. [ Google Scholar ]
  • Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. arXiv 2020 , arXiv:2005.14165. [ Google Scholar ] [ CrossRef ]
  • Grootendorst, M. KeyBERT: Minimal Keyword Extraction with BERT. 2020. Available online: https://zenodo.org/records/8388690 (accessed on 28 November 2023).
  • Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977 , 33 , 159–174. [ Google Scholar ] [ CrossRef ] [ PubMed ]
  • Gana Castillo, B.P. Leveraging LLMs for Efficient Topic Reviews. 2024. Available online: https://zenodo.org/records/13346401 (accessed on 19 August 2024).
  • Gana Castillo, B. Topic-Modeling-BERTopic-SOLAR. 2024. Available online: https://github.com/Zickbad/Topic-modeling-BERTopic-SOLAR (accessed on 19 August 2024).

Click here to enlarge figure

TopicYearScopeSLR Focus TaskCitation
Classic machine learning algorithms in text mining for abstract screening2007–2022Explores the use of SVMs and ensemble methods in distinguishing relevant citations, with a focus on overcoming class imbalance.Select relevant studies, Assess study quality[ , ]
Hybrid Feature Selection with supervised learning for automatic ranking of articles2016–2018Discusses the application of HFSRM and logistic regression classifiers in DTA reviews, significantly reducing screening time and effort.Assess study quality, Analyze and interpret data[ , , ]
Adherence of search methods to SLR frameworks2021Examines the quality and nature of search methods in Campbell systematic reviews, emphasizing adherence to MECCIR and PRISMA guidelines.Develop review protocol, Conduct exhaustive literature search[ ]
ML-Aided pipeline for screening in systematic reviews2021Proposes a machine learning-aided pipeline, ASReview, to significantly improve the efficiency and quality of systematic reviews and meta-analyses.Select relevant studies, Extract data from selected studies[ ]
Systematic reviews automation based on query log analysis2022Analyzes query logs from a specialized tool to reveal intuitive user behaviors in query formulation for medical systematic reviews.Conduct exhaustive literature search[ ]
Automatic query generation using pre-trained language models2023Discusses an automatic query generation approach using pre-trained language models, significantly surpassing traditional models in precision, recall, and F-measures.Conduct exhaustive literature search[ ]
Utility of Text-Mining Tools2023Assesses the utility of semi-automated text-mining tools in systematic reviews of diagnostic test accuracy, yielding additional relevant articles with varied precision.Select relevant studies[ ]
MeSH Term Identification2023Focuses on using MeSH term identification, highlighting the strategy’s efficiency.Conduct exhaustive literature search[ ]
BERTopic applications in various domains2023BERTopic has been used across multiple domains including LLM research, medical research, financial sector, and interdisciplinary research, showing superiority in topic clustering, analysis, and identifying key research trends across diverse datasets.Analyze and interpret data[ , , , , ]
Our work2024Proposes an advanced systematic review process incorporating AI techniques for improved accuracy, speed, and comprehensiveness in literature reviews.Semi-automatic exhaustive literature search and screening framework
TopicCustom NameCount
6Automated Methods for Literature Screening in Medical Systematic Reviews399
29Biomedical Text Mining for Gene and Protein Extraction131
54Clinical Natural Language Processing for Medical Text Analysis67
Index KeywordsAuthor Keywords
KeyBERT (6) Mean of
Selected Topics
KeyBERT (6) Mean of
Selected Topics
Min0.49560.53330.35770.3071
Max0.81490.69400.86000.6847
Mean0.69560.63230.69710.5958
Median0.70070.63090.70640.6019
Std Dev0.05310.02450.07170.0432
Shapiro-Wilk p-value0.19090.21100.45510.2572
t-Test Paired p-value0.01350.0224
Context ParameterAverage Score Fleiss’
Meaningfulness2.4230.650.420.540.0402
Importance2.3550.680.400.440.0650
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Gana, B.; Leiva-Araos, A.; Allende-Cid, H.; García, J. Leveraging LLMs for Efficient Topic Reviews. Appl. Sci. 2024 , 14 , 7675. https://doi.org/10.3390/app14177675

Gana B, Leiva-Araos A, Allende-Cid H, García J. Leveraging LLMs for Efficient Topic Reviews. Applied Sciences . 2024; 14(17):7675. https://doi.org/10.3390/app14177675

Gana, Bady, Andrés Leiva-Araos, Héctor Allende-Cid, and José García. 2024. "Leveraging LLMs for Efficient Topic Reviews" Applied Sciences 14, no. 17: 7675. https://doi.org/10.3390/app14177675

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Quality assessment of systematic reviews in software engineering: a tertiary study

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations.

  • Sepúlveda S Cravero A Fonseca G Antonelli L (2024) Systematic Review on Requirements Engineering in Quantum Computing: Insights and Future Directions Electronics 10.3390/electronics13152989 13 :15 (2989) Online publication date: 29-Jul-2024 https://doi.org/10.3390/electronics13152989
  • Lin H Wang C Sun Y (2024) How big five personality traits influence information sharing on social media: A meta analysis PLOS ONE 10.1371/journal.pone.0303770 19 :6 (e0303770) Online publication date: 12-Jun-2024 https://doi.org/10.1371/journal.pone.0303770
  • Okesola M Okesola J Ogunlana O Afolabi I (2024) Quality assessment of systematic literature on uterine fibroids: a systematic review F1000Research 10.12688/f1000research.124879.2 11 (1050) Online publication date: 8-Mar-2024 https://doi.org/10.12688/f1000research.124879.2
  • Show More Cited By

Index Terms

Software and its engineering

Recommendations

Automated selection and quality assessment of primary studies: a systematic literature review.

Researchers use systematic literature reviews (SLRs) to synthesize existing evidence regarding a research topic. While being an important means to condense knowledge, conducting an SLR requires a large amount of time and effort. Consequently, ...

Systematic review in software engineering: where we are and where we should be going

In 2004 Kitchenham et al. first proposed the idea of evidence-based software engineering (EBSE). EBSE requires a systematic and unbiased method of aggregating empirical studies and has encouraged software engineering researches to undertake systematic ...

A critical appraisal tool for systematic literature reviews in software engineering

Context: Methodological research on systematic literature reviews (SLRs) in Software Engineering (SE) has so far focused on developing and evaluating guidelines for conducting systematic reviews. However, the support for quality assessment of ...

Information

Published in.

  • General Chair:

Nanjing University, China

  • Program Chairs:

University of Adelaide, Australia

  • NJU: Nanjing University

Association for Computing Machinery

New York, NY, United States

Publication History

Permissions, check for updates, author tags.

  • quality assessment
  • software engineering
  • systematic (literature) review
  • Research-article

Acceptance Rates

Contributors, other metrics, bibliometrics, article metrics.

  • 41 Total Citations View Citations
  • 1,280 Total Downloads
  • Downloads (Last 12 months) 197
  • Downloads (Last 6 weeks) 23
  • dos Santos V Iwazaki A Felizardo K de Souza É Nakagawa E (2024) Sustainable systematic literature reviews Information and Software Technology 10.1016/j.infsof.2024.107551 (107551) Online publication date: Aug-2024 https://doi.org/10.1016/j.infsof.2024.107551
  • Tripathi N Hietala H Xu Y Liyanage R (2024) Stakeholders collaborations, challenges and emerging concepts in digital twin ecosystems Information and Software Technology 10.1016/j.infsof.2024.107424 169 (107424) Online publication date: May-2024 https://doi.org/10.1016/j.infsof.2024.107424
  • Hannousse A Yahiouche S Nait-Hamoud M (2024) Twenty-two years since revealing cross-site scripting attacks Computer Science Review 10.1016/j.cosrev.2024.100634 52 :C Online publication date: 1-May-2024 https://dl.acm.org/doi/10.1016/j.cosrev.2024.100634
  • Orošnjak M Štrbac B Vulanović S Runje B Horvatić Novak A Razumić A (2024) RCE (rationale–cogency–extent) criterion unravels features affecting citation impact of top-ranked systematic literature reviews: leaving the impression…is all you need Scientometrics 10.1007/s11192-024-04935-2 129 :3 (1891-1947) Online publication date: 1-Mar-2024 https://dl.acm.org/doi/10.1007/s11192-024-04935-2
  • Bolaños F Salatino A Osborne F Motta E (2024) Artificial intelligence for literature reviews: opportunities and challenges Artificial Intelligence Review 10.1007/s10462-024-10902-3 57 :10 Online publication date: 17-Aug-2024 https://doi.org/10.1007/s10462-024-10902-3
  • Pauzi Z Capiluppi A (2024) Beyond the Systematic: Forecasting Importance and Emergence of Research Areas in Applications of Software Traceability Using NLP Evaluation of Novel Approaches to Software Engineering 10.1007/978-3-031-64182-4_6 (119-140) Online publication date: 10-Jul-2024 https://doi.org/10.1007/978-3-031-64182-4_6
  • van den Berg C Eybers S (2024) Investigating Machine Learning Techniques Used for the Detection of Class Noise in Data: A Systematic Literature Review Intelligent Computing 10.1007/978-3-031-62277-9_9 (128-147) Online publication date: 13-Jun-2024 https://doi.org/10.1007/978-3-031-62277-9_9

View Options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

View options.

View or Download as a PDF file.

View online with eReader .

Share this Publication link

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

COMMENTS

  1. Systematic literature reviews in software engineering

    The impact of software engineering research on modern programming languages: Informal literature survey. No clear search criteria, no data extraction process. ACM Surv: J. Ma and J. V. Nickerson: 38(3), pp. 1-24: 2006: Hands-on, simulated and remote laboratories: a comparative literature review: Not a software engineering topic: ISESE: S ...

  2. Guidelines for performing Systematic Literature Reviews in Software

    The guidelines have been adapted to reflect the specific problems of software engineering research. The guidelines cover three phases of a systematic literature review: planning the review ...

  3. Performing systematic literature reviews in software engineering

    Context: Making best use of the growing number of empirical studies in Software Engineering, for making decisions and formulating research questions, requires the ability to construct an objective summary of available research evidence. Adopting a systematic approach to assessing and aggregating the outcomes from a set of empirical studies is also particularly important in Software Engineering ...

  4. Systematic literature reviews in software engineering

    4.4.1. Review topics and extent of evidence. Compared with our previous study [12], the 33 reviews discussed in this paper addressed a broader range of software engineering topics. There is no longer a preponderance of cost estimation studies and more general software engineering topics have been addressed.

  5. PDF Guidelines for performing Systematic Literature Reviews in Software

    literature reviews appropriate for software engineering researchers, including PhD students. A systematic literature review is a means of evaluating and interpreting all available research relevant to a particular research question, topic area, or phenomenon of interest. Systematic reviews aim to present a fair evaluation of a

  6. Systematic literature reviews in software engineering

    Lessons from applying the systematic literature review process within the software engineering domain. Journal of Systems and Software ... On the performance of hybrid search strategies for systematic literature reviews in software engineering. Information and Software Technology, Volume 123, 2020, Article 106294. Erica Mourão, …, Claes Wohlin.

  7. A systematic literature review of literature reviews in software

    S. Imtiaz, M. Bano, N. Ikram, M. Niazi, A tertiary study: experiences of conducting systematic literature reviews in software engineering, in: Presented at the Proceedings of International Conference on Evaluation and Assessment in Software Engineering, 2013.

  8. Software product line testing: a systematic literature review

    A Software Product Line (SPL) is a software development paradigm in which a family of software products shares a set of core assets. Testing has a vital role in both single-system development and SPL development in identifying potential faults by examining the behavior of a product or products, but it is especially challenging in SPL. There have been many research contributions in the SPL ...

  9. Performing systematic literature reviews in software engineering

    This article follows the guidelines described in [31] [32][33] for performing literature reviews in software engineering and scoping reviews [30]. Consequently, the main steps performed in this ...

  10. Evidence-Based Software Engineering and Systematic Reviews:

    Abstract. In the decade since the idea of adapting the evidence-based paradigm for software engineering was first proposed, it has become a major tool of empirical software engineering. Evidence-Based Software Engineering and Systematic Reviews provides a clear introduction to the use of an evidence-based model for software engineering research ...

  11. Performing systematic literature reviews in software engineering

    This tutorial is designed to provide an introduction to the role, form and processes involved in performing Systematic Literature Reviews, and to gain the knowledge needed to conduct systematic reviews of their own. Context: Making best use of the growing number of empirical studies in Software Engineering, for making decisions and formulating research questions, requires the ability to ...

  12. Systematic literature reviews in software engineering: Preliminary

    Systematic Literature Reviews (SLRs) have been gaining significant attention from software engineering researchers since 2004. Several researchers have reported their experiences of and lessons learned from applying systematic reviews to different subject matters in software engineering. However, there has been no attempt at independently exploring experiences and perceptions of the ...

  13. Systematic Literature Reviews

    Kitchenham et al. report 53 unique systematic literature reviews in software engineering being published between 2004 and 2008 [103, 104]. They conclude that there is a growth of the number of systematic literature reviews being published, and that the quality of the reviews tend to be increasing too. However, still there is large variation ...

  14. Machine/Deep Learning for Software Engineering: A Systematic Literature

    Since 2009, the deep learning revolution, which was triggered by the introduction of ImageNet, has stimulated the synergy between Software Engineering (SE) and Machine Learning (ML)/Deep Learning (DL). Meanwhile, critical reviews have emerged that suggest that ML/DL should be used cautiously. To improve the applicability and generalizability of ML/DL-related SE studies, we conducted a 12-year ...

  15. Systematic literature reviews in software engineering

    The recommended methodology for aggregating empirical studies is a systematic literature review (SLR) (see for example [4], [5], [6]). Kitchenham adapted the medical guidelines for SLRs to software engineering [7], and later updated them to include insights from sociology research [8]. SLRs are a means of aggregating knowledge about a software ...

  16. A Quality Assessment Instrument for Systematic Literature Reviews in

    literature review. Conclusion: It is concluded that the presented instrument may be helpful support for an appraiser in assessing the quality of SLRs in software engineering. Keywords: Systematic reviews, quality assessment, critical appraisal, AMSTAR 2, systematic literature review, tertiary study. 1. Introduction To establish evidence-based ...

  17. Evidence-Based Software Engineering and Systematic Literature Reviews

    In the presentation, the view is taken that although Evidence-based Software Engineering may be unproven, one aspect of the evidencebased paradigm is hard to ignore, that is: Systematic literature reviews. Systematic literature reviews aim to summarize research studies related to a specific research question in a way that is fair, rigorous, and ...

  18. (PDF) Systematic literature reviews in software engineering-A

    BackgroundIn 2004 the concept of evidence-based software engineering (EBSE) was introduced at the ICSE04 conference.AimsThis study assesses the impact of systematic literature reviews (SLRs) which ...

  19. Large Language Models for Software Engineering: A Systematic Literature

    Software Engineering (SE) - a discipline focused on the development, implementation, and maintenance of software systems - is one of those areas reaping the benefits of the LLM revolution (Ma et al., 2023a).The utilization of LLMs in SE primarily emerges from an innovative perspective where numerous SE challenges can be effectively reframed into data, code, or text analysis tasks (Wang et ...

  20. Systematic Literature Review of Commercial Participation in Open Source

    Yuxia Zhang, Klaas-Jan Stol, Hui Liu, and Minghui Zhou. 2022. Corporate dominance in open source ecosystems: a case study of OpenStack. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software ...

  21. Grey Literature in Software Engineering: A critical review

    Overview of Grey Literature use in secondary studies of Software Engineering. At least 75% of the studies used Grey Literature to support answers to the RQs. Almost 50% of the Grey Literature in secondary studies are now unavailable. Few specific criteria were used to search and assess the Grey Literature quality.

  22. Analysing app reviews for software engineering: a systematic literature

    App reviews found in app stores can provide critically valuable information to help software engineers understand user requirements and to design, debug, and evolve software products. Over the last ten years, a vast amount of research has been produced to study what useful information might be found in app reviews, and how to mine and organise such information as efficiently as possible. This ...

  23. Accessibility of low-code approaches: A systematic literature review

    The software engineering research community has concentrated on the accessibility of software engineering products but has paid less attention to the accessibility of software engineering ... We evaluated and assessed the search engines employed in prior software engineering literature reviews, as documented in Maplesden et al., Shahin ...

  24. The need for multivocal literature reviews in software engineering

    The need for multivocal literature reviews in software engineering: complementing systematic literature reviews with grey literature. General and reference. Cross-computing tools and techniques. Empirical studies. Document types. Surveys and overviews. Recommendations.

  25. Leveraging LLMs for Efficient Topic Reviews

    This paper presents the topic review (TR), a novel semi-automatic framework designed to enhance the efficiency and accuracy of literature reviews. By leveraging the capabilities of large language models (LLMs), TR addresses the inefficiencies and error-proneness of traditional review methods, especially in rapidly evolving fields. The framework significantly improves literature review ...

  26. PDF Systematic literature reviews in software engineering A systematic

    Background: In 2004 the concept of evidence-based software engineering (EBSE) was introduced at the ICSE04 conference. Aims: This study assesses the impact of systematic literature reviews (SLRs) which are the recommended EBSE method for aggregating evidence. Method: We used the standard systematic literature review method employing a manual ...

  27. Quality assessment of systematic reviews in software engineering

    Context: The quality of an Systematic Literature Review (SLR) is as good as the quality of the reviewed papers. Hence, it is vital to rigorously assess the papers included in an SLR. There has been no tertiary study aimed at reporting the state of the practice of quality assessment used in SLRs in Software Engineering (SE).

  28. Quality Assessment in Systematic Literature Reviews: A Software

    Context: Quality Assessment (QA) of reviewed literature is paramount to a Systematic Literature Review (SLR) as the quality of conclusions completely depends on the quality of selected literature.A number of researchers in Software Engineering (SE) have developed a variety of QA instruments and also reported their challenges. We previously conducted a tertiary study on SLRs with QA from 2004 ...