Within-project and cross-project defect prediction based on model averaging

Within-project and cross-project defect prediction based on model averaging

Dataset

The paper uses four typical software defect open datasets: NASA, SoftLab, ReLink and AEEEM, including the experimental data of 23 projects. The NASA dataset is the most commonly used software defect dataset and includes 12 projects: PC1, PC2, PC3, PC4, PC5, JM1, CM1, KC1, MC1, KC3, MC2, and MW1. The number of measurement attributes for each project is different, including 22 to 40 measurement attributes. Generally, each measurement attribute includes specific characteristics such as size, readability, and complexity. The SoftLab dataset comes from the PROMISE repository, and the three projects selected in this paper are ar1, ar4, and ar6. Each item contains 30 metric features, including Halstead and McCabe’s circle count metric. The ReLink dataset includes three projects: Apache, Safe, and ZXing. Each project includes 27 complexity feature metrics. The defect information was manually verified. The defect prediction performance can be improved by increasing the defect data quality. The AEEEM dataset includes a total of five items: EQ, JDT, LC, ML, and PDE. Each project includes 62 metrics, 17 source code metrics, five defect metrics, five entropy change metrics, 17 source code entropy metrics, and 17 source code volatility metrics.

To ensure that our experimental evaluations adhere as closely as possible to the i.i.d. assumption, we employed stratified sampling techniques during data partitioning. This approach helps maintain consistent class distributions across training and testing sets, thereby minimizing potential biases and correlations that could violate the i.i.d. premise.

Table 1 shows the details of the experimental datasets, including the project name, sample size, number of metrics and defect rate. It can also be seen in Table 1 that there is a serious category imbalance in some projects. We found that the defect rate of six projects is less than 10%, and the defect rate of MC1 and PC2 is approximately 2%.

Table 1 Descriptions of defect dataset.

The experimental design is a binary classification problem, and the binary classification prediction problem has only four possible results: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The first two are correct prediction results, and the latter two predict negative as positive and positive as negative, which are incorrect prediction results. After using the model to make predictions, these four results are generated, resulting in the following confusion matrix, as shown in Table 2.

Table 2 Confusion matrix.

To analyse and compare the actual prediction effects of each prediction model more scientifically and reasonably, the experiment uses the F-measure (F1), which is often used as a comparison in binary classification problems and software defect prediction research, and AUC as the performance evaluation index. F1 is the weighted harmonic average of precision and recall, while precision and recall represent the proportion of all predicted positives that are truly correct and the proportion of all actual positives that are truly correct, respectively. When the F1 value is larger, it usually indicates that the model performance is better. Its specific calculation method is as follows:

Precision: The precision ratio indicates the proportion of original flawless samples in the samples that the model predicts as flawless instances after using the prediction model. Generally, the higher the precision is, the better the prediction model.

$$\text{Precision} = \frac{\text{TP}}{\text{TP+FP}}$$

(9)

Recall: The recall rate is the proportion of the number of instances predicted by the prediction model to the number of real instances without defects. The recall rate plays a very high role in software defect prediction. In model testing, if the recall rate is very high, it proves that a large part of the original defect-free modules have been correctly predicted by the model, which indicates that the trained model has good performance. However, the recall rate is generally high, and the accuracy is often reduced.

$$\text{Recall} = \frac{\text{TP}}{{\text{TP}}+{\text{FN}}}$$

(10)

F1: This is a comprehensive value. As mentioned above, the higher the recall rate is, the lower the accuracy is. F1 represents the relationship between the two, so we can appropriately evaluate the recall rate and accuracy according to this indicator.

$$\text{F}1=2\times \frac{{\text{Precision}}\times {\text{Recall}}}{\text{Precision+Recall}}$$

AUC: The area under the ROC curve. This curve is called the receiver operating characteristic curve, its ordinate is the true positive rate (TPR), and the abscissa is the false positive rate (FPR). The larger the value of the area under the ROC curve, the better the performance of the trained model. Each point on the curve corresponds to a threshold, and there will be a TPR and FPR under each threshold. The ROC curve can be obtained by adjusting the prediction model threshold.

Feature engineering and data preprocessing

To ensure the quality and relevance of the features used in our classification experiments, we performed comprehensive feature engineering and data preprocessing steps applicable to all algorithms and datasets. Below are the detailed steps and results:

  1. (1)

    Handling missing values

    Missing values were present in several features across the datasets. Specifically, approximately 10% of features in the NASA dataset, 5% in AEEEM, 8% in ReLink, and 3% in SoftLab had missing values. These missing values were primarily found in metrics such as “LOC_TOTAL,” “Cyclomatic Complexity,” “Code Churn,” and “Number of Developers.” We addressed these missing values by imputing numerical features with the median value and categorical features with the mode value of the respective feature. This approach ensured that the datasets remained complete without introducing significant bias.

  2. (2)

    Removing highly correlated features

    To eliminate redundancy and reduce multicollinearity, we conducted a Pearson correlation analysis on all features. Features with a correlation coefficient greater than 0.8 were considered highly correlated, and one feature from each correlated pair was removed. For example, in the NASA dataset, “LOC_CODE” was removed due to its high correlation with “LOC_TOTAL” (r = 0.92). Similarly, in the AEEEM dataset, “Code Churn” was removed as it was highly correlated with “Number of Code Clones” (r = 0.85). This step resulted in the removal of 10–15% of features across the datasets, ensuring that the remaining features were independent and informative.

  3. (3)

    Feature scaling

    All numerical features were standardized using Min–Max Scaling to ensure that they contributed equally to the model training process. This step was particularly important for datasets like NASA and AEEEM, where features such as “LOC_TOTAL” and “Cyclomatic Complexity” had significantly different scales. By scaling the features to a range of [0, 1], we ensured that no single feature dominated the model training due to its larger magnitude.

  4. (4)

    Dimensionality reduction

    In cases where the feature space remained large after correlation-based feature selection, we applied Principal Component Analysis (PCA) to reduce dimensionality while retaining at least 95% of the variance. For example, in the AEEEM dataset, which initially had 62 features, PCA reduced the dimensionality to 15 principal components. Similarly, the NASA dataset was reduced from 38 features to 12 principal components, and the ReLink dataset from 27 features to 10 principal components. This step not only reduced computational complexity but also helped mitigate the risk of overfitting.

  5. (5)

    Feature importance analysis

    To further understand the contribution of individual features, we analyzed feature importance using XGBoost and LightGBM. The top 10 most important features across all datasets included metrics such as “Cyclomatic Complexity,” “LOC_TOTAL,” “Code Churn,” and “Number of Developers.” These features consistently ranked high in importance, aligning with prior studies in software defect prediction. For instance, “Cyclomatic Complexity” was the most influential feature in the NASA dataset, while “Code Churn” dominated in the AEEEM dataset.

Within-project defect prediction

To evaluate the defect prediction performance of the model averaging (abbreviated to MA) algorithm, the paper uses AdaBoost51, J4852, NaiveBayes53, SMO54, LogitBoost55, RandomForest42 and LMT56 algorithms as benchmark models in the above four extensive experimental tests on the dataset. The above seven machine learning models are trained with the default parameters in data mining Weka, and model evaluation is performed by tenfold cross-validation. The model averaging algorithm (including XGBoost and LightGBM) also uses tenfold cross-validation, and all results are trained 20 times to obtain the final result.

We utilized the default parameter values provided by the respective machine learning libraries to maintain consistency and fairness across all models. While hyperparameter tuning can potentially enhance model performance, it requires significant computational resources and time, especially when dealing with multiple models and large datasets. Previous research in software defect prediction has shown that default parameters often yield competitive results, serving as a reliable baseline for comparative studies. Therefore, we chose to employ default settings to ensure that our comparisons are based on standardized configurations.

Traditional ensemble methods include RandomForest (Bagging), AdaBoost (Boosting), LogitBoost, and gradient boosting frameworks like LightGBM and XGBoost. We selected a range of baseline models, including both single classifiers (e.g., J48, NaiveBayes, SMO, LMT) and ensemble methods (e.g., AdaBoost, LogitBoost, RandomForest, XGBoost, LightGBM), to ensure a comprehensive evaluation of our MA approach. Ensemble methods like AdaBoost and LogitBoost provide strong benchmarks as they integrate multiple weak learners to form a robust classifier, similar to our MA method. This selection allows us to directly compare the performance and advantages of our MA approach against established ensemble techniques.

Table 3 presents the default parameter settings for all algorithms. The benchmark model algorithm is introduced as follows:

  1. 1.

    AdaBoost: AdaBoost is an iterative algorithm that integrates different weak classifiers trained by the training set to form a strong classifier;

  2. 2.

    J48: J48 adopts a top-down approach. First, select an attribute to place on the root node, divide the dataset into different subsets according to the root node, and then select an attribute on each subset as the root node of the subtree. Then, continue this process until all samples correspond to their correct classifications, and finally, a decision tree is formed.

  3. 3.

    Naive Bayes: The Naive Bayes algorithm is one of the few probability-based classification algorithms in machine learning. It uses probability inference and known probability to derive the sample probability on each classification and finally, selects the sample with the highest probability as the prediction result.

  4. 4.

    SMO: The SMO algorithm is used to solve the optimization problems arising from the support vector machine (SVM) training process. It decomposes the quadratic programming problem into a quadratic programming subproblem with only two variables. By continuously solving the subproblem, all its variables meet the KKT condition.

  5. 5.

    LogitBoost: LogitBoost builds a relatively basic weak classifier from the existing sample dataset, calls the weak classifier repeatedly, gives a greater weight proportion to the incorrectly judged samples in each round, focuses more attention on the samples that are difficult to distinguish, and finally, combines the weak classifiers in each round into strong classifiers through multiple rounds of circulation to obtain a higher precision prediction model.

  6. 6.

    LMT: The LMT algorithm is an algorithm that combines ordinary decision tree and logistic regression. LMT determines the corresponding classification category by taking all attributes in the corresponding subsample space as independent variables and building a logistic regression model. LMT has stronger classification performance compared with ordinary decision tree models and logistic regression models.

  7. 7.

    RandomForest: Random Forest performs classification or regression tasks by constructing multiple decision trees. It leverages randomness during the training process by selecting data samples and features, thereby generating a set of diverse decision trees. Finally, the output is determined by voting (for classification problems) or averaging (for regression problems) the prediction results of all the decision trees.

Table 3 Parameter settings for different algorithms.

In addition, we have compared our results with those reported in the literature to demonstrate the effectiveness of the MA approach. Our MA method consistently outperforms traditional ensemble methods, as well as advanced ensemble techniques such as XGBoost and LightGBM, across various datasets. These comparisons are detailed in Tables 4, 5, 6 and 7, where MA shows superior performance in terms of F1-score and AUC, highlighting its advantages over existing ensemble methods.

Table 4 Results of different methods on AEEEM dataset.
Table 5 Results of different methods on NASA dataset.
Table 6 Results of different methods on ReLink dataset.
Table 7 Results of different methods on SoftLab dataset.

Table 4 presents the test results of various algorithms on the AEEEM dataset. The results indicate that the model averaging algorithm achieves the highest AUC performance across the five projects and generally leads in prediction accuracy, with the only exception being a slightly inferior performance on the LC project. Moreover, its overall performance is markedly superior to that of the seven traditional machine learning models. Although the model averaging algorithm exhibits comparable performance to the XGBoost and LightGBM algorithms on certain metrics, it ranks first in 16 evaluation indicators, whereas XGBoost and LightGBM each outperform in only three indicators. These findings demonstrate that the model averaging algorithm delivers outstanding performance in defect prediction tasks on the AEEEM dataset.

Table 5 summarizes the test results of various algorithms on the NASA dataset. The data show that the model averaging algorithm achieves excellent prediction performance on several sub-datasets (e.g., PC1, PC2, PC3, PC4, KC3, MC2, and MW1), ranking first in a total of 28 evaluation indicators and tying for first in 12 projects. In addition, the performance of XGBoost and LightGBM is very similar, with LightGBM exhibiting a slight advantage by ranking first or tied for first in 9 and 12 evaluation indicators, respectively. Overall, the model averaging algorithm, XGBoost, and LightGBM all demonstrate significantly superior performance compared to the other seven traditional machine learning models.

Table 6 documents the test results of various algorithms on the ReLink dataset. The analysis indicates that LogitBoost delivers the best performance on this dataset, achieving optimal results in five evaluation indicators—marginally outperforming the model averaging algorithm and LightGBM, which each lead in three indicators. Although XGBoost does not achieve the highest score in any individual metric, its overall performance is comparable to that of LightGBM. In general, the model averaging algorithm exhibits a superior overall performance that aligns with ensemble learning principles and outperforms the other seven traditional machine learning models. Notably, on the Safe sub-dataset, despite LightGBM achieving a very high recall, its AUC falls below 0.5, indicating that the model is ineffective for this particular dataset.

Table 7 presents the test results of various algorithms on the SoftLab dataset. The results reveal that the model averaging method clearly demonstrates superior prediction performance, ranking first in eight evaluation indicators. It is followed by LightGBM, which ranks first in three indicators, while XGBoost achieves the best performance in only one indicator. Overall, the performance of the model averaging method significantly surpasses that of the other seven traditional machine learning models.

Cross-project defect prediction

This paper considers the problem of cross-project defect prediction for different projects in the same dataset; that is, the attribute and feature sets of the source project and target project are the same. To evaluate the performance of the methods proposed in the paper, four methods, meCom1741, TCA40, peUpMeCom43 and PCAEnsemble42 are used as the benchmark models:

  1. 1.

    meCom17: meCom17 is an efficient method for cross-project software defect prediction, which improves the homogeneity of training data and target data by normalizing the training data instead of the target data.

  2. 2.

    TCA: The main motivation for TCA is that although the observed domain characteristics are different, the source and target domains may have some common potential factors. These domains are projected to a new space to reveal potential factors, which can reduce the differences between domains and retain the original data structure.

  3. 3.

    peUpMeCom: The Pearson feature selection method is introduced to solve the data redundancy problem, and measurement compensation technology based on migration learning is used to solve the problem of large differences in data distributions between source projects and target projects.

  4. 4.

    PCAEnsemble: First, PCA is used for dimensionality reduction, and then SMOTE technology is used for sampling. Finally, random forest and XGBoost classifiers are integrated for the defect prediction model.

To evaluate the effectiveness of the model averaging method, we conducted cross-project software defect testing based on the AEEEM dataset and ReLink dataset. Taking the AEEEM dataset as an example, when EQ is selected for the source project and JDT is selected for the target project, it can be expressed as EQ =  > JDT, where the left of \(=>\) represents the source project and the right represents the target project. Tables 8 and 9 show the experimental results of four benchmark models and MA based on the selected dataset. To improve the predictive performance of the model average, we add the PCA feature dimensionality reduction technique. It can be seen that the model averaging method has obvious advantages in the F1 indicators on the two datasets, the AUC indicator is not much different from peUpMeCom and PCAEnsemble, and the overall effect is significantly better than the other two benchmark algorithms. For JDT =  > ML of the AEEEM dataset, the F1 index of the model averaging algorithm MA is 0.39 higher than that of the suboptimal peUpMeCom, and the AUC index is 0.07 higher. For the Safe =  > ZXing of the ReLink dataset, the F1 index of the model averaging algorithm MA is 0.28 higher than that of the suboptimal peUpMeCom, and the AUC index is 0.03 higher.

Table 8 Results of different methods on AEEEM Dataset.
Table 9 Results of different methods on ReLink dataset.

Compared with the benchmark method, the model averaging method proposed in this paper constitutes a better defect prediction model when dealing with data distribution differences across different domains. To further verify the comprehensive performance of the proposed model, we conducted pairwise tests on five items in the AEEEM dataset, which included 20 groups of data. The average F1 of the model average algorithm was 0.824.

Statistical significance analysis

To ensure that the performance improvements of the Model Averaging (MA) method over the baseline models are statistically significant, we performed the Wilcoxon signed-rank test on the F1-scores obtained from all datasets57. This non-parametric test is suitable for comparing paired samples without assuming a normal distribution of differences.

Table 10 presents the results of the Wilcoxon signed-rank tests, indicating whether the differences in F1-scores between the MA method and each baseline model are statistically significant.

Table 10 Wilcoxon signed-rank test results for F1-score comparisons.

The statistical analysis presented in Table 10 demonstrates that the Model Averaging (MA) method significantly outperforms several baseline models across multiple datasets. In particular, MA shows highly significant improvements over AdaBoost, J48, SMO, LogitBoost, LMT, and RandomForest—with p-values well below the 0.01 threshold—indicating that the superior performance of MA is not due to random chance and is consistently robust across different datasets.

In contrast, the comparisons between MA and NaiveBayes (p = 0.0574), MA and XGBoost (p = 0.0500), and MA and LightGBM (p = 0.0580) yield marginally significant results. This suggests that while the MA method tends to perform slightly better than these baselines, the differences in F1-scores are less pronounced.

Overall, the Wilcoxon signed-rank test confirms the robustness and effectiveness of the MA method in software defect prediction tasks. The significant enhancements in F1-scores across most baseline comparisons underscore MA’s potential as a superior predictive tool in the software engineering domain.

Model averaging (MA) method analysis

Our proposed model averaging (MA) method is based on the theory of ensemble learning. Ensemble learning integrates multiple learning models and improves the accuracy and stability of predictions by allowing these models to collectively vote on the final outcome. We apply this theory in our MA method by using XGBoost and LightGBM as our base models and averaging their prediction results to obtain the final prediction. The advantage of this method is that even if each individual model has its own weaknesses and biases, when the models are integrated, these weaknesses and biases tend to be mutually offset, thus making the overall prediction more accurate and stable.

At its core, the model averaging method, which employs different learning algorithms (XGBoost and LightGBM), introduces diversity into the solution. Diversity between models is an important factor in enhancing the predictive performance of ensemble learning. As each model may excel in different areas or scenarios, having a diverse set of models makes the ensemble more robust and accurate.

The test results for the AEEEM, NASA, ReLink, and SoftLab datasets demonstrate that our model averaging method outperforms the individual XGBoost and LightGBM algorithms and other traditional machine learning models. This is primarily because our model can fully exploit the advantages of both the XGBoost and LightGBM models while avoiding their respective drawbacks. For instance, while XGBoost is very powerful in handling sparse data and feature selection, it may fall short in dealing with noisy data and preventing overfitting. LightGBM, on the other hand, has a significant advantage in handling large-scale data and enhancing training efficiency, but it may not be as good as XGBoost in handling sparse data and feature selection. Through model averaging, our method can suppress the drawbacks of these models while preserving their strengths, thus achieving the best prediction results for all datasets.

For cross-project defect prediction, our MA method shows superior performance. This is likely because it can better handle differences between different projects. When the training data (source project) and testing data (target project) have different distributions, a single model might not generalize well. However, our MA method, by leveraging the power of multiple models, can adapt more effectively to these differences.

link