Research on Data Resource Pricing Method Based on SSA-XGBoost Model

Jian YANG, Yajuan CHEN, Liwei CHANG, Yali LÜ

系统科学与信息学报(英文) ›› 2025, Vol. 13 ›› Issue (1) : 116-136.

PDF(601 KB)
PDF(601 KB)
系统科学与信息学报(英文) ›› 2025, Vol. 13 ›› Issue (1) : 116-136. DOI: 10.21078/JSSI-2024-0074

    Jian YANG*(), Yajuan CHEN(), Liwei CHANG(), Yali LÜ()
作者信息 +

Research on Data Resource Pricing Method Based on SSA-XGBoost Model

    Jian YANG*(), Yajuan CHEN(), Liwei CHANG(), Yali LÜ()
Author information +
文章历史 +

Abstract

Data pricing is a key link to promote the efficient circulation of data in the market. However, the existing methods are still insufficient in terms of pertinence, dynamism and comprehensiveness. Therefore, we proposed a data pricing prediction model based on sparrow search optimization XGBoost, aiming to provide a reference for pricing decisions in data market. First, we crawled the data transaction information of Youedata.com and performed preprocessing operations such as outlier processing, one hot encoding and logarithmic transformation on the dataset; Secondly, we conducted exploratory data analysis to understand the distribution of data and their correlation. Then, we used the LASSO algorithm to select features for the dataset and constructed a data pricing prediction model based on SSA-XGBoost. Finally, we compared and analyzed it with six machine learning models including LightGBM, GBDT, MLP, KNN, LR and XGBoost. The experimental results show that in terms of the R-squared, the prediction results of the proposed SSA-XGBoost model exceed the above six models by 4.9%, 7.4%, 7.1%, 23.8%, 12.8%, and 2.3% respectively, and are superior to the state-of-the-art work. Furthermore, the evaluation results of the five indicators of MSE, RMSE, MAE, MAPE, and RMSPE are better than other models, showing higher stability.

Key words

data pricing / lasso / SSA-XGBoost / machine learning

引用本文

导出引用
Jian YANG , Yajuan CHEN , Liwei CHANG , Yali LÜ. . 系统科学与信息学报(英文), 2025, 13(1): 116-136 https://doi.org/10.21078/JSSI-2024-0074
Jian YANG , Yajuan CHEN , Liwei CHANG , Yali LÜ. Research on Data Resource Pricing Method Based on SSA-XGBoost Model. Journal of Systems Science and Information, 2025, 13(1): 116-136 https://doi.org/10.21078/JSSI-2024-0074

1 Introduction

With the rapid development and widespread adoption of information technologies such as the Internet of Things, big data, and cloud computing, global data volume has shown exponential growth, and the importance of data has attracted much attention in the era of big data. However, the data field is naturally fragmented and monopolized, with mismatched supply and demand and data silos still hindering the realization of data value[1]. To facilitate the realization of data value and the efficient exchange and optimal distribution of data assets, thereby achieving data sharing, a mechanism can be established to transfer data usage rights to third parties, directly or indirectly gaining profits. For instance, through data trading markets, data holders can sell data usage rights to organizations or institutions in need, which not only meets the needs of big data applications but also promotes the realization of data value. According to the "2023 China Data Transaction Market Research and Analysis Report", the market size of China's data trading industry is expected to continue to grow steadily, reaching 204.6 billion yuan by 2025 and escalating to 515.59 billion yuan by 2030, with a compound annual growth rate of approximately 20.3% from 2025 to 2030. Over the next decade, the compound annual growth rate of China's data trading market size will be significantly higher than the global level.
The key to establishing a data trading market lies in the reasonable pricing of data. Price is one of the essential elements of commodity exchange. Compared to traditional commodities, data products exhibit new characteristics, namely zero marginal cost, high fixed costs, and sunk costs[2], making traditional pricing mechanisms unsuitable for data products. An equitable data valuation framework is crucial for the robust evolution of the data trading marketplace. Thus, designing a reasonable pricing mechanism for data resources has become an urgent issue. Therefore, we propose a novel data pricing model based on a machine learning framework.
This paper makes the following key contributions: First, we crawled the real data transaction information of youedata.com, performed preprocessing operations such as outlier removal and data distribution adjustment on the samples, and used the lasso algorithm for feature selection to provide data resource preparation for the construction of the pricing model.
Second, we introduced a data pricing framework utilizing the SSA-XGBoost algorithm. By incorporating the sparrow search algorithm to fine-tune the parameters of the XGBoost, we enhanced the predictive accuracy and robustness of the XGBoost model in price forecasting.
Furthermore, a comparative analysis was conducted between the SSA-XGBoost model and six baseline models. The experimental results of five evaluation indicators, such as MSE, verified the superior performance of the SSA-XGBoost model in price prediction.
The structure of this paper is as follows. Section 2 provides a review of related literature. Then the dataset is described, and data preprocessing and feature selection are performed in Section 3. The SSA-XGBoost model is constructed in Section 4. Section 5 presents and discusses the experimental findings. Section 6 concludes this paper and presents the next research agenda.

2 Literature Review

2.1 Data Pricing

The data pricing model is an important component of data pricing research. Many scholars have proposed multiple models from different perspectives, forming various classification methods. 1) Pricing models based on data ontology and profit maximization. These methods focus on reflecting the real prices of different dimensions of data ontology. Yang, et al.[3] proposed a pricing framework based on quality dimensions, adjusting data quality dimensions according to users' willingness to pay to obtain quality scores for further pricing. Yu, et al.[4] approached the problem from the perspective of information amount, fully considering the scarcity of data. They focused more on the effective quantity and distribution of data relative to its content and quality, establishing a functional relationship between information entropy and price and mapping information entropy to price through a connection function. Yang, et al.[5] developed a nonlinear model focused on willingness to pay (WTP) by analyzing consumers' self-selection behaviors. They employed a bi-level programming model to address the optimal pricing issue for personal privacy data. 2) Pricing models based on academic research and operational practice. Cong, et al.[6] from a data science and data market perspective, sought to assess the value of data by examining the predictive efficacy of machine learning algorithms and illustrated their findings with pertinent examples. Chen, et al.[7] designed a framework for a personal data market, considering in sequence personal privacy loss quantification, privacy compensation, and query pricing. Cai, et al.[8] proposed an arbitrage-free query pricing based on tuple importance, proving the model's monotonicity and boundedness, ensuring fairness and feasibility in pricing. 3) Pricing models based on economics and game theory. These methods focus on the comprehensive factors under specific market scenario constructions. Alorwu, et al.[9] constructed a personal health data pricing model based on second-price sealed auctions, promoting pareto optimality among transaction parties. Cheng, et al.[10] proposed a data market framework enhanced by blockchain with cloud computing as an auxiliary, using stackelberg game to maximize the interests of market participants. Pandey, et al.[11] proposed a fair negotiation method, adopting the Rubinstein bargaining model to determine the price of data and the value of privacy loss, ensuring fair transactions. In summary, scholars have made significant progress in data pricing research. However, current research predominantly focuses on mathematical model derivation. Existing methods have their applicability and inherent issues. There is a lack of empirical research and econometric analysis specific to particular data trading markets. Therefore, it is necessary to explore a more scientific and widely applicable data pricing method.

2.2 XGBoost Model

Machine learning is capable of automatically analyzing large-scale data, processing rapidly changing and complex datasets, and providing efficient and accurate prediction and decision support. Among them, XGBoost, Gradient Boosting Decision Tree, Random Forest, Support Vector Machine Regression, and Multilayer Perceptron have been widely applied in various price prediction studies and have achieved good application effects[12-15].
We aim to utilize the XGBoost model to predict data resource prices, benefiting from its low computational complexity, rapid execution, and high precision[16]. XGBoost can be applied to classification, regression, anomaly detection, and is widely used in multiple fields. In the medical field, Budholiya, et al.[17] used a Bayesian-optimized XGBoost classifier to predict heart disease and compared its predictive performance with random forest and extra trees classifiers, showing that the proposed model had better predictive performance. Deng, et al.[18] introduced a novel technique that integrates XGBoost with MOGA for cancer classification. The empirical analysis indicated that XGBoost-MOGA outperformed previous state-of-the-art algorithms in F-score, accuracy, recall, and other evaluation metrics. In risk assessment, Ma, et al.[19] used XGBoost to assess flash flood risk and found that it outperformed the LSSVM model in several predictive evaluation metrics. Wang, et al.[20] compared the performance of XGBoost, K-nearest neighbors and decision tree classification algorithms for personal credit risk assessment, concluding that XGBoost performed well in terms of accuracy, AUC. In price prediction, XGBoost also performed outstandingly. Jabeur, et al.[21] compared the predictive performance of six models, including XGBoost, for gold prices, and empirical analysis revealed that the XGBoost model outperformed other machine learning models. Avanijaa, et al.[22] used XGBoost regression techniques to predict house prices, aiding customers in determining when and where to buy a house. Wu, et al.[23] introduced a model for forecasting electricity prices using XGBoost enhanced by PSO optimization, and empirical analysis comparing its predictive results with ARIMA, LSTM, SVM, RW and XGBoost itself showed that the PSO-XGBoost model had better predictive performance. Furthermore, XGBoost also demonstrates excellent predictive capabilities in other types of prediction tasks and has a wide range of applications[24, 25]. It is particularly important to emphasize that the hyperparameter optimization of the XGBoost model is a key link in improving its performance. The performance of XGBoost can be significantly improved by using advanced hyperparameter optimization techniques such as Bayesian optimization or swarm intelligence[26, 27]. These methods can systematically explore the hyperparameter space to ensure that the best configuration is found to maximize the efficiency and effectiveness of the model.

3 Proposed Methodology

This study introduces an approach for enhancing XGBoost parameter optimization utilizing the sparrow search algorithm (SSA), with the goal of examining the effectiveness and applicability of machine learning models in predicting data resource prices. Firstly, we collected and preprocessed transaction information from youedata.com[28], which involved removing outliers, converting categorical variables into dummy variables, and adjusting the data distribution using logarithmic transformations. Then, we conducted exploratory analysis on the preprocessed data to further understand the correlations between data features and employed the lasso algorithm for feature selection. Finally, we constructed a data price prediction model based on the SSA-XGBoost algorithm and compared its predictive performance against LightGBM, GBDT, MLP, KNN, LR, and XGBoost using six evaluation metrics. Figure 1 shows the framework flowchart of our proposed machine learning-based data price prediction model.
Figure 1 The proposed flowchart for machine learning-based data pricing

Full size|PPT slide

3.1 Dataset Introduction

The youedata.com was launched in November 2016 as an online platform that provides data intelligence services, including online big data product transactions, such as APIs and block data. The platform contains data product resources from 13 industries, including public data. There are currently 6, 445 data products available, with 92 category labels.
This study collected 3, 328 sample data points from the Youedata.com block data trading information, and after data completeness checks, 3, 301 valid samples were obtained. The data set mainly includes three major industry category: Industrial economics, healthcare and medicine, and scientific research and technology, with 17 sub-category labels. Among these, four labels (higher education, development tools, smart healthcare, and regional economics) have no price information for their related data products. The data with the labels "COVID-19" and "pandemic" only have four entries, so these labels were removed from the data set. Ultimately, our data set contains 11 product labels. Each data entry includes the data name, product price, data volume, industry category, commodity label, sales volume, and quality score for some data dimensions. Table 1 outlines the features included in the data set and their descriptions.
Table 1 Description of features
No. Feature Description
1 Price The price at which data products are sold in data transactions
2 Data Volume The data scale of selling data products
3 Scarcity Score Scoring of the supply and demand differences of data resources
4 Consistency Score Scoring of data compliance with unified standards
5 Applicability Score Scoring of the benefits that data can bring
6 Structural Level Score Scoring of the organization and formatting degree of data
7 Data Quantity Score Scoring of the scale of selling data products
8 Redundancy Score Scoring of the degree of redundant information contained in data
9 Completeness Score Scoring of the information missing situation in the data
10 Timeliness Score Scoring of the time interval from data generation to use
11 Sales Volume The quantity of data product sales
12 Industry category Including Industrial economy, Healthcare and medicine, and Scientific research and technology.
13 Commodity label Scientific research and technology: Patents, Research data; Industrial economy: Statistical yearbook, Import and export, Supply and demand of agricultural products, Natural resources; Healthcare and medicine: Domestic pharmaceuticals, Imported drugs, Pharmaceutical bidding, Drug procurement, Pharmaceutical companies
Due to the focus of this study on unpriced data products, sales data will not be used as an input indicator for the model. Commodity label further categorize the industry type, so only finely classified commodity label are selected as input indicators for the model. In the end, we choose 11 features as input indicators for the model and select data product prices as the output indicator.

3.2 Data Preprocessing

As shown in Table 2, we conducted descriptive statistical analysis on the features of the dataset. From the statistical results, it can be seen that the median of the price indicator in the dataset is 4, which means that half of the products in the crawled data have relatively low prices. The average sales volume is 26.94, indicating that the sales volume of Youedata.com's data is relatively considerable. However, from the comprehensive analysis of indicators such as mean, variance, maximum value, and minimum value, it can be seen that there are outliers in the dataset, and the value ranges of various features differ significantly. To reduce the negative impact of variables on model construction, data needs to be preprocessed to improve data quality and analysis results.
Table 2 Descriptive statistical analysis
Feature Mean Median Variance Skewness Min Max
Price 783.578 4 3512112 4.171 0 19421
Data Volume 98523.46 1024 1.09e+1 6.015 0.23046 5242880
Sales Volume 26.94 25 376.979 0.658 0 242
Rating 3.92 4 0.167 0.547 3 5
Scarcity Score 3.98 4 0.172 0.547 3 5
Consistency Score 4.3 4 0.181 0.357 3 5
Applicability Score 3.99 4 0.616 0.018 2 5
Structural Level Score 3.99 4 0.609 0.003 2 5
Data Quantity Score 3.98 4 0.616 0.008 2 5
Redundancy Score 4.05 4 0.624 0.012 2 5
Completeness Score 4.05 4 0.624 0.005 3 5
Timeliness Score 4.45 4 0.611 0.016 3 5

3.2.1 Outlier Handling

Handling outliers in data is a crucial step in data preprocessing. By addressing outliers, models can more effectively cope with data variations and noise, enhancing the precision and dependability of data analysis and machine learning, which in turn supports more informed decision-making.
In many business scenarios, the price of goods is typically not zero. A price of zero may indicate special situations such as giveaways, test items, or samples, which do not reflect the normal market pricing of the goods. Including such data in predictive models can introduce bias and affect the model's predictive performance. Therefore, it is advisable to exclude samples with a price of zero.
We use a box plot to analyze the dispersion of the dataset. Through preliminary observation of the data, we find that there are significant differences in data prices corresponding to different product labels. Therefore, we plot a box plot with "product label" and "price" as indicators. If all outliers are excluded, the generalizability of the model would be reduced. Hence, we only exclude those extreme outliers, allowing the existence of outliers with a moderate degree of deviation. Figure 2 shows the box plot of prices corresponding to some product labels.
Figure 2 Box plot of prices corresponding to different data product labels

Full size|PPT slide

3.2.2 Data Skewness Handling

From the skewness statistics in Table 1, it can be observed that there is a significant skewness phenomenon in the data volume and price features of the dataset. This skewness can lead to heteroscedasticity during the regression analysis. Therefore, we applied a logarithmic transformation to adjust the distribution of prices in the dataset. After the logarithmic transformation, the skewness of the prices is 0.314, significantly improving the right skewness. The distribution of price data before and after the logarithmic transformation is shown in Figure 3. Additionally, for the feature of data volume, we also applied a logarithmic transformation, resulting in a skewness of 0.014 after the transformation. Handling the skewness of the data allows the predictive model to be trained on a more balanced dataset, thereby ensuring the reliability of the model.
Figure 3 Price transformation

Full size|PPT slide

3.2.3 One-Hot Encoding

The dataset used in this paper contains commodity labels that are categorical and unordered. Therefore, one-hot encoding is required. We selected a total of 11 categories of product labels, including patents, research data, and statistical yearbooks, among others. For any given commodity label, if a data point belongs to that category, it is assigned a value of 1; otherwise, it is assigned a value of 0. We can ultimately expand this to 21 features. For example, if a data point's commodity label is SDAP (i.e., supply and demand of agricultural products), after one-hot encoding, the feature value for SDAP in that data point will be 1, while the remaining 10 categorical features will be 0.

3.3 Exploratory Data Analysis

By conducting exploratory data analysis, we can gain insights into the fundamental properties of the data, evaluate its quality, and uncover relationships between different variables. Figure 4 shows the average prices corresponding to different data product labels. We can see significant differences in their average prices. Specifically, the average price of statistical yearbooks is 0.11, imported drugs is 0.33, natural resources is 6, research data is 12.73, and pharmaceutical companies is 12.79. These data products have relatively low prices. However, the prices for the labels "import and export" and "patents" are significantly higher than those of other labels, with average prices of 489.04 and 2245.02 respectively. Therefore, there is a correlation between data commodity labels and their prices.
Figure 4 Average prices of different product labels

Full size|PPT slide

Furthermore, the industry category for the commodity labels labels of Research data and patents is scientific research and technology, with an average price difference of 2232.29. Both import/export and statistical yearbooks belong to industrial economics, with an average price difference of 488.93. import drugs and pharmaceutical procurement pertain to healthcare and medicine, with an average price difference of 99.07. Hence, it is reasonable to choose sub-labels of commodity labels instead of the industry category itself as feature input.
The correlation heat map is constructed based on the correlation coefficients between feature variables, which can provide a preliminary understanding of the relationship between different variables in the data set. Due to the presence of discrete variables in the data set, we use the spearman correlation coefficient for analysis.
As shown in Figure 5, the SDAP is negatively correlated with the data volume (0.66), indicating that as the data volume increases, the demand for SDAP will decrease, which may be due to market saturation, data quality issues, or information overload. In addition, the data volume is positively correlated with price (0.46), indicating that the larger the data volume, the higher the price. The correlation between patents and prices is 0.56, which means that data products with patent labels tend to be more expensive.
Figure 5 The correlation heatmap

Full size|PPT slide

Drawing a heatmap of the correlations can provide an initial visual overview of the relationships between variables. This preliminary analysis can help understand the logical relationships among these variables, but it does not support subsequent feature selection.

3.4 Feature Selection

This paper selects the LASSO algorithm[29] for feature selection, whose basic principle is to add an L1 norm penalty term on the model coefficients, constrain the complexity of the model, compress unimportant regression coefficients to 0, thereby eliminating certain features, and obtaining a model with fewer variables and higher efficiency. The formula is detailed below.
Φ(ω)=i=1k(yij=1tωjxij)2+ξi=1t|ωi|,
(1)
where t represents the number of features in the dataset, k represents the number of samples, ω is the regression coefficient vector, x is the feature matrix, yi represents the true value of the product price in the dataset, and ξ is a parameter that controls the strength of regularization.
We used the lasso feature selection method to screen features and finally selected 19 features, such as data volume, patents, and completeness score. We removed two features: SDAP (i.e., supply and demand of agricultural products) and the applicability score, as shown in Figure 6.
Figure 6 Lasso feature selection

Full size|PPT slide

4 Model Construction

4.1 XGBoost Algorithm

XGBoost is a robust machine learning technique known for its accuracy and ability to generalize well[30]. It enhances performance by integrating multiple weak models, like decision trees, and refines the overall model through the gradient boosting method. The regression problem in the XGBoost algorithm involves the following formulas and principles.
Assuming D={(xi,yi):i=1,2,,n,xiRm,yiR}, where n is the number of samples, and each sample has m features. Given z regression trees, xi represents the feature vector of the i-th data point, fz is one of the regression trees, and F represents the space containing the functions of the Z trees. XGBoost generates the final output by aggregating the predictions of multiple regression trees, as shown in the following formula.
y^i=z=1Zfz(xi),fzF.
(2)
The objective function of the XGBoost is shown as:
Objective=i=1nl(yi,y^i)+z=1ZΩ(fz),
(3)
where y^i is the predicted value, yi is the true value, l(y^i,yi) is the loss function, which describes the deviation of the prediction, and Ω(fz) is the complexity of the model. The primary aim is to minimize the objective function's loss function by employing Taylor expansion up to the second order, which simplifies the gradient descent updates. This method helps to achieve the best approximation of the objective function, leading to the development of the optimal algorithm.

4.2 Sparrow Search Algorithm

Sparrow optimization is a collective intelligence algorithm inspired by the foraging behavior of sparrows[31]. It mainly consists of two stages: The discoverer phase and the joiner phase. In the discoverer phase, the discoverer identifies and occupies relatively optimal feeding sites. During the joiner phase, the joiner trails the discoverer and expands the exploration area to avoid convergence on a local optimum.
In a D-dimensional search space with N sparrows, the position of the i-th sparrow is denoted as Xi=[xi1,xi2,,xid,,xiD].
1) The update of the producer position is shown in Formula (4).
Xi,dt+1={Xi,dtexp(iαitermax)R<ST,Xi,dt+QLRST.
(4)
Let t denote the iteration number, while itermax signifies the upper limit of iterations. The parameter α is a stochastic variable uniformly distributed in the interval (0, 1], Q adheres to a normal distribution, L is a 1×d vector, R represents the warning value confined to [0, 1], and ST indicates the security threshold ranging from [0.5, 1]. If R<ST, it signifies that the search environment is secure, allowing the discoverer to conduct a broad search, which improves the population's fitness. If RST, it suggests that the sparrow in the population has detected predators, triggering an immediate alarm. Consequently, the population will rapidly relocate to a safe zone to evade predation.
2) The update of the scrounger position is shown in Formula (6).
Xi,dt+1={Qexp(XworsttXi,dti2)i>l/2,Xbt+1+|Xi,dtXbt+1|A+Lil/2,
(5)
where Xb and Xworst are the current global optimal position and global worst position found by the searcher, respectively, A is a matrix of 1×d, while A+=AT(AAT)1. When i>l/2, it indicates that the i-th sparrow has lower fitness and is in a very hungry state, needing to fly to other places to forage for energy.
3) The update of the alert location is shown in Formula (8).
Xi,dt+1={Xi,dt+K[Xi,dtXworsttfifw+ϵ]fi=fg,Xbestt+β|Xi,dtXbestt|fi>fg,
(6)
where Xbest represents the current globally optimal position, K denotes the sparrow's movement direction, β is the step control parameter governed by the Cauchy distribution, and ϵ is a small positive constant to prevent division by zero. The fitness value of the i-th sparrow is fi, while fg and fw are the current global minimum and worst fitness values in the sparrow population, respectively. If fi=fg, it signifies that the sparrow is one of the two types that senses a threat. This type of sparrow will proactively seek out and approach others of its kind to refine its search strategy.

4.3 SSA-XGBoost Model

The hyperparameters of the prediction model may cause unstable and erroneous prediction performance. Therefore, a performance optimization technique is required to determine the globally optimal solution. The sparrow search algorithm (SSA) can be used to improve the scientific quality of parameter selection in the XGBoost model's hyperparameter tuning process, allowing for the use of optimal parameters in data product price prediction while minimizing prediction errors. We focus on optimizing four parameters: n_estimators, learning_rate, max_depth and min_child_weight, while using default values for the other parameters. The fitness function is set as the mean squared error function.
Algorithm 1: Pseudocode of SSA-XGBoost
Input: number of sparrows N, number of producers Ds, maximum iterations M, number of sparrows aware of danger Ts, safety threshold ST, current iteration t
Output: n_estimator, learning_rate, max_depth, min_child_weight
1 t1;
2 Initialize population {x1,x2,,xn};
3 while t<M do
16 (n_estimator,learning_rate,max_depth,min_child_weight)gbest
Algorithm 1 outlines the pseudocode for optimizing the XGBoost model using the Sparrow Search Algorithm. The process consists of the following steps:
Step 1 Set up initial parameters, including the SSA population size, the maximum iteration count, and the parameters and boosting threshold for the XGBoost model.
Step 2 Evaluate the fitness values and rank the sparrow population accordingly. Update the positions of producers, scroungers, and alarmers using formulas (4), (6), and (8), respectively.
Step 3 Compare the fitness values of the new positions with the current best value and update the global best information.
Step 4 Verify if the iteration termination condition is met. If it is, produce the optimal sparrow position; if not, return to Step 2.
Step 5 Using the results of SSA optimization, establish the XGBoost data pricing prediction model by determining the number of estimators, learning rate, maximum depth, and minimum child weight.

5 Experimental Results

5.1 Experimental Setup and Evaluation Metrics

1) Experimental Setup
To evaluate the effectiveness of the algorithm, we randomly split the dataset into 70% training set and 30% test set. This partitioning strategy is chosen because it is consistent with most previous studies, facilitating fair comparisons. Additionally, to verify the accuracy and effectiveness of the proposed SSA-XGBoost model in predicting data product prices, we introduced six models including LightGBM, GBDT, MLP, KNN, LR, and XGBoost to compare and analyze the experimental results. All experimental results reported come from a computer equipped with an Intel Core Ultra 9 185H, 32 GB RAM and a 64-bit Windows 11 operating system, using popular Python packages including NumPy, Pandas and Scikit-Learn.
2) Evaluation Metrics
Our evaluation entails a thorough analysis of the model's performance in terms of fit and prediction accuracy, employing various metrics such as MAE, RMSE, MSE, MAPE, RMSPE, and R2. Among these, MAE measures the absolute deviation between predicted and actual values, while RMSE and MSE indicate the magnitude of deviation. Additionally, MAPE and RMSPE reflect error percentages. Smaller values for these metrics indicate higher model accuracy. R2 represents the degree of data fitting, with a higher value indicating stronger explanatory power.
R2=1i=1n(yiy^i)2i=1n(yiy¯i)2,
(7)
MAE=1ni=1n|yiy^i|,
(8)
MSE=1ni=1n(yiy^i)2,
(9)
RMSE=1ni=1n(yiy^i)2,
(10)
MAPE=1ni=1n|yiy^iyi|×100%,
(11)
RMSPE=1ni=1n(yiy^iyi)2.
(12)

5.2 Experimental Results

1) Comparative Analysis of Different Models
Table 3 shows the prediction accuracy results of the proposed model compared to the six baseline models. To facilitate a more intuitive comparative analysis, we plotted radar charts for the six metrics, as shown in Figure 7. In terms of MSE, RMSE, and MAE, the SSA-XGBoost model performs the best, with values lower than the non-optimized XGBoost by 0.083, 0.107, and 0.067, respectively, while KNN and LR models perform the worst. For the MAPE metric, the SSA-XGBoost value is 10.14%, which is 4.54%, 9.57%, 15.94%, and 19.48% lower than XGBoost, GBDT, LightGBM, and MLP, respectively. For the RMSPE metric, the SSA-XGBoost value is 24.87%, which is 9% and 22.95% lower than XGBoost and GBDT, respectively, with KNN having the highest value at 88.86%, and LR at 65.28%. For the R2, the values of all models range from 0.729 to 0.967, with SSA-XGBoost performing the best, being 0.023 higher than XGBoost and 0.049 higher than GBDT. Overall, the SSA-XGBoost model performs the best across all six evaluation metrics, making it the most effective in prediction. In contrast, KNN and LR models perform poorly across all metrics, MLP and LightGBM perform moderately, while the other models show relatively good fitting effects.
Table 3 Evaluation metric results
Model MSE RMSE MAE R2 MAPE RMSPE
SSA-XGBoost 0.112 0.335 0.172 0.967 10.14% 24.87%
XGBoost 0.195 0.442 0.239 0.944 14.68% 33.87%
GBDT 0.287 0.536 0.284 0.918 19.71% 47.82%
LightGBM 0.373 0.611 0.358 0.893 26.08% 54.54%
MLP 0.364 0.603 0.388 0.896 29.62% 53.76%
LR 0.564 0.751 0.602 0.839 47.72% 65.28%
KNN 0.947 0.973 0.692 0.729 56.94% 88.86%
Figure 7 Performance comparison of seven models with different evaluation indicators

Full size|PPT slide

2) Comparative Analysis with Existing Work
We compare the predictive performance of the proposed model with both Stacking[32] and Stacked-GBDT[33], as summarized in Table 4.
Table 4 Comparison of existing work
Relevant literature Model name MAE RMSE R2
Proposed Method SSA-XGBoost 0.172 0.335 0.967
[32] Stacking ensemble 0.593 0.824 0.839
[33] Stacked-GBDT 1265.419 2517.864 0.923
The results indicate that the SSA-XGBoost model demonstrates superior performance, exhibiting lower MAE and RMSE values compared to the traditional stacking approach, which are reduced by 0.421 and 0.489, respectively. The possible reason is that the dataset used in this study is derived from the real world, often containing noise and redundant features. Traditional Stacking ensemble models tend to overfit in such scenarios, while the SSA-XGBoost model effectively reduces the risk of overfitting by adaptively selecting features and adjusting parameters. It is worth noting that in terms of R2, SSA-XGBoost is 0.128 higher than the Stacking model, indicating a higher degree of fit in the relationship between explanatory variables and response variables. Due to significant differences in how the Stacked-GBDT model handles predictive variables compared to our method, its MAE and RMSE values are relatively high, so it is only compared with our model in terms of the R2. In this aspect, SSA-XGBoost is 0.044 higher than the Stacked-GBDT model. In summary, the SSA-XGBoost model shows significant improvements in MAE, RMSE, and R2, indicating that the model not only provides more accurate prediction results but also better reveals the underlying patterns in the data, offering stronger support for decision-making.
The prediction error plot of the proposed model is illustrated in Figure 8. It can be seen that the errors on the test set are close to zero, with a notable decrease in larger error values. This indicates that the proposed model is capable of accurate predictions in most cases, demonstrating high robustness and stability. This further illustrates that our proposed SSA-XGBoost model achieves good predictive performance.
Figure 8 Prediction error

Full size|PPT slide

3) Comparative Analysis of Feature Importance Across Different Models
Since the prediction of the KNN algorithm is based on the distance between instances rather than learning or modeling feature weights, it does not provide a ranking or assessment of feature importance. As a result, we assess the significance of features across the remaining six models. Figure 9 illustrates the ranking of feature importance for various models.
Figure 9 Feature importance ranking of different models

Full size|PPT slide

Figure 9 shows that there are significant differences in feature importance rankings among different models. To facilitate a more intuitive comparison, this paper selects the top eight features from six models for comparative analysis, as shown in Table 5. Specifically, we list the top eight features in terms of importance based on gain and split count for the LightGBM model.
Table 5 Feature importance ranking of different models
Index SSA-XGBoost XGBoost LightGBM-Split
1 Patents Patents Data volume
2 Import and export Statistical yearbook Scarcity Score
3 Research data Import and export Completeness Score
4 Statistical yearbook Research data Redundancy Score
5 Data volume Natural resources Consistency Score
6 Natural resources Pharmaceutical companies Structural Level Score
7 Pharmaceutical companies Data volume Data Quantity Score
8 Imported drugs Imported drugs Timeliness Score
Index LightGBM-Gain GBDT MLP LR
1 Patents Patents Data volume Statistical yearbook
2 Data volume Data volume Research data Research data
3 Research data Research data Statistical yearbook Import and export
4 Statistical yearbook Statistical yearbook Natural resources Patents
5 Pharmaceutical companies Import and export Pharmaceutical companies Natural resources
6 Natural resources Natural resources Imported drugs Drug procurement
7 Scarcity Score Pharmaceutical companies Import and export Imported drugs
8 Rating Rating Drug bidding Drug bidding
By comparing the feature importance rankings of LightGBM based on gain and split count, we found significant differences between the two rankings. Except for the data volume and scarcity score features, which are in the top eight in both rankings, all other features are completely different. Given that gain-based feature importance directly reflects the contribution of features to model performance improvement, we chose to compare LightGBM's gain-based feature importance with those of other models.
Table 5 shows that the feature "patents" ranks first in the models SSA-XGBoost, XGBoost, LightGBM-Gain, and GBDT, while it ranks fourth in the LR model, indicating its high importance. "data volume" and "pharmaceutical companies" rank in the top eight in five models except for the LR model, with "data volume" generally ranking higher and "pharmaceutical companies" ranking lower. "statistical yearbook" and "research data" rank in the top four across all six models, while "import and export" ranks in the top eight in five models except for the GBDT model. "natural resources" ranks in the top eight across all six models but generally ranks lower. "imported drugs" rank in the top eight in four models, with other features appearing one to two times.
In summary, the key factors affecting the price of data products include data volume, patents, statistical yearbooks, and research data. Generally, the larger the data volume, the higher the price. Most other features important to various models are categorical features, with only a few scores on data dimensions. Since the data crawled in this paper, the scores on various dimensions of data products are given by the data trading platform, rather than user feedback, and each dimension has only three rating levels of 3, 4, and 5, the scores on various dimensions of the data cannot well reflect the price of the data products. Therefore, the importance of data product labels in all models shows that the type of data product largely determines its price.

6 Conclusion and Outlook

In this research, we develop a data pricing model utilizing SSA-XGBoost and conduct comparisons with six reference models as well as state-of-the-art work. The key findings are as follows:
1) By examining various evaluation metrics, we observe that the SSA-XGBoost model demonstrates superior prediction accuracy relative to other baseline models, enabling more precise forecasting of data product prices. In contrast to existing methodologies, the model introduced in this paper shows enhancements in MAE, RMSE, and R-squared metrics, reflecting improved predictive performance.
2) By comparing the feature importance of different models, we find that data volume and data category (i.e., commodity label) are the main factors affecting the price of data products, while the scores of different dimensions of data are ranked at the back of the feature importance, which may be due to two reasons: Firstly, the scores are provided by data trading platforms rather than real users' feedbacks. Secondly, the scoring scales have only three grades, which may result in the evaluation of the insufficient information on the categorized independent variables, resulting in weaker explanatory power.
In future research agenda, we will employ more advanced machine learning algorithms to enhance the predictive accuracy and adaptability of pricing models. In addition, we will conduct more comprehensive data collection and more targeted data preprocessing. In addition to data ontology factors, user behavior and market factors can be considered to ensure a more comprehensive understanding of the pricing mechanism and make data pricing more rational.

参考文献

1
Dan L, Hao X J, Chen Y H. A review and comparative analysis of domestic and foreign research on big data pricing methods. Big Data Research, 2021, 7 (6): 89- 102.
2
Pei J. A survey on data pricing: From economics to data science. IEEE Transactions on Knowledge and Data Engineering, 2020, 34 (10): 4586- 4608.
3
Yang J, Zhao C, Xing C. Big data market optimization pricing model based on data quality. Complexity, 2019.
4
Yu M, Wang J, Yan J, et al. Pricing information in smart grids: A quality-based data valuation paradigm. IEEE Transactions on Smart Grid, 2022, 13 (5): 3735- 3747.
5
Yang J, Xing C. Personal data market optimization pricing model based on privacy level. Information, 2019, 10 (4): 123.
6
Cong Z, Luo X, Pei J, et al. Data pricing in machine learning pipelines. Knowledge and Information Systems, 2022, 64 (6): 1417- 1455.
7
Chen X, Miao S, Wang Y. Differential privacy in personalized pricing with nonparametric demand models. Operations Research, 2023, 71 (2): 581- 602.
8
Cai Z, Zheng X, Wang J, et al. Private data trading towards range counting queries in internet of things. IEEE Transactions on Mobile Computing, 2023, 22 (8): 4881- 4897.
9
Alorwu A, van Berkel N, Visuri A, et al. Monetary valuation of personal health data in the wild. International Journal of Human-Computer Studies, 2024, 185, 103241.
10
Cheng S, Ren T, Zhang H, et al. A stackelberg game based framework for edge pricing and resource allocation in mobile edge computing. IEEE Internet of Things Journal, 2024, 11 (11): 20514- 20530.
11
Pandey S R, Pinson P, Popovski P. Strategic coalition for data pricing in IoT data markets. IEEE Internet of Things Journal, 2024, 11 (4): 6454- 6468.
12
Lin H, Chung J W, Lao Y, et al. Machine unlearning in gradient boosting decision trees. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, 1374- 1383.
13
Rey-Blanco D, Zofío J L, González-Arias J. Improving hedonic housing price models by integrating optimal accessibility indices into regression and random forest analyses. Expert Systems with Applications, 2024, 235, 121059.
14
Zheng J, Tian Y, Luo J, et al. A novel hybrid method based on kernel-free support vector regression for stock indices and price forecasting. Journal of the Operational Research Society, 2023, 74 (3): 690- 702.
15
Zhu M, Xu H, Wang M, et al. Carbon price interval prediction method based on probability density recurrence network and interval multi-layer perceptron. Physica A: Statistical Mechanics and Its Applications, 2024, 636, 129543.
16
Zhang L, Jánošík D. Enhanced short-term load forecasting with hybrid machine learning models: CatBoost and XGBoost approaches. Expert Systems with Applications, 2024, 241, 122686.
17
Budholiya K, Shrivastava S K, Sharma V. An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, 2022, 34 (7): 4514- 4523.
18
Deng X, Li M, Deng S, et al. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Medical & Biological Engineering & Computing, 2022, 60 (3): 663- 681.
19
Ma M, Zhao G, He B, et al. XGBoost-based method for flash flood risk assessment. Journal of Hydrology, 2021, 598, 126382.
20
Wang K, Li M, Cheng J, et al. Research on personal credit risk evaluation based on XGBoost. Procedia Computer Science, 2022, 199, 1128- 1135.
21
Jabeur S B, Mefteh-Wali S, Viviani J L. Forecasting gold price with the XGBoost algorithm and SHAP interaction values. Annals of Operations Research, 2024, 334 (1): 679- 699.
22
Avanijaa J. Prediction of house price using xgboost regression algorithm. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 2021, 12 (2): 2151- 2155.
23
Wu K, Chai Y, Zhang X, et al. Research on power price forecasting based on pso-xgboost. Electronics, 2022, 11 (22): 3763.
24
Zhao X, Li Q, Xue W, et al. Research on ultra-short-term load forecasting based on real-time electricity price and window-based XGBoost model. Energies, 2022, 15 (19): 7367.
25
Rui C, Bin L, Min L, et al. Predicting prices and analyzing features of online short-term rentals based on XGBoost. Data Analysis and Knowledge Discovery, 2021, 5 (6): 51- 65.
26
Mao F, Chen M, Zhong K, et al. An XGBoost-assisted evolutionary algorithm for expensive multiobjective optimization problems. Information Sciences, 2024, 666, 120449.
27
Yuan Y, Du J, Luo J, et al. Discrimination of missing data types in metabolomics data based on particle swarm optimization algorithm and XGBoost model. Scientific Reports, 2024, 14 (1): 152.
28
https://www.youedata.com/ (accessed on 8 March 2024).
29
Wang X, Yang M, Li W. Efficient data reduction strategies for big data and high-dimensional LASSO regressions. arXiv Preprint: 2401.11070, 2024.
30
Yang J, Guan J. A heart disease prediction model based on feature optimization and smote-Xgboost algorithm. Information, 2022, 13 (10): 475.
31
Li J, Chen J, Shi J. Evaluation of new sparrow search algorithms with sequential fusion of improvement strategies. Computers & Industrial Engineering, 2023, 182, 109425.
32
Shen J X, Zhao X S. Research on data resource pricing method based on stacking multi-algorithm fusion model. Information Studies: Theory & Application, 2023, 46 (1): 179- 186.
33
Shen J X, Zhao X S. Research on data resource value assessment method based on dynamic stacked-GBDT ensemble learning. Science and Technology Management Research, 2023, 43 (1): 53- 61.
PDF(601 KB)

1072

Accesses

0

Citation

Detail

段落导航
相关文章

/