Jian YANG; Yajuan CHEN; Liwei CHANG; Yali LÜ

doi:10.21078/JSSI-2024-0074

PDF(601 KB)

系统科学与信息学报(英文) ›› 2025, Vol. 13 ›› Issue (1) : 116-136. DOI: 10.21078/JSSI-2024-0074

作者信息 +

Research on Data Resource Pricing Method Based on SSA-XGBoost Model

Author information +

文章历史 +

Abstract

Data pricing is a key link to promote the efficient circulation of data in the market. However, the existing methods are still insufficient in terms of pertinence, dynamism and comprehensiveness. Therefore, we proposed a data pricing prediction model based on sparrow search optimization XGBoost, aiming to provide a reference for pricing decisions in data market. First, we crawled the data transaction information of Youedata.com and performed preprocessing operations such as outlier processing, one hot encoding and logarithmic transformation on the dataset; Secondly, we conducted exploratory data analysis to understand the distribution of data and their correlation. Then, we used the LASSO algorithm to select features for the dataset and constructed a data pricing prediction model based on SSA-XGBoost. Finally, we compared and analyzed it with six machine learning models including LightGBM, GBDT, MLP, KNN, LR and XGBoost. The experimental results show that in terms of the R-squared, the prediction results of the proposed SSA-XGBoost model exceed the above six models by 4.9%, 7.4%, 7.1%, 23.8%, 12.8%, and 2.3% respectively, and are superior to the state-of-the-art work. Furthermore, the evaluation results of the five indicators of MSE, RMSE, MAE, MAPE, and RMSPE are better than other models, showing higher stability.

Key words

data pricing / lasso / SSA-XGBoost / machine learning

引用本文

EndNote

Ris (Procite)

Bibtex

导出引用

Jian YANG , Yajuan CHEN , Liwei CHANG , Yali LÜ. . 系统科学与信息学报(英文), 2025, 13(1): 116-136 https://doi.org/10.21078/JSSI-2024-0074

Jian YANG , Yajuan CHEN , Liwei CHANG , Yali LÜ. Research on Data Resource Pricing Method Based on SSA-XGBoost Model. Journal of Systems Science and Information, 2025, 13(1): 116-136 https://doi.org/10.21078/JSSI-2024-0074

1 Introduction

With the rapid development and widespread adoption of information technologies such as the Internet of Things, big data, and cloud computing, global data volume has shown exponential growth, and the importance of data has attracted much attention in the era of big data. However, the data field is naturally fragmented and monopolized, with mismatched supply and demand and data silos still hindering the realization of data value^[1]. To facilitate the realization of data value and the efficient exchange and optimal distribution of data assets, thereby achieving data sharing, a mechanism can be established to transfer data usage rights to third parties, directly or indirectly gaining profits. For instance, through data trading markets, data holders can sell data usage rights to organizations or institutions in need, which not only meets the needs of big data applications but also promotes the realization of data value. According to the "2023 China Data Transaction Market Research and Analysis Report", the market size of China's data trading industry is expected to continue to grow steadily, reaching 204.6 billion yuan by 2025 and escalating to 515.59 billion yuan by 2030, with a compound annual growth rate of approximately 20.3% from 2025 to 2030. Over the next decade, the compound annual growth rate of China's data trading market size will be significantly higher than the global level.

The key to establishing a data trading market lies in the reasonable pricing of data. Price is one of the essential elements of commodity exchange. Compared to traditional commodities, data products exhibit new characteristics, namely zero marginal cost, high fixed costs, and sunk costs^[2], making traditional pricing mechanisms unsuitable for data products. An equitable data valuation framework is crucial for the robust evolution of the data trading marketplace. Thus, designing a reasonable pricing mechanism for data resources has become an urgent issue. Therefore, we propose a novel data pricing model based on a machine learning framework.

This paper makes the following key contributions: First, we crawled the real data transaction information of youedata.com, performed preprocessing operations such as outlier removal and data distribution adjustment on the samples, and used the lasso algorithm for feature selection to provide data resource preparation for the construction of the pricing model.

Second, we introduced a data pricing framework utilizing the SSA-XGBoost algorithm. By incorporating the sparrow search algorithm to fine-tune the parameters of the XGBoost, we enhanced the predictive accuracy and robustness of the XGBoost model in price forecasting.

Furthermore, a comparative analysis was conducted between the SSA-XGBoost model and six baseline models. The experimental results of five evaluation indicators, such as MSE, verified the superior performance of the SSA-XGBoost model in price prediction.

The structure of this paper is as follows. Section 2 provides a review of related literature. Then the dataset is described, and data preprocessing and feature selection are performed in Section 3. The SSA-XGBoost model is constructed in Section 4. Section 5 presents and discusses the experimental findings. Section 6 concludes this paper and presents the next research agenda.

2 Literature Review

2.1 Data Pricing

The data pricing model is an important component of data pricing research. Many scholars have proposed multiple models from different perspectives, forming various classification methods. 1) Pricing models based on data ontology and profit maximization. These methods focus on reflecting the real prices of different dimensions of data ontology. Yang, et al.^[3] proposed a pricing framework based on quality dimensions, adjusting data quality dimensions according to users' willingness to pay to obtain quality scores for further pricing. Yu, et al.^[4] approached the problem from the perspective of information amount, fully considering the scarcity of data. They focused more on the effective quantity and distribution of data relative to its content and quality, establishing a functional relationship between information entropy and price and mapping information entropy to price through a connection function. Yang, et al.^[5] developed a nonlinear model focused on willingness to pay (WTP) by analyzing consumers' self-selection behaviors. They employed a bi-level programming model to address the optimal pricing issue for personal privacy data. 2) Pricing models based on academic research and operational practice. Cong, et al.^[6] from a data science and data market perspective, sought to assess the value of data by examining the predictive efficacy of machine learning algorithms and illustrated their findings with pertinent examples. Chen, et al.^[7] designed a framework for a personal data market, considering in sequence personal privacy loss quantification, privacy compensation, and query pricing. Cai, et al.^[8] proposed an arbitrage-free query pricing based on tuple importance, proving the model's monotonicity and boundedness, ensuring fairness and feasibility in pricing. 3) Pricing models based on economics and game theory. These methods focus on the comprehensive factors under specific market scenario constructions. Alorwu, et al.^[9] constructed a personal health data pricing model based on second-price sealed auctions, promoting pareto optimality among transaction parties. Cheng, et al.^[10] proposed a data market framework enhanced by blockchain with cloud computing as an auxiliary, using stackelberg game to maximize the interests of market participants. Pandey, et al.^[11] proposed a fair negotiation method, adopting the Rubinstein bargaining model to determine the price of data and the value of privacy loss, ensuring fair transactions. In summary, scholars have made significant progress in data pricing research. However, current research predominantly focuses on mathematical model derivation. Existing methods have their applicability and inherent issues. There is a lack of empirical research and econometric analysis specific to particular data trading markets. Therefore, it is necessary to explore a more scientific and widely applicable data pricing method.

2.2 XGBoost Model

Machine learning is capable of automatically analyzing large-scale data, processing rapidly changing and complex datasets, and providing efficient and accurate prediction and decision support. Among them, XGBoost, Gradient Boosting Decision Tree, Random Forest, Support Vector Machine Regression, and Multilayer Perceptron have been widely applied in various price prediction studies and have achieved good application effects^[12-15].

We aim to utilize the XGBoost model to predict data resource prices, benefiting from its low computational complexity, rapid execution, and high precision^[16]. XGBoost can be applied to classification, regression, anomaly detection, and is widely used in multiple fields. In the medical field, Budholiya, et al.^[17] used a Bayesian-optimized XGBoost classifier to predict heart disease and compared its predictive performance with random forest and extra trees classifiers, showing that the proposed model had better predictive performance. Deng, et al.^[18] introduced a novel technique that integrates XGBoost with MOGA for cancer classification. The empirical analysis indicated that XGBoost-MOGA outperformed previous state-of-the-art algorithms in

F

-score, accuracy, recall, and other evaluation metrics. In risk assessment, Ma, et al.^[19] used XGBoost to assess flash flood risk and found that it outperformed the LSSVM model in several predictive evaluation metrics. Wang, et al.^[20] compared the performance of XGBoost,

K

-nearest neighbors and decision tree classification algorithms for personal credit risk assessment, concluding that XGBoost performed well in terms of accuracy, AUC. In price prediction, XGBoost also performed outstandingly. Jabeur, et al.^[21] compared the predictive performance of six models, including XGBoost, for gold prices, and empirical analysis revealed that the XGBoost model outperformed other machine learning models. Avanijaa, et al.^[22] used XGBoost regression techniques to predict house prices, aiding customers in determining when and where to buy a house. Wu, et al.^[23] introduced a model for forecasting electricity prices using XGBoost enhanced by PSO optimization, and empirical analysis comparing its predictive results with ARIMA, LSTM, SVM, RW and XGBoost itself showed that the PSO-XGBoost model had better predictive performance. Furthermore, XGBoost also demonstrates excellent predictive capabilities in other types of prediction tasks and has a wide range of applications^{[24, 25]}. It is particularly important to emphasize that the hyperparameter optimization of the XGBoost model is a key link in improving its performance. The performance of XGBoost can be significantly improved by using advanced hyperparameter optimization techniques such as Bayesian optimization or swarm intelligence^{[26, 27]}. These methods can systematically explore the hyperparameter space to ensure that the best configuration is found to maximize the efficiency and effectiveness of the model.

3 Proposed Methodology

This study introduces an approach for enhancing XGBoost parameter optimization utilizing the sparrow search algorithm (SSA), with the goal of examining the effectiveness and applicability of machine learning models in predicting data resource prices. Firstly, we collected and preprocessed transaction information from youedata.com^[28], which involved removing outliers, converting categorical variables into dummy variables, and adjusting the data distribution using logarithmic transformations. Then, we conducted exploratory analysis on the preprocessed data to further understand the correlations between data features and employed the lasso algorithm for feature selection. Finally, we constructed a data price prediction model based on the SSA-XGBoost algorithm and compared its predictive performance against LightGBM, GBDT, MLP, KNN, LR, and XGBoost using six evaluation metrics. Figure 1 shows the framework flowchart of our proposed machine learning-based data price prediction model.

Figure 1 The proposed flowchart for machine learning-based data pricing

Full size|PPT slide

3.1 Dataset Introduction

The youedata.com was launched in November 2016 as an online platform that provides data intelligence services, including online big data product transactions, such as APIs and block data. The platform contains data product resources from 13 industries, including public data. There are currently 6, 445 data products available, with 92 category labels.

This study collected 3, 328 sample data points from the Youedata.com block data trading information, and after data completeness checks, 3, 301 valid samples were obtained. The data set mainly includes three major industry category: Industrial economics, healthcare and medicine, and scientific research and technology, with 17 sub-category labels. Among these, four labels (higher education, development tools, smart healthcare, and regional economics) have no price information for their related data products. The data with the labels "COVID-19" and "pandemic" only have four entries, so these labels were removed from the data set. Ultimately, our data set contains 11 product labels. Each data entry includes the data name, product price, data volume, industry category, commodity label, sales volume, and quality score for some data dimensions. Table 1 outlines the features included in the data set and their descriptions.

Table 1 Description of features

No.	Feature	Description
1	Price	The price at which data products are sold in data transactions
2	Data Volume	The data scale of selling data products
3	Scarcity Score	Scoring of the supply and demand differences of data resources
4	Consistency Score	Scoring of data compliance with unified standards
5	Applicability Score	Scoring of the benefits that data can bring
6	Structural Level Score	Scoring of the organization and formatting degree of data
7	Data Quantity Score	Scoring of the scale of selling data products
8	Redundancy Score	Scoring of the degree of redundant information contained in data
9	Completeness Score	Scoring of the information missing situation in the data
10	Timeliness Score	Scoring of the time interval from data generation to use
11	Sales Volume	The quantity of data product sales
12	Industry category	Including Industrial economy, Healthcare and medicine, and Scientific research and technology.
13	Commodity label	Scientific research and technology: Patents, Research data; Industrial economy: Statistical yearbook, Import and export, Supply and demand of agricultural products, Natural resources; Healthcare and medicine: Domestic pharmaceuticals, Imported drugs, Pharmaceutical bidding, Drug procurement, Pharmaceutical companies

Due to the focus of this study on unpriced data products, sales data will not be used as an input indicator for the model. Commodity label further categorize the industry type, so only finely classified commodity label are selected as input indicators for the model. In the end, we choose 11 features as input indicators for the model and select data product prices as the output indicator.

3.2 Data Preprocessing

As shown in Table 2, we conducted descriptive statistical analysis on the features of the dataset. From the statistical results, it can be seen that the median of the price indicator in the dataset is 4, which means that half of the products in the crawled data have relatively low prices. The average sales volume is 26.94, indicating that the sales volume of Youedata.com's data is relatively considerable. However, from the comprehensive analysis of indicators such as mean, variance, maximum value, and minimum value, it can be seen that there are outliers in the dataset, and the value ranges of various features differ significantly. To reduce the negative impact of variables on model construction, data needs to be preprocessed to improve data quality and analysis results.

Table 2 Descriptive statistical analysis

Feature	Mean	Median	Variance	Skewness	Min	Max
Price	783.578	4	3512112	4.171	0	19421
Data Volume	98523.46	1024	1.09e+1	6.015	0.23046	5242880
Sales Volume	26.94	25	376.979	$-$ 0.658	0	242
Rating	3.92	4	0.167	$-$ 0.547	3	5
Scarcity Score	3.98	4	0.172	$-$ 0.547	3	5
Consistency Score	4.3	4	0.181	$-$ 0.357	3	5
Applicability Score	3.99	4	0.616	0.018	2	5
Structural Level Score	3.99	4	0.609	$-$ 0.003	2	5
Data Quantity Score	3.98	4	0.616	$-$ 0.008	2	5
Redundancy Score	4.05	4	0.624	$-$ 0.012	2	5
Completeness Score	4.05	4	0.624	$-$ 0.005	3	5
Timeliness Score	4.45	4	0.611	0.016	3	5

3.2.1 Outlier Handling

Handling outliers in data is a crucial step in data preprocessing. By addressing outliers, models can more effectively cope with data variations and noise, enhancing the precision and dependability of data analysis and machine learning, which in turn supports more informed decision-making.

In many business scenarios, the price of goods is typically not zero. A price of zero may indicate special situations such as giveaways, test items, or samples, which do not reflect the normal market pricing of the goods. Including such data in predictive models can introduce bias and affect the model's predictive performance. Therefore, it is advisable to exclude samples with a price of zero.

We use a box plot to analyze the dispersion of the dataset. Through preliminary observation of the data, we find that there are significant differences in data prices corresponding to different product labels. Therefore, we plot a box plot with "product label" and "price" as indicators. If all outliers are excluded, the generalizability of the model would be reduced. Hence, we only exclude those extreme outliers, allowing the existence of outliers with a moderate degree of deviation. Figure 2 shows the box plot of prices corresponding to some product labels.

Figure 2 Box plot of prices corresponding to different data product labels

Full size|PPT slide

3.2.2 Data Skewness Handling

From the skewness statistics in Table 1, it can be observed that there is a significant skewness phenomenon in the data volume and price features of the dataset. This skewness can lead to heteroscedasticity during the regression analysis. Therefore, we applied a logarithmic transformation to adjust the distribution of prices in the dataset. After the logarithmic transformation, the skewness of the prices is 0.314, significantly improving the right skewness. The distribution of price data before and after the logarithmic transformation is shown in Figure 3. Additionally, for the feature of data volume, we also applied a logarithmic transformation, resulting in a skewness of 0.014 after the transformation. Handling the skewness of the data allows the predictive model to be trained on a more balanced dataset, thereby ensuring the reliability of the model.

Figure 3 Price transformation

Full size|PPT slide

3.2.3 One-Hot Encoding

The dataset used in this paper contains commodity labels that are categorical and unordered. Therefore, one-hot encoding is required. We selected a total of 11 categories of product labels, including patents, research data, and statistical yearbooks, among others. For any given commodity label, if a data point belongs to that category, it is assigned a value of 1; otherwise, it is assigned a value of 0. We can ultimately expand this to 21 features. For example, if a data point's commodity label is SDAP (i.e., supply and demand of agricultural products), after one-hot encoding, the feature value for SDAP in that data point will be 1, while the remaining 10 categorical features will be 0.

3.3 Exploratory Data Analysis

By conducting exploratory data analysis, we can gain insights into the fundamental properties of the data, evaluate its quality, and uncover relationships between different variables. Figure 4 shows the average prices corresponding to different data product labels. We can see significant differences in their average prices. Specifically, the average price of statistical yearbooks is 0.11, imported drugs is 0.33, natural resources is 6, research data is 12.73, and pharmaceutical companies is 12.79. These data products have relatively low prices. However, the prices for the labels "import and export" and "patents" are significantly higher than those of other labels, with average prices of 489.04 and 2245.02 respectively. Therefore, there is a correlation between data commodity labels and their prices.

Figure 4 Average prices of different product labels

Full size|PPT slide

Furthermore, the industry category for the commodity labels labels of Research data and patents is scientific research and technology, with an average price difference of 2232.29. Both import/export and statistical yearbooks belong to industrial economics, with an average price difference of 488.93. import drugs and pharmaceutical procurement pertain to healthcare and medicine, with an average price difference of 99.07. Hence, it is reasonable to choose sub-labels of commodity labels instead of the industry category itself as feature input.

The correlation heat map is constructed based on the correlation coefficients between feature variables, which can provide a preliminary understanding of the relationship between different variables in the data set. Due to the presence of discrete variables in the data set, we use the spearman correlation coefficient for analysis.

As shown in Figure 5, the SDAP is negatively correlated with the data volume

(- 0.66)

, indicating that as the data volume increases, the demand for SDAP will decrease, which may be due to market saturation, data quality issues, or information overload. In addition, the data volume is positively correlated with price (0.46), indicating that the larger the data volume, the higher the price. The correlation between patents and prices is 0.56, which means that data products with patent labels tend to be more expensive.

Figure 5 The correlation heatmap

Full size|PPT slide

Drawing a heatmap of the correlations can provide an initial visual overview of the relationships between variables. This preliminary analysis can help understand the logical relationships among these variables, but it does not support subsequent feature selection.

3.4 Feature Selection

This paper selects the LASSO algorithm^[29] for feature selection, whose basic principle is to add an

L 1

norm penalty term on the model coefficients, constrain the complexity of the model, compress unimportant regression coefficients to 0, thereby eliminating certain features, and obtaining a model with fewer variables and higher efficiency. The formula is detailed below.

\begin{aligned} Φ (ω) = \sum_{i = 1}^{k} {(y_{i} - \sum_{j = 1}^{t} ω_{j} x_{i j})}^{2} + ξ \sum_{i = 1}^{t} | ω_{i} |, \end{aligned}

(1)

where

t

represents the number of features in the dataset,

k

represents the number of samples,

ω

is the regression coefficient vector,

x

is the feature matrix,

y_{i}

represents the true value of the product price in the dataset, and

ξ

is a parameter that controls the strength of regularization.

We used the lasso feature selection method to screen features and finally selected 19 features, such as data volume, patents, and completeness score. We removed two features: SDAP (i.e., supply and demand of agricultural products) and the applicability score, as shown in Figure 6.

Figure 6 Lasso feature selection

Full size|PPT slide

4 Model Construction

4.1 XGBoost Algorithm

XGBoost is a robust machine learning technique known for its accuracy and ability to generalize well^[30]. It enhances performance by integrating multiple weak models, like decision trees, and refines the overall model through the gradient boosting method. The regression problem in the XGBoost algorithm involves the following formulas and principles.

Assuming

D = {(x_{i}, y_{i}) : i = 1, 2, \dots, n, x_{i} \in R^{m}, y_{i} \in R}

, where

n

is the number of samples, and each sample has

m

features. Given

z

regression trees,

x_{i}

represents the feature vector of the

i

-th data point,

f_{z}

is one of the regression trees, and

F

represents the space containing the functions of the

Z

trees. XGBoost generates the final output by aggregating the predictions of multiple regression trees, as shown in the following formula.

\begin{aligned} {\hat{y}}_{i} = \sum_{z = 1}^{Z} f_{z} (x_{i}), f_{z} \in F . \end{aligned}

(2)

The objective function of the XGBoost is shown as:

\begin{aligned} O b j e c t i v e = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}) + \sum_{z = 1}^{Z} Ω (f_{z}), \end{aligned}

(3)

where

{\hat{y}}_{i}

is the predicted value,

y_{i}

is the true value,

l ({\hat{y}}_{i}, y_{i})

is the loss function, which describes the deviation of the prediction, and

Ω (f_{z})

is the complexity of the model. The primary aim is to minimize the objective function's loss function by employing Taylor expansion up to the second order, which simplifies the gradient descent updates. This method helps to achieve the best approximation of the objective function, leading to the development of the optimal algorithm.

4.2 Sparrow Search Algorithm

Sparrow optimization is a collective intelligence algorithm inspired by the foraging behavior of sparrows^[31]. It mainly consists of two stages: The discoverer phase and the joiner phase. In the discoverer phase, the discoverer identifies and occupies relatively optimal feeding sites. During the joiner phase, the joiner trails the discoverer and expands the exploration area to avoid convergence on a local optimum.

In a

D

-dimensional search space with

N

sparrows, the position of the

i

-th sparrow is denoted as

X_{i} = [x_{i 1}, x_{i 2}, \dots, x_{i d}, \dots, x_{i D}]

1) The update of the producer position is shown in Formula (4).

\begin{aligned} X_{i, d}^{t + 1} = {\begin{cases} X_{i, d}^{t} \cdot \exp (\frac{- i}{α \cdot {i t e r}_{max}}) & R < S T, \\ X_{i, d}^{t} + Q \cdot L & R \geq S T . \end{cases} \end{aligned}

(4)

Let

t

denote the iteration number, while

{i t e r}_{max}

signifies the upper limit of iterations. The parameter

α

is a stochastic variable uniformly distributed in the interval (0, 1],

Q

adheres to a normal distribution,

L

is a

1 \times d

vector,

R

represents the warning value confined to [0, 1], and

S T

indicates the security threshold ranging from [0.5, 1]. If

R < S T

, it signifies that the search environment is secure, allowing the discoverer to conduct a broad search, which improves the population's fitness. If

R \geq S T

, it suggests that the sparrow in the population has detected predators, triggering an immediate alarm. Consequently, the population will rapidly relocate to a safe zone to evade predation.

2) The update of the scrounger position is shown in Formula (6).

\begin{aligned} X_{i, d}^{t + 1} = {\begin{cases} Q \cdot \exp (\frac{X_{w o r s t}^{t} - X_{i, d}^{t}}{i^{2}}) & i > l / 2, \\ X_{b}^{t + 1} + | X_{i, d}^{t} - X_{b}^{t + 1} | \cdot A^{+} \cdot L & i \leq l / 2, \end{cases} \end{aligned}

(5)

where

X_{b}

and

X_{w o r s t}

are the current global optimal position and global worst position found by the searcher, respectively,

A

is a matrix of

1 \times d

, while

A^{+} = A^{T} (A A^{T})^{- 1}

. When

i > l / 2

, it indicates that the

i

-th sparrow has lower fitness and is in a very hungry state, needing to fly to other places to forage for energy.

3) The update of the alert location is shown in Formula (8).

\begin{aligned} X_{i, d}^{t + 1} = {\begin{cases} X_{i, d}^{t} + K \cdot [\frac{X_{i, d}^{t} - X_{w o r s t}^{t}}{f_{i} - f_{w} + ϵ}] & f_{i} = f_{g}, \\ X_{b e s t}^{t} + β \cdot | X_{i, d}^{t} - X_{b e s t}^{t} | & f_{i} > f_{g}, \end{cases} \end{aligned}

(6)

where

X_{b e s t}

represents the current globally optimal position,

K

denotes the sparrow's movement direction,

β

is the step control parameter governed by the Cauchy distribution, and

ϵ

is a small positive constant to prevent division by zero. The fitness value of the

i

-th sparrow is

f_{i}

, while

f_{g}

and

f_{w}

are the current global minimum and worst fitness values in the sparrow population, respectively. If

f_{i} = f_{g}

, it signifies that the sparrow is one of the two types that senses a threat. This type of sparrow will proactively seek out and approach others of its kind to refine its search strategy.

4.3 SSA-XGBoost Model

The hyperparameters of the prediction model may cause unstable and erroneous prediction performance. Therefore, a performance optimization technique is required to determine the globally optimal solution. The sparrow search algorithm (SSA) can be used to improve the scientific quality of parameter selection in the XGBoost model's hyperparameter tuning process, allowing for the use of optimal parameters in data product price prediction while minimizing prediction errors. We focus on optimizing four parameters: n_estimators, learning_rate, max_depth and min_child_weight, while using default values for the other parameters. The fitness function is set as the mean squared error function.

Algorithm 1: Pseudocode of SSA-XGBoost
Input: number of sparrows $N$ , number of producers $D_{s}$ , maximum iterations $M$ , number of sparrows aware of danger $T_{s}$ , safety threshold $S T$ , current iteration $t$
Output: n_estimator, learning_rate, max_depth, min_child_weight
1 $t \leftarrow 1$ ;
2 Initialize population ${x_{1}, x_{2}, \dots, x_{n}}$ ;
3 while $t < M$ do

16 $(n_e s t i m a t o r, l e a r n i n g_r a t e, m a x_d e p t h, m i n_c h i l d_w e i g h t) \leftarrow g b e s t$

Algorithm 1 outlines the pseudocode for optimizing the XGBoost model using the Sparrow Search Algorithm. The process consists of the following steps:

Step 1 Set up initial parameters, including the SSA population size, the maximum iteration count, and the parameters and boosting threshold for the XGBoost model.

Step 2 Evaluate the fitness values and rank the sparrow population accordingly. Update the positions of producers, scroungers, and alarmers using formulas (4), (6), and (8), respectively.

Step 3 Compare the fitness values of the new positions with the current best value and update the global best information.

Step 4 Verify if the iteration termination condition is met. If it is, produce the optimal sparrow position; if not, return to Step 2.

Step 5 Using the results of SSA optimization, establish the XGBoost data pricing prediction model by determining the number of estimators, learning rate, maximum depth, and minimum child weight.

5 Experimental Results

5.1 Experimental Setup and Evaluation Metrics

1) Experimental Setup

To evaluate the effectiveness of the algorithm, we randomly split the dataset into 70% training set and 30% test set. This partitioning strategy is chosen because it is consistent with most previous studies, facilitating fair comparisons. Additionally, to verify the accuracy and effectiveness of the proposed SSA-XGBoost model in predicting data product prices, we introduced six models including LightGBM, GBDT, MLP, KNN, LR, and XGBoost to compare and analyze the experimental results. All experimental results reported come from a computer equipped with an Intel Core Ultra 9 185H, 32 GB RAM and a 64-bit Windows 11 operating system, using popular Python packages including NumPy, Pandas and Scikit-Learn.

2) Evaluation Metrics

Our evaluation entails a thorough analysis of the model's performance in terms of fit and prediction accuracy, employing various metrics such as MAE, RMSE, MSE, MAPE, RMSPE, and

R^{2}

. Among these, MAE measures the absolute deviation between predicted and actual values, while RMSE and MSE indicate the magnitude of deviation. Additionally, MAPE and RMSPE reflect error percentages. Smaller values for these metrics indicate higher model accuracy.

R^{2}

represents the degree of data fitting, with a higher value indicating stronger explanatory power.

\begin{aligned} R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - {\bar{y}}_{i})^{2}}, \end{aligned}

(7)

\begin{aligned} M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |, \end{aligned}

(8)

\begin{aligned} M S E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}, \end{aligned}

(9)

\begin{aligned} R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}, \end{aligned}

(10)

\begin{aligned} M A P E = \frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | \times 100 %, \end{aligned}

(11)

\begin{aligned} R M S P E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\frac{y_{i} - {\hat{y}}_{i}}{y_{i}})}^{2}} . \end{aligned}

(12)

5.2 Experimental Results

1) Comparative Analysis of Different Models

Table 3 shows the prediction accuracy results of the proposed model compared to the six baseline models. To facilitate a more intuitive comparative analysis, we plotted radar charts for the six metrics, as shown in Figure 7. In terms of MSE, RMSE, and MAE, the SSA-XGBoost model performs the best, with values lower than the non-optimized XGBoost by 0.083, 0.107, and 0.067, respectively, while KNN and LR models perform the worst. For the MAPE metric, the SSA-XGBoost value is 10.14%, which is 4.54%, 9.57%, 15.94%, and 19.48% lower than XGBoost, GBDT, LightGBM, and MLP, respectively. For the RMSPE metric, the SSA-XGBoost value is 24.87%, which is 9% and 22.95% lower than XGBoost and GBDT, respectively, with KNN having the highest value at 88.86%, and LR at 65.28%. For the

R^{2}

, the values of all models range from 0.729 to 0.967, with SSA-XGBoost performing the best, being 0.023 higher than XGBoost and 0.049 higher than GBDT. Overall, the SSA-XGBoost model performs the best across all six evaluation metrics, making it the most effective in prediction. In contrast, KNN and LR models perform poorly across all metrics, MLP and LightGBM perform moderately, while the other models show relatively good fitting effects.

Table 3 Evaluation metric results

Model	MSE	RMSE	MAE	$R^{2}$	MAPE	RMSPE
SSA-XGBoost	0.112	0.335	0.172	0.967	10.14%	24.87%
XGBoost	0.195	0.442	0.239	0.944	14.68%	33.87%
GBDT	0.287	0.536	0.284	0.918	19.71%	47.82%
LightGBM	0.373	0.611	0.358	0.893	26.08%	54.54%
MLP	0.364	0.603	0.388	0.896	29.62%	53.76%
LR	0.564	0.751	0.602	0.839	47.72%	65.28%
KNN	0.947	0.973	0.692	0.729	56.94%	88.86%

Figure 7 Performance comparison of seven models with different evaluation indicators

Full size|PPT slide

2) Comparative Analysis with Existing Work

We compare the predictive performance of the proposed model with both Stacking^[32] and Stacked-GBDT^[33], as summarized in Table 4.

Table 4 Comparison of existing work

Relevant literature	Model name	MAE	RMSE	$R^{2}$
Proposed Method	SSA-XGBoost	0.172	0.335	0.967
[32]	Stacking ensemble	0.593	0.824	0.839
[33]	Stacked-GBDT	1265.419	2517.864	0.923

The results indicate that the SSA-XGBoost model demonstrates superior performance, exhibiting lower MAE and RMSE values compared to the traditional stacking approach, which are reduced by 0.421 and 0.489, respectively. The possible reason is that the dataset used in this study is derived from the real world, often containing noise and redundant features. Traditional Stacking ensemble models tend to overfit in such scenarios, while the SSA-XGBoost model effectively reduces the risk of overfitting by adaptively selecting features and adjusting parameters. It is worth noting that in terms of

R^{2}

, SSA-XGBoost is 0.128 higher than the Stacking model, indicating a higher degree of fit in the relationship between explanatory variables and response variables. Due to significant differences in how the Stacked-GBDT model handles predictive variables compared to our method, its MAE and RMSE values are relatively high, so it is only compared with our model in terms of the

R^{2}

. In this aspect, SSA-XGBoost is 0.044 higher than the Stacked-GBDT model. In summary, the SSA-XGBoost model shows significant improvements in MAE, RMSE, and

R^{2}

, indicating that the model not only provides more accurate prediction results but also better reveals the underlying patterns in the data, offering stronger support for decision-making.

The prediction error plot of the proposed model is illustrated in Figure 8. It can be seen that the errors on the test set are close to zero, with a notable decrease in larger error values. This indicates that the proposed model is capable of accurate predictions in most cases, demonstrating high robustness and stability. This further illustrates that our proposed SSA-XGBoost model achieves good predictive performance.

Figure 8 Prediction error

Full size|PPT slide

3) Comparative Analysis of Feature Importance Across Different Models

Since the prediction of the KNN algorithm is based on the distance between instances rather than learning or modeling feature weights, it does not provide a ranking or assessment of feature importance. As a result, we assess the significance of features across the remaining six models. Figure 9 illustrates the ranking of feature importance for various models.

Figure 9 Feature importance ranking of different models

Full size|PPT slide

Figure 9 shows that there are significant differences in feature importance rankings among different models. To facilitate a more intuitive comparison, this paper selects the top eight features from six models for comparative analysis, as shown in Table 5. Specifically, we list the top eight features in terms of importance based on gain and split count for the LightGBM model.

Table 5 Feature importance ranking of different models

Index	SSA-XGBoost	XGBoost	LightGBM-Split
1	Patents	Patents	Data volume
2	Import and export	Statistical yearbook	Scarcity Score
3	Research data	Import and export	Completeness Score
4	Statistical yearbook	Research data	Redundancy Score
5	Data volume	Natural resources	Consistency Score
6	Natural resources	Pharmaceutical companies	Structural Level Score
7	Pharmaceutical companies	Data volume	Data Quantity Score
8	Imported drugs	Imported drugs	Timeliness Score

Index	LightGBM-Gain	GBDT	MLP	LR
1	Patents	Patents	Data volume	Statistical yearbook
2	Data volume	Data volume	Research data	Research data
3	Research data	Research data	Statistical yearbook	Import and export
4	Statistical yearbook	Statistical yearbook	Natural resources	Patents
5	Pharmaceutical companies	Import and export	Pharmaceutical companies	Natural resources
6	Natural resources	Natural resources	Imported drugs	Drug procurement
7	Scarcity Score	Pharmaceutical companies	Import and export	Imported drugs
8	Rating	Rating	Drug bidding	Drug bidding

By comparing the feature importance rankings of LightGBM based on gain and split count, we found significant differences between the two rankings. Except for the data volume and scarcity score features, which are in the top eight in both rankings, all other features are completely different. Given that gain-based feature importance directly reflects the contribution of features to model performance improvement, we chose to compare LightGBM's gain-based feature importance with those of other models.

Table 5 shows that the feature "patents" ranks first in the models SSA-XGBoost, XGBoost, LightGBM-Gain, and GBDT, while it ranks fourth in the LR model, indicating its high importance. "data volume" and "pharmaceutical companies" rank in the top eight in five models except for the LR model, with "data volume" generally ranking higher and "pharmaceutical companies" ranking lower. "statistical yearbook" and "research data" rank in the top four across all six models, while "import and export" ranks in the top eight in five models except for the GBDT model. "natural resources" ranks in the top eight across all six models but generally ranks lower. "imported drugs" rank in the top eight in four models, with other features appearing one to two times.

In summary, the key factors affecting the price of data products include data volume, patents, statistical yearbooks, and research data. Generally, the larger the data volume, the higher the price. Most other features important to various models are categorical features, with only a few scores on data dimensions. Since the data crawled in this paper, the scores on various dimensions of data products are given by the data trading platform, rather than user feedback, and each dimension has only three rating levels of 3, 4, and 5, the scores on various dimensions of the data cannot well reflect the price of the data products. Therefore, the importance of data product labels in all models shows that the type of data product largely determines its price.

6 Conclusion and Outlook

In this research, we develop a data pricing model utilizing SSA-XGBoost and conduct comparisons with six reference models as well as state-of-the-art work. The key findings are as follows:

1) By examining various evaluation metrics, we observe that the SSA-XGBoost model demonstrates superior prediction accuracy relative to other baseline models, enabling more precise forecasting of data product prices. In contrast to existing methodologies, the model introduced in this paper shows enhancements in MAE, RMSE, and

R

-squared metrics, reflecting improved predictive performance.

2) By comparing the feature importance of different models, we find that data volume and data category (i.e., commodity label) are the main factors affecting the price of data products, while the scores of different dimensions of data are ranked at the back of the feature importance, which may be due to two reasons: Firstly, the scores are provided by data trading platforms rather than real users' feedbacks. Secondly, the scoring scales have only three grades, which may result in the evaluation of the insufficient information on the categorized independent variables, resulting in weaker explanatory power.

In future research agenda, we will employ more advanced machine learning algorithms to enhance the predictive accuracy and adaptability of pricing models. In addition, we will conduct more comprehensive data collection and more targeted data preprocessing. In addition to data ontology factors, user behavior and market factors can be considered to ensure a more comprehensive understanding of the pricing mechanism and make data pricing more rational.

参考文献

原文顺序 | 文献年度倒序 | 文中引用次数倒序

1	Dan L, Hao X J, Chen Y H. A review and comparative analysis of domestic and foreign research on big data pricing methods. Big Data Research, 2021, 7 (6): 89- 102. 本文引用 [1]

2	Pei J. A survey on data pricing: From economics to data science. IEEE Transactions on Knowledge and Data Engineering, 2020, 34 (10): 4586- 4608. 本文引用 [1]

3	Yang J, Zhao C, Xing C. Big data market optimization pricing model based on data quality. Complexity, 2019. https://doi.org/10.1155/2019/5964068 本文引用 [1]

4	Yu M, Wang J, Yan J, et al. Pricing information in smart grids: A quality-based data valuation paradigm. IEEE Transactions on Smart Grid, 2022, 13 (5): 3735- 3747. https://doi.org/10.1109/TSG.2022.3171665 本文引用 [1]

5	Yang J, Xing C. Personal data market optimization pricing model based on privacy level. Information, 2019, 10 (4): 123. https://doi.org/10.3390/info10040123 本文引用 [1]

6	Cong Z, Luo X, Pei J, et al. Data pricing in machine learning pipelines. Knowledge and Information Systems, 2022, 64 (6): 1417- 1455. https://doi.org/10.1007/s10115-022-01679-4 本文引用 [1]

7	Chen X, Miao S, Wang Y. Differential privacy in personalized pricing with nonparametric demand models. Operations Research, 2023, 71 (2): 581- 602. https://doi.org/10.1287/opre.2022.2347 本文引用 [1]

8	Cai Z, Zheng X, Wang J, et al. Private data trading towards range counting queries in internet of things. IEEE Transactions on Mobile Computing, 2023, 22 (8): 4881- 4897. https://doi.org/10.1109/TMC.2022.3164325 本文引用 [1]

9	Alorwu A, van Berkel N, Visuri A, et al. Monetary valuation of personal health data in the wild. International Journal of Human-Computer Studies, 2024, 185, 103241. https://doi.org/10.1016/j.ijhcs.2024.103241 本文引用 [1]

10	Cheng S, Ren T, Zhang H, et al. A stackelberg game based framework for edge pricing and resource allocation in mobile edge computing. IEEE Internet of Things Journal, 2024, 11 (11): 20514- 20530. https://doi.org/10.1109/JIOT.2024.3372016 本文引用 [1]

11	Pandey S R, Pinson P, Popovski P. Strategic coalition for data pricing in IoT data markets. IEEE Internet of Things Journal, 2024, 11 (4): 6454- 6468. https://doi.org/10.1109/JIOT.2023.3310660 本文引用 [1]

12	Lin H, Chung J W, Lao Y, et al. Machine unlearning in gradient boosting decision trees. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, 1374- 1383. 本文引用 [1]

13	Rey-Blanco D, Zofío J L, González-Arias J. Improving hedonic housing price models by integrating optimal accessibility indices into regression and random forest analyses. Expert Systems with Applications, 2024, 235, 121059. https://doi.org/10.1016/j.eswa.2023.121059

14	Zheng J, Tian Y, Luo J, et al. A novel hybrid method based on kernel-free support vector regression for stock indices and price forecasting. Journal of the Operational Research Society, 2023, 74 (3): 690- 702. https://doi.org/10.1080/01605682.2022.2128908

15	Zhu M, Xu H, Wang M, et al. Carbon price interval prediction method based on probability density recurrence network and interval multi-layer perceptron. Physica A: Statistical Mechanics and Its Applications, 2024, 636, 129543. https://doi.org/10.1016/j.physa.2024.129543 本文引用 [1]

16	Zhang L, Jánošík D. Enhanced short-term load forecasting with hybrid machine learning models: CatBoost and XGBoost approaches. Expert Systems with Applications, 2024, 241, 122686. https://doi.org/10.1016/j.eswa.2023.122686 本文引用 [1]

17	Budholiya K, Shrivastava S K, Sharma V. An optimized XGBoost based diagnostic system for effective prediction of heart disease. Journal of King Saud University-Computer and Information Sciences, 2022, 34 (7): 4514- 4523. https://doi.org/10.1016/j.jksuci.2020.10.013 本文引用 [1]

18	Deng X, Li M, Deng S, et al. Hybrid gene selection approach using XGBoost and multi-objective genetic algorithm for cancer classification. Medical & Biological Engineering & Computing, 2022, 60 (3): 663- 681. 本文引用 [1]

19	Ma M, Zhao G, He B, et al. XGBoost-based method for flash flood risk assessment. Journal of Hydrology, 2021, 598, 126382. https://doi.org/10.1016/j.jhydrol.2021.126382 本文引用 [1]

20	Wang K, Li M, Cheng J, et al. Research on personal credit risk evaluation based on XGBoost. Procedia Computer Science, 2022, 199, 1128- 1135. https://doi.org/10.1016/j.procs.2022.01.143 本文引用 [1]

21	Jabeur S B, Mefteh-Wali S, Viviani J L. Forecasting gold price with the XGBoost algorithm and SHAP interaction values. Annals of Operations Research, 2024, 334 (1): 679- 699. 本文引用 [1]

22	Avanijaa J. Prediction of house price using xgboost regression algorithm. Turkish Journal of Computer and Mathematics Education (TURCOMAT), 2021, 12 (2): 2151- 2155. 本文引用 [1]

23	Wu K, Chai Y, Zhang X, et al. Research on power price forecasting based on pso-xgboost. Electronics, 2022, 11 (22): 3763. https://doi.org/10.3390/electronics11223763 本文引用 [1]

24	Zhao X, Li Q, Xue W, et al. Research on ultra-short-term load forecasting based on real-time electricity price and window-based XGBoost model. Energies, 2022, 15 (19): 7367. https://doi.org/10.3390/en15197367 本文引用 [1]

25	Rui C, Bin L, Min L, et al. Predicting prices and analyzing features of online short-term rentals based on XGBoost. Data Analysis and Knowledge Discovery, 2021, 5 (6): 51- 65. 本文引用 [1]

26	Mao F, Chen M, Zhong K, et al. An XGBoost-assisted evolutionary algorithm for expensive multiobjective optimization problems. Information Sciences, 2024, 666, 120449. https://doi.org/10.1016/j.ins.2024.120449 本文引用 [1]

27	Yuan Y, Du J, Luo J, et al. Discrimination of missing data types in metabolomics data based on particle swarm optimization algorithm and XGBoost model. Scientific Reports, 2024, 14 (1): 152. https://doi.org/10.1038/s41598-023-50646-8 本文引用 [1]

28	https://www.youedata.com/ (accessed on 8 March 2024). 本文引用 [1]

29	Wang X, Yang M, Li W. Efficient data reduction strategies for big data and high-dimensional LASSO regressions. arXiv Preprint: 2401.11070, 2024. 本文引用 [1]

30	Yang J, Guan J. A heart disease prediction model based on feature optimization and smote-Xgboost algorithm. Information, 2022, 13 (10): 475. https://doi.org/10.3390/info13100475 本文引用 [1]

31	Li J, Chen J, Shi J. Evaluation of new sparrow search algorithms with sequential fusion of improvement strategies. Computers & Industrial Engineering, 2023, 182, 109425. 本文引用 [1]

32	Shen J X, Zhao X S. Research on data resource pricing method based on stacking multi-algorithm fusion model. Information Studies: Theory & Application, 2023, 46 (1): 179- 186. 本文引用 [2]

33	Shen J X, Zhao X S. Research on data resource value assessment method based on dynamic stacked-GBDT ensemble learning. Science and Technology Management Research, 2023, 43 (1): 53- 61. 本文引用 [2]

PDF(601 KB)

1072

Accesses

Citation

Detail

段落导航

Abstract
Key words
引用本文
1 Introduction
2 Literature Review
2.1 Data Pricing
2.2 XGBoost Model
3 Proposed Methodology
Figure 1 The proposed flowchart for machine learning-based data pricing
3.1 Dataset Introduction
Table 1 Description of features
3.2 Data Preprocessing
Table 2 Descriptive statistical analysis
3.2.1 Outlier Handling
Figure 2 Box plot of prices corresponding to different data product labels
3.2.2 Data Skewness Handling
Figure 3 Price transformation
3.2.3 One-Hot Encoding
3.3 Exploratory Data Analysis
Figure 4 Average prices of different product labels
Figure 5 The correlation heatmap
3.4 Feature Selection
Figure 6 Lasso feature selection
4 Model Construction
4.1 XGBoost Algorithm
4.2 Sparrow Search Algorithm
4.3 SSA-XGBoost Model
5 Experimental Results
5.1 Experimental Setup and Evaluation Metrics
5.2 Experimental Results
Table 3 Evaluation metric results
Figure 7 Performance comparison of seven models with different evaluation indicators
Table 4 Comparison of existing work
Figure 8 Prediction error
Figure 9 Feature importance ranking of different models
Table 5 Feature importance ranking of different models
6 Conclusion and Outlook
参考文献

收稿日期	接受日期	出版日期
2024-06-28	2024-09-14	2025-02-25
发布日期
2025-02-25

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

Abstract

Key words

引用本文

1 Introduction

2 Literature Review

2.1 Data Pricing

2.2 XGBoost Model

3 Proposed Methodology

Figure 1 The proposed flowchart for machine learning-based data pricing

3.1 Dataset Introduction

Table 1 Description of features

3.2 Data Preprocessing

Table 2 Descriptive statistical analysis

3.2.1 Outlier Handling

Figure 2 Box plot of prices corresponding to different data product labels

3.2.2 Data Skewness Handling

Figure 3 Price transformation

3.2.3 One-Hot Encoding

3.3 Exploratory Data Analysis

Figure 4 Average prices of different product labels

Figure 5 The correlation heatmap

3.4 Feature Selection

Figure 6 Lasso feature selection

4 Model Construction

4.1 XGBoost Algorithm

4.2 Sparrow Search Algorithm

4.3 SSA-XGBoost Model

5 Experimental Results

5.1 Experimental Setup and Evaluation Metrics

5.2 Experimental Results

Table 3 Evaluation metric results

Figure 7 Performance comparison of seven models with different evaluation indicators

Table 4 Comparison of existing work

Figure 8 Prediction error

Figure 9 Feature importance ranking of different models

Table 5 Feature importance ranking of different models

6 Conclusion and Outlook

{{custom_sec.title}}

{{custom_sec.title}}

参考文献

{{custom_fnGroup.title_cn}}

脚注