Table of Contents

### Introduction to basic principles

The classical least square estimation has been widely used due to its many excellent properties. With the development of electronic computing technology, more and more accumulated experience in dealing with large-scale regression problems show that the results obtained by least square estimation are sometimes very unsatisfactory. When the design matrix (X) is ill-conditioned, there is a strong linear correlation between the column vectors of (X), that is, there is serious multicollinearity between the independent variables. In this case, using ordinary least squares to estimate the model parameters, the variance of the parameters obtained is too large, and the effect of ordinary least squares becomes very unsatisfactory.

Aiming at the problem that the ordinary least squares method obviously deteriorates when multicollinearity occurs, the American scholar Hoerl proposed an improved least squares estimation method called ridge estimation in 1962. Later Hoerl and Kennard made a systematic discussion in 1970^{33}. When there is multicollinearity between the independent variables, then (left| X^primeX right| approx 0). We add a matrix (kI(k > 0)) to (X^primeX), then the degree to which matrix (X^primeX + kI) is close to singularity will be much smaller than the degree to which matrix (X^primeX) is close to singularity. Taking into account the dimension of variables, this article first standardizes the data. For the convenience of writing, the standardized design matrix is still denoted by (X). Equation (2) is defined as the ridge regression estimation of (beta), where (k) is called the ridge parameter. Since (X) is assumed to have been standardized, (X^primeX) is the sample correlation matrix of the independent variables. (hatbeta left( k right)) as the estimate of (beta) is more stable than the least square estimation (hatbeta ). When (k = 0), the ridge estimation (hatbeta left( 0 right)) is the ordinary least square estimation. Because the ridge parameter (k) is not unique, the ridge regression estimate (hatbeta left( k right)) is actually an estimated family of the regression parameter (beta). For the selection of the ridge parameter (k), the commonly used methods include the ridge trace method and the variance inflation factor method.

$$hatbeta left( k right) = left( X^primeX + kI right)^ – 1 X^primey$$

(2)

The XGBoost algorithm is based on an integrated learning method. The integrated learning method combines multiple learning models so that the combined model has stronger generalization ability to obtain better modeling effects. XGBoost is an improvement on the boosting algorithm based on the gradient descent tree. It is composed of multiple decision tree iterations. XGBoost first builds multiple CART (Classification and Regression Trees) models to predict the data set, and then integrates these trees as a new tree model. The model will continue to iteratively improve, and the new tree model generated in each iteration will fit the residual of the previous tree. As the number of trees increases, the complexity of the ensemble model will gradually increase until it approaches the complexity of the data itself, at which point the training achieves the best results. Equation (3) is the XGBoost algorithm model, where (f_t left( x_i right) = omega_q left( x right)) is the space of CART, (omega_q left( x right)) is the score of sample (x), the model prediction value is obtained by accumulation, and q represents the structure of each tree , (T) is the number of trees, and each (f_t) corresponds to an independent tree structure q and leaf weight.

$$haty_i = varphi left( x_i right) = mathop sum limits_t = 1^T f_t left( x_i right)$$

(3)

XGBoost internal decision tree uses regression tree. For the squared loss function, the split node of the regression tree fits the residual. For the general loss function (gradient descent), the split node of the regression tree fits the approximate value of the residual. Therefore, the accuracy of XGBoost will be higher. Equations (4)–(7) are the iterative process of residual fitting. In Eq. (7), (haty_i^left( t – 1 right)) is the predicted value of the i-th sample after t-1 iterations. (haty_i^left( 0 right)) is the initial value of the i-th sample.

$$haty_i^left( 0 right) = 0$$

(4)

$$haty_i^left( 1 right) = f_1 left( x_i right) = haty_i^left( 0 right) + f_1 left( x_i right)$$

(5)

$$haty_i^left( 2 right) = f_1 left( x_i right) + f_2 left( x_i right) = haty_i^left( 1 right) + f_2 left( x_i right)$$

(6)

$$haty_i^left( t right) = mathop sum limits_k = 1^T f_k left( x_i right) = haty_i^left( t – 1 right) + f_t left( x_i right)$$

(7)

The objective optimization function of the XGBoost algorithm, that is, the loss function (Eq. 8), can be obtained according to the iterative process of the residuals. For the general loss function, XGBoost will perform a second-order Taylor expansion in order to dig out more information about the gradient, and at the same time remove the constant term, so that the gradient descent method can be better trained. Equations (9) and (10) are the loss function of the t-th step, where (g_i) and (h_i) are the first and second derivatives.

$$f_obj^t = mathop sum limits_i = 1^n lleft( y_i ,haty_i^left( t right) right) + mathop sum limits_i = 1^t Omega left( f_i right) = haty_i^left( t – 1 right) + f_t left( x_i right) = mathop sum limits_i = 1^n lleft( y cdot haty_i^left( t right) right) + Omega left( f_i right) + C$$

(8)

$$g_i = partial_{haty_i^left( t – 1 right) } lleft( y_i ,haty_i^left( t – 1 right) right)$$

(9)

$$h_i = partial_{{haty_i^left( t – 1 right) }}^2 lleft( {y_i ,haty_i^{left( t – 1 right)} } right)$$

(10)

$$Omega left( f right) = gamma T + frac12lambda mathop sum limits_i = 1^n omega_j^2$$

(11)

Different from other algorithms, the XGBoost algorithm adds a regularization term (Omega left( f right)) (Eq. (11)) to prevent over-fitting and better improve the accuracy of the model. (Omega left( f right)) is a function that represents the complexity of the tree. The smaller the function value, the stronger the generalization ability of the tree. (omega_j) is the weight on the j-th leaf node in the tree f, (T) is the total number of leaf nodes in the tree, (upgamma ) is the penalty term of the L1 regularity, and (uplambda ) is the penalty term of the L2 regularity, which is the custom parameter of the algorithm. Therefore, the objective function (Eqs. (12)–(14)) are obtained, where (I_j = left qleft( x_i right) = j right) represents the sample set on the j-th leaf node^{28,34}.

$$f_obj = – frac12mathop sum limits_j = 1^T fracG_j^2 H_j + lambda + gamma T$$

(12)

$$G_j = mathop sum limits_i in I_j g_i$$

(13)

$$H_j = mathop sum limits_i in I_j h_i$$

(14)

### Ridge regression model construction

Classical least squares estimation is often used to build pollutant concentration prediction models. It can also derive the quantitative relationship between the various influencing factors and the concentration of pollutants^{15}. However, the factors that affect the concentration of pollutants are more complicated, and through the previous correlation analysis, it can be seen that there is a significant correlation between them. If the multiple linear regression model is directly established, multicollinearity will be generated, which will cause the model’s regression coefficients to be very unstable, and the model application ability will deteriorate. Ridge regression is often used to solve the problem of model multicollinearity. We take the national control point O_{3} as the dependent variable, the pollutant concentration and meteorological parameters measured at the self-built point as the independent variables, and establish a ridge regression model with the help of SPSS (Version20.0,https://www.ibm.com/cn-zh/analytics/spss-statistics-software).

In this paper, the ridge trace method is used to select the independent variables introduced into the model and the ridge parameter (k). In Fig. 4, the abscissa represents the value of the ridge parameter (k), and each curve represents the standardized ridge regression coefficient of each variable. It can be seen that (x_4), (x_6), and (x_10) have relatively stable ridge regression coefficients with relatively small absolute values, indicating that these variables have a small impact on the O_{3} concentration, and they can be deleted in the actual modeling. In addition, although the standardized ridge regression coefficient of (x_2) is not small, it is very unstable, and rapidly tends to zero as (k) increases. For this kind of variable whose ridge regression coefficient is not stable and the rapid vibration tends to zero, it can also be eliminated in the ridge regression model.

After completing the selection of the independent variables of the ridge regression model, the next step is the selection of the ridge parameter (k). We reduce the step length of the ridge parameter (k) to 0.02, and draw the ridge trace diagram of the remaining variables as Fig. 5. It can be seen that when the ridge parameter (k = 0.2), the ridge trace of each variable is relatively stable, and the coefficient of determination *R*^{2} is not reduced much, so the ridge parameter (k = 0.2) can be selected. Finally, with the help of SPSS software, use the selected variables and ridge parameters to make a ridge regression model. Table 3 shows the unstandardized ridge regression equations for six types of pollutants. Using these equations, the predicted value of the ridge regression model for the concentration of each pollutant can be obtained.

### RR-XGBoost model construction

The ridge regression model can be used to predict the concentration of pollutants, and it can also show the quantitative relationship of the influence of each input variable on the concentration of pollutants. However, ridge regression can only show the linear relationship between variables, while the nonlinear relationship between various factors and pollutant concentration has not been found. This study uses the ridge regression prediction value and self-built point measurement data as input, and uses the pollutant concentration value monitored by the national control point as the output. The XGBoost algorithm is used to establish a prediction model for the concentration of each pollutant. We call this model the RR-XGBoost model. Figure 6 is the flux diagram of the RR-XGBoost model.

Before constructing the Ridge-XGBoost model, first divide all samples into training set and test set randomly at a ratio of 8:2 (the other 5 pollutants data sets are also divided in the same way), and normalize all data to the range of [0,1] based on experience^{29,34}. The modeling in this paper is implemented using Python language programming, the simulation platform is Pycharm, and the Grid Search Method (GSM) is used to find the optimal parameter combination.

The XGBoost model has many parameters. If all parameters are optimized, the computer’s memory will be challenged and the optimization time will be greatly increased. In this paper, the following four main parameters are selected for optimization: (i) the number of gradient boosted trees n_estimators, the larger the parameter, the better, but the occupied memory and training time will also increase accordingly, the optimization range of this article is 100–300; (ii) the maximum tree depth for base learners max_depth, this parameter is used to avoid overfitting, the value range is 3–10; (iii) learning rate learning_rate, the value range is 0.01–0.3; and (iv) the minimum sum of instance weight(hessian) needed in a child min_child_weight, which is similar to max_depth, used to avoid over-fitting, and the value range is 1–9. The four initial parameters of the XGBoost model are set to 100, 6, 0.1, and 1. In addition, GSM needs to set the optimization step distance of each parameter during the optimization process (this article takes 10, 1, 0.01, 1).

Table 4 shows the parameters of the XGBoost model determined after using the grid search method. In order to show the fitting effect of the RR-XGBoost model more intuitively, this paper draws the fitting effect of O_{3} concentration as shown in Fig. 7. It can be seen that the correlation coefficient between the true concentration of O_{3} and the predicted concentration of the model in both the training set and the test set exceeds 0.95. In addition, the regression coefficients of the two regression models (training set regression model and test set regression model) are close to 1, indicating that this model performs well in predicting the concentration of pollutants. Figure 8 is the residual analysis diagram of the RR-XGBoost model. It can be seen that most of the residual values of the model are randomly distributed within [-40, 40]. From the residual distribution histogram, it can be seen that the residuals are uniformly distributed around zero, and the residuals are roughly normally distributed as a whole.