1 Introduction
Instead of working on a gene-by-gene basis, with the progress of biotechnology and other technologies, microarray technology can play an important role in addressing the issue of expression of thousands of genes from an experimental sample simultaneously
[1–3]. Biologists and medical scientists have also recognized that gene-set testing (e.g., judging where 2019 nCoV comes from by gene sequencing and testing) or pathway analysis is quite meaningful and necessary
[4, 5]. Due to the cost, the data available in public databases now contain mainly expression data ranging between 10, 000 and 55, 000 probe sets for fewer than 10 samples, e.g., the large amount of data generated by microarray technology is due mainly to the large number of genes represented on the array, rather than a large number of samples. While, for each gene the number of RNA samples assayed is typically small
[6]. In addition, it is not uncommon to see fewer than 10 subjects per tumor, e.g., Lönnstedt
[7], Kendziorski, et al.
[8], Dong, et al.
[9].
As a consequence, we are faced with a "large
, small
" paradigm, where "
" is the total number of genes and "
" is the number of replications. This problem is quite common in high-dimensional data. Therefore, the sample variances are inadmissible due to the relatively small number of replications and the commonly used statistical methods, such as
test or
test, for detecting differentially expressed genes on a gene-by-gene basis have low powers
[10]. Therefore, there is a growing body of literature that pay attention to the improvement of variance estimation of high-dimensional small sample data and its testing methods.
In terms of variance estimation, it is well known that improved variance estimation can substantially improve classification accuracy
[11, 12] and power of testing differential gene expression for high dimensional low-sample-size data. Hence, in order to improve variance estimation, a variety of approaches: Bayesian estimators, Regression estimators and Shrinkage estimators are proposed. While, Bayesian method needs to determine the prior distribution and estimate its parameters, which is more complex. The Bayesian method can be referred to Kendziorski, et al.
[8], Smyth
[13], Fox and Dimmic
[14], Opgen-Rhein and Strimmer
[15], Hwang and Liu
[16] and Ji, et al.
[17]. About another method of variance estimation: Regression estimation, which including parametric and non-parametric methods, which is to apply the regression method to deal the variance increases proportionally with the intensity for microarray
[18]. But, the inconsistence of the estimate is a trouble. Specifically, it can be referred to Carroll and Wang
[19] and Wang, et al.
[20]. For the Shrinkage estimators, the idea originated from the James-Stein estimator of means
[21]. One of the earliest methods to stabilize the variance estimation was proposed by Tusher, et al.
[22]. In order to avoid the undue influence of the small variance estimates, he proposed to estimate the standard deviation
by
in their SAM test, where
is a constant acting as a shrinkage factor. Applying the standard James-Stein shrinkage method to log transformed estimates of variances, Cui, et al.
[6] proposed a James-Stein type shrinkage estimator for variances. Then, Tong and Wang
[23] and Tong, et al.
[24] derived a family of shrinkage estimators
and asymptotic properties for gene-specific variances raised to a fixed power
(nonzero) extending the idea from Cui, et al.
[6] to a more general setting under the Stein loss function:
and and the squared loss function:
. These estimators borrow information across genes by shrinking each gene-specific variance estimator toward bias-corrected geometric mean of variance estimators for all genes. However, only iterative algorithm is given and have't explicit expression of shrinkage estimators for variances, at the same time, the calculation is intensive.
Therefore, there is a need to overcome these deficiencies. In this paper, under an new loss function:
(Arsigny, et al.
[25]), we propose a optimal geometric shrinkage estimation for power of variance based on the James-Stein estimator, which not only provides explicit expression for gene-specific variances, and calculate fastly also. As we know, computing speed is very important in processing high-dimensional data.
In the context of hypothesis testing, many tests modify the
test (or equivalently,
test) by shrinking the variances or the standard errors only are proposed in the earlier, for example, the SAM test
[22], the test of Wright and Simon
[26], the moderated
test
[13] the multivariate MB-statistic
[27] and the
test
[6]. Later, Tong and Wang
[23] proposed a test similar to the
test and their idea has also been applied to discriminant analysis in Pang, et al.
[3]. The
test uses a shrinkage factor depending on the data, which seems more desirable. As we all known, the assumption that variances are equal for all genes is unlikely to be true. Contemporaneously, the optimal discovery procedure (ODP) proposed in Storey
[28] and Storey, et al.
[29] were shown to be much more powerful than above other procedures. However, it is very computationally intensive. Later, Hwang and Liu
[16] proposed a MAP tests in which the derive tests that maximize the average power with respect to some weight functions or prior distributions on the parameters and have more powerful than
test, but it needs shrinking both means and variances.
In recent years, a number of approaches to improve Hotelling's
test have emerged for testing high-dimensional data. In essence, with the main difference among them how the covariance matrix is handled, these approaches can be classified into the following three categories, the first category is the unscaled Hotelling's tests
[4, 12]. The covariance matrix is removed from Hotelling's
statistic to avoid the covariance matrix estimation. They demonstrated that the proposed test has better power than Hotelling's
test under the requirement of
and
being of the same order. The second category is the regularized Hotelling's tests
[12, 30], which is applied to the covariance matrix estimation to resolve the singularity problem. The third category is the diagonal Hotelling's tests
[31–36]. In the third category, the covariance matrix is assumed to be diagonal. Under this assumption, the singularity problem is circumvented since a diagonal matrix is always invertible for non-zero entries, whether or not
is larger than
.
Here, we use the proposed optimal geometric shrinkage estimation for power of variance to construct the diagonal Hotelling's test statistics and compare with other 'information-sharing' statistics including the unscaled Hotelling's and regularized Hotelling's tests. It shows that the test statistics based on proposed shrinkage estimator have the highest or nearly the highest power.
The rest of the paper is organized as follows. First of all, in Section 2, we study how to obtain the geometric shrinkage estimator for power variance and give the relevant theorem. In Section 3, the performance of the proposed estimator is evaluated and compare it with some existing shrinkage estimators. Next, in Section 4, we show how to use geometric shrinkage estimator for power of variances to construct shrinkage-based diagonal Hotelling's tests statistics for tseting differential expression of genes with different forms of covariance and compare to some existing significant tests. Then, the application explanation is given by conducting a real data case in Section 5. Fianlly, we conclude in Section 6 with a brief discussion.
2 Geometric Shrinkage Estimate
Let be the variance of , . Assume each is measured on independent samples and denote as the -th measurement on , and . It is well known that, for each , can be estimated by , where . Estimating , a geometric shrinkage estimator is proposed by
with some geometric shrinkage parameter , where is the target statistic, , are tuning parameters, which is chosen as follows: is an unbiased estimator of , meanwhile is an unbiased estimator of , where . We remark that two particular interesting cases in (2.1) are and , which are the estimators of the variance and its inverse, respectively.
We assume that
,
,
. It follows from
[23] that, for any
,
,
where is the gamma function.
Under framework of Wald's statistical decision theory, the optimal geometric shrinkage parameter is obtained by giving some loss function. In this paper, the loss function is chosen by
where and its estimate are both positive definition matrices. Let
Then, the optimal geometric shrinkage parameter is
Remark 1 If for all and , then (4) is
which does not depend on , so does . In this case, the estimator of is
where is the optimal geometric shrinkage parameter for .
Theorem 1 Assume that , , . Then for any , ,
with
and
The likelihood-unbiased estimator of is
with
3 Simulations
We compare the performance with Tong and Wang
[23] under two cases. Case 1 is
, case 2 is
. We conduct simulations to evaluate the performance of the proposed geometric type shrinkage estimators
with
based on the percentage related improvement in average loss (PRIAL), which is defined as
where
and
are respective naive and proposed shrinkage estimators of
. A shrinkage estimator improves the naive estimator if its PRIAL is larger than
. We also compared the performance of the proposed geometric type shrinkage estimators with that of two estimators
and
proposed by Tong and Wang
[23]. Both of them assume
but the optimal shrinkage parameter
is obtained by minimizing the Stein loss function and
by minimizing the squared loss function. No explicit expression can be found for these optimal shrinkage parameters and a grid search method has to be used to obtain them under the classical asymptotic framework.
We set and simulated 100 random samples of size from normal distributions. When the distribution of the sample is normal, in the -th simulation, for a given sample size , the true variance of each component of the sample was drawn from a Gamma distribution with specific shape parameter and rate parameter . That is, for each , we first drew observations from and then, for each , randomly select observations from with randomly drawn from or . Denoted these observations by . Four different shrinkage estimators , , as defined above, and naive estimator of were then calculated from these observations. The final PRIAL of each shrinkage estimator is estimated by
where .
Figure 1 presents the plots of PRIAL against the sample size for the three estimators with the case considered in the simulations for four difference pairs of , when the gene expression levels are normally distributed. We can see from these figures that proposed shrinkage estimator has the largest improvement to the naive estimator for all sample sizes.
Figure 1 PRIALs of different estimators (1) |
Full size|PPT slide
Figure 2 presents the plots of PRIAL against the sample size for the three estimators with the case considered in the simulations for four difference pairs of when the gene expression levels are normally distributed. We can see from these figures that proposed shrinkage estimator has the largest improvement to the naive estimator for all sample sizes.
Figure 2 PRIALs of different estimators (2) |
Full size|PPT slide
4 Application
In this section, we use the proposed estimator to construct the diagonal Hotelling's test for two-sample cases. According to Dong, et al.
[36], the shrinkage-based diagonal Hotelling's test for two-sample cases can be expressed as
where . Note that, when , the chi-squared distribution (SDchi) can be used as a good approximate null distribution for small . While, the normal distribution (SDnor) can be a good approximation for large under the null hypothesis. The approximate null distributions are as follows, respectively,
where , are the mean and variance of normal distribution, is the freedom of chi-squared distribution.
Under different scenarios of
and
, simulation comparison are performed with existing competitors which are unscaled Hotelling's test (CHEN)
[4], regularized Hotelling's test (RHT)
[12] and diagonal Hotelling's tests (PCT)
[31] and diagonal Hotelling's tests (SRI)
[34]. In addition, we analyze a gene expression data set to illustrate the shrinkage-based diagonal Hotelling's test under the proposed inverse estimator of variance.
4.1 Shrinkage-Based Diagonal Hotelling's Tests
In this subsection, we compare the shrinkage-based diagonal Hotelling's tests, including the chi-squared null distribution (SDchi) and the normal null distribution (SDnor), with some current methods in the aforementioned.
In our simulation, we generate data from the multivariate normal distribution with a common covariance matrix to assess the type I error rate, the data are generated for both groups from . To assess the power, one group of data is generated from and the other one from , where for is randomly drawn from the scaled chi-squared distribution .
The structure of the common covariance matrix is , where and is the relation matrix. We use the following block-diagonal matrix as the correlation matrix:
where is a matrix and . We consider the following settings for : , where for . Let denote this type of common covariance matrix. For , the correlation matrix is autoregressive of the order-1 structure. In our simulation, we set from 5 to 60, and the effect sizes are and . For different correlations, we set . The type I error rate and power are obtained by running 1000 simulations under the settings of , and , where is the significance level. We first focus on the performances of all approaches for small sample sizes. The type I error rate and power are reported in Tables 1 and 2, respectively.
Table 1 Type I error rates for under the null case |
| | | SDnor | SDchi | RHT | CHEN | SRI | PCT/wu |
0 | | 5 | 0.042 | 0.0405 | 0.061 | 0 | 0.3712 | 0.173 |
| | 6 | 0.0395 | 0.047 | 0.078 | 0 | 0.2852 | 0.2172 |
| | 7 | 0.045 | 0.0435 | 0.064 | 0 | 0.2282 | 0.108 |
| | 8 | 0.04 | 0.0445 | 0.076 | 0 | 0.193 | 0.08 |
| | 9 | 0.0435 | 0.041 | 0.063 | 0 | 0.1584 | 0.062 |
| | 10 | 0.0435 | 0.05 | 0.063 | 0 | 0.1372 | 0.046 |
| | 60 | 0.0365 | 0.0486 | 0.053 | 0.086 | 0.0582 | 0.043 |
0.3 | | 5 | 0.047 | 0.0435 | 0.08 | 0 | 0.3504 | 0.1512 |
| | 6 | 0.046 | 0.0445 | 0.056 | 0 | 0.2618 | 0.1512 |
| | 7 | 0.046 | 0.0515 | 0.071 | 0 | 0.2092 | 0.097 |
| | 8 | 0.05 | 0.0445 | 0.056 | 0 | 0.1772 | 0.069 |
| | 9 | 0.048 | 0.049 | 0.063 | 0 | 0.1586 | 0.073 |
| | 10 | 0.0475 | 0.055 | 0.054 | 0 | 0.1414 | 0.056 |
| | 60 | 0.043 | 0.0495 | 0.053 | 0.073 | 0.0626 | 0.051 |
0.5 | | 5 | 0.0615 | 0.061 | 0.074 | 0 | 0.3046 | 0.25 |
| | 6 | 0.0645 | 0.0655 | 0.062 | 0 | 0.227 | 0.151 |
| | 7 | 0.0685 | 0.0655 | 0.07 | 0 | 0.177 | 0.088 |
| | 8 | 0.0655 | 0.071 | 0.067 | 0 | 0.158 | 0.073 |
| | 9 | 0.057 | 0.075 | 0.067 | 0 | 0.135 | 0.068 |
| | 10 | 0.067 | 0.0835 | 0.056 | 0 | 0.1214 | 0.065 |
| | 60 | 0.056 | 0.0805 | 0.055 | 0.047 | 0.0566 | 0.038 |
Table 2 Powers for and under the alternative case |
| | | SDnor | SDchi | RHT | CHEN | SRI | PCT/wu |
0 | | 5 | 0.8845 | 0.734 | 0.804 | 0 | 0.995 | 0.855 |
| | 6 | 0.919 | 0.9165 | 0.934 | 0 | 0.997 | 0.973 |
| | 7 | 0.9865 | 0.9775 | 0.968 | 0 | 0.991 | 0.985 |
| | 8 | 0.9925 | 0.9905 | 0.954 | 0.005 | 0.997 | 0.982 |
| | 9 | 0.997 | 0.995 | 0.982 | 0.009 | 1 | 1 |
| | 10 | 1 | 1 | 0.99 | 0.11 | 0.999 | 1 |
| | 60 | 1 | 1 | 1 | 1 | 1 | 1 |
0.3 | | 5 | 0.8055 | 0.8175 | 0.812 | 0 | 0.957 | 0.964 |
| | 6 | 0.849 | 0.9825 | 0.946 | 0 | 0.954 | 0.959 |
| | 7 | 0.9585 | 0.981 | 0.948 | 0 | 0.994 | 0.914 |
| | 8 | 0.9935 | 0.993 | 0.938 | 0 | 0.996 | 0.982 |
| | 9 | 0.998 | 0.995 | 0.994 | 0.025 | 0.997 | 0.944 |
| | 10 | 0.9985 | 1 | 0.996 | 0.053 | 0.999 | 1 |
| | 60 | 1 | 1 | 1 | 1 | 1 | 1 |
0.5 | | 5 | 0.813 | 0.784 | 0.816 | 0 | 0.956 | 0.966 |
| | 6 | 0.936 | 0.878 | 0.888 | 0 | 0.957 | 0.936 |
| | 7 | 0.944 | 0.9415 | 0.934 | 0 | 0.984 | 0.915 |
| | 8 | 0.985 | 0.972 | 0.938 | 0 | 0.993 | 0.983 |
| | 9 | 0.997 | 0.999 | 0.988 | 0.013 | 0.999 | 0.992 |
| | 10 | 1 | 0.999 | 0.96 | 0.036 | 0.998 | 0.993 |
| | 60 | 1 | 1 | 1 | 1 | 1 | 1 |
From the results in Tables 1 and 2, we observe that the shrinkage-based diagonal Hotelling's tests outperform the other methods for different . When the correlation is weak, our methods control the type I error rate well and at the same time maintain high power. As the correlation increases, the type I error rate of our methods becomes higher but it is still better than those of the other methods. PCT, SRI and RHT have high type I error rates when the sample size is smaller than 10. CHEN selects the null hypothesis too often and both of their powers are low. For higher correlations, these four approaches perform similarly.
In addition, we can see that diagonal Hotelling's tests, including our shrinkage-based methods, have the highest ROC curves. While, the unscaled Hotelling's test has the lowest ROC curves. This demonstrates that with limited numbers of observations, the diagonal Hotelling's tests are the best options.
The superiority of the shrinkage-based diagonal Hotelling's tests is also demonstrated in Figure 3 and Table 3. Figure 3 shows the plots of the ROC curves and their respective AUC values are shown in Table 3. Here, we plot ROC curves with a range of FPR values from 0 to 1 and AUC values are also calculated in the same range as the FPR values. Without loss of generality, we plot the ROC curves for in Figure 3 to assess the overall performances of all approaches for small sample sizes. The figure shows that our methods, SDchi and SDnor, have the largest AUC values and highest ROC curves for all three correlations.
Figure 3 ROC curves for , and . "AR" represents |
Full size|PPT slide
Table 3 AUC values for , and |
| | SDnor | SDchi | RHT | CHEN | SRI | PCT |
0 | | 0.798 | 0.804 | 0.678 | 0.672 | 0.749 | 0.739 |
0.4 | | 0.776 | 0.761 | 0.704 | 0.668 | 0.698 | 0.687 |
4.2 Real Data Analysis
As we all know, the number of sample is small relative to the probe sets in many gene expression studies of cancer or tumor, and this phenomenon is not uncommon. This type of data is a typical high-dimensional small sample data. Therefore, we assess the performances of the shrinkage-based diagonal Hotelling's tests when they are applied to real gene expression data sets in this section. We studied the performance of each test statistical method in the test of glioma tumor data, which can be downloaded from GEO Data sets with accession number GDS1962.
There are four classes of data in this data set: One non-tumor class and three glioma classes. Totally, the data include 54, 613 probes and 180 samples. We use the non-tumor (NON) class and the astrocytomas (AS) class, and the sample sizes are 23 and 26, respectively. Before using each test method, we randomly select 50 genes from genes, and then we use the test statistics in this paper to compare the advantages and disadvantages of each test statistics. In order to illustrate how the FPR and TPR are calculated, we use bootstrap method to select two groups from the non-tumor (NON) class to assess the type I errors. Instead, for the TPR, we use bootstrap method to take specitively a group from the non-tumor (NON) class oup and the astrocytomas (AS) class to assess the power. For each group, the number of samples is 5. The FPR and TPR are calculated based on 1000 simulations. Then, shows the ROC curves and AUC values are provided in Figure 4 and Table 4, Specifically.
Figure 4 ROC curves for Glioma data set when and |
Full size|PPT slide
Table 4 AUC values for and |
| SDnor | SDchi | RHT | CHEN | SRI | PCT |
Glioma | 0.0892 | 0.0902 | 0.0648 | 0.0016 | 0.0852 | 0.0861 |
Similar to Figure 3, the ROC curves in Figure 4 are also generated with a range of FPR values from 0 to 0.1, and the same for AUC values. In Figure 4 and Table 4, our proposed estimations have the highest ROC curves and largest AUC values. That is, the diagonal Hotelling's tests used new estimetion perform better than the unscaled Hotelling's tests and the regularized Hotelling's tests when the sample size is small. This illustrates the advantage of the shrinkage-based diagonal Hotelling's tests compared with other test statistics in the case of small sample size and small copy number change at multiple probes.
5 Conclusion
In this paper, we consider the problem of variance estimation of high dimensional small sample population because sample variance is an unreliable variance estimator for limited observations, the optimal geometric shrinkage estimation for power of variance are proposed under a loss function that different from the previous. The estimation for power of variance not only has display expressions, but also is concise. Under the percentage related improvement in average loss (PRIAL), it performs better than the two estimators
and
proposed by Tong and Wang
[23]. Then, it can be well used to diagonal Hotelling's tests to test gene expression data, in addition, it has the advantage relation to the unscaled Hotelling's tests and the regularized Hotelling's tests when the sample size is small. Nevertheless, real data might come from heavy-tailed distributions or even discrete distributions, for example, RNA-seq data, which can obtained by next-generation sequencing technologies and have better coverage than microarray data, such data have already been applied in medical science. Then, the multivariate normal distribution is not a ideal choice to characterization these data. New methods need to be developed to analyze this data under new situation.
Acknowledgements
The authors gratefully acknowledge the Editor and two anonymous referees for their insightful comments and helpful suggestions that led to a marked improvement of the article.
Appendix
Proof Assume that , , . For any , , we calculate the optimal shrinkage parameter as follows. First, note that
and
where is the digamma function, and is the trigamma function.
Second, when , we show that
Write
It is very easy to see that
By (18) we have
Now, we show that
A similar argument yields
Thus, (25) holds. By (22)(25) we arrival at (20). According to (23),
By (26),
Plugging (28) into (27), we have
Combining (21) and (29), we arrive at (8). It is well known that is maximum the likelihood estimator of and, is the maximum likelihood estimator of by the "hereditary'' property of the maximum likelihood estimation. It follows from a well-known fact that a maximum likelihood estimator is likelihood-unbiased. Hence, (11) is the likelihood-unbiased estimator of (8).
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}