1 Introduction
We consider the following two-block separable convex optimization problem with linear equality constraints:
where
and
are closed convex sets, and
and
are convex functions (not necessarily smooth), while
has its specific structure. In particular, we assume that there is a stochastic first-order oracle (
) for
, which returns an unbiased and bounded stochastic gradient
at
, where
is a random variable whose distribution is supported on
. Such an assumption is common in the stochastic programming (SP), see, e.g., [
1-
3] and the references therein. In SP, the objective function is often in the form of expectation, i.e.,
for some
, and
, including finite sum as a special case. For both cases (number of terms in the summation is large for the latter case), getting the full function value or gradient information is impractical. Motivated by this, we need to design some stochastic approximation
[4] based algorithms to solve problem (1).
For the problem (1) itself, as a linearly constrained convex optimization problem, it is rich enough to characterize many optimization problems arising from various application fields, such as machine learning, image processing, and signal processing. In these fields, a typical scenario is where one of the functions represents some data fidelity term, and the other is a regularization term, see, e.g., [
5] and the references therein. Without considering the specific structure, i.e., the assumption of
is not needed in the model, a classical method for solving problem (1) is the alternating direction method of multipliers (ADMM). ADMM was originally proposed by Glowinski and Marrocco
[6], and Gabay and Mercier
[7], which is a Gauss-Seidel implementation of augmented Lagrangian method
[8] or an application of Douglas-Rachford splitting method on the dual problem of (1)
[9]. For both convex and non-convex problems, there are extensive studies on the theoretical properties of ADMM. In particular, for convex optimization problems, theoretical results on convergence behavior are abundant, whether global convergence, sublinear convergence rate, or linear convergence rate, see, e.g., [
9-
15]. Recently, ADMM has been studied on nonconvex models satisfying the well-known Kurdyka-Lojasiewicz (KL) inequality or other similar properties, see, e.g., [
16-
19]. For a thorough understanding on some recent developments of ADMM, one can refer to a survey
[20].
However, as we mentioned before, the gradient information of
in (1) must be obtained by the
due to some computational or other limitation, and hence aforementioned ADMM does not work. To tackle this problem, some stochastic ADMM type algorithms have been proposed recently, see, e.g., [
21-
24]. Note that in these works, only the basic iterative scheme of ADMM was considered. It is well-known that symmetrically updating the dual variable in a more flexible way often improves the algorithmic performance, which is the idea of symmetric ADMM (or Peaceman-Rachford splitting method applied to the dual of problem (1)), see, e.g., [
25-
28]. In this paper, we study symmetric ADMM in the stochastic setting. In particular, we propose a symmetric stochastic linearized ADMM (SSL-ADMM) for solving two-block separable stochastic optimization problem (1) and analyze corresponding worst-case convergence rate by means of the framework of variational inequality. Moreover, we establish the large-deviation properties of SSL-ADMM under certain light-tail assumptions. Also, numerical experiments on the graph-guided fused lasso problem demonstrate the promising performance compared to non-symmetric ADMM.
The rest of this paper is organized as follows. We introduce some fundamental preliminaries in Section 2. Convergence properties of the proposed algorithm are analyzed in Section 3. The high probability guarantees for objective error and constraint violation of the proposed algorithm are investigated in Section 4. In Section 5, numerical results are presented to indicate the promising efficiency of symmetrically updating dual variables in the stochastic setting. Finally, a summary is made in Section 6.
Notations For two matrices and , the ordering relation () means is positive definite (semidefinite). denotes the identity matrix. For a vector , denotes its Euclidean norm; for a matrix , denotes its spectral norm. For any symmetric matrix , define and if . denotes the mathematical expectation of a random variable. denotes the probability value of an event. and denote the subdifferential and gradient operator of a function, respectively. We also sometimes use and to denote the vectors and , respectively.
2 Preliminaries
In this section, we summarize some preliminaries that will be used in later analysis. Let the Lagrangian function of the problem (1) be
defined on . We call a saddle point of if the following inequalities are satisfied:
Obviously, a saddle point can be characterized by the following inequalities
Below we invoke two propositions, one of which characterizes the optimality condition of an optimization model by a variational inequality and the other gives a result for the martingale-difference sequence.
Proposition 1 Let be a closed convex set and let and be convex functions. In addition, is differentiable. Assuming that the solution set of the minimization problem is nonempty, then we have the assertion that
if and only if
Proof The proof can be found in [
31].
Proposition 2 Let be a sequence of independent identically distributed random variables, and be deterministic Borel functions of such that almost surely and almost surely, where are deterministic and denotes the expectation of random variable conditional on random variable . Then
Proof The proof can be founded in Lemma 4.1 on page 116–117 of [
32].
Hence using Proposition 1, under the solution set of problem (1) is nonempty, solving (1) is equivalent to solving the following variational inequality problem: Finding such that
where
The variables with superscript or subscript such as are denoted similarly. In addition, we define two auxiliary sequences for the convergence analysis. More specifically, for the sequence generated by the SSL-ADMM in Section 3, let
Throughout the paper, we need the following assumptions:
Assumption
The primal-dual solution set of problem (1) is nonempty.
is differentiable, and its gradient satisfies the -Lipschitz condition
for all .
where is some constant.
Under the second assumption, it holds that for all ,
A direct result of combining this property with convexity is shown in the following lemma.
Lemma 1 Suppose function is convex and differentiable, and its gradient is -Lipschitz continuous, then for any we have
In addition, if is -strongly convex, then for any we have
Proof Since the gradient of is -Lipschitz continuous, then for any we have Also, due to the convexity of , we have for any , Adding the above two inequalities, we get the conclusion. If is -strongly convex, then for any , Then combine this inequality with and the proof is completed.
3 Symmetric Stochastic Linearized ADMM
In this section, we will present and analyze iterative scheme of the proposed symmetric stochastic linearized ADMM, named SSL-ADMM.
Algorithm 1 is called SSL-ADMM for short. We give some remarks on this algorithm. SSL-ADMM is a ADMM type algorithm, which alternates through one
-subproblem, an update on the multipliers, one
-subproblem, and an update on the multipliers again. The algorithm is symmetric since the dual variable is symmetrically updated twice at each iteration. The algorithm is stochastic since at each iteration
is called to obtain a stochastic gradient
which is an unbiased estimation of
, the gradient of
at
, and is bounded relative to
in expectation. The algorithm is linearized due to the following two aspects: (i) The term
in the
-subproblem of SSL-ADMM is a stochastic version of linearization of
. (ii)
-subproblem and
-subproblem are added proximal terms
and
respectively, where
are two sequences of symmetric and positive definite matrices that can be change with iteration; with the choice of
, the quadratic term in the
-subproblem is linearized. The same fact applies to the
-subproblem. Furthermore, when
or is of the form
, the term
vanishes, and
, SSL-ADMM reduces to the algorithm appeared in earlier literatures [
21,
24]. Finally, the convergence region
of
is the same as that in [
27]. In particular, if
,
. Recently, Bai, et al.
[29, 30] studied stochastic ADMM algorithms for finite sum optimization problems. The difference between our paper and theirs is that our goal is to consider a more general stochastic optimization problem (1) where the random variable
is not necessarily a discrete random variable. For example, in SP, while it is possible to approximate the function
by the sample average approximation technique, the extra sample approximation error should also be taken into account.
Algorithm 1: Symmetric Stochastic Linearized ADMM (SSL-ADMM) |
Initialize , two sequences of symmetric and positive semidefinite matrices: and , where |
|
for . |
Call the to obtain ; |
|
; |
; |
|
; |
. |
end |
We start to establish the convergence of SSL-ADMM. The next several lemmas are to obtain an upper bound of . With such a bound, it is possible to estimate the worst-case convergence rate of SSL-ADMM.
Lemma 2 Let the sequence be generated by the SSL-ADMM and the associated be defined in (2). Then we have
where , similarly hereinafter, and
Proof Due to Proposition 1, the optimality condition of the -subproblem in SSL-ADMM is
Using the notation in (2), the above inequality can be rewritten as
And then using Lemma 1, we have
Similarly, the optimality condition of -subproblem in SSL-ADMM is
Using the notation of , . Hence
Substituting this equality into (6), we obtain
According to the definition of , we have
and it can be written as
Combining (5), (7), and (8), and using the notation of (4), the proof is completed.
Lemma 3 Let the sequence be generated by the SSL-ADMM and the associated be defined in (2). Then we have
where
Proof
Together with and , we prove the assertion of this lemma.
Noting that for the matrices defined in (4) and defined in (10), there is a matrix such that , where
It is easy to check for any , is positive semidefinite when the matrix is full column rank. In fact, to make positive semidefinite alone, it is sufficient for when other conditions are satisfied.
Lemma 4 Let the sequence be generated by the SSL-ADMM and the associated be defined in (2). Then we have
where .
Proof Using and , we have . Applying the identity , we obtain
The remaining task is to simplify the last two terms.
The proof is completed.
From this lemma, can be written as two terms, {one of which is suitable for recursive operation and the other is a quadratic term}, but the matrix is not necessarily semidefinite. Thus we need to analyze this quadratic term in detail. Since
As the following lemma shows, the last two terms of the above equality can be further analyzed.
Lemma 5 Assume that . Let the sequence be generated by the SSL-ADMM and the associated be defined in (2). Then we have
Proof It follows from the optimality condition of -subproblem for -th iteration that
Similarly, it follows from the optimality condition of -subproblem for -th iteration that
Adding these two inequalities and using
we obtain
This implies
On the other hand, we have
and
Combining (14), (15), and (16), we get the assertion of this lemma.
According to this lemma and using Cauchy-Schwarz inequality, the term can be bounded as follows:
Now combining (17), Lemma 2, and Lemma 4, we obtain the following main theorem. In this theorem, we take of the form , which simplifies the system of linear equation in -subproblem, and . Of course, can also take the similar form as . In particular, if , then -subproblem reduces to the proximal mapping of .
Theorem 1 Assume that . Let the sequence be generated by the SSL-ADMM and the associated be defined in (2), and
for some pre-selected integer . Choosing , where is a constant satisfying the ordering relation , then we have
Proof It is sufficient for using convexity of , , and .
Corollary 1 Assume that all the conditions in Theorem 1 hold, then SSL-ADMM has the following properties
where is a unit vector satisfying and the expectation is taken conditional on .
Proof Let in (18), where , then the left hand side of (18) is , which is followed from
where the first equality follows from the definition of , and the second and last equalities hold due to and the choice of . On the other hand, substituting into the variational inequality associated with (1), we get . Hence, the left hand side of (18) is equal or greater than when letting and (19) is obtained by taking expectation. Substituting into the variational inequality associated with (1), we can also get
i.e.,
Then by taking expectation, (20) is obtained.
Remark 1 (ⅰ) In Theorem 1 or Corollary 1, 's are constant, and needs to be selected in advance. In fact, can also vary with the number of iterations, e.g., . In this case, if the distance between and is bounded, i.e., for any , we can also obtain a worst-case convergence rate. The difference with the proof idea in Theorem 1 and Corollary 1 is bounding the term , which is now bounded as follows.
(ⅱ) Corollary 1 reveals that the worst-case convergence rate of SSL-ADMM for solving general convex problems is , where is the iteration number.
At the end of this section, we assume that is -strongly convex, i.e., for all . With the strong convexity, we can obtain not only the objective function value gap and constraint violation converge to zero in expectation, but also the convergence of ergodic iterates of SSL-ADMM.
Theorem 2 Assume that . Let the sequence be generated by the SSL-ADMM and the associated be defined in (2), and
Choosing , where is a constant satisfying the ordering relation , then SSL-ADMM has the following properties
where is a unit vector satisfying ,
and the expectation is taken conditional on .
Proof First, similar to the proof of Lemma 2, using the -strong convexity of , we conclude that for any
Then using , Lemma 4, and (17), we get
Adding the above inequalities from to and devided by , and then following the proof of Corollary 1, we prove the assertion of this theorem.
This theorem implies that under the assumption that
is strongly convex, the worst-case convergence rate for the SSL-ADMM can be improved to
with the choice of diminishing size. The following theorem shows the convergence of ergodic iterates of SSL-ADMM, which is not covered in some earlier literatures [
21,
24]. Futhermore, if
is also strongly convex, the assumption that
is full column rank can be removed.
Theorem 3 Assume that . Let the sequence be generated by the SSL-ADMM, the associated be defined in (2), and
Choosing , where is a constant satisfying the ordering relation , and assuming is full column rank and denotes the minimum eigenvalue of , then we have
where the bounds for and are the same as in (21) and (22) respectively, and the expectation is taken conditional on .
Proof Since is a solution of (1), we have Hence, since is strongly convex and is convex, we have
and
Adding up (24) and (25), we get that is
On the other hand,
this implies and hence
Adding (26) and (27), using Jensen's inequality for a random variable , and taking expecation imply
The proof is completed.
4 High Probability Performance Analysis
In this section, we shall establish the large deviation properties of SSL-ADMM. By (19) and (20), and Markov's inequality, we have for any and that
and
However, these bounds are not strong. In the following, we will show these high probability bounds can be significantly improved when imposing standard "light-tail" assumption, see, e.g., [
1,
32]. Specifically, assume that for any
This assumption is a little bit stronger than b) in Assumption (ⅲ), which can be explained by Jensen's inequality. For further analysis, we assume that is bounded and its diameter is denoted by , defined as . The following theorem shows the high probability bound for objective error and constraint violation of SSL-ADMM.
Theorem 4 Assume that all the conditions in Theorem hold, then SSL-ADMM has the following properties
where is a unit vector satisfying .
Proof Let . Clearly, is a martingale-difference sequence. Moreover, it follows from the definition of and that light-tail assumption that
Now using Proposition 2 for the martingale-difference sequence, we have for any
Also, observe that by Jensen's inequality for the exponential function
whence, taking expectation,
It then follows from Markov's inequality that for any
Using (32) and (33) in (18) for , we conclude that
and
The result immediately follows from the above inequalities.
Remark 2 In view of Theorem 4, if we take , then we have
and
For strongly convex case, using similar derivation, the high probability bound for objective error and constraint violation of SSL-ADMM is
and
Observe that the convergence rate of ergodic iterates of SSL-ADMM is obtained in Theorem 3. The high probability bound can be also established, which is shown as follows
where is the iteration number. In contrast to (28) and (29), we can observe that the results in Theorem 4 are much finer.
5 Preliminary Numerical Experiments
In this section, we report some numerical results on the following graph-guided fused lasso problem in statistical machine learning:
where
is the logistic loss function on the feature-label pair
,
is a given regularization parameter, and
, where
is obtained by sparse inverse covariance estimation [
33]. By introducing another block variable
, and imposing the constraint
, this problem is reformulated into a form of (1) with
, and
.
The dataset used in numerical experiments is taken from the LIBSVM website
1, which is summarized in the
Table 1. In our experiments, the regularization parameter
and penalty parameter
are set to be
and
, respectively; the initial points are set to be uniformly random vectors in the interval
; other parameters are chosen according to the conditions in corollaries. We plot Opt_err, the maximum of the objective function value error and constraint violation, versus CPU time in seconds, where the approximate optimal objective function value is obtained by running some convergent ADMM-type algorithm for more than 10 minutes.
Figure 1 shows the performances of SSL-ADMM and generalized version of [
21] (the algorithm in [
21] is indeed a stochastic linearized ADMM, hence we call the generalized version of it SLG-ADMM and it is more efficient than the original version), which updates dual variable only once at each iteration, and the results indicate that an improvement from symmetrically updating dual variable also occurs in the stochastic setting. Since our paper is to demonstrate that symmetrically update multipliers are still valid for stochastic optimization problems, we only compare SSL-ADMM with the algorithm in [
21] (this paper made comparisons with other algorithms for stochastic optimization), not with the algorithms designed for deterministic finite sum optimization problems, for example, in [
29,
30].
Table 1 Real-world datasets and regularization parameters |
Dataset | Number of samples | Dimensionality |
| 22696 | 123 |
| 32561 | 123 |
| 49990 | 22 |
| 49749 | 300 |
Figure 1 Opt_err vs (vertical axis) vs the CPU time (s) (horizontal axis) for a8a, a9a, ijcnn1, and w8a respectively |
Full size|PPT slide
6 Summary
In this paper, we analyze the expected convergence rates and the large deviation properties of a stochastic variant of symmetric ADMM using the variational inequality framework. By means of this framework, the proof is very clear. Numerical experiments on some real-world datasets demonstrate that symmetrically updating the dual variable can lead to an algorithmic improvement in the stochastic setting. When the model is deterministic and is not needed, our proposed algorithm reduces to a symmetric proximal ADMM, and the convergence region of is the same as that in the corresponding literature.
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}