A Symmetric Linearized Alternating Direction Method of Multipliers for a Class of Stochastic Optimization Problems

Jia HU, Qimin HU

Journal of Systems Science and Information ›› 2023, Vol. 11 ›› Issue (1) : 58-77.

PDF(286 KB)
PDF(286 KB)
Journal of Systems Science and Information ›› 2023, Vol. 11 ›› Issue (1) : 58-77. DOI: 10.21078/JSSI-2023-058-20
 

A Symmetric Linearized Alternating Direction Method of Multipliers for a Class of Stochastic Optimization Problems

Author information +
History +

Abstract

Alternating direction method of multipliers (ADMM) receives much attention in the recent years due to various demands from machine learning and big data related optimization. In 2013, Ouyang et al. extend the ADMM to the stochastic setting for solving some stochastic optimization problems, inspired by the structural risk minimization principle. In this paper, we consider a stochastic variant of symmetric ADMM, named symmetric stochastic linearized ADMM (SSL-ADMM). In particular, using the framework of variational inequality, we analyze the convergence properties of SSL-ADMM. Moreover, we show that, with high probability, SSL-ADMM has O((ln NN-1/2) constraint violation bound and objective error bound for convex problems, and has O((ln NN-1/2) constraint violation bound and objective error bound for strongly convex problems, where N is the iteration number. Symmetric ADMM can improve the algorithmic performance compared to classical ADMM, numerical experiments for statistical machine learning show that such an improvement is also present in the stochastic setting.

Key words

alternating direction method of multipliers / stochastic approximation / expected convergence rate and high probability bound / convex optimization / machine learning

Cite this article

Download Citations
Jia HU , Qimin HU. A Symmetric Linearized Alternating Direction Method of Multipliers for a Class of Stochastic Optimization Problems. Journal of Systems Science and Information, 2023, 11(1): 58-77 https://doi.org/10.21078/JSSI-2023-058-20

1 Introduction

We consider the following two-block separable convex optimization problem with linear equality constraints:
min{θ1(x)+θ2(y)|Ax+By=b,xX,yY},
(1)
where ARn×n1,BRn×n2,bRn,XRn1 and YRn2 are closed convex sets, and θ1:Rn1R and θ2:Rn2R are convex functions (not necessarily smooth), while θ1 has its specific structure. In particular, we assume that there is a stochastic first-order oracle (SFO) for θ1, which returns an unbiased and bounded stochastic gradient G(x,ξ) at x, where ξ is a random variable whose distribution is supported on ΞRd. Such an assumption is common in the stochastic programming (SP), see, e.g., [1-3] and the references therein. In SP, the objective function is often in the form of expectation, i.e., θ1(x)=ΞΘ(x,ξ)dP(ξ) for some Θ,P, and Ξ, including finite sum as a special case. For both cases (number of terms in the summation is large for the latter case), getting the full function value or gradient information is impractical. Motivated by this, we need to design some stochastic approximation[4] based algorithms to solve problem (1).
For the problem (1) itself, as a linearly constrained convex optimization problem, it is rich enough to characterize many optimization problems arising from various application fields, such as machine learning, image processing, and signal processing. In these fields, a typical scenario is where one of the functions represents some data fidelity term, and the other is a regularization term, see, e.g., [5] and the references therein. Without considering the specific structure, i.e., the assumption of SFO is not needed in the model, a classical method for solving problem (1) is the alternating direction method of multipliers (ADMM). ADMM was originally proposed by Glowinski and Marrocco[6], and Gabay and Mercier[7], which is a Gauss-Seidel implementation of augmented Lagrangian method[8] or an application of Douglas-Rachford splitting method on the dual problem of (1)[9]. For both convex and non-convex problems, there are extensive studies on the theoretical properties of ADMM. In particular, for convex optimization problems, theoretical results on convergence behavior are abundant, whether global convergence, sublinear convergence rate, or linear convergence rate, see, e.g., [9-15]. Recently, ADMM has been studied on nonconvex models satisfying the well-known Kurdyka-Lojasiewicz (KL) inequality or other similar properties, see, e.g., [16-19]. For a thorough understanding on some recent developments of ADMM, one can refer to a survey[20].
However, as we mentioned before, the gradient information of θ1 in (1) must be obtained by the SFO due to some computational or other limitation, and hence aforementioned ADMM does not work. To tackle this problem, some stochastic ADMM type algorithms have been proposed recently, see, e.g., [21-24]. Note that in these works, only the basic iterative scheme of ADMM was considered. It is well-known that symmetrically updating the dual variable in a more flexible way often improves the algorithmic performance, which is the idea of symmetric ADMM (or Peaceman-Rachford splitting method applied to the dual of problem (1)), see, e.g., [25-28]. In this paper, we study symmetric ADMM in the stochastic setting. In particular, we propose a symmetric stochastic linearized ADMM (SSL-ADMM) for solving two-block separable stochastic optimization problem (1) and analyze corresponding worst-case convergence rate by means of the framework of variational inequality. Moreover, we establish the large-deviation properties of SSL-ADMM under certain light-tail assumptions. Also, numerical experiments on the graph-guided fused lasso problem demonstrate the promising performance compared to non-symmetric ADMM.
The rest of this paper is organized as follows. We introduce some fundamental preliminaries in Section 2. Convergence properties of the proposed algorithm are analyzed in Section 3. The high probability guarantees for objective error and constraint violation of the proposed algorithm are investigated in Section 4. In Section 5, numerical results are presented to indicate the promising efficiency of symmetrically updating dual variables in the stochastic setting. Finally, a summary is made in Section 6.
Notations For two matrices A and B, the ordering relation AB (AB) means AB is positive definite (semidefinite). Im denotes the m×m identity matrix. For a vector x, x denotes its Euclidean norm; for a matrix X, X denotes its spectral norm. For any symmetric matrix G, define xG2:=xTGx and xG:=xTGx if G0. E[] denotes the mathematical expectation of a random variable. Pr{} denotes the probability value of an event. and denote the subdifferential and gradient operator of a function, respectively. We also sometimes use (x,y) and (x,y,λ) to denote the vectors (xT,yT)T and (xT,yT,λT)T, respectively.

2 Preliminaries

In this section, we summarize some preliminaries that will be used in later analysis. Let the Lagrangian function of the problem (1) be
L(x,y,λ)=θ1(x)+θ2(y)λT(Ax+Byb),
defined on X×Y×Rn. We call (x,y,λ) a saddle point of L(x,y,λ)X×Y×Rn if the following inequalities are satisfied:
LλRn(x,y,λ)L(x,y,λ)LxX,yY(x,y,λ).
Obviously, a saddle point (x,y,λ) can be characterized by the following inequalities
{xX,L(x,y,λ)L(x,y,λ)0,xX,yY,L(x,y,λ)L(x,y,λ)0,yY,λRn,L(x,y,λ)L(x,y,λ)0,λRn.
Below we invoke two propositions, one of which characterizes the optimality condition of an optimization model by a variational inequality and the other gives a result for the martingale-difference sequence.
Proposition 1 Let XRn be a closed convex set and let θ(x):RnR and f(x):RnR be convex functions. In addition, f(x) is differentiable. Assuming that the solution set of the minimization problem min{θ(x)+f(x)|xX} is nonempty, then we have the assertion that
x=argmin{θ(x)+f(x)|xX}
if and only if
xX,θ(x)θ(x)+(xx)Tf(x)0,xX.
Proof The proof can be found in [31].
Proposition 2 Let ξ[t]{ξ1,ξ2,,ξt} be a sequence of independent identically distributed random variables, and ζt=ζt(ξ[t]) be deterministic Borel functions of ξ[t] such that E|ξ[t1][ζt]=0 almost surely and Eξ[t1][exp{ζt2/σt2}]exp{1} almost surely, where σt>0 are deterministic and E|Y[X] denotes the expectation of random variable X conditional on random variable Y. Then
λ0:Prob{t=1Nζt>λt=1Nσt2}exp{λ2/3}.
Proof The proof can be founded in Lemma 4.1 on page 116–117 of [32].
Hence using Proposition 1, under the solution set of problem (1) is nonempty, solving (1) is equivalent to solving the following variational inequality problem: Finding w=(x,y,λ)Ω:=X×Y×Rn such that
θ(u)θ(u)+(ww)TF(w)0,wΩ,
where
u=(xy),w=(xyλ),F(w)=(ATλBTλAx+Byb),andθ(u)=θ1(x)+θ2(y).
The variables with superscript or subscript such as uk,wk,u¯k,w¯k are denoted similarly. In addition, we define two auxiliary sequences for the convergence analysis. More specifically, for the sequence {wk} generated by the SSL-ADMM in Section 3, let
w~k=(x~ky~kλ~k)=(xk+1yk+1λkβ(Axk+1+Bykb))andu~k=(x~ky~k).
(2)
Throughout the paper, we need the following assumptions:
Assumption
(i) The primal-dual solution set Ω of problem (1) is nonempty.
(ii) θ1(x) is differentiable, and its gradient satisfies the L-Lipschitz condition
θ1(x1)θ1(x2)Lx1x2
for all x1,x2X.
(iii)
a) E[G(x,ξ)]=θ1(x) and b) E[G(x,ξ)θ1(x)2]σ2,
where σ>0 is some constant.
Under the second assumption, it holds that for all x,yX,
θ1(x)θ1(y)+(xy)Tθ1(y)+L2xy2.
A direct result of combining this property with convexity is shown in the following lemma.
Lemma 1 Suppose function f is convex and differentiable, and its gradient is L-Lipschitz continuous, then for any x,y,z we have
(xy)Tf(z)f(x)f(y)+L2yz2.
In addition, if f is μ-strongly convex, then for any x,y,z we have
(xy)Tf(z)f(x)f(y)+L2yz2μ2xz2.
Proof Since the gradient of f is L-Lipschitz continuous, then for any y,z we have f(y)f(z)+(yz)Tf(z)+L2yz2. Also, due to the convexity of f, we have for any x,z, f(x)f(z)+(xz)Tf(z). Adding the above two inequalities, we get the conclusion. If f is μ-strongly convex, then for any x,z, f(x)f(z)+(xz)Tf(z)+μ2xz2. Then combine this inequality with f(y)f(z)+(yz)Tf(z)+L2yz2, and the proof is completed.

3 Symmetric Stochastic Linearized ADMM

In this section, we will present and analyze iterative scheme of the proposed symmetric stochastic linearized ADMM, named SSL-ADMM.
Algorithm 1 is called SSL-ADMM for short. We give some remarks on this algorithm. SSL-ADMM is a ADMM type algorithm, which alternates through one x-subproblem, an update on the multipliers, one y-subproblem, and an update on the multipliers again. The algorithm is symmetric since the dual variable is symmetrically updated twice at each iteration. The algorithm is stochastic since at each iteration SFO is called to obtain a stochastic gradient G(xk,ξ) which is an unbiased estimation of g(xk), the gradient of θ1(x) at xk, and is bounded relative to g(xk) in expectation. The algorithm is linearized due to the following two aspects: (i) The term G(xk,ξ)T(xxk) in the x-subproblem of SSL-ADMM is a stochastic version of linearization of θ1(xk). (ii) x-subproblem and y-subproblem are added proximal terms 12xxkG1,k2 and 12yykG2,k2 respectively, where {G1,k}and{G2,k} are two sequences of symmetric and positive definite matrices that can be change with iteration; with the choice of G2,kτIn2βBTB,τ>βBTB, the quadratic term in the y-subproblem is linearized. The same fact applies to the x-subproblem. Furthermore, when G1,kIn1 or is of the form τIn1βATA,τ>0, the term 12yykG2,k2 vanishes, and r=0, SSL-ADMM reduces to the algorithm appeared in earlier literatures [21, 24]. Finally, the convergence region D of (r,s) is the same as that in [27]. In particular, if r=0, s(0,1+52]. Recently, Bai, et al.[29, 30] studied stochastic ADMM algorithms for finite sum optimization problems. The difference between our paper and theirs is that our goal is to consider a more general stochastic optimization problem (1) where the random variable ξ is not necessarily a discrete random variable. For example, in SP, while it is possible to approximate the function θ1 by the sample average approximation technique, the extra sample approximation error should also be taken into account.
 Algorithm 1: Symmetric Stochastic Linearized ADMM (SSL-ADMM)
  Initialize x0X,y0Y,λ0,β,(r,s)D, two sequences of symmetric and positive semidefinite matrices: {G1,k} and {G2,k}, where
   D={(r,s)|r+s>0,r1,r2s2rs+r+s+10},
for k=0,1,.
  Call the SFO to obtain G(xk,ξ);
  xk+1=argminxX{G(xk,ξ)T(xxk)xTATλk+β2Ax+Bykb2
    +12xxkG1,k2};
  λk+12=λkrβ(Axk+1+Bykb);
  yk+1=argminyY{θ2(y)yTBTλk+12+β2Axk+1+Byb2
    +12yykG2,k2};
  λk+1=λk+12sβ(Axk+1+Byk+1b).
end
We start to establish the convergence of SSL-ADMM. The next several lemmas are to obtain an upper bound of θ(u~k)θ(u)+(w~kw)TF(w~k). With such a bound, it is possible to estimate the worst-case convergence rate of SSL-ADMM.
Lemma 2 Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2). Then we have
θ(u)θ(u~k)+(ww~k)TF(w~k)(ww~k)TQk(wkw~k)(xx~k)TδkL2xkx~k2,wΩ,
(3)
where δk=G(xk,ξ)θ1(xk), similarly hereinafter, and
Qk=(G1,k000βBTB+G2,krBT0B1βIn).
(4)
Proof Due to Proposition 1, the optimality condition of the x-subproblem in SSL-ADMM is
(xxk+1)T(G(xk,ξ)AT(λkβ(Axk+1+Bykb))+G1,k(xk+1xk))0,xX.
Using the notation in (2), the above inequality can be rewritten as
(xx~k)T(θ1(xk)+δkATλ~k+G1,k(x~kxk))0,xX.
And then using Lemma 1, we have
θ1(x)θ1(x~k)+(xx~k)T(ATλ~k)(xx~k)TG1,k(xkx~k)(xx~k)TδkL2xkx~k2,xX.
(5)
Similarly, the optimality condition of y-subproblem in SSL-ADMM is
θ2(y)θ2(yk+1)+(yyk+1)T(BTλk+12+βBT(Axk+1+Byk+1b)+G2,k(yk+1yk))0,yY.
(6)
Using the notation of λ~k, λk+12=λkr(λkλ~k)=λ~k+(1r)(λkλ~k). Hence
BTλk+12+βBT(Axk+1+Byk+1b)=BT(λ~k+(1r)(λkλ~k))+BT(λkλ~k)+βBTB(y~kyk)=BTλ~k+rBT(λkλ~k)+βBTB(y~kyk).
Substituting this equality into (6), we obtain
θ2(y)θ2(y~k)+(yy~k)T(BTλ~k)(yy~k)T(βBTB+G2,k)(yky~k)r(yy~k)TBT(λkλ~k).
(7)
According to the definition of w~k, we have
(Ax~k+By~kb)B(y~kyk)+1β(λ~kλk)=0,
and it can be written as
(λλ~k)T{(Ax~k+By~kb)B(y~kyk)+1β(λ~kλk)}0,λRn.
(8)
Combining (5), (7), and (8), and using the notation of (4), the proof is completed.
Lemma 3 Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2). Then we have
wk+1=wkM(wkw~k),
(9)
where
M=(In1000In200sβB(r+s)In).
(10)
Proof
λk+1=λk+12sβ(Axk+1+Byk+1b)=λkr(λkλ~k)s(β(Axk+1+Bykb)βB(ykyk+1))=λk(r+s)(λkλ~k)+sβB(yky~k).
Together with xk+1=x~ and yk+1=y~, we prove the assertion of this lemma.
Noting that for the matrices Qk defined in (4) and M defined in (10), there is a matrix Hk such that Qk=HkM, where
Hk=(G1,k000(1rsr+s)βBTB+G2,krr+sBT0rr+sB1β(r+s)In).
(11)
It is easy to check for any (r,s)D, Hk is positive semidefinite when the matrix B is full column rank. In fact, to make Hk positive semidefinite alone, it is sufficient for G2,k(r1)βBTB when other conditions are satisfied.
Lemma 4 Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2). Then we have
(ww~k)TQk(wkw~k)=12(wwk+1Hk2wwkHk2)+12(wkw~k)TG(wkw~k),
(12)
where G:=Qk+QkTMTHkM.
Proof Using Qk=HkM and wk+1=wkM(wkw~k), we have (ww~k)TQk(wkw~k)=(ww~k)THk(wkwk+1). Applying the identity (ab)TH(cd)=12(adH2acH2)+12(cbH2dbH2), we obtain
(ww~k)THk(wkwk+1)=12(wwk+1Hk2wwkHk2)+12(wkw~kHk2wk+1w~kHk2).
The remaining task is to simplify the last two terms.
wkw~kHk2wk+1w~kHk2=wkw~kHk2wk+1wk+wkw~kHk2=wkw~kHk2(In1+n2+nM)(wkw~k)Hk2=(wkw~k)T(Hk(In1+n2+nM)THk(In1+n2+nM))(wkw~k)=(wkw~k)T(Qk+QkTMTHkM)(wkw~k).
The proof is completed.
From this lemma, (ww~k)TQk(wkw~k) can be written as two terms, {one of which is suitable for recursive operation and the other is a quadratic term}, but the matrix G is not necessarily semidefinite. Thus we need to analyze this quadratic term in detail. Since
G=(G1,k000(1s)βBTB+G2,k(s1)BT0(s1)B2rsβIn),
(wkw~k)TG(wkw~k)=xkxk+1G1,k2+ykyk+1G2,k2+(1s)βB(ykyk+1)2+2rsβλkλ~k2+2(s1)(λkλ~k)TB(ykyk+1).
As the following lemma shows, the last two terms of the above equality can be further analyzed.
Lemma 5 Assume that (r,s)D. Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2). Then we have
2rsβλkλ~k2+2(s1)(λkλ~k)TB(ykyk+1)2(1r)(1s)1+rβ(Axk+Bykb)TB(ykyk+1)+(sr2r(1r)1+r)βB(ykyk+1)2+(2rs)βAxk+1+Byk+1b2+2(1r)β1+r(yk+1ykG2,k212ykyk1G2,k1212yk+1ykG2,k12).
(13)
Proof It follows from the optimality condition of y-subproblem for (k+1)-th iteration that
θ2(yk)θ2(yk+1)+(ykyk+1)T(BTλk+12+βBT(Axk+1+Byk+1b)+G2,k(yk+1yk))0.
Similarly, it follows from the optimality condition of y-subproblem for k-th iteration that
θ2(yk+1)θ2(yk)+(yk+1yk)T(BTλk12+βBT(Axk+Bykb)+G2,k1(ykyk1))0.
Adding these two inequalities and using
λk12λk+12=rβ(Axk+1+Byk+1b)+sβ(Axk+Bykb)+rβB(ykyk+1),
we obtain
{(r+1)β(Axk+1+Byk+1b)+(s1)β(Axk+Bykb)+rβB(ykyk+1)}TB(ykyk+1)(ykyk+1)T(G2,k1(ykyk1)G2,k(yk+1yk)).
This implies
(Axk+1+Byk+1b)TB(ykyk+1)1s1+r(Axk+Bykb)TB(ykyk+1)r1+rB(ykyk+1)2+β1+r(yk+1ykG2,k212ykyk1G2,k1212yk+1ykG2,k12).
(14)
On the other hand, we have
2(s1)(λkλ~k)TB(ykyk+1)=2(s1)β(Axk+1+Bykb)TB(ykyk+1)=2(s1){β(Axk+1+Byk+1b)TB(ykyk+1)+βB(ykyk+1)2}
(15)
and
2rsβλkλ~k2=(2rs)βAxk+1+Bykb2=(2rs)β(Axk+1+Byk+1b)+B(ykyk+1)2=(2rs)βAxk+1+Byk+1b2+(2rs)βB(ykyk+1)2+2(2rs)β(Axk+1+Byk+1b)TB(ykyk+1).
(16)
Combining (14), (15), and (16), we get the assertion of this lemma.
According to this lemma and using Cauchy-Schwarz inequality, the term (wkw~k)TG(wkw~k) can be bounded as follows:
(wkw~k)TG(wkw~k)(2rs(1s)21+r)βAxk+1+Byk+1b2+xkxk+1G1,k2+ykyk+1G2,k2+2(1r)β1+r(yk+1ykG2,k212ykyk1G2,k1212yk+1ykG2,k12)+(1s)21+rβ(Axk+1+Byk+1b2Axk+Bykb2).
(17)
Now combining (17), Lemma 2, and Lemma 4, we obtain the following main theorem. In this theorem, we take G1,k of the form τkIn1βATA,τk>0, which simplifies the system of linear equation in x-subproblem, and G2,kG2. Of course, G2 can also take the similar form as G1,k. In particular, if G2=ηIn2βBTB,ηβBTB, then y-subproblem reduces to the proximal mapping of g.
Theorem 1 Assume that (r,s)D. Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2), and
w¯N=1Nt=1Nw~t
for some pre-selected integer N. Choosing τkN+M, where M is a constant satisfying the ordering relation MIn1LIn1+βATA, then we have
θ(u¯N)θ(u)+(w¯Nw)TF(w)12Nw1wH12+(1r)β2(1+r)Ny1y0G22+(1s)2β2(1+r)NAx1+By1b2+1Nt=1N(xxt)Tδt+12NNt=1Nδt2.
(18)
Proof It is sufficient for using convexity of θ, (xkxk+1)TδkN2xkxk+12+12Nδk2, and (w~kw)TF(w)=(w~kw)TF(w~k).
Corollary 1 Assume that all the conditions in Theorem 1 hold, then SSL-ADMM has the following properties
(i)
E[Ax¯N+By¯Nb]12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22+(1s)2β2(1+r)NAx1+By1b2+σ22N,
(19)
(ii)
E[θ(u¯N)θ(u)](λ+1)(12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22)+(λ+1)((1s)2β2(1+r)NAx1+By1b2+σ22N),
(20)
where e is a unit vector satisfying eT(Ax¯N+By¯Nb)=Ax¯N+By¯Nb and the expectation is taken conditional on w1.
Proof Let w=(x,y,λ) in (18), where λ=λ+e, then the left hand side of (18) is θ(u¯N)θ(u)(λ)T(Ax¯N+By¯Nb)+Ax¯N+By¯Nb, which is followed from
(w¯Nw)TF(w)=(x¯Nx)T(ATλ)+(y¯Ny)T(BTλ)+(λ¯Nλ)T(Ax+Byb)=λT(Ax+Byb)λT(Ax¯N+By¯Nb)=(λ)T(Ax¯N+By¯Nb)+Ax¯N+By¯Nb,
where the first equality follows from the definition of F, and the second and last equalities hold due to Ax+Byb=0 and the choice of λ. On the other hand, substituting w=w¯N into the variational inequality associated with (1), we get θ(u¯N)θ(u)(λ)T(Ax¯N+By¯Nb)0. Hence, the left hand side of (18) is equal or greater than Ax¯N+By¯Nb when letting w=(x,y,λ+e) and (19) is obtained by taking expectation. Substituting w=w¯N into the variational inequality associated with (1), we can also get
θ(u¯N)θ(u)+(w¯Nw)TF(w)=θ(u¯N)θ(u)(λ)T(Ax¯N+By¯Nb)θ(u¯N)θ(u)λAx¯N+By¯Nb,
i.e.,
θ(u¯N)θ(u)θ(u¯N)θ(u)+(w¯Nw)TF(w)+λAx¯N+By¯Nb.
Then by taking expectation, (20) is obtained.
Remark 1 (ⅰ) In Theorem 1 or Corollary 1, τk's are constant, and N needs to be selected in advance. In fact, τk can also vary with the number of iterations, e.g., τk=k+M. In this case, if the distance between wk and w is bounded, i.e., wkw2R2 for any k, we can also obtain a worst-case convergence rate. The difference with the proof idea in Theorem 1 and Corollary 1 is bounding the term t=0k(xtxG1,t2xt+1xG1,t2), which is now bounded as follows.
t=0k(xtxG1,t2xt+1xG1,t2)=Mx0x2+i=0k1(τi+1τi)xi+1x2xk+1xG1,k2(M+i=0k1(τi+1τi))R2=(M+k)R2.
(ⅱ) Corollary 1 reveals that the worst-case convergence rate of SSL-ADMM for solving general convex problems is O(1N), where N is the iteration number.
At the end of this section, we assume that θ1 is μ-strongly convex, i.e., θ1(x)θ1(y)+θ1(y),xy+μ2xy2,μ>0 for all x,yX. With the strong convexity, we can obtain not only the objective function value gap and constraint violation converge to zero in expectation, but also the convergence of ergodic iterates of SSL-ADMM.
Theorem 2 Assume that (r,s)D. Let the sequence {wk} be generated by the SSL-ADMM and the associated {w~k} be defined in (2), and
w¯k=1kt=1kw~t.
Choosing τk=μ(k+1)+M, where M is a constant satisfying the ordering relation MIn1LIn1+βATA, then SSL-ADMM has the following properties
(i)
E[Ax¯k+By¯kb]12k(y1,λ1)(y,λ+e)H1;2×22+(1r)β2(1+r)ky1y0G22+(1s)2β2(1+r)kAx1+By1b2+μ2kx1x2+(1+lnk)σ22μk,
(21)
(ii)
E[θ(u¯k)θ(u)](λ+1)(12k(y1,λ1)(y,λ+e)H1;2×22+(1r)β2(1+r)ky1y0G22)+(λ+1)((1s)2β2(1+r)kAx1+By1b2+μ2kx1x2+(1+lnk)σ22μk),
(22)
where e is a unit vector satisfying eT(Ax¯N+By¯Nb)=Ax¯N+By¯Nb,
H1;2×2=((1rsr+s)βBTB+G2rr+sBTrr+sB1β(r+s)In),
and the expectation is taken conditional on w1.
Proof First, similar to the proof of Lemma 2, using the μ-strong convexity of θ1, we conclude that for any Ω
θ(u)θ(u~k)+(ww~k)TF(w~k)(ww~k)TQk(wkw~k)(xx~k)TδkL2xkx~k2+μ2xxk2,wΩ.
Then using Qk=HkM,(w~tw)TF(w~t)=(w~tw)TF(w), Lemma 4, and (17), we get
θ(u~t)θ(u)+(w~tw)TF(w)12((yt,λt)(y,λ)H1;2×22(yt+1,λt+1)(y,λ)H1;2×22)+(xxt)Tδt+(1s)2β2(1+r)(Axt+Bytb2Axt+1+Byt+1b2)+12μ(t+1)δt2+(1r)β2(1+r)(ytyt1G22yt+1ytG22)+12(μtxtx2μ(t+1)xt+1x2).
Adding the above inequalities from t=1 to k and devided by k, and then following the proof of Corollary 1, we prove the assertion of this theorem.
This theorem implies that under the assumption that θ1 is strongly convex, the worst-case convergence rate for the SSL-ADMM can be improved to O((lnk)/k) with the choice of diminishing size. The following theorem shows the convergence of ergodic iterates of SSL-ADMM, which is not covered in some earlier literatures [21, 24]. Futhermore, if θ2 is also strongly convex, the assumption that B is full column rank can be removed.
Theorem 3 Assume that (r,s)D. Let the sequence {wk} be generated by the SSL-ADMM, the associated {w~k} be defined in (2), and
w¯k=1kt=1kw~t.
Choosing τk=μ(k+1)+M, where M is a constant satisfying the ordering relation MIn1LIn1+βATA, and assuming B is full column rank and λmin denotes the minimum eigenvalue of BTB, then we have
E[x¯kx+y¯ky](1+Aλmin)[2μ(E[θ(u¯k)θ(u)]+λE[Ax¯k+By¯kb])]+1λminEAx¯k+By¯kb,
(23)
where the bounds for E[Ax¯k+By¯kb] and E[θ(u¯k)θ(u)] are the same as in (21) and (22) respectively, and the expectation is taken conditional on w1.
Proof Since (x,y,λ) is a solution of (1), we have ATλ=θ1(x) and BTλθ2(y). Hence, since θ1 is strongly convex and θ2 is convex, we have
θ1(x¯k)θ1(x)+(λ)T(Ax¯kAx)+μ2x¯kx2
(24)
and
θ2(y¯k)θ2(y)+(λ)T(By¯kBy).
(25)
Adding up (24) and (25), we get θ(u¯k)θ(u)+(λ)T(Ax¯k+By¯kb)+μ2x¯kx2, that is
x¯kx2μ(θ(u¯k)θ(u)(λ)T(Ax¯k+By¯kb))2μ(θ(u¯k)θ(u)+λAx¯k+By¯kb).
(26)
On the other hand,
Ax¯k+By¯kb=A(x¯kx)+B(y¯ky)B(y¯ky)Ax¯kx,
this implies B(y¯ky)Ax¯kx+Ax¯k+By¯kb and hence
y¯kyAλminx¯kx+1λminAx¯k+By¯kb.
(27)
Adding (26) and (27), using Jensen's inequality E[X12](EX)12 for a random variable X, and taking expecation imply
E[x¯kx+y¯ky](1+Aλmin)E[2μ(θ(u¯k)θ(u)+λAx¯k+By¯kb)]+1λminEAx¯k+By¯kb.
The proof is completed.

4 High Probability Performance Analysis

In this section, we shall establish the large deviation properties of SSL-ADMM. By (19) and (20), and Markov's inequality, we have for any ε1>0 and ε2>0 that
Pr{Ax¯N+By¯Nbε1(12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22+(1s)2β2(1+r)NAx1+By1b2+σ22N)}11ε1
(28)
and
Pr{θ(u¯N)θ(u)ε2((λ+1)(12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22)+(λ+1)((1s)2β2(1+r)NAx1+By1b2+σ22N))}11ε2.
(29)
However, these bounds are not strong. In the following, we will show these high probability bounds can be significantly improved when imposing standard "light-tail" assumption, see, e.g., [1, 32]. Specifically, assume that for any xX
E[exp{G(x,ξ)θ1(x)2/σ2}]exp{1}.
This assumption is a little bit stronger than b) in Assumption (ⅲ), which can be explained by Jensen's inequality. For further analysis, we assume that X is bounded and its diameter is denoted by DX, defined as maxx1,x2Xx1x2. The following theorem shows the high probability bound for objective error and constraint violation of SSL-ADMM.
Theorem 4 Assume that all the conditions in Theorem 1 hold, then SSL-ADMM has the following properties
(i)
Pr{Ax¯N+By¯Nb12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22+(1s)2β2(1+r)NAx1+By1b2+ΘDXσN+12N(1+Θ)σ2}1exp{Θ2/3}exp{Θ},
(30)
(ii)
Pr{θ(u¯N)θ(u)(λ+1)(12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22)+(λ+1)((1s)2β2(1+r)NAx1+By1b2+ΘDXσN+12N(1+Θ)σ2)}1exp{Θ2/3}exp{Θ},
(31)
where e is a unit vector satisfying eT(Ax¯N+By¯Nb)=Ax¯N+By¯Nb.
Proof Let ζt=1N(xxt)Tδt. Clearly, {ζt}t1 is a martingale-difference sequence. Moreover, it follows from the definition of DX and that light-tail assumption that
E[exp{(ζt)2/(1NDXσ)2}]E[exp{(1NDXδt)2/(1NDXσ)2}]exp{1}.
Now using Proposition 2 for the martingale-difference sequence, we have for any Θ0
Pr{t=1Nζt>ΘDXσN}exp{Θ2/3}.
(32)
Also, observe that by Jensen's inequality for the exponential function
exp{1Nt=1N(δt2/σ2)}1Nt=1Nexp{δt2/σ2},
whence, taking expectation,
E[exp{1Nt=1Nδt2/σ2}]1Nt=1NE[exp{δt2/σ2}]exp{1}.
It then follows from Markov's inequality that for any Θ0
Pr{1Nt=1Nδt2(1+Θ)σ2}exp{Θ}.
(33)
Using (32) and (33) in (18) for w=(x,y,λ+e), we conclude that
Pr{Ax¯N+By¯Nb>12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22+(1s)2β2(1+r)NAx1+By1b2+ΘDXσN+12N(1+Θ)σ2}exp{Θ2/3}+exp{Θ}
(34)
and
Pr{θ(u¯N)θ(u)>(λ+1)(12Nw1(x,y,λ+e)H12+(1r)β2(1+r)Ny1y0G22)+(λ+1)((1s)2β2(1+r)NAx1+By1b2+ΘDXσN+12N(1+Θ)σ2)}exp{Θ2/3}+exp{Θ}.
(35)
The result immediately follows from the above inequalities.
Remark 2 In view of Theorem 4, if we take Θ=ln N, then we have
Pr{Ax¯N+By¯NbO(lnNN)}11N2/31N
and
Pr{θ(u¯N)θ(u)O(lnNN)}11N2/31N.
For strongly convex case, using similar derivation, the high probability bound for objective error and constraint violation of SSL-ADMM is
Pr{Ax¯N+By¯NbO((lnN)2N)}11N2/31N
and
Pr{θ(u¯N)θ(u)O((lnN)2N)}11N2/31N.
Observe that the convergence rate of ergodic iterates of SSL-ADMM is obtained in Theorem 3. The high probability bound can be also established, which is shown as follows
Pr{x¯Nx+y¯NyO(lnNN)}11N2/31N,
where N is the iteration number. In contrast to (28) and (29), we can observe that the results in Theorem 4 are much finer.

5 Preliminary Numerical Experiments

In this section, we report some numerical results on the following graph-guided fused lasso problem in statistical machine learning:
minxEξfξ(x)+μAx1,
where fξ(x)=log(1+exp(tlTx)) is the logistic loss function on the feature-label pair ξ=(l,t)Rd×{1,1}, μ is a given regularization parameter, and A=[G;I], where G is obtained by sparse inverse covariance estimation [33]. By introducing another block variable y, and imposing the constraint Ax=y, this problem is reformulated into a form of (1) with θ1(x)=Eξfξ(x),θ2(y)=μy1,B=I, and b=0.
The dataset used in numerical experiments is taken from the LIBSVM website1, which is summarized in the Table 1. In our experiments, the regularization parameter μ and penalty parameter β are set to be 1×105 and 1×103, respectively; the initial points are set to be uniformly random vectors in the interval [1,1]d; other parameters are chosen according to the conditions in corollaries. We plot Opt_err, the maximum of the objective function value error and constraint violation, versus CPU time in seconds, where the approximate optimal objective function value is obtained by running some convergent ADMM-type algorithm for more than 10 minutes. Figure 1 shows the performances of SSL-ADMM and generalized version of [21] (the algorithm in [21] is indeed a stochastic linearized ADMM, hence we call the generalized version of it SLG-ADMM and it is more efficient than the original version), which updates dual variable only once at each iteration, and the results indicate that an improvement from symmetrically updating dual variable also occurs in the stochastic setting. Since our paper is to demonstrate that symmetrically update multipliers are still valid for stochastic optimization problems, we only compare SSL-ADMM with the algorithm in [21] (this paper made comparisons with other algorithms for stochastic optimization), not with the algorithms designed for deterministic finite sum optimization problems, for example, in [29, 30].
Table 1 Real-world datasets and regularization parameters
Dataset Number of samples Dimensionality
a8a 22696 123
a9a 32561 123
ijcnn1 49990 22
w8a 49749 300
Figure 1 Opt_err vs (vertical axis) vs the CPU time (s) (horizontal axis) for a8a, a9a, ijcnn1, and w8a respectively

Full size|PPT slide

6 Summary

In this paper, we analyze the expected convergence rates and the large deviation properties of a stochastic variant of symmetric ADMM using the variational inequality framework. By means of this framework, the proof is very clear. Numerical experiments on some real-world datasets demonstrate that symmetrically updating the dual variable can lead to an algorithmic improvement in the stochastic setting. When the model is deterministic and SFO is not needed, our proposed algorithm reduces to a symmetric proximal ADMM, and the convergence region of (r,s) is the same as that in the corresponding literature.

References

1
Nemirovski A, Juditsky A, Lan G, et al. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 2009, 19 (4): 1574- 1609.
2
Lan G. An optimal method for stochastic composite optimization. Mathematical Programming, 2012, 133 (1): 365- 397.
3
Ghadimi S, Lan G, Zhang H. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 2016, 155 (1): 267- 305.
4
Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22 (3): 400- 407.
5
Boyd S, Parikh N, Chu E, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends R in Machine Learning, 2011, 3 (1): 1- 122.
6
Glowinski R, Marroco A. Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problèmes de Dirichlet non linéaires. Journal of Equine Veterinary Science, 1975, 9 (2): 41- 76.
7
Gabay D, Mercier B. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 1976, 2 (1): 17- 40.
8
Glowinski R. On alternating direction methods of multipliers: A historical perspective. Modeling, Simulation and Optimization for Science and Technology, Springer, Dordrecht, 2014: 59-82.
9
Eckstein J, Bertsekas D P. On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 1992, 55 (1): 293- 318.
10
He B, Yuan X. On the O(1/n) convergence rate of the Douglas-Rachford alternating direction method. SIAM Journal on Numerical Analysis, 2012, 50 (2): 700- 709.
11
Monteiro R D C, Svaiter B F. Iteration-complexity of block-decomposition algorithms and the alternating direction method of multipliers. SIAM Journal on Optimization, 2013, 23 (1): 475- 507.
12
He B, Yuan X. On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers. Numerische Mathematik, 2015, 130 (3): 567- 577.
13
Deng W, Yin W. On the global and linear convergence of the generalized alternating direction method of multipliers. Journal of Scientific Computing, 2016, 66 (3): 889- 916.
14
Yang W H, Han D. Linear convergence of the alternating direction method of multipliers for a class of convex optimization problems. SIAM Journal on Numerical Analysis, 2016, 54 (2): 625- 640.
15
Han D, Sun D, Zhang L. Linear rate convergence of the alternating direction method of multipliers for convex composite programming. Mathematics of Operations Research, 2018, 43 (2): 622- 637.
16
Li G, Pong T K. Global convergence of splitting methods for nonconvex composite optimization. SIAM Journal on Optimization, 2015, 25 (4): 2434- 2460.
17
Wang Y, Yin W, Zeng J. Global convergence of ADMM in nonconvex nonsmooth optimization. Journal of Scientific Computing, 2019, 78 (1): 29- 63.
18
Jiang B, Lin T, Ma S, et al. Structured nonconvex and nonsmooth optimization: algorithms and iteration complexity analysis. Computational Optimization and Applications, 2019, 72 (1): 115- 157.
19
Zhang J, Luo Z Q. A proximal alternating direction method of multiplier for linearly constrained nonconvex minimization. SIAM Journal on Optimization, 2020, 30 (3): 2272- 2302.
20
Han D R. A survey on some recent developments of alternating direction method of multipliers. Journal of the Operations Research Society of China, 2022, 10 (1): 1- 52.
21
Ouyang H, He N, Tran L, et al. Stochastic alternating direction method of multipliers. International Conference on Machine Learning, 2013, 80- 88.
22
Suzuki T. Dual averaging and proximal gradient descent for online alternating direction multiplier method. International Conference on Machine Learning, 2013, 392- 400.
23
Zhao P, Yang J, Zhang T, et al. Adaptive stochastic alternating direction method of multipliers. International Conference on Machine Learning, 2015, 69- 77.
24
Gao X, Jiang B, Zhang S. On the information-adaptive variants of the ADMM: An iteration complexity perspective. Journal of Scientific Computing, 2018, 76 (1): 327- 363.
25
He B, Liu H, Wang Z, et al. A strictly contractive Peaceman-Rachford splitting method for convex programming. SIAM Journal on Optimization, 2014, 24 (3): 1011- 1040.
26
He B, Ma F, Yuan X. Convergence study on the symmetric version of ADMM with larger step sizes. SIAM Journal on Imaging Sciences, 2016, 9 (3): 1467- 1501.
27
Bai J, Li J, Xu F, et al. Generalized symmetric ADMM for separable convex optimization. Computational Optimization and Applications, 2018, 70 (1): 129- 170.
28
Lions P L, Mercier B. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 1979, 16 (6): 964- 979.
29
Bai J, Hager W W, Zhang H. An inexact accelerated stochastic ADMM for separable convex optimization. Computational Optimization and Applications, 2022, 81 (2): 479- 518.
30
Bai J, Han D, Sun H, et al. Convergence analysis of an inexact accelerated stochastic ADMM with larger stepsizes. CSIAM Transactions on Applied Mathematics, 2022, 3 (3): 448- 479.
31
He B S. On the convergence properties of alternating direction method of multipliers. Numerical Mathematics, 2017, 39, 81- 96.
32
Lan G. First-order and stochastic optimization methods for machine learning, Springer, New York, 2020.
33
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 2008, 9 (3): 432- 441.

Funding

National Natural Science Foundation of China(61662036)
PDF(286 KB)

591

Accesses

0

Citation

Detail

Sections
Recommended

/