In this blog, we revisit the convergence analysis of first-order algorithms in minimization and minimax optimization problems. Within the classical oracle model framework, we review the state-of-the-art upper and lower bound results in various settings, aiming to identify gaps in existing research. With the rapid development of applications like machine learning and operation research, we further identify some recent works that revised the classical settings of optimization algorithms study.
In this blog, we review the complexity bounds of (stochastic) first-order methods in optimization.
Regarding the problem structure, we consider minimization problems:
\[\min_{x\in\mathcal{X}}\ f(x),\]and minimax optimization problems of the following forms:
\[\min_{x\in\mathcal{X}}\ \left[f(x)\triangleq \max_{y\in\mathcal{Y}}\ g(x,y)\right],\]where $\mathcal{X}$ is a convex set.
Based on the stochasticity, we divide our discussions into three cases:
Finite-sum and stochastic optimization problems might appear similar, particularly when $n$ is large. Indeed, if $f_i(x) = f(x; \xi_i)$ where $\xi_i$ are independently and identically distributed across all $i$, the finite-sum problem can be seen as an empirical counterpart of the stochastic optimization. In such scenarios, finite-sum problems typically arise in statistics and learning as empirical risk minimization, corresponding to an offline setting, i.e., one can access a dataset of $n$ sample points. In contrast, stochastic optimization often pertains to an online setting, i.e., one could query an oracle to obtain samples from the population distribution $\mathcal{D}$. The primary distinction between methods used to solve these optimization challenges centers on the accessibility to the total objective function $f(x)$. Specifically, access to $f(x)$ is typically unavailable in stochastic optimization, unlike in finite-sum problems. Consequently, algorithms that rely on access to $f(x)$, such as the classical stochastic variance reduced gradient (SVRG) algorithm
Based on the convexity of the objective $f(x)$, we categorize our discussions on the minimization problem into strongly convex (SC), convex (C), and nonconvex (NC) cases. For minimax problems, depending on the convexity of $g(\cdot,y)$ for a given $y$ and the concavity of $g(x,\cdot)$ for a given $x$, we review results for combinations such as strongly convex-strongly concave (SC-SC), convex-concave (C-C), nonconvex-strongly concave (NC-SC), and other variations.
This blog summarizes complexity results of state-of-the-art (SOTA) first-order optimization algorithms. There are several great works for a comprehensive review of optimization algorithms from different perspectives. Besides many well-known textbooks and course materials like the one from Stephen Boyd
Intuitively, upper complexity bounds mean how many samples/iterations it takes for an algorithm to reach a certain accuracy, such as $\epsilon$-optimality. Thus, upper complexity bounds are algorithm-specific. Lower complexity bounds characterize how many samples/iterations it at least takes for the best algorithm (within some algorithm class) to reach a certain accuracy for the worst-case function within some function class. Thus lower complexity bounds are usually for a class of algorithms and function class. Since computing gradients or generating samples requires some effort, we often use oracle to represent these efforts in optimization.
To formally characterize complexity, we use the classical oracle complexity model framework
The oracle complexity model consists of the following components:
Algorithm class $\mathcal{A}$, e.g., a common algorithm class studied in optimization literature is the linear-span algorithm, which covers various gradient-based methods. The algorithm interacts with an oracle $\mathbb{O}$ to decide the next query point. Linear-span algorithm says that the next query point is within a linear combination of all past information:
\[x^{t+1}\in\mathrm{Span}\left\{x^0,\cdots,x^t;\mathbb{O}(f,x^0),\cdots,\mathbb{O}(f,x^t)\right\}.\]The efficiency of algorithms is quantified by the oracle complexity: for an algorithm $\mathtt{A}\in\mathcal{A}(\mathbb{O})$ interacting with an oracle $\mathbb{O}$, an instance $f\in\mathcal{F}$, and the corresponding measurement $\mathcal{M}$, we define
\[T_{\epsilon}(f,\mathtt{A})\triangleq\inf\left\{T\in\mathbb{N}~|~\mathcal{M}(x^T)\leq\epsilon\right\}\]as the minimum number of oracle calls $\mathcal{A}$ makes to reach convergence. Given an algorithm $\mathtt{A}$, its upper complexity bound for solving one specific function class $\mathcal{F}$ is defined as
\[\mathrm{UB}_\epsilon(\mathcal{F};\mathtt{A}) \triangleq \underset{f\in\mathcal{F}}{\sup}\ T_{\epsilon}(f,\mathtt{A}),\]One of the mainstreams of optimization study is trying to design algorithms with better (smaller) upper complexity bounds, corresponding to decreasing $\mathrm{UB}_\epsilon(\mathcal{F};\cdot)$ with their algorithms for a specific class of functions. On the other hand, another stream of study focuses on understanding the performance limit in terms of the worst-case complexity, i.e., the lower complexity bound (LB) of a class of algorithms using the information from a class of oracles on a class of functions under certain settings, which can be written as:
\[\mathrm{LB}_\epsilon(\mathcal{F},\mathcal{A},\mathbb{O}) \triangleq \underset{\mathtt{A}\in{\mathcal{A}(\mathbb{O})}}{\inf}\ \underset{f\in\mathcal{F}}{\sup}\ T_{\epsilon}(f,\mathtt{A}),\]As the figure above suggests, a common goal in optimization algorithm complexity studies is to find the optimal algorithm $\mathtt{A}^\star$ in a given setting, which means its upper bound matches with the lower bound of the algorithm class for a class of functions using certain oracles, i.e.,
\[\mathrm{UB}_\epsilon(\mathcal{F};\mathtt{A}^\star)\asymp\mathrm{LB}_\epsilon(\mathcal{F},\mathcal{A},\mathbb{O}).\]In this note, we will focus on first-order algorithms in various optimization problem settings, trying to summarize the state-of-the-art (SOTA) UB and LB results to identify the gaps in existing research and discuss new trends.
Throughout the blog, we focus on first-order optimization. There are also many works on zeroth-order optimization
Some other popular algorithms like proximal algorithms
where the proximal operator requires to solve a subproblem exactly. Solving a subproblem could be regarded as a new kind of oracle in algorithm design. Similarly, algorithms like the alternating direction method of multipliers (ADMM)
Also here we do not cover the method like conditional gradient method (or Frank–Wolfe algorithm)
To analyze the convergence of optimization algorithms, the literature often requires some other regularity conditions like Lipschitz smoothness, Lipschitz continuity, unbiased gradient estimator, and bounded variance. Interested readers may refer to these nice handbooks
For convenience, we summarize some of the notations commonly used in the tables below.
Duality Gap (for minimax optimization): the primal-dual gap of a given point $(\hat{x},\hat{y})$, defined as $G_f(\hat{x},\hat{y})\triangleq\max_y f(\hat{x},y)-\min_x f(x,\hat{y})$.
As mentioned above, we categorize the discussion based on the problem, stochasticity, and function structure. For the convenience of presentation, we divide the presentation into the following cases:
Minimization Problems:
Minimax Problems, based on the convexity combination of each component, we consider the following cases:
We present the lower and upper bound results in tables below
Problem Type | Measure | Lower Bound | Upper Bound | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
$L$-Smooth Convex | Optimality gap | $\Omega \left( \sqrt{L \epsilon^{-1}} \right)$ | $\checkmark$ | | |
$L$-Smooth $\mu$-SC | Optimality gap | $\Omega \left( \sqrt{\kappa} \log \frac{1}{\epsilon} \right)$ | $\checkmark$ | | |
NS $G$-Lip Cont. Convex | Optimality gap | $\Omega (G^2 \epsilon^{-2})$ | $\checkmark$ | | |
NS $G$-Lip Cont. $\mu$-SC | Optimality gap | $\Omega (G^2 (\mu \epsilon)^{-1})$ | $\checkmark$ | | |
$L$-Smooth Convex (function case) | Stationarity | $\Omega \left( \sqrt{\Delta L }\epsilon^{-1} \right)$ | $\checkmark$ (within logarithmic) | | |
$L$-Smooth Convex (domain case) | Stationarity | $\Omega \left( \sqrt{D L}\epsilon^{-\frac{1}{2}} \right)$ | $\checkmark$ | | |
$L$-Smooth NC | Stationarity | $\Omega (\Delta L \epsilon^{-2})$ | $\checkmark$ | | |
NS $G$-Lip Cont. $\rho$-WC | Near-stationarity | Unknown | $\mathcal{O}(\epsilon^{-4})$ | / | |
$L$-Smooth $\mu$-PL | Optimality gap | $\Omega \left( \kappa \log \frac{1}{\epsilon} \right)$ | $\checkmark$ | | |
Remark:
Problem Type | Measure | Lower Bound | Upper Bound | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
FS $L$-IS $\mu$-SC | Point Distance | $\Omega \left( (n + \sqrt{\kappa n}) \log \frac{1}{\epsilon} \right)$ | $\checkmark$ | | |
FS $L$-AS $\mu$-SC | Optimality gap | $\Omega \left( n + n^{\frac{3}{4}} \sqrt{\kappa} \log \frac{\Delta}{\epsilon} \right)$ | $\checkmark$ | | |
FS $L$-IS C | Optimality gap | $\Omega \left( n + D \sqrt{n L \epsilon^{-1}} \right)$ | $\checkmark$ (within logarithmic) | | |
FS $L$-AS C | Optimality gap | $\Omega \left( n + n^{\frac{3}{4}} D \sqrt{L \epsilon^{-1}} \right)$ | $\checkmark$ | | |
FS $L$-IS NC | Stationarity | $\Omega \left( \Delta L \epsilon^{-2} \right)$ | $\mathcal{O} \left( \sqrt{n}\Delta L \epsilon^{-2} \right)$ | | |
FS $L$-AS NC | Stationarity | $\Omega \left( \sqrt{n} \Delta L \epsilon^{-2} \right)$ | $\checkmark$ | | |
Stoc $L$-S $\mu$-SC | Optimality gap | $\Omega (\sqrt{\kappa}\log\frac{1}{\epsilon}+\frac{\sigma^2}{\mu\epsilon})$ | $\mathcal{O} (\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\mu\epsilon})$ | | |
Stoc $L$-S C | Optimality gap | $\Omega (\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\epsilon^2})$ | $\checkmark$ | | |
Stoc NS $\mu$-SC | Optimality gap | $\Omega (\epsilon^{-1})$ | $\checkmark$ | | |
Stoc NS C | Optimality gap | $\Omega (\epsilon^{-2})$ | $\checkmark$ | | |
Stoc $L$-S $\mu$-SC | Stationarity | $\Omega \left(\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\epsilon^2}\right)$ | $\checkmark$ (within logarithmic) | | |
Stoc $L$-S C | Stationarity | $\Omega \left( \frac{\sqrt{L}}{\epsilon} + \frac{\sigma^2}{\epsilon^2} \log \frac{1}{\epsilon} \right)$ | $\checkmark$ (within logarithmic) | | |
Stoc $L$-SS NC | Stationarity | $\Omega \left( \Delta \sigma \epsilon^{-4} \right)$ | $\checkmark$ | | |
Stoc $L$-AS NC | Stationarity | $\Omega \left( \Delta \sigma^2 + 3 \sigma \epsilon^{-2} \right)$ | $\checkmark$ | | |
Stoc NS $L$-Lip $\rho$-WC | Near-stationarity | Unknown | $\mathcal{O} (\epsilon^{-4})$ | / | |
Remark:
Problem Type | Measure | Lower Bound | Upper Bound | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
SC-SC, bilinear | Duality Gap | $\Omega(\sqrt{\kappa_x \kappa_y} \log \frac{1}{\epsilon})$ | $\checkmark$ | | |
SC-SC, general | Duality Gap | $\Omega(\sqrt{\kappa_x \kappa_y} \log \frac{1}{\epsilon})$ | $\checkmark$ (within logarithmic) | | |
SC-C, bilinear, NS | Duality Gap | $\Omega(\sqrt{\kappa_x} / \epsilon)$ | $\mathcal{O}(\kappa_x^2 / \sqrt{\epsilon})$ | | |
SC-C, general | Duality Gap | $\Omega(D \sqrt{L \kappa_x} / \epsilon)$ | $\tilde{\mathcal{O}}(D \sqrt{L \kappa_x} / \epsilon)$ | | |
C-SC, bilinear | Duality Gap | Unknown | $\mathcal{O}(\log \frac{1}{\epsilon})$ | / | |
C-C, bilinear, NS | Duality Gap | $\Omega(L / \epsilon)$ | $\checkmark$ | | |
C-C, general | Duality Gap | $\Omega(L D^2 / \epsilon)$ | $\checkmark$ | | |
C-C, general | Stationarity | $\Omega(L D / \epsilon)$ | $\checkmark$ | | |
PL-PL | Duality Gap | Unknown | $\mathcal{O}(\kappa^3\log \frac{1}{\epsilon})$ | / | |
Remark:
Problem Type | Measure | LB | UB | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
SC-SC, FS | Duality Gap | $\Omega\left((n + \kappa) \log \frac{1}{\epsilon}\right)$ | $\checkmark$ | | |
SC-SC, Stoc, SS | Duality Gap | $\Omega(\epsilon^{-1})$ | $\checkmark$ | / | |
SC-SC, Stoc, NS | Duality Gap | $\Omega(\epsilon^{-1})$ | $\checkmark$ | / | |
SC-SC, Stoc | Stationarity | $\Omega(\sigma^2\epsilon^{-2}+\kappa)$ | $\checkmark$ | | |
SC-C, FS | Duality Gap | $\Omega\left(n + \sqrt{n L / \epsilon}\right)$ | $\tilde{\mathcal{O}}(\sqrt{n L / \epsilon})$ | / | |
C-C, FS | Duality Gap | $\Omega(n + L / \epsilon)$ | $\mathcal{O}(\sqrt{n}/\epsilon)$ | | |
C-C, Stoc, SS | Duality Gap | $\Omega(\epsilon^{-2})$ | $\checkmark$ | / | |
C-C, Stoc, NS | Duality Gap | $\Omega(\epsilon^{-2})$ | $\checkmark$ | / | |
C-C, Stoc | Stationarity | $\Omega(\sigma^2\epsilon^{-2}+LD\epsilon^{-1})$ | $\checkmark$ | | |
PL-PL, Stoc | Duality Gap | Unknown | $\mathcal{O}(\kappa^5\epsilon^{-1})$ | / | |
Remark:
Type | Measure | LB | UB | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
NC-SC, Deter | Primal Stationarity | $\Omega(\sqrt{\kappa}\Delta \mathcal{L} \epsilon^{-2})$ | $\checkmark$ | | |
NC-C, Deter | Near-Stationarity | Unknown | $\mathcal{O}(\Delta L^2 \epsilon^{-3} \log^2 \frac{1}{\epsilon})$ | / | |
WC-C, Deter | Near-Stationarity | Unknown | $\mathcal{O}(\epsilon^{-6})$ | / | |
NC-PL, Deter | Primal Stationarity | Unknown | $\mathcal{O}(\kappa\epsilon^{-2})$ | / | |
Remark:
Type | Measure | LB | UB | Reference (LB) | Reference (UB) |
---|---|---|---|---|---|
NC-SC, FS, AS | Primal Stationarity | $\Omega\left(n+\sqrt{n\kappa}\Delta L\epsilon^{-2}\right)$ | $\mathcal{O}\left((n+n^{3/4}\sqrt{\kappa})\Delta L\epsilon^{-2}\right)$ | | |
NC-C, FS, IS | Near-stationarity | Unknown | $\mathcal{O}\left(n^{3/4}L^2D\Delta\epsilon^{-3}\right)$ | / | |
NC-SC, Stoc, SS | Primal Stationarity | $\Omega\left(\kappa^{1/3}\Delta L\epsilon^{-4}\right)$ | $\mathcal{O}\left(\kappa\Delta L\epsilon^{-4}\right)$ | | |
NC-SC, Stoc, IS | Primal Stationarity | Unknown | $\mathcal{O}\left(\kappa^2\Delta L\epsilon^{-3}\right)$ | / | |
NC-C, Stoc, SS | Near-stationarity | Unknown | $\mathcal{O}\left(L^3\epsilon^{-6}\right)$ | / | |
NC-PL, Stoc | Primal Stationarity | Unknown | $\mathcal{O}(\kappa^2\epsilon^{-4})$ | / | |
WC-SC, Stoc, NS | Near-Stationarity | Unknown | $\mathcal{O}(\epsilon^{-4})$ | / | |
Remark:
The section above summarizes the upper and lower bounds of the oracle complexity for finding an $\epsilon$-optimal solution or $\epsilon$-stationary points for minimization and minimax problems. Clearly, this is not the end of the story. There are more and more optimization problems arising from various applications like machine learning and operation research
In the aforementioned discussion, we only considered minimization and minimax problems. There are also many other important optimization problems with different structures, for example:
Bilevel Optimization
Bilevel optimization covers minimax optimization as a special case. Over the past seven years, bilevel optimization has become increasingly popular due to its applications in machine learning. Common approaches for solving bilevel optimization problems include:
Starting from
Subsequent developments have focused on developing fully first-order methods for solving bilevel optimization
Compositional Stochastic Optimization
Conditional Stochastic Optimization
Conditional stochastic optimization differs from composition stochastic optimization mainly in the conditional expectation inside $f$. This requires one to sample from the conditional distribution of $\eta$ for any given $\xi$, while compositional optimization can sample from the marginal distribution of $\eta$. It appears widely in machine learning and causality when the randomness admits a two-level hierarchical structure.
Performative Prediction (or Decision-Dependent Stochastic Optimization)
Note that this problem diverges from classical stochastic optimization because the distribution of $\xi$ depends on the decision variable $x$. Such dependency often disrupts the convexity of $F$, even if $f$ is convex with respect to $x$. In practical scenarios where the randomness can be decoupled from the decision variable, as in $\xi = g(x) + \eta$, the problem can be simplified to a classical stochastic optimization framework. This presents a trade-off: One can either impose additional modeling assumptions to revert to a classical approach or tackle the computational complexities inherent in such performative prediction problems. Practically, it is advisable to explore the specific structure of the problem to determine if it can be restructured into classical stochastic optimization.
Contextual Stochastic Optimization
Contextual stochastic optimization aims to leverage side information $Z$ to facilitate decision-making. The goal is to find a policy $\pi$ that maps a context $z$ to a decision $x$. Thus the performance measure is
\[\mathbb{E}_z(F(\pi(z);z) - F(\pi^*(z);z))\]or
\[\mathbb{E}_z\|\pi(z) - \pi^*(z)\|^2.\]The challenges for solving such problems come from the fact that usually the available samples are only $(z,\xi)$ pairs, i.e., one does not have access to multiple samples of $\xi$ from the conditional distribution. As a result, one usually needs to first estimate $F(x;z)$ via nonparametric statistics techniques like $k$-nearest neighbors and kernel regression or via reparametrization tricks and conduct a regression. Both could suffer from the curse of dimensionality as the dimension of $z$ is large.
Distributionally Robust Optimization
where $U_r(Q)$ refers to an uncertainty set containing a family of distributions around a nominal distribution $Q$ measured by some distance between distributions with radius $r$.
Since most deep learning problems are nonconvex, a vast amount of literature focuses on finding a (generalized) stationary point of the original optimization problem, but the practice often showed that one could find global optimality for various structured nonconvex problems efficiently. In fact, there is a line of research tailored for the global landscape of structured nonconvex optimization. For example, in neural network training, the interpolation condition holds for some overparameterized neural networks
Regarding such a mismatch between theory and practice, one reason may be the coarse assumptions the community applied in the theoretical analysis, which cannot effectively characterize the landscape of objectives. Here we briefly summarize a few structures arising in recent works, which try to mix the gap between practice and theory:
Hidden convexity says that the original nonconvex optimization problem might admit a convex reformulation via a variable change. It appears in operations research
Another stream considers Polyak-Łojasiewicz (PL) or Kurdyka-Łojasiewicz (KL) type of conditions, or other gradient dominance conditions
With numerical experiments disclosing structures in objective functions, some works proposed new assumptions that drive the algorithm design and corresponding theoretical analysis, which in turn reveals acceleration in empirical findings. For example,
Another noteworthy work is
For lower bounds, we adapt the so-called optimization-based lower bounds proved via zero-chain arguments
Note that the dimension $d_\epsilon$ depends on the accuracy, particularly $d_\epsilon$ increases as $\epsilon$ decreases. For a function class with a given dimension $d$, which is independent of the accuracy, the dependence on $\epsilon$ becomes loose especially when $d$ is small. In other words, for a $d$-dimension function class and a given accuracy $\epsilon>0$, the upper bounds on the complexity of first-order methods given in the tables still hold, yet the lower bounds become loose, which could lead to a mismatch between upper and lower bounds. This leads to a fundamental question:
How to prove lower bounds of first-order methods for any given $d$-dimensional function classes?
Such a question has been partially addressed for stochastic optimization using information theoretic-based lower bounds
The oracle complexity model mainly focuses on worst-case instances in the function class which may be far from practical instances. It is possible that the derived complexities can be too conservative and vacuous that they may not match the practice well, as the figure below illustrates.
Recently, there appeared some works that go beyond the classical oracle model in evaluating the optimization algorithm efficiency, for example:
Recently another series of recent works
Update: Regarding the open problem above, after we submitted the draft for review, there appeared a new work on arXiv
In this post, we review SOTA complexity upper and lower bounds of first-order algorithms in optimization tailored for minimization and minimax regimes with various settings, the summary identified gaps in existing research, which shed light on the open questions regarding accelerated algorithm design and performance limit investigation. Under the oracle framework, people should be careful when claiming one algorithm is better than the others and double-check whether the comparison is fair in terms of the settings like function class, oracle information, and algorithm class definition.
Regarding the rapid development and interdisciplinary applications in areas like machine learning and operation research, we revisited several recent works that go beyond the classical research flow in the optimization community. These works advocate a paradigm shift in research: besides an elegant and unified theory trying to cover all cases, sometimes we should also try to avoid the “Maslow’s hammer”, focus on the detailed applications first, identify their unique structure, and correspondently design algorithms tailored for these problems, which in turn will benefit the practice. Such instance-driven patterns may help the optimization community to devise a theory that fits the practice better.
While we have aimed to provide a thorough and balanced summary of existing complexity results for first-order methods, we acknowledge the possibility of overlooking certain relevant works, subtle technical conditions, or potential inaccuracies in interpreting the literature. Readers who identify those issues are warmly encouraged to send emails to bring them to our attention. Constructive feedback, corrections, and suggestions are highly appreciated.
We thank the insightful suggestions from two anonymous reviewers and Prof. Benjamin Grimmer.
PLACEHOLDER FOR ACADEMIC ATTRIBUTION
BibTeX citation
PLACEHOLDER FOR BIBTEX