Avoid Overclaims - Summary of Complexity Bounds for Algorithms in Minimization and Minimax Optimization

In this blog, we revisit the convergence analysis of first-order algorithms in minimization and minimax optimization problems. Within the classical oracle model framework, we review the state-of-the-art upper and lower bound results in various settings, aiming to identify gaps in existing research. With the rapid development of applications like machine learning and operation research, we further identify some recent works that revised the classical settings of optimization algorithms study.

Introduction

In this blog, we review the complexity bounds of (stochastic) first-order methods in optimization.

Regarding the problem structure, we consider minimization problems:

\[\min_{x\in\mathcal{X}}\ f(x),\]

and minimax optimization problems of the following forms:

\[\min_{x\in\mathcal{X}}\ \left[f(x)\triangleq \max_{y\in\mathcal{Y}}\ g(x,y)\right],\]

where $\mathcal{X}$ is a convex set.

Based on the stochasticity, we divide our discussions into three cases:

\[\min_{x\in\mathcal{X}}\ f(x).\] \[\min_{x\in\mathcal{X}}\ f(x)\triangleq\frac{1}{n}\sum_{i=1}^n f_i(x).\] \[\min_{x\in\mathcal{X}}\ f(x)\triangleq\mathbb{E}_{\xi\sim\mathcal{D}}[f(x;\xi)].\]

Finite-sum and stochastic optimization problems might appear similar, particularly when $n$ is large. Indeed, if $f_i(x) = f(x; \xi_i)$ where $\xi_i$ are independently and identically distributed across all $i$, the finite-sum problem can be seen as an empirical counterpart of the stochastic optimization. In such scenarios, finite-sum problems typically arise in statistics and learning as empirical risk minimization, corresponding to an offline setting, i.e., one can access a dataset of $n$ sample points. In contrast, stochastic optimization often pertains to an online setting, i.e., one could query an oracle to obtain samples from the population distribution $\mathcal{D}$. The primary distinction between methods used to solve these optimization challenges centers on the accessibility to the total objective function $f(x)$. Specifically, access to $f(x)$ is typically unavailable in stochastic optimization, unlike in finite-sum problems. Consequently, algorithms that rely on access to $f(x)$, such as the classical stochastic variance reduced gradient (SVRG) algorithm , cannot be directly applied in purely stochastic settings.

Based on the convexity of the objective $f(x)$, we categorize our discussions on the minimization problem into strongly convex (SC), convex (C), and nonconvex (NC) cases. For minimax problems, depending on the convexity of $g(\cdot,y)$ for a given $y$ and the concavity of $g(x,\cdot)$ for a given $x$, we review results for combinations such as strongly convex-strongly concave (SC-SC), convex-concave (C-C), nonconvex-strongly concave (NC-SC), and other variations.

Literature

This blog summarizes complexity results of state-of-the-art (SOTA) first-order optimization algorithms. There are several great works for a comprehensive review of optimization algorithms from different perspectives. Besides many well-known textbooks and course materials like the one from Stephen Boyd, maybe one of the most impressive works is the blog post by Sebastian Ruder, which received more than 10k citations according to Google Scholar. The post reviewed algorithmic design of gradient descent (GD), stochastic gradient descent (SGD), and their variants, especially those commonly used in the machine learning community like AdaGrad and Adam. Several monographs reviewed optimization algorithms in various settings, e.g., , , , and ; the page by Ju Sun was a popular repository tracking research effort on achieving global optimality for nonconvex optimizationLatest update: Dec 11 2021.. The review by Ruoyu Sun further specified the survey of optimization algorithm study in the context of deep learning. A recent survey by Danilova et al., revisited algorithm design and complexity analysis for nonconvex optimization. To some extent, it is the closest one to our blog post. Our blog aims to serve as an easily accessible tool for optimizers to check the SOTA theoretical convergence rate from both upper and lower bounds perspectives.


Framework: Oracle Complexity Model

Intuitively, upper complexity bounds mean how many samples/iterations it takes for an algorithm to reach a certain accuracy, such as $\epsilon$-optimality. Thus, upper complexity bounds are algorithm-specific. Lower complexity bounds characterize how many samples/iterations it at least takes for the best algorithm (within some algorithm class) to reach a certain accuracy for the worst-case function within some function class. Thus lower complexity bounds are usually for a class of algorithms and function class. Since computing gradients or generating samples requires some effort, we often use oracle to represent these efforts in optimization.

To formally characterize complexity, we use the classical oracle complexity model framework. Feel free to jump directly to the summary table, as these are just for proper descriptions of lower bounds.

Framework Setup

The oracle complexity model consists of the following components:

Oracle Complexity Framework (adapted from Prof. Yangyang Xu's Slides)

The efficiency of algorithms is quantified by the oracle complexity: for an algorithm $\mathtt{A}\in\mathcal{A}(\mathbb{O})$ interacting with an oracle $\mathbb{O}$, an instance $f\in\mathcal{F}$, and the corresponding measurement $\mathcal{M}$, we define

\[T_{\epsilon}(f,\mathtt{A})\triangleq\inf\left\{T\in\mathbb{N}~|~\mathcal{M}(x^T)\leq\epsilon\right\}\]

as the minimum number of oracle calls $\mathcal{A}$ makes to reach convergence. Given an algorithm $\mathtt{A}$, its upper complexity bound for solving one specific function class $\mathcal{F}$ is defined as

\[\mathrm{UB}_\epsilon(\mathcal{F};\mathtt{A}) \triangleq \underset{f\in\mathcal{F}}{\sup}\ T_{\epsilon}(f,\mathtt{A}),\]

One of the mainstreams of optimization study is trying to design algorithms with better (smaller) upper complexity bounds, corresponding to decreasing $\mathrm{UB}_\epsilon(\mathcal{F};\cdot)$ with their algorithms for a specific class of functions. On the other hand, another stream of study focuses on understanding the performance limit in terms of the worst-case complexity, i.e., the lower complexity bound (LB) of a class of algorithms using the information from a class of oracles on a class of functions under certain settings, which can be written as:

\[\mathrm{LB}_\epsilon(\mathcal{F},\mathcal{A},\mathbb{O}) \triangleq \underset{\mathtt{A}\in{\mathcal{A}(\mathbb{O})}}{\inf}\ \underset{f\in\mathcal{F}}{\sup}\ T_{\epsilon}(f,\mathtt{A}),\]
Illustration of Upper and Lower Complexity Bounds

As the figure above suggests, a common goal in optimization algorithm complexity studies is to find the optimal algorithm $\mathtt{A}^\star$ in a given setting, which means its upper bound matches with the lower bound of the algorithm class for a class of functions using certain oracles, i.e.,

\[\mathrm{UB}_\epsilon(\mathcal{F};\mathtt{A}^\star)\asymp\mathrm{LB}_\epsilon(\mathcal{F},\mathcal{A},\mathbb{O}).\]

In this note, we will focus on first-order algorithms in various optimization problem settings, trying to summarize the state-of-the-art (SOTA) UB and LB results to identify the gaps in existing research and discuss new trends.

What We Do Not Cover

Throughout the blog, we focus on first-order optimization. There are also many works on zeroth-order optimization, and higher-order optimization. The key difference lies within the oracle information. For example, second-order methods (e.g., Newton’s method) have access to the Hessian information. With such finer information, generally, second-order methods attain better complexities compared to first-order methods, which is characterized in theory as mentioned in . Of course, obtaining higher-order information would be much more costly, and thus, the per-iteration computational complexity is usually higher.

Some other popular algorithms like proximal algorithms are not discussed. One prominent example is proximal point algorithm (PPA) based on proximal operator:

\[x^{t+1}=\text{prox}_{\lambda f}(x^t)\triangleq\underset{x}{\arg\min}\left\{f(x)+\frac{1}{2\lambda}\|\|x-x^t\|\|^2\right\},\]

where the proximal operator requires to solve a subproblem exactly. Solving a subproblem could be regarded as a new kind of oracle in algorithm design. Similarly, algorithms like the alternating direction method of multipliers (ADMM), which also inherits subproblems to solve, are not discussed.

Also here we do not cover the method like conditional gradient method (or Frank–Wolfe algorithm), which further requires a linear minimization oracle (LMO) in the algorithm design, so that it can avoid potentially expensive projection steps.


Notations

To analyze the convergence of optimization algorithms, the literature often requires some other regularity conditions like Lipschitz smoothness, Lipschitz continuity, unbiased gradient estimator, and bounded variance. Interested readers may refer to these nice handbooks and for detailed definitions.

For convenience, we summarize some of the notations commonly used in the tables below.


Summary of Results

As mentioned above, we categorize the discussion based on the problem, stochasticity, and function structure. For the convenience of presentation, we divide the presentation into the following cases:

Minimization Problems:

  1. Deterministic optimization
  2. Finite-sum and stochastic optimization

Minimax Problems, based on the convexity combination of each component, we consider the following cases:

  1. SC-SC/SC-C/C-C deterministic minimax optimization
  2. SC-SC/SC-C/C-C finite-sum and stochastic minimax optimization
  3. NC-SC/NC-C deterministic minimax optimization
  4. NC-SC/NC-C finite-sum and stochastic minimax optimization

We present the lower and upper bound results in tables below. Given the extensive body of literature on this topic, it is possible that some relevant references may have been inadvertently overlooked. We welcome any comments and discussions.. Note that:

Case 1-1: Deterministic Minimization

Problem Type Measure Lower Bound Upper Bound Reference (LB) Reference (UB)Note that here, possibly we may not choose the most original work that proposed the results; rather, we may select the one that may come with a clearer presentation. Readers are encouraged to check the references therein for the original works.
$L$-Smooth Convex Optimality gap $\Omega \left( \sqrt{L \epsilon^{-1}} \right)$ $\checkmark$ , Theorem 2.1.7 , Theorem 2.2.2
$L$-Smooth $\mu$-SC Optimality gap $\Omega \left( \sqrt{\kappa} \log \frac{1}{\epsilon} \right)$ $\checkmark$ , Theorem 2.1.13 , Theorem 2.2.2
NS $G$-Lip Cont. Convex Optimality gap $\Omega (G^2 \epsilon^{-2})$ $\checkmark$ , Theorem 3.2.1 , Theorem 3.2.2
NS $G$-Lip Cont. $\mu$-SC Optimality gap $\Omega (G^2 (\mu \epsilon)^{-1})$ $\checkmark$ , Theorem 3.2.5 , Theorem 3.9The algorithm design therein requires projection.
$L$-Smooth Convex (function case) Stationarity $\Omega \left( \sqrt{\Delta L }\epsilon^{-1} \right)$ $\checkmark$ (within logarithmic) , Theorem 1 , Appendix A.1
$L$-Smooth Convex (domain case) Stationarity $\Omega \left( \sqrt{D L}\epsilon^{-\frac{1}{2}} \right)$ $\checkmark$ , Theorem 1 Section 6.5
$L$-Smooth NC Stationarity $\Omega (\Delta L \epsilon^{-2})$ $\checkmark$ , Theorem 1 , Theorem 10.15
NS $G$-Lip Cont. $\rho$-WC Near-stationarity Unknown $\mathcal{O}(\epsilon^{-4})$ / , Corollary 2.2
$L$-Smooth $\mu$-PL Optimality gap $\Omega \left( \kappa \log \frac{1}{\epsilon} \right)$ $\checkmark$ , Theorem 3 , Theorem 1

Remark:

  1. References:
  2. $\kappa\triangleq L/\mu\geq 1$ is called the condition number, which can be very large in many applications, e.g., the optimal regularization parameter choice in statistical learning can lead to $\kappa=\Omega(\sqrt{n})$ where $n$ is the sample size.
  3. The PL condition is a popular assumption in nonconvex optimization, generalizing the strong convexity condition. Based on the summary above, we can find that both smooth strongly convex and smooth PL condition optimization problems have established the optimal complexities (i.e., UB matches LB). However, the LB in the PL case is strictly larger than that of the SC case. Thus, regarding the worst-case complexity, we can say that the PL case is “strictly harder” than the strongly convex case.
  4. The $L$-Smooth convex setting with stationarity measurement is divided into two cases: the “function case” assumes the initial optimality gap is bounded $f(x_0)-f(x^\star)\leq \Delta$, and the “domain case” assumed bounded initialization $||x_0-x^\star||\leq D$.
  5. For $L$-Smooth Convex (function case) setting, (Theorem 6.1) provided $\mathcal{O}\left( \Delta L \epsilon^{-1} \right)$ upper bound, which avoids the logarithmic factor, while with worse dependence on $\Delta$ and $L$.
  6. The optimality gap and stationarity measurements, in fact, are closely related; see for a discussion on their duality.
  7. Exact Matching: In fact, for some settings, some works have further refined the bounds and shown that the lower and upper bounds are exactly the same, i.e., the two bounds are the same in both orders and constants, so big-O notations are unnecessary in these cases. In details:
    • In the smooth convex case, (Corollary 4) derived the minimax risk, which exactly matches the upper bound derived in (Theorem 2).
    • In the nonsmooth convex case, as discussed in , the best upper bound already exactly matches the lower bound derived in (Theorem 2).

Case 1-2: Finite-sum and Stochastic Minimization

Problem Type Measure Lower Bound Upper Bound Reference (LB) Reference (UB)
FS $L$-IS $\mu$-SC Point Distance $\Omega \left( (n + \sqrt{\kappa n}) \log \frac{1}{\epsilon} \right)$ $\checkmark$ , Corollary 3 , Corollary 2
FS $L$-AS $\mu$-SC Optimality gap $\Omega \left( n + n^{\frac{3}{4}} \sqrt{\kappa} \log \frac{\Delta}{\epsilon} \right)$ $\checkmark$ , Theorem 3.5 , Section 5
FS $L$-IS C Optimality gap $\Omega \left( n + D \sqrt{n L \epsilon^{-1}} \right)$ $\checkmark$ (within logarithmic) , Theorem 7The lower bound analysis extended the oracle class with additional proximal oracle. , Corollary 3.7
FS $L$-AS C Optimality gap $\Omega \left( n + n^{\frac{3}{4}} D \sqrt{L \epsilon^{-1}} \right)$ $\checkmark$ , Theorem 4.2 , Section 5
FS $L$-IS NC Stationarity $\Omega \left( \Delta L \epsilon^{-2} \right)$ $\mathcal{O} \left( \sqrt{n}\Delta L \epsilon^{-2} \right)$ , Theorem 4.7 , Theorem 1
FS $L$-AS NC Stationarity $\Omega \left( \sqrt{n} \Delta L \epsilon^{-2} \right)$ $\checkmark$ , Theorem 4.5 , Theorem 2, 3
           
Stoc $L$-S $\mu$-SC Optimality gap $\Omega (\sqrt{\kappa}\log\frac{1}{\epsilon}+\frac{\sigma^2}{\mu\epsilon})$ $\mathcal{O} (\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\mu\epsilon})$ , Equation 1.3 , Proposition 9
Stoc $L$-S C Optimality gap $\Omega (\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\epsilon^2})$ $\checkmark$ , Equation 6 , Corollary 1
Stoc NS $\mu$-SC Optimality gap $\Omega (\epsilon^{-1})$ $\checkmark$ , Theorem 2 , Section 2.1
Stoc NS C Optimality gap $\Omega (\epsilon^{-2})$ $\checkmark$ , Theorem 1 , Section 2.2
Stoc $L$-S $\mu$-SC Stationarity $\Omega \left(\sqrt{\frac{L}{\epsilon}}+\frac{\sigma^2}{\epsilon^2}\right)$ $\checkmark$ (within logarithmic) , Equation 10 , Theorem 3
Stoc $L$-S C Stationarity $\Omega \left( \frac{\sqrt{L}}{\epsilon} + \frac{\sigma^2}{\epsilon^2} \log \frac{1}{\epsilon} \right)$ $\checkmark$ (within logarithmic) , Theorem 2 , Corollary 1
Stoc $L$-SS NC Stationarity $\Omega \left( \Delta \sigma \epsilon^{-4} \right)$ $\checkmark$ , Theorem 1 , Corollary 2.2
Stoc $L$-AS NC Stationarity $\Omega \left( \Delta \sigma^2 + 3 \sigma \epsilon^{-2} \right)$ $\checkmark$ , Theorem 2 , Theorem 1
Stoc NS $L$-Lip $\rho$-WC Near-stationarity Unknown $\mathcal{O} (\epsilon^{-4})$ / , Theorem 2.1

Remark:

  1. References:
  2. Here $n$ corresponds to the number of component functions $f_i$, and $\kappa\triangleq L/\mu$ is the condition number, $\sigma^2$ corresponds to the variance of gradient estimator.
  3. For the finite-sum $L$-IS $\mu$-SC case, considered more general randomized algorithm and oracle class settings, and derived $\Omega \left( n + \sqrt{\kappa n} \log \frac{1}{\epsilon} \right)$ lower bound. A matching upper bound is proposed in .
  4. For IFO/SFO-based algorithms, here we only consider the case that all oracles are independent, so shuffling-based algorithm analysis is not directly applicable here, regarding their without-replacement sampling.

Case 2-1: (S)C-(S)C Deterministic Minimax Optimization

Problem Type Measure Lower Bound Upper Bound Reference (LB) Reference (UB)
SC-SC, bilinear Duality Gap $\Omega(\sqrt{\kappa_x \kappa_y} \log \frac{1}{\epsilon})$ $\checkmark$ , Theorem 3.5 , Theorem 5
SC-SC, general Duality Gap $\Omega(\sqrt{\kappa_x \kappa_y} \log \frac{1}{\epsilon})$ $\checkmark$ (within logarithmic) , Theorem 4.5 , Theorem 3
SC-C, bilinear, NS Duality Gap $\Omega(\sqrt{\kappa_x} / \epsilon)$ $\mathcal{O}(\kappa_x^2 / \sqrt{\epsilon})$ , Theorem 10 , Theorem 2
SC-C, general Duality Gap $\Omega(D \sqrt{L \kappa_x} / \epsilon)$ $\tilde{\mathcal{O}}(D \sqrt{L \kappa_x} / \epsilon)$ , Theorem 2 , Section 3.2
C-SC, bilinear Duality Gap Unknown $\mathcal{O}(\log \frac{1}{\epsilon})$ / , Theorem 3.1
C-C, bilinear, NS Duality Gap $\Omega(L / \epsilon)$ $\checkmark$ , Theorem 9 , Theorem 1
C-C, general Duality Gap $\Omega(L D^2 / \epsilon)$ $\checkmark$ , Theorem 3 , Theorem 4.1
C-C, general Stationarity $\Omega(L D / \epsilon)$ $\checkmark$ , Corollary 3 , Corollary 2
PL-PL Duality Gap Unknown $\mathcal{O}(\kappa^3\log \frac{1}{\epsilon})$ / , Theorem 3.2

Remark:

  1. References:
  2. Here $\kappa_x$ and $\kappa_y$ corresponds to condition numbers on $x$ and $y$ components, respectively. A more refined dicussion regarding the different structure between $x$, $y$ and their coupling can be found in and references therein.

Case 2-2: (S)C-(S)C Finite-sum and Stochastic Minimax Optimization

Problem Type Measure LB UB Reference (LB) Reference (UB)
SC-SC, FS Duality Gap $\Omega\left((n + \kappa) \log \frac{1}{\epsilon}\right)$ $\checkmark$ , Theorem 1 , Theorem 1
SC-SC, Stoc, SS Duality Gap $\Omega(\epsilon^{-1})$ $\checkmark$ / , Theorem 5
SC-SC, Stoc, NS Duality Gap $\Omega(\epsilon^{-1})$ $\checkmark$ / , Theorem 1
SC-SC, Stoc Stationarity $\Omega(\sigma^2\epsilon^{-2}+\kappa)$ $\checkmark$ , Theorem 6.1 , Theorem 4.1
SC-C, FS Duality Gap $\Omega\left(n + \sqrt{n L / \epsilon}\right)$ $\tilde{\mathcal{O}}(\sqrt{n L / \epsilon})$ / , Section 3.2
C-C, FS Duality Gap $\Omega(n + L / \epsilon)$ $\mathcal{O}(\sqrt{n}/\epsilon)$ , Theorem 3 , Corollary 2.1
C-C, Stoc, SS Duality Gap $\Omega(\epsilon^{-2})$ $\checkmark$ / , Corollary 1
C-C, Stoc, NS Duality Gap $\Omega(\epsilon^{-2})$ $\checkmark$ / , Lemma 3.1
C-C, Stoc Stationarity $\Omega(\sigma^2\epsilon^{-2}+LD\epsilon^{-1})$ $\checkmark$ , Theorem 6.2 , Theorem 4.2
PL-PL, Stoc Duality Gap Unknown $\mathcal{O}(\kappa^5\epsilon^{-1})$ / , Theorem 3.3

Remark:

  1. References:

Case 2-3: NC-(S)C Deterministic Minimax Optimization

Type Measure LB UB Reference (LB) Reference (UB)
NC-SC, Deter Primal Stationarity $\Omega(\sqrt{\kappa}\Delta \mathcal{L} \epsilon^{-2})$ $\checkmark$ , Theorem 3.1 , Theorem 4.1
NC-C, Deter Near-Stationarity Unknown $\mathcal{O}(\Delta L^2 \epsilon^{-3} \log^2 \frac{1}{\epsilon})$ / , Corollary A.8
WC-C, Deter Near-Stationarity Unknown $\mathcal{O}(\epsilon^{-6})$ / , Theorem 3.7
NC-PL, Deter Primal Stationarity Unknown $\mathcal{O}(\kappa\epsilon^{-2})$ / , Corollary 4.1

Remark:

  1. References:
  2. Some other works also studied the above problems in terms of the function stationarity (i.e., the gradient norm of $f$, rather than it primal), e.g., . As discussed in , it has been shown that function stationarity and primal stationarity are transferable with mild efforts. Thus, we do not present the results specifically.

Case 2-4: NC-(S)C Finite-sum and Stochastic Minimax Optimization

Type Measure LB UB Reference (LB) Reference (UB)
NC-SC, FS, AS Primal Stationarity $\Omega\left(n+\sqrt{n\kappa}\Delta L\epsilon^{-2}\right)$ $\mathcal{O}\left((n+n^{3/4}\sqrt{\kappa})\Delta L\epsilon^{-2}\right)$ , Theorem 3.2 , Section 4.2
NC-C, FS, IS Near-stationarity Unknown $\mathcal{O}\left(n^{3/4}L^2D\Delta\epsilon^{-3}\right)$ / , Corollary 4.3
NC-SC, Stoc, SS Primal Stationarity $\Omega\left(\kappa^{1/3}\Delta L\epsilon^{-4}\right)$ $\mathcal{O}\left(\kappa\Delta L\epsilon^{-4}\right)$ , Theorem 2 , Theorem 3
NC-SC, Stoc, IS Primal Stationarity Unknown $\mathcal{O}\left(\kappa^2\Delta L\epsilon^{-3}\right)$ / , Theorem 4
NC-C, Stoc, SS Near-stationarity Unknown $\mathcal{O}\left(L^3\epsilon^{-6}\right)$ / , Theorem 6
NC-PL, Stoc Primal Stationarity Unknown $\mathcal{O}(\kappa^2\epsilon^{-4})$ / , Corollary 4.1
WC-SC, Stoc, NS Near-Stationarity Unknown $\mathcal{O}(\epsilon^{-4})$ / , Theorem 2

Remark:

  1. References:

What is Next?

The section above summarizes the upper and lower bounds of the oracle complexity for finding an $\epsilon$-optimal solution or $\epsilon$-stationary points for minimization and minimax problems. Clearly, this is not the end of the story. There are more and more optimization problems arising from various applications like machine learning and operation research, which come with more involved problem structure and complicated landscape characteristics. We also need to indicate that the above summary corresponds to asymptotic upper and lower bounds in theory. Sometimes (or often), we find it harder to explain algorithm behavior in practice using existing theory, e.g., shows that variance reduction may be ineffective in accelerating the training of deep learning models, which contrast the classical convergence theory. Below, we discuss what could be potentially interesting next steps.

Richer Problem Structure

In the aforementioned discussion, we only considered minimization and minimax problems. There are also many other important optimization problems with different structures, for example:

\[\min_{x \in \mathcal{X}} F(x) =\mathbb{E}_{\xi}\left[f\left(\mathbb{E}_{\eta}\left[g(x;\eta)\right];\xi\right)\right].\]

Landscape Analysis

Since most deep learning problems are nonconvex, a vast amount of literature focuses on finding a (generalized) stationary point of the original optimization problem, but the practice often showed that one could find global optimality for various structured nonconvex problems efficiently. In fact, there is a line of research tailored for the global landscape of structured nonconvex optimization. For example, in neural network training, the interpolation condition holds for some overparameterized neural networks; also it has been observed that low-rank structures naturally emerge in the weight matrices during training.

Regarding such a mismatch between theory and practice, one reason may be the coarse assumptions the community applied in the theoretical analysis, which cannot effectively characterize the landscape of objectives. Here we briefly summarize a few structures arising in recent works, which try to mix the gap between practice and theory:

Unified Lower Bounds

For lower bounds, we adapt the so-called optimization-based lower bounds proved via zero-chain arguments. The narrative of such lower bounds admits the following form: For any given accuracy $\epsilon>0$, there exists a hard instance $f:\mathbb{R}^{d_\epsilon}\rightarrow\mathbb{R}$ in the function class $\mathcal{F}$, where the dimension $d_\epsilon$ depends on the accuracy $\epsilon>0$, such that it takes at least $\mathrm{poly}(\epsilon^{-1})$ number of oracles to find an $\epsilon$-optimal solution or $\epsilon$-stationary point.

Note that the dimension $d_\epsilon$ depends on the accuracy, particularly $d_\epsilon$ increases as $\epsilon$ decreases. For a function class with a given dimension $d$, which is independent of the accuracy, the dependence on $\epsilon$ becomes loose especially when $d$ is small. In other words, for a $d$-dimension function class and a given accuracy $\epsilon>0$, the upper bounds on the complexity of first-order methods given in the tables still hold, yet the lower bounds become loose, which could lead to a mismatch between upper and lower bounds. This leads to a fundamental question:

How to prove lower bounds of first-order methods for any given $d$-dimensional function classes?

Such a question has been partially addressed for stochastic optimization using information theoretic-based lower bounds and for deterministic one-dimensional optimization:

Beyond Classical Oracle Model

The oracle complexity model mainly focuses on worst-case instances in the function class which may be far from practical instances. It is possible that the derived complexities can be too conservative and vacuous that they may not match the practice well, as the figure below illustrates.

Gap Between General Worst-Case Complexity and Instance-Level Complexity Analysis (adapted from )

Recently, there appeared some works that go beyond the classical oracle model in evaluating the optimization algorithm efficiency, for example:


Conclusion

In this post, we review SOTA complexity upper and lower bounds of first-order algorithms in optimization tailored for minimization and minimax regimes with various settings, the summary identified gaps in existing research, which shed light on the open questions regarding accelerated algorithm design and performance limit investigation. Under the oracle framework, people should be careful when claiming one algorithm is better than the others and double-check whether the comparison is fair in terms of the settings like function class, oracle information, and algorithm class definition.

Regarding the rapid development and interdisciplinary applications in areas like machine learning and operation research, we revisited several recent works that go beyond the classical research flow in the optimization community. These works advocate a paradigm shift in research: besides an elegant and unified theory trying to cover all cases, sometimes we should also try to avoid the “Maslow’s hammer”, focus on the detailed applications first, identify their unique structure, and correspondently design algorithms tailored for these problems, which in turn will benefit the practice. Such instance-driven patterns may help the optimization community to devise a theory that fits the practice better.

While we have aimed to provide a thorough and balanced summary of existing complexity results for first-order methods, we acknowledge the possibility of overlooking certain relevant works, subtle technical conditions, or potential inaccuracies in interpreting the literature. Readers who identify those issues are warmly encouraged to send emails to bring them to our attention. Constructive feedback, corrections, and suggestions are highly appreciated.

Acknowledgement

We thank the insightful suggestions from two anonymous reviewers and Prof. Benjamin Grimmer.

For attribution in academic contexts, please cite this work as
        PLACEHOLDER FOR ACADEMIC ATTRIBUTION
  
BibTeX citation
        PLACEHOLDER FOR BIBTEX