Skip to yearly menu bar Skip to main content


Poster

High-Probability Bound for Non-Smooth Non-Convex Stochastic Optimization with Heavy Tails

Langqi Liu · Yibo Wang · Lijun Zhang


Abstract: Recently, Cutkosky et al. introduce the online-to-non-convex framework, which utilizes online learning methods to solve non-smooth non-convex optimization problems. However, their results are still unsatisfactory because (i) they only achieve an in-expectation theoretical guarantee so that the solution in a particular run could be worse than the expected performance and (ii) their method relies on the bounded variance assumption of gradient distributions, which could be impractical in many deep learning problems. To address these limitations, in this paper, we relax the assumption on gradients to allow heavy-tailed distributions that have finite $\mathfrak{p}$-th moments for some $\mathfrak{p}\in(1,2]$, and propose a novel algorithm that uses $\tilde{\mathcal{O}}(\epsilon^{-\frac{2\mathfrak{p}-1}{\mathfrak{p}-1}}\delta^{-1}\log(1/q))$ gradient queries to identify a $(\delta,\epsilon)$-stationary point with probability $1-q$. The key idea is first incorporating the gradient clipping technique into the online-to-non-convex framework to produce a sequence of points, the averaged gradient norms of which is no greater than $\epsilon$ with high probability. Then, we propose a validation method to identify one $(\delta,\epsilon)$-stationary point among the candidates. When gradient distributions have bounded variance, by taking $\mathfrak{p}=2$ our result turns into $\tilde{\mathcal{O}}(\epsilon^{-3}\delta^{-1}\log(1/q))$ and improves the existing high-probability bound of $\tilde{\mathcal{O}}(\epsilon^{-4}\delta^{-1}\log(1/q))$. We also recover the previous bound of $\tilde{\mathcal{O}}(\epsilon^{-\frac{3\mathfrak{p}-2}{\mathfrak{p}-1}}\log(1/q))$ when the objective is smooth.

Live content is unavailable. Log in and register to view live content