Title | Stochastic subgradient method |
---|---|
Course | Convex Optimization II |
Institution | Stanford University |
Pages | 23 |
File Size | 359.1 KB |
File Type | |
Total Downloads | 75 |
Total Views | 118 |
lecture slides...
Stochastic Subgradient Method
• noisy unbiased subgradient • stochastic subgradient method • convergence proof • stochastic programming • expected value of convex function • on-line learning and adaptive signal processing
EE364b, Stanford University
Noisy unbiased subgradient • random vector g˜ ∈ Rn is a noisy unbiased subgradient for f : Rn → R at x if for all z f (z) ≥ f (x) + (E g˜)T (z − x) i.e., g = E g˜ ∈ ∂f (x) • same as g˜ = g + v, where g ∈ ∂f (x), E v = 0 • v can represent error in computing g, measurement noise, Monte Carlo sampling error, etc.
EE364b, Stanford University
1
• if x is also random, g˜ is a noisy unbiased subgradient of f at x if ∀z
f (z) ≥ f (x) + E(˜ g |x)T (z − x)
holds almost surely • same as E(˜ g |x) ∈ ∂f (x) (a.s.)
EE364b, Stanford University
2
Stochastic subgradient method stochastic subgradient method is the subgradient method, using noisy unbiased subgradients x(k+1) = x(k) − αk g˜(k) • x(k) is kth iterate • g˜(k) is any noisy unbiased subgradient of (convex) f at x(k), i.e., E(˜ g (k)|x(k)) = g (k) ∈ ∂f (x(k)) • αk > 0 is the kth step size (k)
• define fbest = min{f (x(1)), . . . , f (x(k))} EE364b, Stanford University
3
Assumptions • f ⋆ = inf x f (x) > −∞, with f (x⋆) = f ⋆ • E kg (k)k22 ≤ G2 for all k • E kx(1) − x⋆k22 ≤ R2 (can take = here) • step sizes are square-summable but not summable αk ≥ 0,
∞ X
k=1
αk2
=
kαk22
< ∞,
∞ X
αk = ∞
k=1
these assumptions are stronger than needed, just to simplify proofs EE364b, Stanford University
4
Convergence results • convergence in expectation: (k) lim E fbest = f⋆
k→∞
• convergence in probability: for any ǫ > 0, (k) ≥ f ⋆ + ǫ) = 0 lim Prob(fbest
k→∞
• almost sure convergence: (k) lim f best = f⋆
k→∞
a.s. (we won’t show this) EE364b, Stanford University
5
Convergence proof key quantity: expected Euclidean distance squared to the optimal set E kx(k+1) − x⋆k22 x(k) = E kx(k) − αk g˜(k) − x⋆k22 x(k)
2 (k) 2 (k) (k )T (k) ⋆ (k) + αk E k˜ g k2 x − 2αk E g˜ (x − x ) x = kx − (k) 2 (k) (k) ⋆ 2 (k) (k) T (k) ⋆ 2 g k2 x = kx − x k2 − 2αk E(˜ g |x ) (x − x ) + αk E k˜ (k) ⋆ 2 (k) ⋆ 2 (k) 2 (k) ≤ kx − x k2 − 2αk (f (x ) − f ) + αk E k˜ g k2 x (k)
x⋆k22
using E(˜ g (k)|x(k)) ∈ ∂f (x(k))
EE364b, Stanford University
6
now take expectation: E kx(k+1) − x⋆k22 ≤ E kx(k) − x⋆k22 − 2αk (E f (x(k)) − f ⋆) + α2k E k˜ g (k)k22 apply recursively, and use E k˜ g (k)k22 ≤ G2 to get (k+1)
E kx
−
x⋆k22
(1)
≤ E kx
−
x⋆k22
−2
k X
(i)
αi(E f (x ) − f ) + G
i=1
and so
2
k X
αi2
i=1
R2 + G2kαk22 min (E f (x ) − f ) ≤ Pk i=1,...,k 2 i=1 αi (i)
EE364b, Stanford University
⋆
⋆
7
• we conclude mini=1,...,k E f (x(i)) → f ⋆ • Jensen’s inequality and concavity of minimum yields (k) E fbest = E min f (x(i)) ≤ min E f (x(i)) i=1,...,k
i=1,...,k
(k)
so E fbest → f ⋆ (convergence in expectation) • Markov’s inequality: for ǫ > 0 (k)
(k) Prob(f best
E(fbest − f ⋆) ⋆ − f ≥ ǫ) ≤ ǫ
righthand side goes to zero, so we get convergence in probability EE364b, Stanford University
8
Example piecewise linear minimization minimize f (x) = maxi=1,...,m (aiT x + bi) we use stochastic subgradient algorithm with noisy subgradient g˜(k) = g (k) + v (k),
g (k) ∈ ∂f (x(k))
v (k) independent zero mean random variables
EE364b, Stanford University
9
problem instance: n = 20 variables, m = 100 terms, f ⋆ ≈ 1.1, αk = 1/k v (k) are IID N (0, 0.5I) (25% noise since kgk ≈ 4.5)
EE364b, Stanford University
10
(k)
average and one std. dev. for fbest − f ⋆ over 100 realizations
10
0
−1
(k)
E fbest − f ⋆
10
10
10
−2
−3
1000
2000
3000
4000
5000
k EE364b, Stanford University
11
(k)
empirical distributions of fbest − f ⋆ at k = 250, k = 1000, and k = 5000 k = 250
30 20 10 0
−3
10
−2
10
−1
10
0
10
k = 1000
30 20 10 0
−3
10
−2
10
−1
10
0
10
k = 5000
30 20 10 0
−3
10
EE364b, Stanford University
−2
10
−1
10
0
10
12
Stochastic programming minimize E f0(x, ω) subject to E fi(x, ω) ≤ 0,
i = 1, . . . , m
if fi(x, ω) is convex in x for each ω, problem is convex ‘certainty-equivalent’ problem minimize f0(x, E ω) subject to fi(x, E ω) ≤ 0,
i = 1, . . . , m
(if fi(x, ω) is convex in ω, gives a lower bound on optimal value of stochastic problem)
EE364b, Stanford University
13
Variations
• in place of E fi(x, ω) ≤ 0 (constraint holds in expectation) can use – E fi(x, ω)+ ≤ ǫ (LHS is expected violation) – E (maxi fi(x, ω)+) ≤ ǫ (LHS is expected worst violation) • unfortunately, chance constraint Prob(fi(x, ω) ≤ 0) ≥ η is convex only in a few special cases
EE364b, Stanford University
14
Expected value of convex function suppose F (x, w) is convex in x for each w and G(x, w) ∈ ∂x F (x, w) Z • f (x) = E F (x, w) = F (x, w)p(w) dw is convex • a subgradient of f at x is g = E G(x, w) =
Z
G(x, w)p(w) dw ∈ ∂f (x)
• a noisy unbiased subgradient of f at x is M 1 X g˜ = G(x, wi) M i=1
where w1, . . . , wM are M independent samples (Monte Carlo) EE364b, Stanford University
15
Example: Expected value of piecewise linear function minimize f (x) = E maxi=1,...,m (aTi x + bi) where ai and bi are random evaluate noisy subgradient using Monte Carlo method with M samples, and run stochastic subgradient method compare to: • certainty equivalent: minimize fce(x) = maxi=1,...,m (E aiT x + E bi) • heuristic: minimize fheur (x) = maxi=1,...,m (E aiT x + E bi + λkxk2) EE364b, Stanford University
16
problem instance: n = 20, m = 100, ai ∼ N (¯ ai, 5I), b ∼ N (¯b, 5I), kaik2 ≈ 5, kbk2 ≈ 10, xstoch computed using M = 100 xce f (xce)
20 10 0 1
1.5
2
2.5
3
2.5
3
2.5
3
xheur f (xheur)
20 10 0 1
1.5
2
xstoch f (xstoch)
20 10 0 1
EE364b, Stanford University
1.5
2
17
f ⋆ ≈ 1.34 estimated by running the method with M = 1000 for long time
(k) fbest − fˆ⋆
10
10
10
10
M M M M
0
= = = =
1 10 100 1000
−1
−2
−3
500
1000
1500
2000
k EE364b, Stanford University
18
On-line learning and adaptive signal processing • (x, y) ∈ Rn × R have some joint distribution • find weight vector w ∈ Rn for which wT x is a good estimator of y • choose w to minimize expected value of a convex loss function l J(w) = E l(wT x − y) – l(u) = u2: mean-square error – l(u) = |u|: mean-absolute error • at each step (e.g., time sample), we are given a sample (x(k), y (k)) from the distribution EE364b, Stanford University
19
noisy unbiased subgradient of J at w(k), based on sample x(k+1) , y (k+1) : g (k) = l′(w(k)T x(k+1) − y (k+1) )x(k+1) where l′ is the derivative (or a subgradient) of l on-line algorithm: w(k+1) = w(k) − αk l′(w(k)T x(k+1) − y (k+1) )x(k+1). • for l(u) = u2, gives the LMS (least mean-square) algorithm • for l(u) = |u|, gives the sign algorithm • w(k)T x(k+1) − y (k+1) is the prediction error
EE364b, Stanford University
20
Example: Mean-absolute error minimization problem instance: n = 10, (x, y) ∼ N (0, Σ), Σ random with E(y 2) ≈ 12, αk = 1/k 40
prediction error
30 20 10 0 −10 −20 −30 −40
50
100
150
200
250
300
k EE364b, Stanford University
21
empirical distribution of prediction error for w⋆ (over 1000 samples) 150
100
50
0 −2
EE364b, Stanford University
−1.5
−1
−0.5
0
0.5
1
1.5
2
22...