Stochastic subgradient method PDF

Title	Stochastic subgradient method
Course	Convex Optimization II
Institution	Stanford University
Pages	23
File Size	359.1 KB
File Type	PDF
Total Downloads	75
Total Views	118

Preview

CLICK TO PREVIEW PDF

Summary

lecture slides...

Description

Stochastic Subgradient Method

• noisy unbiased subgradient • stochastic subgradient method • convergence proof • stochastic programming • expected value of convex function • on-line learning and adaptive signal processing

EE364b, Stanford University

Noisy unbiased subgradient • random vector g˜ ∈ Rn is a noisy unbiased subgradient for f : Rn → R at x if for all z f (z) ≥ f (x) + (E g˜)T (z − x) i.e., g = E g˜ ∈ ∂f (x) • same as g˜ = g + v, where g ∈ ∂f (x), E v = 0 • v can represent error in computing g, measurement noise, Monte Carlo sampling error, etc.

EE364b, Stanford University

1

• if x is also random, g˜ is a noisy unbiased subgradient of f at x if ∀z

f (z) ≥ f (x) + E(˜ g |x)T (z − x)

holds almost surely • same as E(˜ g |x) ∈ ∂f (x) (a.s.)

EE364b, Stanford University

2

Stochastic subgradient method stochastic subgradient method is the subgradient method, using noisy unbiased subgradients x(k+1) = x(k) − αk g˜(k) • x(k) is kth iterate • g˜(k) is any noisy unbiased subgradient of (convex) f at x(k), i.e., E(˜ g (k)|x(k)) = g (k) ∈ ∂f (x(k)) • αk > 0 is the kth step size (k)

• define fbest = min{f (x(1)), . . . , f (x(k))} EE364b, Stanford University

3

Assumptions • f ⋆ = inf x f (x) > −∞, with f (x⋆) = f ⋆ • E kg (k)k22 ≤ G2 for all k • E kx(1) − x⋆k22 ≤ R2 (can take = here) • step sizes are square-summable but not summable αk ≥ 0,

∞ X

k=1

αk2

=

kαk22

< ∞,

∞ X

αk = ∞

k=1

these assumptions are stronger than needed, just to simplify proofs EE364b, Stanford University

4

Convergence results • convergence in expectation: (k) lim E fbest = f⋆

k→∞

• convergence in probability: for any ǫ > 0, (k) ≥ f ⋆ + ǫ) = 0 lim Prob(fbest

k→∞

• almost sure convergence: (k) lim f best = f⋆

k→∞

a.s. (we won’t show this) EE364b, Stanford University

5

Convergence proof key quantity: expected Euclidean distance squared to the optimal set       E kx(k+1) − x⋆k22  x(k) = E kx(k) − αk g˜(k) − x⋆k22  x(k)

      2 (k) 2  (k) (k )T (k) ⋆  (k) + αk E k˜ g k2  x − 2αk E g˜ (x − x )  x = kx −    (k) 2  (k) (k) ⋆ 2 (k) (k) T (k) ⋆ 2 g k2  x = kx − x k2 − 2αk E(˜ g |x ) (x − x ) + αk E k˜    (k) ⋆ 2 (k) ⋆ 2 (k) 2  (k) ≤ kx − x k2 − 2αk (f (x ) − f ) + αk E k˜ g k2  x (k)

x⋆k22

using E(˜ g (k)|x(k)) ∈ ∂f (x(k))

EE364b, Stanford University

6

now take expectation: E kx(k+1) − x⋆k22 ≤ E kx(k) − x⋆k22 − 2αk (E f (x(k)) − f ⋆) + α2k E k˜ g (k)k22 apply recursively, and use E k˜ g (k)k22 ≤ G2 to get (k+1)

E kx

−

x⋆k22

(1)

≤ E kx

−

x⋆k22

−2

k X

(i)

αi(E f (x ) − f ) + G

i=1

and so

2

k X

αi2

i=1

R2 + G2kαk22 min (E f (x ) − f ) ≤ Pk i=1,...,k 2 i=1 αi (i)

EE364b, Stanford University

⋆

⋆

7

• we conclude mini=1,...,k E f (x(i)) → f ⋆ • Jensen’s inequality and concavity of minimum yields (k) E fbest = E min f (x(i)) ≤ min E f (x(i)) i=1,...,k

i=1,...,k

(k)

so E fbest → f ⋆ (convergence in expectation) • Markov’s inequality: for ǫ > 0 (k)

(k) Prob(f best

E(fbest − f ⋆) ⋆ − f ≥ ǫ) ≤ ǫ

righthand side goes to zero, so we get convergence in probability EE364b, Stanford University

8

Example piecewise linear minimization minimize f (x) = maxi=1,...,m (aiT x + bi) we use stochastic subgradient algorithm with noisy subgradient g˜(k) = g (k) + v (k),

g (k) ∈ ∂f (x(k))

v (k) independent zero mean random variables

EE364b, Stanford University

9

problem instance: n = 20 variables, m = 100 terms, f ⋆ ≈ 1.1, αk = 1/k v (k) are IID N (0, 0.5I) (25% noise since kgk ≈ 4.5)

EE364b, Stanford University

10

(k)

average and one std. dev. for fbest − f ⋆ over 100 realizations

10

0

−1

(k)

E fbest − f ⋆

10

10

10

−2

−3

1000

2000

3000

4000

5000

k EE364b, Stanford University

11

(k)

empirical distributions of fbest − f ⋆ at k = 250, k = 1000, and k = 5000 k = 250

30 20 10 0

−3

10

−2

10

−1

10

0

10

k = 1000

30 20 10 0

−3

10

−2

10

−1

10

0

10

k = 5000

30 20 10 0

−3

10

EE364b, Stanford University

−2

10

−1

10

0

10

12

Stochastic programming minimize E f0(x, ω) subject to E fi(x, ω) ≤ 0,

i = 1, . . . , m

if fi(x, ω) is convex in x for each ω, problem is convex ‘certainty-equivalent’ problem minimize f0(x, E ω) subject to fi(x, E ω) ≤ 0,

i = 1, . . . , m

(if fi(x, ω) is convex in ω, gives a lower bound on optimal value of stochastic problem)

EE364b, Stanford University

13

Variations

• in place of E fi(x, ω) ≤ 0 (constraint holds in expectation) can use – E fi(x, ω)+ ≤ ǫ (LHS is expected violation) – E (maxi fi(x, ω)+) ≤ ǫ (LHS is expected worst violation) • unfortunately, chance constraint Prob(fi(x, ω) ≤ 0) ≥ η is convex only in a few special cases

EE364b, Stanford University

14

Expected value of convex function suppose F (x, w) is convex in x for each w and G(x, w) ∈ ∂x F (x, w) Z • f (x) = E F (x, w) = F (x, w)p(w) dw is convex • a subgradient of f at x is g = E G(x, w) =

Z

G(x, w)p(w) dw ∈ ∂f (x)

• a noisy unbiased subgradient of f at x is M 1 X g˜ = G(x, wi) M i=1

where w1, . . . , wM are M independent samples (Monte Carlo) EE364b, Stanford University

15

Example: Expected value of piecewise linear function minimize f (x) = E maxi=1,...,m (aTi x + bi) where ai and bi are random evaluate noisy subgradient using Monte Carlo method with M samples, and run stochastic subgradient method compare to: • certainty equivalent: minimize fce(x) = maxi=1,...,m (E aiT x + E bi) • heuristic: minimize fheur (x) = maxi=1,...,m (E aiT x + E bi + λkxk2) EE364b, Stanford University

16

problem instance: n = 20, m = 100, ai ∼ N (¯ ai, 5I), b ∼ N (¯b, 5I), kaik2 ≈ 5, kbk2 ≈ 10, xstoch computed using M = 100 xce f (xce)

20 10 0 1

1.5

2

2.5

3

2.5

3

2.5

3

xheur f (xheur)

20 10 0 1

1.5

2

xstoch f (xstoch)

20 10 0 1

EE364b, Stanford University

1.5

2

17

f ⋆ ≈ 1.34 estimated by running the method with M = 1000 for long time

(k) fbest − fˆ⋆

10

10

10

10

M M M M

0

= = = =

1 10 100 1000

−1

−2

−3

500

1000

1500

2000

k EE364b, Stanford University

18

On-line learning and adaptive signal processing • (x, y) ∈ Rn × R have some joint distribution • find weight vector w ∈ Rn for which wT x is a good estimator of y • choose w to minimize expected value of a convex loss function l J(w) = E l(wT x − y) – l(u) = u2: mean-square error – l(u) = |u|: mean-absolute error • at each step (e.g., time sample), we are given a sample (x(k), y (k)) from the distribution EE364b, Stanford University

19

noisy unbiased subgradient of J at w(k), based on sample x(k+1) , y (k+1) : g (k) = l′(w(k)T x(k+1) − y (k+1) )x(k+1) where l′ is the derivative (or a subgradient) of l on-line algorithm: w(k+1) = w(k) − αk l′(w(k)T x(k+1) − y (k+1) )x(k+1). • for l(u) = u2, gives the LMS (least mean-square) algorithm • for l(u) = |u|, gives the sign algorithm • w(k)T x(k+1) − y (k+1) is the prediction error

EE364b, Stanford University

20

Example: Mean-absolute error minimization problem instance: n = 10, (x, y) ∼ N (0, Σ), Σ random with E(y 2) ≈ 12, αk = 1/k 40

prediction error

30 20 10 0 −10 −20 −30 −40

50

100

150

200

250

300

k EE364b, Stanford University

21

empirical distribution of prediction error for w⋆ (over 1000 samples) 150

100

50

0 −2

EE364b, Stanford University

−1.5

−1

−0.5

0

0.5

1

1.5

2

22...