7. Proximal Policy Optimization Algorithms

๋…ผ๋ฌธ Arxiv: https://arxiv.org/abs/1707.06347

๋†€๋ž๊ฒŒ๋„ ์ €๋„์ด๋‚˜ ํ•™ํšŒ์— ์—†๋‹คโ€ฆOpenAI๋ผ์„œ ๊ทธ๋Ÿฐ๊ฐ€

Abstract

PPO๋Š” Policy Gradient (PG)์˜ ํ•˜๋‚˜์ด์ง€๋งŒ data๋ฅผ samplingํ•˜๋Š” ๋Œ€์‹  stochatic gradient ascent( SGD)๋ฅผ ํ†ตํ•ด โ€œsurrogateโ€๋ผ๋Š” ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค.

๊ธฐ์กด์˜ PG๋Š” data sampling์„ ํ•  ๋•Œ๋งˆ๋‹ค ํ•œ ๋ฒˆ์”ฉ gradient๋ฅผ ์ด๋™ํ•˜์ง€๋งŒ, PPO๋Š” multiple epochs of minibatch update๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

Gradient Descent(GD)๋Š” ํ•œ epoch๋‹น ํ•œ ๋ฒˆ minibatch update(=iteration)์„ ํ•œ๋‹ค. SGD๋Š” epochs * minibatch ๋งŒํผ iteration ํ•œ๋‹ค. (Reference)

PPO๋Š” ์•ž์„œ ๋‚˜์˜จ trust region policy opitmization(TRPO)์˜ ์žฅ์  ์ค‘ ์ผ๋ถ€๋ฅผ ๊ฐ€์ง€์ง€๋งŒ ํ›จ์”ฌ ์ ์šฉํ•˜๊ธฐ ์‰ฝ๊ณ  ๋„๋ฆฌ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋‹ค.

Introduction

17๋…„๋„ ๋‹น์‹œ์—๋Š” deep Q-learning, vanila PG, TPRO, [2023-11-21-PolicyGradient2|natural policy gradient] ๋“ฑ์ด ๋งŽ์ด ์—ฐ๊ตฌ๋˜์—ˆ๋‹ค. ๊ฐ์ž์˜ ๋‹จ์ ๋“ค์ด ์žˆ์ง€๋งŒ TPRO๋Š” data efficiencyํ•˜๊ณ  ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ์„ ๋ณด์˜€์ง€๋งŒ, ๋ณต์žกํ•˜๊ณ  ๋…ธ์ด์ฆˆ์— ์ทจ์•ฝํ•œ ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค.

๋ณธ ๋…ผ๋ฌธ์€ TPRO์˜ ์žฅ์ ์„ ๊ฐ€์ง€๋ฉด์„œ๋„ first-order ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ PPO ๋ฐฉ์‹์„ ์ œ์•ˆํ•˜์˜€๋‹ค.

PPO ๋ฐฉ์‹์—๋Š” ์ƒˆ๋กœ์šด ๋ชฉ์  ํ•จ์ˆ˜๊ฐ€ ๋„์ž…๋˜์—ˆ๋Š”๋ฐ clipped probability ratios๋ฅผ ๊ฐ€์ง€๋„๋ก ํ•˜์—ฌ ์„ฑ๋Šฅ์˜ ํ•˜ํ•œ์„ ๊ฐ€์ง€๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ํ•œ, sampled data์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ epoch์˜ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. (Stochastic)

Background

Policy Gradient

๊ธฐ์กด์˜ PG๋Š” ๋‹ค๋ฅธ action๋ณด๋‹ค ๋‚˜์€ action์„ ์ฐพ๋Š” advantager๋ฅผ ํ†ตํ•ด ๋” ๋‚˜์€ reward๋ฅผ ์ฐพ๋„๋ก ํ•˜์˜€์œผ๋‚˜, step size๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ์—๋Š” ํ•™์Šต์ด ๋„ˆ๋ฌด ๋А๋ฆฌ๊ณ  ํฌ๋ฉด ๋ณ€ํ™” ํญ์ด ํฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์—ˆ๋‹ค.

Trust Region Methods

TRPO์—์„œ๋Š” โ€œsurrogateโ€๋ผ๊ณ  ํ•˜๋Š” ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ ์žˆ๋‹ค.

๋Š” KL divergence๊ฐ€ threshold ์ดํ•˜์—ฌ์•ผ ํ•˜๋ฏ€๋กœ Policy update์˜ ํฌ๊ธฐ์— ๋Œ€ํ•œ ์ œํ•œ์ด๋‹ค.

์—ฌ๊ธฐ์„œ TRPO์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” vanilla PG์˜ Loss function์—์„œ ์ค‘์š”์ƒ˜ํ”Œ๋ง์„ ํ†ตํ•ด ๋ฐ”๋€ ๊ฒƒ์ด๋‹ค.

๋ฌธ์ œ๋Š” ์ด ๋ฌธ์ œ๊ฐ€ ํ’€๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. constraint์— ๋Œ€ํ•ด์„œ๋Š” quadratic approximation์ด ํ•„์š”ํ•˜๊ณ , objective์— ๋Œ€ํ•ด์„œ๋Š” linear approximation์ด ํ•„์š”ํ•˜๋‹ค. ์ดํ›„ conjugate gradient algorithm์„ ํ†ตํ•ด ๊ทผ์‚ฌํ•˜๊ฒŒ ํ’€ ์ˆ˜ ์žˆ๋‹ค.

์ด ๋Œ€์‹  ์•„๋ž˜์™€ ๊ฐ™์ด constraint ๋Œ€์‹  penalty ๋ฐฉ์‹์œผ๋กœ ์‹์„ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค.

์ˆ˜์‹์— ๋Œ€ํ•ด ์ด๋Ÿฌํ•œ ํŽ˜๋„ํ‹ฐ๋Š” ๋Œ€์‹  state์— ๋Œ€ํ•ด KL์„ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ policy์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ lower bound๋ฅผ ๋งŒ๋“ค๊ฒŒ ๋œ๋‹ค.

๋Œ€์‹  ๋ฌธ์ œ์ ์€ ์ ์ ˆํ•œ ๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ ์ ˆํ•œ ๋ฅผ ์ฐพ๊ณ  SGD๋ฅผ ํ†ตํ•ด penalize objective function (3)์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋Š” ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์Œ์„ ๋ณด์ด๊ณ  ์ถ”๊ฐ€์ ์ธ ์ˆ˜์ •์„ ํ•˜์˜€๋‹ค.

Clipped Surrogate Objective

์ˆ˜์‹ (1)์—์„œ ๋ณด์ธ ratio๋ฅผ the probability ratio ์œผ๋กœ ๋‘”๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋‹ค.

๊ธฐ์กด TRPO์—์„œ์˜ ์ˆ˜์‹ (2)์€ constraint์—†์ด๋Š” ๋งค์šฐ ํฐ policy update ์ด๋‹ค. Why?

๊ทธ๋ž˜์„œ ๋ณธ ๋…ผ๋ฌธ์€ ๋ฅผ ์ตœ๋Œ€ํ•œ 1๋ณด๋‹ค ํฌ๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์œผ๋กœ ์ˆ˜์‹์„ ์ˆ˜์ •ํ•˜์˜€๋‹ค.

์•ˆ์— ์žˆ๋Š” ์ฒซ ๋ฒˆ์งธ ์ธ์ž๋Š” ๊ธฐ์กด TRPO์˜ ๋ชฉ์ ํ•จ์ˆ˜์™€ ๊ฐ™๋‹ค. ๋‘ ๋ฒˆ์งธ ์ธ์ž๋Š” probability ratio๋ฅผ hyperparamter ์œผ๋กœ clipping ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์ด ๋‘˜ ์ค‘์˜ ์ตœ์†Œ๊ฐ’์„ ์ทจํ•œ๋‹ค.

์ด๊ฒƒ์— ๋Œ€ํ•œ ํ‰๊ท ์„ ๊ตฌํ•จ์œผ๋กœ์จ ๊ธฐ์กด TRPO ๋ชฉ์ ํ•จ์ˆ˜(unclipped objective)์˜ ํ•˜ํ•œ์„ ์–ป๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๊ฒฐ๊ตญ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๊ฒƒ๋“ค์€ ์ œ์™ธํ•˜๊ณ , ๋” ์•ˆ์ข‹๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉํ–ฅ์„ ํฌํ•จํ•˜๊ฒŒ ๋œ๋‹ค.

advantager ๊ฐ€ ์–‘์ˆ˜์ธ ๊ฒฝ์šฐ์—๋Š” ํ‰๊ท ๋ณด๋‹ค ์ข‹์€ action ๋ผ๋Š” ์˜๋ฏธ๋กœ ๊ฐ€ ์ปค์ง€๊ณ  ์ด๋ฅผ ์œผ๋กœ ์ œํ•œํ•˜๊ณ , ์Œ์ˆ˜์ธ ๊ฒฝ์šฐ์—๋Š” ์ž‘์•„์ง€๋Š” ๋ฅผ ์œผ๋กœ ์ œํ•œํ•œ๋‹ค.

์œ„ Figure 2๋Š” ๋ชฉ์ ํ•จ์ˆ˜๊ฐ€ policy update์— ๋”ฐ๋ผ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š”๋ฐ, (5) ๊ฐ€ ๊ธฐ์กด TRPO์˜ ์˜ ํ•˜ํ•œ์ด ๋จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค

Adaptive KL Penalty Coefficient

์ด์ „ ์„น์…˜์—์„œ์˜ clipped surrogate objective์— ๋”ํ•ด ์‹ (4)์˜ ๋ชฉ์ ํ•จ์ˆ˜์— ํŽ˜๋„ํ‹ฐ๋ฅผ ๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹๋„ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋งค policy update ๋งˆ๋‹ค ์›ํ•˜๋Š” KL divergence ๊ฐ’์„ ์–ด๋А์ •๋„ ๋„๋‹ฌํ•˜๋„๋ก ๋งŒ๋“ค์–ด์ค€๋‹ค.

๋น„๋ก ์ด์ „์˜ clipped surrogate objective ๋ฐฉ์‹๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋Š” ์•Š์•˜๋‹ค.

๋งค policy update ๋งˆ๋‹ค ์•„๋ž˜์— ๋Œ€ํ•ด ์ˆ˜ํ–‰ํ•œ๋‹ค.

  • SGD๋ฅผ ํ†ตํ•ด KL-penalized objective๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค.

  • ๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

๊ฒฐ๊ตญ ์ด ๋ฐฉ์‹์€ ๊ธฐ์กด TRPO ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํŽ˜๋„ํ‹ฐ ๋ฐฉ์‹์œผ๋กœ ๋ฐ”๊พผ ๊ฒƒ์—์„œ ๋ฅผ ์ ์ ˆํžˆ ์ฐพ๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค.

1.5์™€ 2 ์—ฐ์‚ฐ์€ heuristicํ•˜๊ฒŒ ์ •ํ•ด์ง„ ๊ฒƒ์ด๋‚˜ ์ด ๋ณ€์ˆ˜๋“ค์— ํฌ๊ฒŒ ๋ฏผ๊ฐํ•˜์ง€ ์•Š๊ณ  ์˜ ์ดˆ๊ธฐ๊ฐ’๋„ ๋ฐ”๋กœ ๋ณ€ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ’ ์„ค์ •์— ํฐ ๋ถ€๋‹ด์ด ์—†๋‹ค.

Algorithm

๋ณธ ๋…ผ๋ฌธ์˜ ํฐ ์žฅ์  ์ค‘ ํ•˜๋‚˜๋Š” ๊ธฐ์กด vanila policy gradient์—์„œ ๋ช‡๊ฐ€์ง€์˜ ์ฝ”๋“œ ์ˆ˜์ •์œผ๋กœ ์‰ฝ๊ฒŒ ์ ์šฉ์ด ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

ํ•˜๋‚˜๋Š” ๋Œ€์‹  ๋˜๋Š” ์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ์—ฌ๋Ÿฌ step์˜ SGD๋ฅผ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋ฐ”๊ฟ”์ฃผ๋ฉด ๋œ๋‹ค.

PPO๋Š” ๊ณ ์ •๋œ ๊ถค์ ์˜ ์š”์†Œ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋งค ๋ฐ˜๋ณต๋งˆ๋‹ค ๊ฐ N ๊ฐœ์˜ ๋ณ‘๋ ฌ์ ์ธ actor๋“ค์ด T ์‹œ๊ฐ„๋งŒํผ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  surrogate loss๋ฅผ NT ๊ฐœ์˜ data๋กœ ๊ตฌ์„ฑํ•˜์—ฌ K epoch๋งŒํผ minibatch SGD ๋ฐฉ์‹์œผ๋กœ ์ตœ์ ํ™”ํ•œ๋‹ค.

variance-reduced advantage-function estimator๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋œ state-value function ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

policy function๊ณผ value function ๊ฐ„์˜ parameter๋ฅผ ๊ณต์œ ํ•˜๋Š” NN ๊ตฌ์กฐ์—์„œ PPO๋Š” loss function์—์„œ policy surrogate์™€ value function error term์ด ๊ฒฐํ•ฉ๋œ ๊ฒƒ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•œ๋‹ค.

์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ entropy bonus๋ฅผ ์ถ”๊ฐ€ํ•œ ๋ฐฉ์‹๋„ ๋”ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ตฌ์„ฑํ•จ.

๋Š” ๊ณ„์ˆ˜์ด๊ณ , ๊ฐ€ entorpy bonus์— ํ•ด๋‹นํ•œ๋‹ค. ๋Š” squared-error loss ์ด๋‹ค.

์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ RNN์— ์ ํ•ฉํ•˜๊ณ  T ์‹œ๊ฐ„ ๋งŒํผ policy๋ฅผ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•˜๋Š” ๋ฐฉ์‹์„ ์ ์šฉํ•˜์˜€๋Š”๋ฐ, ์ด ๋•Œ์˜ advantage estimator๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

step์ด ์ž‘์œผ๋ฉด ํŽธํ–ฅ์ด ํฌ๊ณ  ๋ถ„์‚ฐ์ด ๋‚ฎ์ง€๋งŒ, step์ด ํฐ ๊ฒฝ์šฐ ํŽธํ–ฅ์€ ์ž‘์ง€๋งŒ ๋ถ„์‚ฐ์ด ์ปค์ง„๋‹ค. ๋”ฐ๋ผ์„œ Generalized Advnatage Estimation ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

์œ„ ์ˆ˜์‹ (7)์„ ์ผ๋ฐ˜ํ™”ํ•˜์—ฌ ๋กœ ๋‘๊ณ  ์•„๋ž˜์˜ truncated version of generalized advantage estimation์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค.

๋…ผ๋ฌธ ์š”์•ฝ ๋!

๊ฒฐ๊ตญ PPO๋Š” ํ˜„์žฌ์™€ ์ด์ „ policy์˜ ๋น„์œจ๊ณผ clipping ๋ฐฉ์‹์„ ํ†ตํ•ด ํ•™์Šต์˜ ์•ˆ์ •์„ฑ์„ ๋†’์ด๋ฉด์„œ ๋„ˆ๋ฌด ํฐ policy update๋ฅผ ํ”ผํ•˜๋„๋ก ํ•˜์˜€๋‹ค.

References