3. Policy Gradient

Q. Why do we use Action-value function over State-Value function in Model Free reinforcement learning

Q. ์œ„์—์„œ Action-value function Q๋ฅผ ์—…๋ฐ์ดํŠธ ํ•  ๋•Œ total discounted return Gt๋ฅผ Q๋ฅผ ์ด์šฉํ•ด ๊ทผ์‚ฌํ•œ๋‹ค.

State-value function V์™€ Q ๋ชจ๋‘ ์•ž์œผ๋กœ์˜ ๊ธฐ๋Œ€๋˜๋Š” return์— ๋Œ€ํ•œ ํ•จ์ˆ˜์ด์ง€๋งŒ ํ•ด๋‹น state์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” return์ธ์ง€, ํ•ด๋‹น state์™€ ์–ด๋–ค action์—์„œ ์–ป์„ ์ˆ˜ ์žˆ๋Š” return ์ธ์ง€์— ๋Œ€ํ•ด ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค.

์ด ๋•Œ, V์—์„œ ์ƒ˜ํ”Œ๋งํ•  ์ˆ˜ ์—†์„๊นŒ? ์™œ model-free์—์„œ๋Š” V๋ณด๋‹ค Q๊ฐ€ ํ•„์š”ํ• ๊นŒ?

A. Model-free ์ ‘๊ทผ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฃผ์–ด์ง„ policy์— ๋Œ€ํ•œ State-value function V๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•™์Šตํ•œ V๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ action์„ ์„ ํƒํ•  ์ˆ˜๋Š” ์—†๋‹ค.

๊ทธ ์ด์œ ๋Š” action์— ๋”ฐ๋ผ ๋‹ค์Œ์— ์–ด๋–ค state๊ฐ€ ๋ฐœ์ƒํ• ์ง€ ์˜ˆ์ธกํ•  ๋ฐฉ๋ฒ•์ด ์—†๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋Š” ๋ชจ๋‘ action ๊ฐ’์„ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š” ๋ชฌํ…Œ์นด๋ฅผ๋กœ(Monte-Carlo) ์ œ์–ด, SARSA ๋ฐ Q-๋Ÿฌ๋‹๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์—์„œ ๋ฌธ์ œ๋‹ค. Actor-Critic ๋ฐฉ๋ฒ•์—์„œ์™€ ๊ฐ™์ด policy gradient์™€ ๊ฒฐํ•ฉํ•˜๋ฉด ๋ณ„๋„์˜ ์ œ์–ด ๊ฐ€๋Šฅํ•œ policy์„ ์ œ๊ณตํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š๋Š”๋‹ค.

RL Agent Taxonomy

RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์—๋Š” ์•„๋ž˜์™€ ๊ฐ™์€ ์ข…๋ฅ˜๊ฐ€ ์žˆ๋‹ค.

  • Policy gradients: ์— ๋Œ€ํ•ด ๋ฏธ๋ถ„ํ•˜์—ฌ
    • REINFORCE
    • Natural policy gradient
    • Trust Region Policy Optimization (TRPO)
  • Value-based RL: ์ตœ์  policy์˜ value function์ด๋‚˜ Q-function์„ ์ถ”์ •
    • Q-learning, DQN
    • Temporal Difference Learning
    • Fitted value iteration
  • Actor-critic RL: ํ˜„์žฌ policy์˜ value function์ด๋‚˜ Q-function์„ ์ถ”์ •ํ•˜์—ฌ policy ๊ฐœ์„ ์— ์‚ฌ์šฉ
    • Asynchronous advantage actor-critic (A3C)
    • Soft Actor-Critic (SAC), DDPG
  • Model-based RL: transition model์„ ์ถ”์ •ํ•˜์—ฌ planning์— ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ policy ๊ฐœ์„ ์— ์‚ฌ์šฉ

Model-based algorithms

Diagram

Policy๋ฅผ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡๊ฐ€์ง€ ์˜ต์…˜

  • Model๋ฅผ ์ด์šฉํ•œ๋‹ค(policy X)
    • ๊ฒฝ๋กœ ์ตœ์ ํ™” ํ˜น์€ ์ตœ์  ์ œ์–ดํ•œ๋‹ค
    • Discrete action spaces ์—์„œ planningํ•œ๋‹ค.( e.g. Monte Carlo tree search)
  • Policy์˜ gradient๋กœ ์—ญ์ „ํŒŒ(backpropagate)ํ•˜๊ธฐ
    • ์—ฐ์†์ ์ธ ์ตœ์ ํ™” ๋ฐฉ์‹์„ ์‚ฌ์šฉ
    • numerically unstable ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์•ฝ๊ฐ„์˜ trick์ด ํ•„์š”ํ•˜๋‹ค
  • Value function์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด model์„ ์‚ฌ์šฉํ•œ๋‹ค
    • Dynamic programming
    • Generate simulated experience for model-free learner

Question

  • ์œ„์˜ ๋‚ด์šฉ๋“ค์ด ์–ด๋–ค ๋œป์ธ์ง€ ์ œ๋Œ€๋กœ ์ดํ•ด ๋ชปํ•จ

Model-based RL์€ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ MDP๋ฅผ ์ถ”์ •ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์ง€๋งŒ ๋ณต์žกํ•œ(high-demensional) state space์—์„œ๋Š” ์ข‹์ง€ ์•Š๋‹ค.

Value function-based algorithms

  • Learnt value function
  • Implicit policy (e.g. - greedy)
  • Examples
    • Q-learning, DQN
    • Temporal Difference Learning
    • Fitted value iteration

Direct policy gradients

Policy based์—๋Š”

  • No value function
  • Learnt policy
  • Examples
    • REINFORCE
    • Natural policy gradient
    • Trust Region Policy Optimization (TRPO)

Actor-critic: value functions + policy gradients

  • Learnt value function
  • Learnt policy
  • Examples
    • Asynchronous advantage actor-critic (A3C)
    • Soft Actor-Critic (SAC), DDPG

์™œ ์ด๋ ‡๊ฒŒ ๋งŽ์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์ด ์žˆ๋Š” ๊ฑธ๊นŒ?

  1. ๋‹ค๋ฅธ tradeoffs

    ์—ฌ๋Ÿฌ ๋ฌธ์ œ๋“ค์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ๊ฐ์˜ ๋ฌธ์ œ์— ๋” ๋‚˜์€ ๋ฐฉ๋ฒ•๋“ค์ด ๋‹ค๋ฅด๋‹ค.

    • Sample efficiency: ์ข‹์€ policy๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ์–ผ๋งˆ๋‚˜ ๋งŽ์€ sample๋“ค์ด ํ•„์š”ํ•œ์ง€
      • ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด off policy์ธ์ง€ on policy์ธ์ง€์— ๋”ฐ๋ผ ๋‹ค๋ฅด๋‹ค.
      • Off policy: ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ๋“ค์„ ๋งŒ๋“ค์ง€ ์•Š๊ณ ๋„ policy๋ฅผ ๊ฐœ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.
      • On policy: ๋งค๋ฒˆ policy๊ฐ€ ๋ฐ”๋€Œ๊ธฐ ๋•Œ๋ฌธ์— ์ƒˆ๋กœ์šด sample์„ ๋งŒ๋“ค์–ด์•ผ ํ•œ๋‹ค.
      • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋ณ„๋กœ ๋น„๊ต
      • ํšจ์œจ์ ์ด๊ณ  ์ ์€ ์ƒ˜ํ”Œ๋กœ๋„ ์ถฉ๋ถ„ํ•  ์ˆ˜๋ก ๋งŽ์€ ๊ฐ€์ •์ด ํ•„์š”ํ•˜๋‹ค.(๋•Œ๋กœ๋Š” ์ข‹์€ ๋ชจ๋ธ์„ ์–ป๊ธฐ ์œ„ํ•œ ์‹œ๊ฐ„์ด ๋” ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Œ.)
    • Stability & easy of use
  2. Different assumptions

    ๋ฌธ์ œ์— ๋Œ€ํ•œ ๊ฐ€์ •๋“ค์ด ๋‹ค๋ฅด๋‹ค.

    • Stochastic or deterministic?
    • Continuous or discrete?
    • Episodic of infinite horizon?
  3. Different things are easy or hard in different settings

    ๊ฐ๊ฐ์ด ๊ฐ€์ง„ ์žฅ๋‹จ์ (1๋ฒˆ์ด๋ž‘ ์œ ์‚ฌ)

    • Easier to represent the policy?
    • Easier to represent the model?

On-Policy Control with SARSA

SARSA๋Š” State-Action-Reward-State-Action์˜ ์•ฝ์ž๋กœ on-policy TD control method์ด๋‹ค.

Policy improvement์—๋Š” - greedy ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด Action์„ ์„ ํƒํ•˜๊ณ  ์ด๋ฅผ ๋ฐ˜๋ณตํ•œ๋‹ค.

Off-Policy Control with Q-Learning

behaviour, target policies๋ฅผ ๋ชจ๋‘ improve ํ•˜์ง€ ์•Š๋Š” ๊ฒƒ์€ off-policy ๋ฐฉ์‹์ด๋‹ค.

Q-learning์€ SARSA์™€ ์œ ์‚ฌํ•˜์ง€๋งŒ policy๋ฅผ ์ด์šฉํ•ด ๋‹ค์Œ action์„ ์„ ํƒํ•˜๊ธฐ ๋ณด๋‹ค greedyํ•œ ๋ฐฉ์‹์„ ์ทจํ•œ๋‹ค.

์œ„์˜ SARSA pseudocode์™€ ๋น„๊ตํ•ด๋ณด๋ฉด Action์„ updateํ•˜์ง€ ์•Š๋Š”๋‹ค. ๋Œ€์‹  policy๋กœ๋ถ€ํ„ฐ ๊ณ„์† Action์„ ์„ ํƒํ•œ๋‹ค.

Policy Gradient

ํฐ MDP๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์˜ ๋ฌธ์ œ๋Š” ๋„ˆ๋ฌด ๋งŽ์€ state์™€ action์„ ์ €์žฅํ•˜๊ณ  ์žˆ์–ด์•ผ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ๊ฐ์˜ state์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๋„ˆ๋ฌด ๋А๋ฆฌ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด

  • function approximation์„ ์ด์šฉํ•ด value function์„ ์ถ”์ •ํ•œ๋‹ค.
  • ๊ด€์ฐฐ๋œ state๋ฅผ ํ†ตํ•ด ๊ด€์ฐฐ๋˜์ง€ ์•Š์€ state๋“ค์„ generalizeํ•œ๋‹ค.
  • MC๋‚˜ TD ๋ฐฉ์‹์„ ํ†ตํ•ด ๋ฅผ ์—…๋ฐ์ดํŠธ ํ•œ๋‹ค.

Function approximation์—๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

  • Linear combinations of features
  • Neural network
  • Decision tree
  • Nearest neighbour
  • Fourier / wavelet basis

๊ทธ๋ฆฌ๊ณ  ๋ฏธ๋ถ„๊ฐ€๋Šฅํ•œ ์ข…๋ฅ˜๋“ค์„ ๊ณ ๋ คํ•œ๋‹ค๋ฉด Linear combinations of features, NN ๊ทธ๋ฆฌ๊ณ  Gaussian process๊ฐ€ ์žˆ๋‹ค.

Advantages of Policy-Based RL

์•ž์„œ Policy-based RL์— ๋Œ€ํ•ด ๊ฐ„๋žตํžˆ ์•Œ์•„๋ณด์•˜๋‹ค. ์ด Policy-based RL์˜ ์žฅ๋‹จ์ ์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

Advantages

  • ๋‚˜์€ ์ˆ˜๋ ด์„ฑ
  • high-dimensional์ด๋‚˜ continuousํ•œ action space์—์„œ ํšจ์œจ์ ์ด๋‹ค
  • Stochastic policy๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค. (??)

Disadvantages

  • local optimum์œผ๋กœ ์ˆ˜๋ ดํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค
  • policy evaluation์ด ๋น„ํšจ์œจ์ ์ด๋‹ค.

Policy Optimization

Policy-based RL์€ ์ตœ์ ํ™” ๋ฌธ์ œ์ด๋‹ค(an optimization problem)

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๋ฐฉ์‹์—๋Š” ์ข…๋ฅ˜๊ฐ€ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์žˆ๋‹ค.

Gradient๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ฐฉ์‹์—๋Š”

  • Hill climbing
  • Simplex / amoeba / Nelder Mead
  • Genetic algorithms

Gradient๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ฃผ๋กœ ํšจ์œจ์ ์ด๋‹ค

  • Gradient ascent
  • Conjuage gradient
  • Quasi-newton