0%

Reinforcement Learning-1: Basic Concepts

Basic Concepts

A grid-world example

  • State: The status of the agent with respect to the environment.
    • 在grid-world示例中,agent的位置就是state.
  • State Space: 所有state构成的一个集合就是State Space.
  • Action: For each state, there are some possible actions.
    • 在grid-world示例中,agent每一步可采取的行动(如向上走、向下走)就是action.
  • Action Space:所有action构成的一个集合就是Action Space.
    Action Space与当前的state有关。
  • State Transition: When taking an action, the agent may move from one state to another. Such a process is called state transition.
    \[
    s_1 \xrightarrow{a_2} s_2
    \]
  • Tabular reprentation: 用表格的形式表达状态转移,但是只能表达deterministic cases。
  • State transition probability: 用条件概率才能表达stochastic cases.
    \[
    p(s_2 | s_1, a_2) = 0.4 \\
    p(s_1 | s_1, a_2) = 0.6
    \]

  • Policy: Policy tells the agent what actions to take at a state.

    • 基于policy可以生成trajectory.
    • 仍然用条件概率表达stochastic policy.
      \[
      \pi(a_1 | s_1) = 0.5
      \]
    • Tabular representation of a policy 既可以描述deterministic policy,也可以描述stochastic policy.
    • 编程时用随机数描述概率的问题。
  • Reward: A real number we get after taking an action.
    • 正的reward表示鼓励,负的reward表示惩罚。
    • Reward可以理解为人机交互的一个接口。
    • Tabular representation of a reward 只能描述deterministic reward.
      \[
      p(r = -1 | s_1, a_1) = 1
      \]
    • 努力学习应该会获得奖励,但是奖励的多少是不一定的。
    • 鼓励行动,而非鼓励下一个状态。(注重过程而非结果)
  • Trajectory: A trajectory is a state-action-reward chain.
    \[
    s_1 \xrightarrow{a_2, r=0} s_2 \xrightarrow{a_3, r=0} s_5 \xrightarrow{a_3, r=0} s_8 \xrightarrow{a_2, r=1} s_9
    \]
  • Return: 在一个trajectory中所有reward的和即为return.
    • 用于评估一个策略的好坏。
  • Discounted Return:
    \[
    \mathrm{Discounted\ Return} = \Sigma \gamma^{i}r_i
    \]
    • $\gamma$ 被称为Discounted Rate, $\gamma$ 减少,可能会更加近视;$\gamma$ 增加,可能会较为远视。
  • Episode: When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode(or a trial).
    • Episode is a finite trajectory.(有限的任务被称为episodic tasks,永远进行下去的任务是continuing tasks)
    • 有两种将episodic tasks转化为continuing tasks的办法:
      • 第一种设置terminal state的action space只有留在原地,且奖励始终为0.
      • 第二种设置terminal state仍然是一个正常的state, 仍然有可能再离开terminal state, 再次进入时会获得正的奖励。
      • 第二种方法更加鼓励积极的探索,学习效果更好。

Markov Decision Process (MDP)

Key Elements

  • Sets:
    • Set of States: $\mathcal{S}$
    • Set of Actions: $\mathcal{A}(s)$ is associated for state $s\in \mathcal{S}$.
    • Set of Rewards: $\mathcal{R}(s,a)$
  • Probability Distribution:
    • State transition probability: $p(s’|s,a)$
    • Reward Probability: $p(r |s,a)$
  • Policy: $\pi(a|s)$
  • Markov Property: memoryless property
    \[
    \begin{align}
    p(s_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(s_{t+1}|a_{t+1}, s_{t}) \\
    p(r_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(r_{t+1}|a_{t+1}, s_{t})
    \end{align}
    \]