Reinforcement Learning-1: Basic Concepts

Basic Concepts

A grid-world example

State: The status of the agent with respect to the environment.
- 在grid-world示例中，agent的位置就是state.
State Space: 所有state构成的一个集合就是State Space.
Action: For each state, there are some possible actions.
- 在grid-world示例中，agent每一步可采取的行动（如向上走、向下走）就是action.
Action Space:所有action构成的一个集合就是Action Space.
Action Space与当前的state有关。
State Transition: When taking an action, the agent may move from one state to another. Such a process is called state transition.
\[
s_1 \xrightarrow{a_2} s_2
\]
Tabular reprentation: 用表格的形式表达状态转移，但是只能表达deterministic cases。
State transition probability: 用条件概率才能表达stochastic cases.
\[
p(s_2 | s_1, a_2) = 0.4 \\
p(s_1 | s_1, a_2) = 0.6
\]
Policy: Policy tells the agent what actions to take at a state.
- 基于policy可以生成trajectory.
- 仍然用条件概率表达stochastic policy.
  \[
  \pi(a_1 | s_1) = 0.5
  \]
- Tabular representation of a policy 既可以描述deterministic policy，也可以描述stochastic policy.
- 编程时用随机数描述概率的问题。
Reward: A real number we get after taking an action.
- 正的reward表示鼓励，负的reward表示惩罚。
- Reward可以理解为人机交互的一个接口。
- Tabular representation of a reward 只能描述deterministic reward.
  \[
  p(r = -1 | s_1, a_1) = 1
  \]
- 努力学习应该会获得奖励，但是奖励的多少是不一定的。
- 鼓励行动，而非鼓励下一个状态。（注重过程而非结果）
Trajectory: A trajectory is a state-action-reward chain.
\[
s_1 \xrightarrow{a_2, r=0} s_2 \xrightarrow{a_3, r=0} s_5 \xrightarrow{a_3, r=0} s_8 \xrightarrow{a_2, r=1} s_9
\]
Return: 在一个trajectory中所有reward的和即为return.
- 用于评估一个策略的好坏。
Discounted Return:
\[
\mathrm{Discounted\ Return} = \Sigma \gamma^{i}r_i
\]
- $\gamma$ 被称为Discounted Rate, $\gamma$ 减少，可能会更加近视；$\gamma$ 增加，可能会较为远视。
Episode: When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode(or a trial).
- Episode is a finite trajectory.(有限的任务被称为episodic tasks，永远进行下去的任务是continuing tasks)
- 有两种将episodic tasks转化为continuing tasks的办法：
  - 第一种设置terminal state的action space只有留在原地，且奖励始终为0.
  - 第二种设置terminal state仍然是一个正常的state, 仍然有可能再离开terminal state, 再次进入时会获得正的奖励。
  - 第二种方法更加鼓励积极的探索，学习效果更好。

Markov Decision Process (MDP)

Key Elements

Sets:
- Set of States: $\mathcal{S}$
- Set of Actions: $\mathcal{A}(s)$ is associated for state $s\in \mathcal{S}$.
- Set of Rewards: $\mathcal{R}(s,a)$
Probability Distribution:
- State transition probability: $p(s’|s,a)$
- Reward Probability: $p(r |s,a)$
Policy: $\pi(a|s)$
Markov Property: memoryless property
\[
\begin{align}
p(s_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(s_{t+1}|a_{t+1}, s_{t}) \\
p(r_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(r_{t+1}|a_{t+1}, s_{t})
\end{align}
\]