Basic Concepts
A grid-world example
- State: The status of the agent with respect to the environment.
- 在grid-world示例中,agent的位置就是state.
- State Space: 所有state构成的一个集合就是State Space.
- Action: For each state, there are some possible actions.
- 在grid-world示例中,agent每一步可采取的行动(如向上走、向下走)就是action.
- Action Space:所有action构成的一个集合就是Action Space.
Action Space与当前的state有关。 - State Transition: When taking an action, the agent may move from one state to another. Such a process is called state transition.
\[
s_1 \xrightarrow{a_2} s_2
\] - Tabular reprentation: 用表格的形式表达状态转移,但是只能表达deterministic cases。
State transition probability: 用条件概率才能表达stochastic cases.
\[
p(s_2 | s_1, a_2) = 0.4 \\
p(s_1 | s_1, a_2) = 0.6
\]Policy: Policy tells the agent what actions to take at a state.
- 基于policy可以生成trajectory.
- 仍然用条件概率表达stochastic policy.
\[
\pi(a_1 | s_1) = 0.5
\] - Tabular representation of a policy 既可以描述deterministic policy,也可以描述stochastic policy.
- 编程时用随机数描述概率的问题。
- Reward: A real number we get after taking an action.
- 正的reward表示鼓励,负的reward表示惩罚。
- Reward可以理解为人机交互的一个接口。
- Tabular representation of a reward 只能描述deterministic reward.
\[
p(r = -1 | s_1, a_1) = 1
\] - 努力学习应该会获得奖励,但是奖励的多少是不一定的。
- 鼓励行动,而非鼓励下一个状态。(注重过程而非结果)
- Trajectory: A trajectory is a state-action-reward chain.
\[
s_1 \xrightarrow{a_2, r=0} s_2 \xrightarrow{a_3, r=0} s_5 \xrightarrow{a_3, r=0} s_8 \xrightarrow{a_2, r=1} s_9
\] - Return: 在一个trajectory中所有reward的和即为return.
- 用于评估一个策略的好坏。
- Discounted Return:
\[
\mathrm{Discounted\ Return} = \Sigma \gamma^{i}r_i
\]- $\gamma$ 被称为Discounted Rate, $\gamma$ 减少,可能会更加近视;$\gamma$ 增加,可能会较为远视。
- Episode: When interacting with the environment following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode(or a trial).
- Episode is a finite trajectory.(有限的任务被称为episodic tasks,永远进行下去的任务是continuing tasks)
- 有两种将episodic tasks转化为continuing tasks的办法:
- 第一种设置terminal state的action space只有留在原地,且奖励始终为0.
- 第二种设置terminal state仍然是一个正常的state, 仍然有可能再离开terminal state, 再次进入时会获得正的奖励。
- 第二种方法更加鼓励积极的探索,学习效果更好。
Markov Decision Process (MDP)
Key Elements
- Sets:
- Set of States: $\mathcal{S}$
- Set of Actions: $\mathcal{A}(s)$ is associated for state $s\in \mathcal{S}$.
- Set of Rewards: $\mathcal{R}(s,a)$
- Probability Distribution:
- State transition probability: $p(s’|s,a)$
- Reward Probability: $p(r |s,a)$
- Policy: $\pi(a|s)$
- Markov Property: memoryless property
\[
\begin{align}
p(s_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(s_{t+1}|a_{t+1}, s_{t}) \\
p(r_{t+1}|a_{t+1}, s_{t},\dots,a_1,s_0) = p(r_{t+1}|a_{t+1}, s_{t})
\end{align}
\]