Chapter 4: Value-based Methods for Deep RL¶
约 505 个字 预计阅读时间 2 分钟
4.1 Q-Learning¶
\[
Q^* (s,a) = (\mathcal{B} Q^*) (s, a), \tag{4.1}
\]
\[
(\mathcal{B} K) (s, a) = \sum\limits_{s^{\prime} \in S} T(s, a, s^{\prime}) \left( R(s, a, s^{\prime}) + \gamma \max\limits_{a^{\prime} \in \mathcal{A}} K(s^{\prime}, a^{\prime}) \right), \tag{4.2}
\]
4.2 Fitted Q-Learning¶
\[
Y^Q_k = r + \gamma \max\limits_{a^{\prime} \in \mathcal{A}} Q(s^{\prime}, a^{\prime}; \theta_k), \tag{4.3}
\]
\[
\mathrm{L}_{DQN} = \left(Q(s, a; \theta_k) - Y^Q_k\right)^2, \tag{4.4}
\]
\[
\theta_{k+1} = \theta_k + \alpha \left(Y^Q_k - Q(s, a; \theta_k)\right) \nabla_{\theta_k} Q(s, a; \theta_k), \tag{4.5}
\]
4.3 Deep Q-Networks¶
Deep Q-Network 使用下面两种方式来抑制学习的不稳定性:
- 回放缓存/Replay Memory:回放通过一个 \(\epsilon\)-贪心策略收集,保存最近 \(N_{replay} \in \mathbb{N}\) 个时间步的全部信息,然后从回放缓存之中抽取一组元组 \(\langle s, a, r, s^{\prime}\) 来进行更新,这组元组被称为一个 mini-batch。这种技术允许更新覆盖更广的状态-动作空间。与一次仅用一个元组更新相比,使用一个 mini-batch 的方差更小。因此,它既允许更大幅度地更新参数,又有利于算法的高效并行化。
- 目标网络/Target Network:
4.4 Double DQN/DDQN¶
\[
Y_{k}^{DDQN} = r + \gamma Q(s^{\prime}, \operatorname{\arg\max}\limits_{a \in \mathcal{A}} Q(s^{\prime}, a; \theta_k); \theta_k^{-}), \tag{4.6}
\]
4.5 Dueling Network Architecture¶
\[\begin{align}
Q(s, a; \theta^{(1)}, \theta^{(2)}, \theta^{(3)}) &= V(s; \theta^{(1)}, \theta^{(3)})\\
&+ \left(A(s, a; \theta^{(1)}, \theta^{(2)}) - \max\limits_{a^{\prime} \in \mathcal{A}} A(s, a^{\prime}; \theta^{(1)}, \theta^{(2)})\right), \tag{4.7}
\end{align}\]
\[\begin{align}
Q(s, a; \theta^{(1)}, \theta^{(2)}, \theta^{(3)}) &= V(s; \theta^{(1)}, \theta^{(3)})\\
&+ \left(A(s, a; \theta^{(1)}, \theta^{(2)}) - \frac{1}{\lvert \mathcal{A} \rvert}\sum\limits_{a^{\prime} \in \mathcal{A}} A(s, a^{\prime}; \theta^{(1)}, \theta^{(2)})\right), \tag{4.8}
\end{align}\]
4.6 Distributional DQN¶
4.7 Multi-step Learning¶
\[
Y^{Q,n}_k = \sum_{t=0}^{n-1} \gamma^t r_t + \gamma^n \max_{a^{\prime}\in\mathcal{A}} Q(s_n,a^{\prime};\theta_k), \tag{4.10}
\]
\[
Y^{Q,n}_k = \sum_{i=0}^{n-1} \lambda_i \left( \sum_{t=0}^{i} \gamma^t r_t + \gamma^{i+1} \max_{a^{\prime}\in\mathcal{A}} Q(s_{i+1},a^{\prime};\theta_k) \right), \tag{4.11}
\]
其权重满足 \(\sum_i \lambda_i = 1\)。