Data-Mining-Recitation2

Monte Carlo Approximation techniques

Evolutionary Strategies, 策略梯度

Policy optimization can be done through gradient ascend:

\[\nabla_\theta \mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} F(\theta + \sigma \epsilon) = \frac{1}{\sigma}\mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} [ \epsilon F(\theta + \sigma\epsilon) ]\]

Proof:

一些参考资料:
Evolution Strategies Evolution Strategies as a Scalable Alternative to Reinforcement Learning
知乎回答 OpenAI ES
CSDN推导

\[ \begin{aligned} &\nabla_\theta \mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} F(\theta + \sigma\epsilon) \\ &= \nabla_\theta \int_{\epsilon} p(\epsilon) F(\theta + \sigma\epsilon) d\epsilon & \scriptstyle{\text{; Gaussian }p(\epsilon)=(2\pi)^{-\frac{n}{2}} \exp(-\frac{1}{2}\epsilon^\top\epsilon)} \\ &= \frac{1}{\sigma}\int_{\epsilon} \nabla_\epsilon p(\epsilon) F(\theta + \sigma\epsilon) d\epsilon \\ &= \frac{1}{\sigma}\int_{\epsilon} p(\epsilon) \nabla_\epsilon \log p(\epsilon) F(\hat{\theta} + \sigma\epsilon) d\epsilon & \scriptstyle{\text{; log-likelihood trick}}\\ &= \frac{1}{\sigma}\mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} [ \nabla_\epsilon \big(-\frac{1}{2}\epsilon^\top\epsilon\big) F(\hat{\theta} + \sigma\epsilon) ] & \\ &= \frac{1}{\sigma}\mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} [ (-\epsilon) F(\hat{\theta} + \sigma\epsilon) ] & \\ &= \frac{1}{\sigma}\mathbb{E}_{\epsilon\sim\mathcal{N}(0, I)} [ \epsilon F(\hat{\theta} + \sigma\epsilon) ] & \scriptstyle{\text{; negative sign can be absorbed.}} \end{aligned} \]

Log likelihood trick

The Reparameterization Trick