Evolutional Strategy
RL Policies Optimization
State, Action
Policy: Deterministic / Randomized
Function \(F: \mathbb{R}^d \to \mathbb{R}\), reward from vector to scalar.
Transition is a blackbox: decided by the environment. Dense reward better than sparse reward. Simulators are used to get the transitions (Physics engine?).
Now assume that the environment is blackbox. We get partial observation. Environment can be randomized.
RFD: Fancy Finite difference replacing backprop
Random finite difference
\(F: \mathbb{R}^d \to \mathbb{R}\)
Towards smooth relations
\[\max_{\mu\in P(\mathbb{R}^d)} {\mathbb{E}_{\theta \sim \mu}{[F(\theta)]}}\]
Gaussian smoothings
\[\max_{\theta\in \mathbb{R}^d}{J(\theta)} = \mathbb{E}_{\phi \sim N(\theta, \sigma^2I)}{[F(\phi)]}\]
\[\nabla J(\theta) = \frac1\sigma \mathbb{E}_{\epsilon \sim N(0, I)}{[F(\theta+\sigma\epsilon)\epsilon]}\]
Saliman(2017)
Gradient sensing as a MC estimation
ES-style vanilla gradient estimator: \[\hat{\nabla}_NJ(\theta) = \frac{1}{N\sigma}\sum{F(\theta+\sigma\epsilon_i)\epsilon_i}\]
gradient estimator with antithetic pairs
\[\hat{\nabla}_NJ(\theta) =\frac{1}{2N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta-\sigma\epsilon_i)\epsilon_i)} \]
FD style, randomized
\[\hat{\nabla}_NJ(\theta) =\frac{1}{N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta)\epsilon_i)}\]
- all 3 estimators are unbiased
- none dominates
It can be proved that the orthogonal(structured) gradient estimator \(\hat{\nabla}_N^{ort}J(\theta)\) is unbiased and yields lower MSE than the unstructured estimator.
Coupled anithetic pairs for MC estimation
\[\hat{\nabla}_NJ(\theta) =\frac{1}{2N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta-\sigma\epsilon_i')\epsilon_i')} \]
\(\epsilon_i'\) inverse length.
Or more complex formulation.
RBO-Robust Blackbox Optimization
\[\hat{\nabla}_{RBO}^{ort}J(\theta) = \text{arg} \min_{v\in R^d} \frac1{2N} \|y-Zv\|^p_p + \alpha \|v\|^q_q\]
re-used perturbations
Efficiency of the orthogonal exploration for gradient sensing - structured NNs.
Structurized Matrices: Toeplitz, \(n\times n\) with \(2n-1\) parameters. constant diagonal values.
\(O(n\log n)\) complexity
RBO & noisy measurements
under noise 0.4, the MC still manages to reach the same accuracy
[Neuroevolution of self-interpretation]