Evolutional Strategy

RL Policies Optimization

State, Action

Policy: Deterministic / Randomized

Function \(F: \mathbb{R}^d \to \mathbb{R}\), reward from vector to scalar.

Transition is a blackbox: decided by the environment. Dense reward better than sparse reward. Simulators are used to get the transitions (Physics engine?).

Now assume that the environment is blackbox. We get partial observation. Environment can be randomized.

RFD: Fancy Finite difference replacing backprop

Random finite difference

\(F: \mathbb{R}^d \to \mathbb{R}\)

Towards smooth relations

\[\max_{\mu\in P(\mathbb{R}^d)} {\mathbb{E}_{\theta \sim \mu}{[F(\theta)]}}\]

Gaussian smoothings

\[\max_{\theta\in \mathbb{R}^d}{J(\theta)} = \mathbb{E}_{\phi \sim N(\theta, \sigma^2I)}{[F(\phi)]}\]

\[\nabla J(\theta) = \frac1\sigma \mathbb{E}_{\epsilon \sim N(0, I)}{[F(\theta+\sigma\epsilon)\epsilon]}\]


Gradient sensing as a MC estimation

ES-style vanilla gradient estimator: \[\hat{\nabla}_NJ(\theta) = \frac{1}{N\sigma}\sum{F(\theta+\sigma\epsilon_i)\epsilon_i}\]

gradient estimator with antithetic pairs

\[\hat{\nabla}_NJ(\theta) =\frac{1}{2N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta-\sigma\epsilon_i)\epsilon_i)} \]

FD style, randomized

\[\hat{\nabla}_NJ(\theta) =\frac{1}{N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta)\epsilon_i)}\]

  • all 3 estimators are unbiased
  • none dominates

It can be proved that the orthogonal(structured) gradient estimator \(\hat{\nabla}_N^{ort}J(\theta)\) is unbiased and yields lower MSE than the unstructured estimator.

Coupled anithetic pairs for MC estimation

\[\hat{\nabla}_NJ(\theta) =\frac{1}{2N\sigma}\sum{(F(\theta+\sigma\epsilon_i)\epsilon_i - F(\theta-\sigma\epsilon_i')\epsilon_i')} \]

\(\epsilon_i'\) inverse length.

Or more complex formulation.

RBO-Robust Blackbox Optimization

\[\hat{\nabla}_{RBO}^{ort}J(\theta) = \text{arg} \min_{v\in R^d} \frac1{2N} \|y-Zv\|^p_p + \alpha \|v\|^q_q\]

re-used perturbations

Efficiency of the orthogonal exploration for gradient sensing - structured NNs.

Structurized Matrices: Toeplitz, \(n\times n\) with \(2n-1\) parameters. constant diagonal values.

\(O(n\log n)\) complexity

RBO & noisy measurements

under noise 0.4, the MC still manages to reach the same accuracy

[Neuroevolution of self-interpretation]