MinPPO

These are notes for the MinPPO project here.

Testing

Hidden layer size of 256 shows progress (loss is based on state.q[2])
setting std to zero makes rewards nans why. I wonder if there NEEDS to be randomization in the enviornment
ctrl cost is whats giving nans? interesting?
it is unrelated to randomization of enviornmnet. i think gradient related
first thing to become nans seems to be actor loss and scores. after that, everything becomes nans
fixed entropy epsilon. hope this works now.