behavior

(This is a note based on Learning to Score Behaviors for Guided Policy Optimization. I am trying to expand and clarify some of the algorithms that were presented there. More content will be added to this note in the future!) The core question: What is the right measure of similarity between two policies acting on the same underlaying MDP and how can we devise algorithms to leverage this information for RL?...