Currently people use reward functions, demonstrations, or preferences as learning feedback. Here we instead investigate using examples (or successful states). With examples we alleviate the potential suboptimality of demonstrations while removing the need of defining a reward function. The trade-off here is that we have to explore effectively to achieve successes!
Exploration can be dealt with using hierarchical approaches (e.g. scheduled auxiliary control and options), but we found that the improved state-action coverage also introduces more out-of-distribution data, resulting in Q-value divergence, similar to my other work... We ended up penalizing Q-values that exceed realizable expected return and saw improved performance.
I think there remains a lot of questions about how we should really penalize Q-values and understand how Q-functions truly improve policy learning.
If you're interested, my collaborator Trevor Ablett will be presenting this at the the Conference on Robot Learning (CoRL) MRM-D workshop! He will also present my RL work at the same workshop!
Just show me what you want, and I'll figure it out!
Designing good reward functions for robotic reinforcement learning (RL) is hard. Trevor Ablett will present VPACE: our work on Fast RL without Rewards or Demonstrations via Auxiliary Task Examples at the #CoRL MRM-D workshop this Saturday!
Project Site: https://lnkd.in/e2cDpD9y
Full Paper: https://lnkd.in/eThy_ec9
President, Beacon Media Group | Transforming Media Landscapes | Expert in Revenue Growth & Brand Relevance | Champion of Integrated Marketing & Social Impact
18hAmazing and good luck!!