Rohan Tangri
May 4, 2021

--

Indeed it is! Note we use the mask flags to clip the sum appropriately when the episode has ended (don't want to incorporate the next episode discounted rewards into the advantage calculation for a step in the current episode).

I appreciate the code doesn't mirror the equation exactly, but to code it efficiently we start from the end and calculate backwards :)

--

--

Rohan Tangri
Rohan Tangri

Written by Rohan Tangri

AI PhD @ Imperial College London

Responses (1)