A family of methods that estimate the value function or policy by averaging over sample episodes generated by executing the current policy in the environment.
A family of methods that estimate the value function or policy by averaging over sample episodes generated by executing the current policy in the environment.