TD3
Twin Delayed DDPG — an off-policy RL algorithm that addresses DDPG's overestimation bias by: (1) using two Q-networks and taking the minimum, (2) delaying policy updates relative to Q-network updates, and (3) adding smoothed target policy noise. TD3 is more stable than DDPG and is a standard baseline for continuous-control robot learning tasks.