"Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward."
A type of machine learning where an agent learns how to make decisions based on rewards and punishments.
Markov Decision Processes: A formal framework for modeling decision-making problems, where the outcome of a decision depends not only on the decision itself but also on the previous state of the system.
Q-Learning: A model-free, online algorithm for learning the optimal policy in a Markov Decision Process by iteratively updating an estimate of the expected future reward.
Policy Gradient Methods: A class of reinforcement learning algorithms that directly optimize the policy function without estimating the value function.
Value Iteration: A dynamic programming algorithm that computes the optimal value function and policy for a given Markov Decision Process by iteratively updating the value function.
Monte Carlo Methods: A family of methods that estimate the value function or policy by averaging over sample episodes generated by executing the current policy in the environment.
Temporal Difference Learning: A family of methods that estimate the value function or policy by interpolating between the current estimate and the reward observed in the next state.
Actor-Critic Methods: A class of algorithms that combine a policy learning component ("actor") with a value function learning component ("critic").
Exploration-Exploitation Tradeoff: The balance between acting on the currently best known policy ("exploitation") and trying out new actions in order to gather more information about the environment ("exploration").
Deep Reinforcement Learning: A recent approach to reinforcement learning that combines deep neural networks with reinforcement learning to learn policies that can handle high-dimensional input and complex decision-making problems.
Model-Based Reinforcement Learning: A class of reinforcement learning algorithms that make use of a learned or pre-defined model of the environment to optimize the policy or value function.
Q-Learning: It is a model-free reinforcement learning algorithm for finding the optimal action-selection policy. The algorithm works by iteratively improving an estimate of the action-value function towards the optimal action-value function using the Bellman equation as a guiding principle.
SARSA: It is another model-free reinforcement learning algorithm, that stands for State-Action-Reward-State-Action. It iteratively estimates the optimal action-value function as well as the corresponding policy by repeatedly following the current policy, observing the current state, taking an action based on the current policy and observing a reward and the resultant next state.
Actor-Critic Methods: It is a class of methods that combines the value-based method of Q-learning with the policy-based method of Monte Carlo policy gradients to improve the learning rate.
Policy Gradients: It is a method of reinforcement learning where the learning algorithm directly learns the policy function that maps the state to the action that can be taken under that state to maximize an expected reward.
Model-based RL: It is a type of reinforcement learning that uses a model of the environment instead of trial-and-error learning. These models can be learned from data or designed by humans.
Model-Free RL: It is a type of reinforcement learning where the agent learns the optimal behavior by interacting directly with the environment without using a model of the environment.
Evolutionary methods: It is a type of reinforcement learning that use natural selection and genetic algorithms to evolve a population of agents over time. The surviving agents are then used to choose the next set of actions.
Multi-Agent Reinforcement Learning: It is a type of reinforcement learning where there are multiple agents interacting with each other in the same environment. The agents can learn to cooperate, compete or do both simultaneously.
Deep Reinforcement Learning: It is a type of reinforcement learning that uses neural networks to approximate the value of taking an action in a given state, instead of using a lookup table as in traditional RL.
Inverse Reinforcement Learning: It is an area in reinforcement learning that involves inferring the reward function from demonstrated behavior, and consequently, generating a policy that follows the inferred reward function, called "expert policy".
Hierarchical Reinforcement Learning: It is a type of reinforcement learning that uses a hierarchy of sub-tasks to make the learning process more efficient. Each sub-task can be learned independently, and the learned policies can be combined to solve the larger task.
Meta-Reinforcement Learning: It is a type of reinforcement learning that involves learning how to learn, where the agent adapts its learning process to new tasks it has never encountered before.
Adversarial Reinforcement Learning: It is a form of reinforcement learning where the agent's environment is adversarial, meaning that the agent needs to learn to play against an opponent.
Continuous Reinforcement Learning: It is a type of reinforcement learning that deals with continuous states and actions, where the agent needs to learn a control policy for a continuous-time system.
Contextual Bandits: It is a form of online learning where the agent needs to choose an action based on the context or state. It is a stateless problem, meaning the learner does not get the next state after action.
"Reinforcement learning differs from supervised learning in not needing labelled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected."
"Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning."
"Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge)."
"The environment is typically stated in the form of a Markov decision process (MDP), because many reinforcement learning algorithms for this context use dynamic programming techniques."
"The main difference between the classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the MDP and they target large MDPs where exact methods become infeasible."