Markov Decision Process (MDP)

Mohit Uniyal

Machine Learning

Imagine a robot navigating through a maze, making decisions at each step on whether to move left, right, or forward, while considering the obstacles and its goal of finding the exit. Each decision the robot makes is part of a Markov Decision Process (MDP) — a mathematical framework used to model decision-making in environments where outcomes are partly random and partly under the control of a decision-maker.

This article will explore the fundamentals of Markov Decision Processes in Machine Learning, covering their terminology, key components, and mathematical equations that govern how decisions are made in dynamic environments.

Markov Decision Process Terminology

In a Markov Decision Process (MDP), the agent and its interactions with the environment are central to decision-making. These interactions follow a well-defined structure, where key concepts such as states, actions, and policies come into play.

1. Agent: The agent is the decision-maker in an MDP. It interacts with the environment, making choices based on the current situation (state) to achieve its goals. The agent’s primary objective is to choose actions that maximize long-term rewards by following a particular policy. In the context of our earlier example, the robot navigating through a maze is the agent.

2. Environment: The environment represents everything outside the agent’s control. It consists of all possible states and the rules that govern how the states change based on the agent’s actions. The environment provides feedback to the agent by returning new states and rewards after each action. For example, in the robot’s maze, the walls, obstacles, and exit form the environment, determining how the robot moves and whether its decisions lead to success.

3. State (S): A state represents the current situation or configuration of the environment at any given time. The state provides the agent with all the necessary information needed to make a decision. In MDPs, the next state is dependent only on the current state and the action taken, embodying the Markov Property. For instance, in our robot maze example, the state might consist of the robot’s current position and its proximity to the walls and exit.

4. Action (A): An action is a decision made by the agent based on its current state. The action determines how the agent interacts with the environment and influences the transition from one state to another. Actions can vary widely depending on the context. For the robot, possible actions include moving left, right, forward, or backward.

5. Policy (π): A policy defines the strategy or rule that the agent follows to choose actions based on its current state. A deterministic policy specifies a particular action for each state, while a stochastic policy selects actions based on a probability distribution. The policy governs how the agent behaves to achieve its goal, such as finding the maze’s exit efficiently. The agent’s objective is to follow a policy that maximizes cumulative rewards over time.

Together, these concepts—agent, environment, state, action, and policy—define how decisions are made and how outcomes unfold in a Markov Decision Process. The agent’s ability to interact with its environment, choose appropriate actions, and follow an effective policy is what ultimately determines success in an MDP.

What Is the Markov Property?

The Markov Property is a fundamental concept in Markov Decision Processes (MDPs) that defines how future states depend on current states. According to the Markov Property, the probability of transitioning to the next state depends only on the current state and the action taken, and not on the sequence of events or states that preceded it. This concept is often described as “memorylessness” because it implies that the system does not need to remember past states to predict future states.

Mathematically, the Markov Property is expressed as:

Real-World Example

Consider a chess game. In each turn, the player’s decision (action) depends solely on the current position of the pieces on the board (state) at that moment, without needing to consider how the pieces arrived at their current positions. This satisfies the Markov Property, where the next move depends only on the present configuration of the game, not on the previous moves.

In an MDP, this property simplifies the decision-making process, as the agent only needs to consider the current state to determine the best action, without requiring a complete history of past states.

Markov Process Explained

A Markov Process (also known as a Markov Chain) is a mathematical model used to describe systems that undergo transitions from one state to another in a probabilistic manner, following the Markov Property. In a Markov Process, the system’s state evolves over time based on transition probabilities, but the key characteristic is that the future state depends only on the current state, not the past states.

A Markov Process is defined by two components:

  1. State Space (S): A set of possible states the system can be in.
  2. Transition Probability Matrix (P): A matrix that represents the probabilities of moving from one state to another.

Transition Probability Matrix

In a Markov Process, the transitions between states are governed by a transition probability matrix. This matrix provides the probability of moving from one state to another. For example, if the system can be in states $S = \{s_1, s_2, \dots, s_n\}$, then the matrix $P$ contains elements $P(s_i \mid s_j)$, representing the probability of transitioning from state $s_j$​ to state $s_i$​.

Each row in the matrix represents the probabilities of transitioning from one state to all other possible states.

Stationary Distribution

A stationary distribution is a probability distribution over states that remains unchanged as the system evolves over time. In a Markov Process, the system may reach a steady-state where the probabilities of being in different states remain constant. This happens when the state transitions balance out over time, and no matter where the process starts, it eventually converges to the same distribution.

Real-World Example

Consider a weather system with two states: sunny and rainy. The system follows a Markov Process where the probability of tomorrow’s weather depends only on today’s weather. For instance:

  • If it’s sunny today, there’s a 70% chance it will remain sunny tomorrow and a 30% chance it will be rainy.
  • If it’s rainy today, there’s a 60% chance it will remain rainy tomorrow and a 40% chance it will be sunny.

These probabilities form the transition matrix for this weather system. Over time, this system may reach a stationary distribution where, regardless of the initial weather conditions, the long-term probabilities of sunny and rainy days stabilize.

Markov Reward Process (MRP)

A Markov Reward Process (MRP) is an extension of the basic Markov Process that introduces the concept of rewards. In an MRP, as the system transitions from one state to another, it receives a reward associated with each state or transition. This framework is useful for problems where an agent not only needs to model transitions between states but also needs to evaluate the long-term rewards associated with those transitions.

An MRP is defined by two additional components:

  1. Reward Function (R): This function assigns a numerical reward to each state or transition. The reward reflects the immediate benefit (or cost) the agent receives upon entering a state.
  2. Discount Factor (γ): The discount factor is a value between 0 and 1 that determines how future rewards are valued relative to immediate rewards. A lower discount factor places more importance on immediate rewards, while a higher discount factor emphasizes future rewards.

Connection to MDP

An MRP is essentially a special case of an MDP where the actions are predetermined or fixed. In an MRP, the agent passively experiences transitions and rewards without the need to make decisions. However, understanding MRPs is crucial because they form the basis for reasoning about long-term rewards, a key concept in MDPs.

Return (G_t) Equation

The return is the total reward accumulated over time by the agent as it moves through the environment. The return at time step $t_t$, denoted by $G_t$​, is the sum of rewards the agent will receive starting from time step $t$, discounted by a factor $γ$ to account for the importance of future rewards:

This equation means that the return $G_t$​ is the sum of all future rewards, where future rewards are discounted by the factor $γ^k$. The discount factor ensures that rewards received sooner are valued higher than those received later.

Markov Decision Process (MDP)

A Markov Decision Process (MDP) is a framework that extends the Markov Reward Process (MRP) by incorporating actions and decision-making into the system. In an MDP, an agent interacts with its environment by choosing actions that influence both the transitions between states and the rewards received. The agent’s objective is to maximize the long-term cumulative rewards by making optimal decisions.

An MDP is defined by the following components:

  1. State (S): The current situation or configuration of the environment.
  2. Action (A): The decision or choice made by the agent at each state.
  3. Transition Probability (P): The probability of moving from one state to another after taking a particular action.
  4. Reward (R): The immediate reward received after transitioning to a new state.
  5. Policy (π): A strategy or rule that defines the actions an agent takes in each state. The goal is to find an optimal policy that maximizes the long-term rewards.

Decision-Making Process

In an MDP, the agent interacts with the environment by taking actions based on its current state. The environment responds by transitioning to a new state according to the transition probability and delivering a reward. The agent then updates its understanding of the environment and selects the next action based on a policy (π).

At each step, the agent aims to make decisions that maximize its expected return, where the return is the sum of future rewards, discounted by a factor γ\gammaγ to emphasize near-term rewards over distant rewards. This decision-making process allows the agent to balance immediate and long-term benefits.

Return (G_t)

The return $G_t$​ in an MDP, as in the MRP, is the total discounted sum of future rewards that the agent can expect to accumulate starting from time $t$:

The agent’s goal is to maximize this return by following a policy that selects actions leading to the most favorable states and rewards over time.

Discount (γ)

The discount factor $γ$ plays a critical role in Markov Decision Processes (MDPs) by determining how future rewards are valued relative to immediate rewards. The value of $γ$ is always between 0 and 1, and it directly influences the agent’s behavior by controlling how much importance is placed on rewards that occur later in time.

Impact on Long-Term Rewards

The discount factor $γ$ affects the way the agent considers future rewards when making decisions. A higher discount factor (closer to 1) means that the agent values future rewards almost as much as immediate rewards, leading the agent to focus more on long-term strategies. Conversely, a lower discount factor (closer to 0) means the agent prioritizes short-term rewards and places less value on future outcomes.

The return $G_t$​ at time $t$ is calculated as the sum of future rewards, discounted by $γ$ at each time step:

If $γ=1$, the agent values future rewards equally as much as immediate rewards, leading to long-term planning. If $γ=0$, the agent considers only the immediate reward, ignoring any future benefits.

Choosing the Discount Factor

The choice of $γ$ depends on the specific problem and how far into the future the agent needs to plan. For example:

  • Short-Term Focus: In cases where short-term rewards are more critical (e.g., day trading in finance), a lower $γ$ is more appropriate.
  • Long-Term Planning: In scenarios where the agent needs to consider long-term goals (e.g., a robot optimizing energy consumption over a long period), a higher $γ$ is preferred.

Policy (π)

In a Markov Decision Process (MDP), a policy (π) is a strategy that defines the actions an agent should take in each state to maximize its long-term rewards. The policy is essentially a function that maps states to actions, guiding the agent’s behavior as it navigates the environment. The goal of the agent is to find an optimal policy that maximizes the expected return over time.

Types of Policies

There are two main types of policies in MDPs:

  1. Deterministic Policy: A deterministic policy specifies a single action for each state. For every state sss, the policy $π(s)$ prescribes a specific action aaa.
  2. Stochastic Policy: A stochastic policy assigns probabilities to each possible action in a given state. Instead of always selecting the same action, the agent chooses an action based on a probability distribution over actions. This is represented as $π(a∣s)$, which gives the probability of taking action $a$ in state $s$.

Optimal Policy

An optimal policy $π^*$ is the policy that maximizes the expected return (total discounted rewards) from any starting state. The optimal policy ensures that the agent makes decisions that lead to the highest cumulative rewards over time. Finding the optimal policy is the ultimate objective in many MDP problems.

Policy Optimization

The process of finding the optimal policy involves evaluating different policies and improving them based on the outcomes. Techniques such as policy iteration and value iteration are commonly used in reinforcement learning to compute the best policy over time.

Value Functions

In a Markov Decision Process (MDP), value functions are critical for determining the quality of states and actions in terms of expected long-term rewards. Value functions help the agent evaluate how good a particular state or state-action pair is under a given policy. The two most important types of value functions are the state-value function and the action-value function.

1. State-Value Function (V)

The state-value function $V(s)$ represents the expected return (total discounted reward) starting from state $s$ and following a particular policy $π$. This function estimates how valuable it is for the agent to be in a specific state, considering the rewards the agent can expect to receive in the future if it follows the policy $π$.

The state-value function for a given policy $π$ is defined as:

Where:

  • $G_t$​ is the total return starting from state sss,
  • $γ$ is the discount factor,
  • $R_{t+k+1}$​ is the reward received at each time step,
  • $π$ is the policy followed.

This function helps the agent evaluate how valuable each state is in the long run under policy $π$.

2. Action-Value Function (Q)

The action-value function $Q(s,a)$ (also called the Q-value) represents the expected return starting from state $s$, taking action $a$, and then following the policy $π$. It evaluates how good it is to take a particular action in a particular state, considering the rewards that the agent can expect to receive in the future if it follows policy $π$ after taking action $a$.

The action-value function for a given policy $π$ is defined as:

The action-value function $Q(s,a)$ allows the agent to compare the potential rewards of different actions in the same state, helping it make more informed decisions.

Relationship Between Value Functions

The state-value function $V(s)$ and action-value function $Q(s,a)$ are closely related. The state-value function can be expressed in terms of the action-value function as follows:

This equation shows that the value of a state is the weighted average of the values of all possible actions in that state, weighted by the policy’s probability of taking each action.

Bellman Equations

The Bellman Equations are central to solving Markov Decision Processes (MDPs), as they provide a recursive definition for value functions. The Bellman equations decompose the value of a state or state-action pair into immediate rewards and the value of future states. These equations are fundamental for policy evaluation and optimization in MDPs.

There are two primary forms of Bellman equations: one for the state-value function $V(s)$ and another for the action-value function $Q(s,a)$.

1. Bellman Equation for State-Value Function

The Bellman equation for the state-value function $V(s)$ expresses the value of a state as the immediate reward plus the discounted value of the next state, averaged over all possible actions and transitions.

The equation for the state-value function is:

Where:

  • $V^π(s)$ is the value of state sss under policy $π$,
  • $π(a∣s)$ is the probability of taking action aaa in state $s$ according to policy $π$,
  • $P(s’ | s, a)$ is the probability of transitioning to state $s^′$ from state sss after taking action $a$,
  • $R(s, a, s’)$ is the reward for taking action $a$ in state sss and moving to state $s^′$,
  • $γ$ is the discount factor, determining the importance of future rewards.

2. Bellman Equation for Action-Value Function

The Bellman equation for the action-value function $Q(s,a)$ expresses the value of taking action $a$ in state $s$ as the immediate reward plus the discounted value of the next state, given that action $a$ was chosen.

The equation for the action-value function is:

Where:

  • $Q^π(s,a)$ is the value of taking action $a$ in state sss under policy $π$,
  • $P(s’ | s, a)$ is the probability of transitioning to state $s^′$ from state $s$ after taking action $a$,
  • $R(s,a,s′)$ is the reward received for transitioning from state sss to state $s^′$ after taking action $a$,
  • $γ$ is the discount factor,
  • $\sum_{a’} \pi(a’ \mid s’) Q^\pi(s’, a’)$ is the expected future reward from following the policy $π$ in the next state $s^′$.

Solving MDPs

Solving a Markov Decision Process (MDP) involves finding an optimal policy $πi^*$ that maximizes the expected cumulative reward (return) over time. There are several approaches to solving MDPs, each of which relies on value functions and the recursive structure of the Bellman equations. The two most common methods are policy iteration and value iteration, both of which leverage dynamic programming techniques to converge on the optimal policy.

1. Policy Iteration

Policy Iteration is an iterative method that alternates between evaluating a policy and improving it until the optimal policy is found.

Steps of Policy Iteration:

  1. Policy Evaluation: In this step, the algorithm computes the value of each state $V^π(s)$ under a given policy $π$. This is done by solving the Bellman equation for the current policy.
  2. Policy Improvement: Once the state-value function $V^π(s)$ is known, the policy is updated by choosing the action in each state that maximizes the expected return. The new policy is determined using the action-value function $Q(s,a)$:

$$\pi'(s) = \arg\max_a Q(s, a)$$

This step generates a new policy that is expected to yield higher returns than the previous one.

These two steps—evaluating and improving the policy—are repeated until the policy converges to the optimal policy $π^*$, at which point the algorithm terminates.

2. Value Iteration

Value Iteration is another dynamic programming approach that focuses on directly updating the value function without explicitly evaluating policies.

Steps of Value Iteration:

  • Bellman Update: In value iteration, the state-value function $V(s)$ is updated iteratively using the Bellman equation:


    This equation computes the maximum expected return for each state by choosing the best action.
  • Policy Extraction: Once the state-value function converges, the optimal policy can be derived by choosing the action that maximizes the expected return

Value iteration updates the value of each state based on the best possible action at every step, ensuring convergence to the optimal state-value function and policy simultaneously.

Other Methods

  • Q-learning: A reinforcement learning algorithm that learns the action-value function $Q(s,a)$ without needing a model of the environment.
  • Monte Carlo Methods: These methods estimate value functions based on averaging returns from multiple episodes or simulations, especially useful in environments where the transition probabilities are unknown.

Examples of the Markov Decision Process

Markov Decision Processes (MDPs) are powerful tools for modeling decision-making in environments with uncertainty. They have a wide range of applications across various domains, where agents must make a sequence of decisions to optimize long-term outcomes. Below are some of the key applications of MDPs:

1. Robotics

  • In robotics, MDPs are used to model the decision-making process of robots as they interact with dynamic environments. Robots must make decisions at each time step, such as how to move, what actions to take, and how to respond to obstacles or changes in their environment.
  • Example: An autonomous warehouse robot may use an MDP to decide the optimal path for picking up items, avoiding obstacles, and minimizing travel time. By balancing immediate rewards (quick access to items) with long-term goals (overall efficiency), the robot can make better decisions.

2. Reinforcement Learning

  • MDPs form the foundation for many reinforcement learning algorithms, where an agent learns to make decisions by interacting with its environment. In reinforcement learning, the agent seeks to maximize the cumulative reward by learning an optimal policy over time.
  • Example: Game-playing AI systems, such as those used in AlphaGo or video game bots, use MDPs to evaluate possible future moves and determine the best strategy for winning the game.

3. Healthcare

  • MDPs are applied in healthcare to model decision-making processes for treatment planning and patient care. Doctors and healthcare systems can use MDPs to determine optimal treatment policies based on patient states and expected outcomes.
  • Example: In cancer treatment, MDPs can help optimize radiation therapy by modeling the patient’s health state, treatment effects, and long-term outcomes. The goal is to balance immediate treatment effectiveness with long-term quality of life, considering the uncertainty of the treatment’s effects.

4. Financial Modeling

  • MDPs are used in finance to model investment decisions, trading strategies, and portfolio management. Investors use MDPs to optimize their decision-making over time, considering the uncertainties in stock prices, market trends, and returns.
  • Example: An investor might use an MDP to model decisions on when to buy or sell stocks, with the goal of maximizing their overall return while managing risk. The states could represent different market conditions, and actions could include buying, selling, or holding assets.

5. Operations Research

  • In operations research, MDPs are employed to optimize resource allocation, inventory management, and supply chain operations. MDPs help companies make decisions that balance immediate costs with long-term efficiency.
  • Example: A company may use an MDP to manage inventory levels, determining when to order new stock and how much to order based on fluctuating demand and storage costs. The goal is to minimize holding costs while ensuring enough inventory is available to meet demand.

6. Autonomous Driving

  • MDPs play a key role in autonomous vehicle decision-making, helping self-driving cars navigate complex environments while optimizing for safety, fuel efficiency, and travel time.
  • Example: An autonomous car uses an MDP to decide whether to speed up, slow down, change lanes, or make turns at intersections, based on the current traffic conditions, road layout, and obstacles. The policy helps the car reach its destination efficiently while avoiding collisions.

7. Energy Management

  • MDPs are used in energy management systems to optimize the use of resources, such as electricity or fuel, while considering fluctuating demand, costs, and supply constraints.
  • Example: An energy grid might use an MDP to decide how to distribute electricity from various power plants, considering factors like cost, demand, and weather conditions. The goal is to minimize costs while ensuring a stable and reliable energy supply.

8. Natural Language Processing (NLP)

  • MDPs are also applied in NLP tasks such as dialogue systems, where an AI must decide the best response to user queries while maximizing user satisfaction.
  • Example: In a customer service chatbot, MDPs help the system decide how to respond to user inquiries, considering the current conversation state and the long-term goal of resolving the issue.

Conclusion

The Markov Decision Process (MDP) is a powerful framework for modeling decision-making in environments where outcomes are influenced by both the agent’s actions and random events. By breaking down problems into states, actions, rewards, and transitions, MDPs provide a structured way to find optimal policies that maximize long-term rewards.

Whether applied in robotics, healthcare, finance, or autonomous driving, MDPs offer valuable insights into how agents can navigate uncertainty and make informed decisions. The use of value functions and Bellman equations enables efficient computation of optimal policies, allowing agents to balance immediate rewards with future gains.

As AI continues to evolve, the principles behind MDPs will remain crucial for solving increasingly complex problems that require intelligent, adaptive decision-making. Understanding MDPs provides a foundation for mastering reinforcement learning and other advanced topics in artificial intelligence.