The Data Scientist's Odyssey

The organized chaos of all things data science, machine learning, and social science


Markov Decision Processes - The Framework For Decision-Making Under Uncertainty

How do we optimize decision making in uncertain environments?

Finite Markov Decision Processes (MDPs) are fundamental processes in reinforcement learning (RL) and are an extension of markov chains, systems where the next state is only dependent on the current state. MDPs extend Markov chains by simply adding in actions and rewards, hence turning a stochastic process into a stochastic decision process. They are useful for studying optimization problems solved using dynamic programming, which is a method for solving problems with optimal substructure, that is, where the optimal solution to a problem can be found by combining optimal solutions to its subproblems. In this post, we'll go over the basics of MDPs and how they are used in RL.

MDPs involve both evaluative feedback, as with the k-armed bandits, but also an associative aspect, where actions influence not just immediate rewards, but also subsequent situations (or states). This leads to delayed rewards and the need to trade off immediate and delayed rewards because the state the bot finds itself in can impact future rewards. In this sense, the RL agent is trying to determine how both their environment and actions map onto value. In k-armed bandits, the goal is to estimate the value $q_{*}(a)$ of each action $a$, while with MDPs, the goal is to estimate $q_{*}(s,a)$ of each action $a$ in each state $s$, or the value $v_{*}(s)$ of each state given optimal action selections. It's important to not that MDPs are a mathematically idealized form of the RL problem.

The agent-environment interface is the foundation of MDPs. The agent, or the learner and decision maker, interacts with the environment, which presents new situations and gives rise to rewards that the agent seeks to maximize over time with its actions. The agent and environment interact at discrete time steps $t=0,1,2,3,...$ and at each time step, the agent receives a representation of the environment's state, $S_t \in S$ and from that, selects an action, $A_t \in A(s)$. One time step later, the agent receives a reward $R_{t+1} \in R$ and finds itself in a new state $S_{t+1}$. The state space can be anything including stock returns, the score of a game, or the position of a robot. A common diagram used to represent this, as well as RL in general, is the following:

A Markov Decision Process involves an agent interacting with an environment. At discrete time steps, the agent takes an action and then recieves a reward and state in the next time step. The process repeats.

In a finite MDP, the sets of states, actions, and rewards ($S, A, R$) all have a finite number of elements. In this case, the random variables $R_t$ and $S_t$ have well defined discrete probability distributions dependent only on the preceding state and action. The state-transition probabilities can be computed as:

$$p(s'|s,a) \coloneqq Pr(S_t=s'|S_{t-1}=s,A_{t-1}=a) = \sum_{r \in R}p(s',r|s,a)$$

and the expected rewards for state-action pairs can be computed as:

$$r(s,a) \coloneqq \mathbf{E}[R_{t}|S_{t-1}=s,A_{t-1}=a] = \sum_{r \in R}r\sum_{s' \in S}p(s',r|s,a)$$

The general rule is that whatever cannot be changed by the agent is considered part of its environment. However, the reward computation is always considered external to the agent because it defined the task facing the agent and is beyond its ability to change arbitrarily. The agent-environment boundary represents the limit of the agent's absolute control, not of its knowledge.

Sources

Sutton, R. S., Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT Press Ltd.

White, M., White, A. (n.d.). Reinforcement Learning Specialization [MOOC]. Coursera.

Markov Chains | Brilliant Math & Science Wiki. (n.d.). Brilliant.org. https://brilliant.org/wiki/markov-chains/#:~:text=A%20Markov%20chain%20is%20a

Wikipedia Contributors. (2019, May 20). Markov decision process. Wikipedia; Wikimedia Foundation. https://en.wikipedia.org/wiki/Markov_decision_process

About
Hello! Names Eddie. I finished my PhD in experimental fusion physics in 2019 and have been working as an operations research scientist for a few years. I enjoy all things science and hope to understand my world better by reading the best sources and summarizing the information to the best of my ability for all to enjoy. I hope you enjoy my blog!