Beyesian Network for an MDP

Agent’s Goal:

To find a policy that maximizes the expected total amount of reward that it will get. (one of many goals in RL literature)

Objective Function

$$J: \Pi \rightarrow \mathbb{R} $$ $J$ is not random. Sometimes $\rho$.

$\Pi$ denotes the set of all policies(uppercase). $$J(\pi)=\mathbb{E}\left[\sum_{t=0}^{\infty} R_{t} \mid \pi\right]$$

Wrong. Will correct it soon!

Since $\Pi$ are policies, not events. So we cannot make it as the conditional expectation.

Reward Discount $\gamma \in [0,1]$

$$J(\pi)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^t R_{t} \mid \pi\right]$$

$\gamma < 1$ means that rewards that occur later are worth less to the agent. One purpose is preventing $J$ from being infinite ($\gamma is not one$:

$$\sum_{i=0}^{\infty} \gamma^{i}=\frac{1}{1-\gamma}$$ Based on this rule, $J$ will be in range $( -\frac{R_{max}}{1-\gamma}, \frac{R_{max}}{1-\gamma} )$


Quests

  1. Why is this gamma part of the environment MDP, not the agent doing as part of its objective function. — Because it’s part of the problem specification.

页面: 1 2