## Utility

If I am slightly risk averse which do I prefer:

[0.5,$900;0.5,$800] or [0.1,$8750;0.9,$0]

[0.6,$100;0.4,$90] or [0.6001,$100;0.3999,$90]

[0.5,$110;0.5,$90] or [0.5001,$90;0.4999,$150]

## Decisions

This section follows the flow of the oil example given in class.

### Max Expected Utility

Let $P(x) = 0.2$ for a Boolean Random Variable X.

Assume that you have to make a decision D=1 or 2, leading to different utilities, both functions of X:

$U(D=1,x)= 400\,$ and $U(D=1,\neg x)= 2\,$

$U(D=2,x)= 20\,$ and $U(D=2,\neg x)= 100\,$

Compute the Expected Utilities and state what choice you would make.

### Posterior Expected Utilities

Suppose that X (Boolean true, false) influences another Random Variable, Y (3-valued 1,2 and 3), in the following way:

$P(Y=1|x) = 0.2\,$

$P(Y=2|x) = 0.4\,$

$P(Y=1|\neg x) = 0.6\,$

$P(Y=2|\neg x) = 0.3\,$

Compute the posterior probabilities:

Note that we will be computing many probabilities in this and subsequent sections. You may compute the joint of x and y and then sum the needed Values out of your joint or use Bayes' law directly.

• $P(x|Y=1)\,$
• $P(x|Y=2)\,$
• $P(x|Y=3)\,$

Use these probabilities to compute the posterior expected utilities:

• $E(U(D=1,X|Y=1))\,$
• $E(U(D=2,X|Y=1))\,$
• $E(U(D=1,X|Y=2))\,$
• $E(U(D=2,X|Y=2))\,$
• $E(U(D=1,X|Y=3))\,$
• $E(U(D=2,X|Y=3))\,$

What choice would you make in each of the following cases. What utility would you expect in each case given your choice? Use these three conditional expected utilities in the next section.

• Y=1
• Y=2
• Y=3

### Expected Value of Sample Information

Compute the following probabilities:

• $P(Y=1)\,$
• $P(Y=2)\,$
• $P(Y=3)\,$

What is the Expected Posterior Utility, that is, multiply (pair-wise) the probabilities you just computed by the conditional expected utilities you computed at the end of the previous section, then add them up.

What is the Expected Value of Sample Information? (The expected Posterior Utility you just computed, minus the maximum expected utility computed at the beginning of this section)

This sequence of questions led you through the computations needed for EVSI.

• The computation of the Maximum Expected Utility (the first question) is valuable in its own right, not just in the context of the computation of EVSI. It is the general key to making correct decisions.

## MDP's and Learning

### MDP's

Consider the following simple MDP:

<table border=1> <tr> <td>A</td> <td>B</td> <td>C</td>

<tr> <td>D</td> <td>E</td> <td>F</td>

<table>

Assume that:

• There is a reward of 10 in state C
• There is a reward of 9 in state D
• C and D are absorbing
• Your available actions are Up, Down, Left and Right
• Use a discount ($\gamma$) of 0.9
• When you move you move in the intended direction with probability 0.9 or magically and bizarrely teleport from wherever you are back to state A with probability 0.1
• If you crash into a wall (try to go up when you are in state A, for example) you return to the state you were in.

Do two full (all states) iterations of each of the following:

• Value Iteration (start with U(s)=0 for all non-goal states, and the reward for all goal states, use Jacobi updates. If you don't remember what that is, it means that as you update state B, you use the un-updated values for the neighbors of B. in other words, use the old value for A, even though you have already computed the update for A)
• Modified Policy Iteration (start with $\pi (s) = up$ for all states)

You might also want to satisfy yourself that you could do policy iteration too, but don't turn it in. policy iteration would require that you have access to a linear equations solver.

<!–

### Q-Learning

Using the modified version of the example used in class, shown below

<table border=1> <tr> <td>A</td> <td>B</td> <td>C</td>

<table>

where:

• $\alpha=0.5$ (not 1.0 as in class)
• $\gamma=0.9$
• Actions are up, down, right, left
• Transitions are non-deterministic
• C teleports to B
• Initialize the table to all 0.5
• Use the recommended form of the Q-value update $Q(a,s) \leftarrow \alpha[R(s,a,s')+\gamma \max_{a'}Q(a',s')]+(1-\alpha)[Q(a,s)]$
• Use $R(s,a,s')=1$ if $s'=C$; 0 otherwise
• Treat $Q(a,C) \equiv 0$ for all a (Remember that you do not learn Q values for C).

Compute the updates for the following trace:

<table border=1> <tr> <th>Step</th><th>State</th><th>Action</th><th>Result</th><th>Reward</th>

<tr> <th>1</th> <td>A</td> <td>U</td> <td>A</td> <td>0.0</td>

<tr> <th>2</th> <td>A</td> <td>R</td> <td>B</td> <td>0.0</td>

<tr> <th>3</th> <td>B</td> <td>U</td> <td>A</td> <td>0.0</td>

<tr> <th>4</th> <td>A</td> <td>R</td> <td>B</td> <td>0.0</td>

<tr> <th>5</th> <td>B</td> <td>R</td> <td>C → B</td> <td>1.0</td>

<tr> <th>6</th> <td>B</td> <td>R</td> <td>B</td> <td>0.0</td>

<table>–> 