Intro to AI

http://ai.berkeley.edu



Search



Constraint Satisfaction Problems



Adversarial Search

Types

  1. Zero-Sum Games

    Agents have opposite utilities

  2. General Games

    Agents have independent utilities

Strategy

  1. MiniMax avec Alpha-Beta Pruning

    MAX -> MIN -> MAX -> … -> Utility

    • alpha : MAX’s best utility on path to root
    • beta : MIN’s best utility on path to root
    • Algo
      • If the state is a terminal state, return the state’s utility
      • If the next agent is MAX, return the max utility of its successors : max-value(state, alpha, beta)
      • If the next agent is MIN, return the min utility of its successors : min-value(state, alpha, beta)
    • Problem Like DFS, it costs too much. So we can do depth-limited search (or iterative deepening, etc.) and replace terminal utilities with an evaluation function

      In practice, the evaluation function is a weighted linear sum of features.

  2. Expectimax Search

    MAX -> EXP -> MAX -> EXP -> … -> Utility

    Values now reflect average-case (expectimax) outcomes, not worst-case (minmax) outcomes.

    • Algo
      • If the state is a terminal state, return the state’s utility
      • If the next agent is MAX, return the max utility of its successors : max-value(state)
      • If the next agent is EXP, return the min utility of its successors : exp-value(state)

Utility

Utility Theorem [Ramsey, 1931; von Neumann & Morgenstern, 1944]

Rational Preferences



Markov Decision Processes

Problem Goal Tech
Known MDP (offline) Compute Value / Policy iteration
  Evaluate a fixed policy $\pi$ Policy evaluation
Unknown MDP Model-Based Compute Value / Policy iteration on approx. MDP
  Evaluate a fixed policy $\pi$ Policy evaluation on approx MDP
Unknown MDP Model-Free Compute Q-learning
  Evaluate a fixed policy $\pi$ Value learning

Markov Decision Processes

“Markov” generally means that given the present state, the future and the past are independent

Offline (T, R known)

state $s$ -> take action $a$ -> q-state $(s,a)$ -> transition $(s,a,s’)$ -> state $s’$

Algo

Bellman updates

  1. Value Iteration

    Asynchronous Value Iteration :

    Here we update every state in each iteration. Actually, any sequences of Bellman updates will converge if every state is visited infinitely often.

    • Initialization $V_0(s) = 0$
    • Iteration

  2. Policy Iteration

    Repeats policy evaluation - policy improvement until policy converges

    • Policy Evaluation

      Compute the utility of states under a fixed policy

      • Iteration
      • Linear System Solver
    • Policy Improvement (Extraction) (one-step lookahead)

      Extract a policy by doing mini-expectimax from values

Reinforcement Learning (T or R unknown)

  1. Direct Evaluation

    Compute values under a fixed policy $\pi$

    • Act according to $\pi$
    • Note rewards $R(s,a,s’)$ when we experience $(s,a,s’)$
    • Average samples

    Problems:

    • Need to learn each state separately and it takes a long time
    • It wastes information about state connections
  2. Temporal Difference Learning (Value learning)

    Compute values under a fixed policy $\pi$

    • Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

    • Update $V(s)$

      1. $\text{sample} = R(s,a,s’) + \gamma V’(s)$
      2. $V(s) = V(s) + \alpha (\text{sample} - V(s))$

    Problem:

    • Can’t turn values into a new policy because we learn values instead of Q-values
  3. Q Learning

    Instead of computing $V(s)$

    We compute $Q(s)$

    • Act accoring to $\pi$ to receive a sample ($a = \pi(s)$)

    • Update $Q(s,a)$

      1. $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
      2. $Q(s,a) = Q(s,a) + \alpha (\text{sample} - Q(s,a))$

    Problem:

    • We need to keep a table of all Q-values

    Solution:

    • We can use a vector of features to describe a state

      Often the feature-based policies that work well aren’t the ones that approximate V/Q best. Need to learn policies that maximize rewards, not the values that predict them.

  4. Exploration

    1. Acts randomly with probability eps (eps decreases over time)

    2. Exploration functions

      Takes a value estimate u and a visit count n, and returns an optimistic utility

      $f(u, n) = u + \frac{k}{n}$

      • Regular Q-update : $\text{sample} = R(s,a,s’) + \gamma \max_a Q(s’,a’)$
      • Modified Q-update : $\text{sample} = R(s,a,s’) + \gamma \max_a f(Q(s’,a’), N(s’,a’))$

Probability



Hidden Markov Model

X1 -> X2 -> X3 -> ...
|     |     |
E1    E2    E3

Forward Algorithm

Track the distribution $B(X_t) = P_t(X_t|e_{1:t})$ over time

Problem:

Viterbi Algorithm

For responding queries like most likely explanation :

Passage of time and observation: (use max instead of sum)

Particle Filtering

Use particles to present the probability distribution



Dynamic Bayes Nets

DynamicBayesNets



Bayes Nets

BaysNets

Definitions

Conditional Independence

Causual Chains

X -> Y -> Z

Common Cause

Y -> X
  -> Z

Common Effect

X -> Z
Y ->
X -> P -> ... -> Z
Y ->

General case

Are X and Y conditionally independent given evidence variables Z ?

Consider all (undirected) paths from X to Y, if there is no active paths, then X and Y are conditionally independent ! If there is an active path, the independence is not guaranteed. (But it may be independent)

A path is active if each triple is active (independence not guaranteed ).

Topology limits distributions

Given some graph topology G, the graph structure guarantees certain (conditional) independences and only certain joint distributions can be encoded.

Inference

We want to know $P(Q|e_1,\cdots,e_k)$

where all variables $X_1,\cdots,X_n$ can be divided into three types

So $P(Q|e_1,\cdots,e_k)=\sum\limits_{h_1,\cdots,h_r}P(Q|h_1,\cdots,h_r,e_1,\cdots,e_k)$

Example

R -> T -> L

We have $P(R), P(T|R), P(L|T)$ and we want to know $P(L)$

General Variable Elimination

$P(Q|E_1=e_1,\cdots,E_k=e_k)$

The elimination ordering can greatly affect the size of the largest factor. And for poly-trees (directed graph with no undirected cycles) we can always find an efficient ordering. For other graphs, we can try to choose a set of variables such that if removed only a polytree remains.

General Example

GeneralVariableEliminationExample

Given $P(B), P(E), P(A|B,E), P(j|A), P(m|A)$, we want to know $P(B|j,m)$

  1. Choose A
    • Join A

      $P(A|B,E), P(j|A), P(m|A) \rightarrow P(j,m,A|B,E)$

    • Eliminate A

      $P(j,m,A|B,E) \rightarrow P(j,m|B,E)$

  2. Choose E
    • Join E

      $P(E), P(j,m|B,E) \rightarrow P(j,m,E|B)$

    • Eliminate E

      $P(j,m,E|B) \rightarrow P(j,m|B)$

  3. Choose B
    • Join B

      $P(B), P(j,m|B) \rightarrow P(j,m,B)$

    • Normalize

      $P(j,m,B) \rightarrow P(B|j,m)$

Sampling

Why sampling ?

Prior Sampling

Rejection Sampling

If there are evidences, we reject inconsistent samples.

If evidence is unlikely, rejects lots of samples.

Likelihood Weighting

Fix evidence variables and sample the rest

Evidence influences the choice of downstream variables, but not upstream ones (it isn’t more likely to get a value matching the evidence)

The weight can be very small and sum of weights over all samples is indicative of how many “effective” samples were obtained, so want high weight.

Gibbs Sampling

Gibbs sampling is a special case of more general methods called Markov chain Monte Carlo (MCMC) methods and Metropolis-Hastings is one of the more famous MCMC methods

Value of Information

MEU = Maximum Expected Utility

VPI = Value of Perfect Information

VPI Properties



POMDP

Partially Observable Markov Decision Process

The agent cannot directly observe the underlying state.



Naïve Bayes

Classification

The idea is that each feature depends on the class.

To use Naïve Bayes, we need to calculate $P(Y|f_1,\cdots,f_n)$ by using $P(y_k,f_1,\cdots,f_n)$.

Naïve Bayes for Text

Feature $W_i$ is the word at position $i$ and we assume that all positions share the same conditional probability $P(W|Y)$

To calculate a conditional probability $P(x|y)$:



Perceptrons

Linear, Oneclass

Use $\text{sgn}(w^Tf(x))$ to classify instance

Linear, Multiclass

There are different weights $w_y$ and we use $\arg\max\limits_y {w_y}^Tf(x)$ to classify instance

MIRA (Margin Infused Relaxed Algorithm)

To prevent overfitting

Try to minimize the change to weight

such that

If we consider change like:

problem becomes:

such that

and so

In practice, it’s bad to make updates that are too large so we can cap the maximum possible value of $\gamma$ with some constant $C$:

SVM (Support Vector Machines)

SVM

such that

The goal is to maximize the margin for all instances. Different from MIRA which optimize for an instance and try to minimize the size of change.

This algorithm (with kernel) can be used for online learning by using prime/dual problem.

Bordes, Antoine, et al. “Fast kernel classifiers with online and active learning.” Journal of Machine Learning Research 6.Sep (2005): 1579-1619.

Kernel Trick

For a perceptron, the weight is in fact a linear combination of features $f(x_i)$:

So $w_y^T f(x) = \sum\limits_i \alpha_{i,y} f(x_i)^Tf(x) = \sum\limits_i \alpha_{i,y} K(x_i,x)$

Kernels: