Update week12.ipynb

mhjensen · mhjensen · commit 1aa30989acf4 · 2026-04-16T08:50:31.000+02:00
diff --git a/doc/pub/week12/ipynb/week12.ipynb b/doc/pub/week12/ipynb/week12.ipynb
@@ -1356,15 +1356,7 @@
     "editable": true
    },
    "source": [
-    "Other choices are possible, but they all correspond to multiplying\n",
-    "$A_{i\\rightarrow j}$ and $A_{j\\rightarrow i}$ by the same constant\n",
-    "smaller than unity. The penalty function method uses just such\n",
-    "a factor to compensate for $p_i$ that are evaluated stochastically\n",
-    "and are therefore noisy.\n",
-    "\n",
-    "Having chosen the acceptance probabilities, we have guaranteed that\n",
-    "if the  $p_i^{(n)}$ has equilibrated, that is if it is equal to $p_i$,\n",
-    "it will remain equilibrated."
+    "Other choices are possible, but they all correspond to multiplying\n$A_{i\\\\rightarrow j}$ and $A_{j\\\\rightarrow i}$ by the same constant\nsmaller than unity.\n\nHaving chosen the acceptance probabilities, we have guaranteed that\nif $p_i^{(n)}$ has equilibrated it will remain equilibrated.\n\nThe Metropolis algorithm gives us a correct recipe for sampling\nfrom any distribution we can evaluate pointwise, at the cost of\npotentially rejecting proposed moves. **Gibbs sampling** is a\nrejection-free alternative that exploits the conditional structure\nof the target distribution — and it is the algorithm that drives\ntraining of Boltzmann machines.\n"
    ]
   },
   {
@@ -1374,23 +1366,112 @@
     "editable": true
    },
    "source": [
-    "## Gibbs sampling\n",
-    "\n",
-    "An efficient way if performing the sampling is through the use of\n",
-    "Gibbs sampling. The latter uses the conditional probability instead of\n",
-    "the full probability as done in the Metropolis algorithm.\n",
-    "\n",
-    "Gibbs sampling is useful for sampling from high-dimensional\n",
-    "distributions where single-variable conditional distributions are\n",
-    "known.\n",
-    "\n",
-    "For example, say it is too expensive to sample from $p(x_0, x_1, x_2,\n",
-    "..., x_d)$. With Gibbs sampling, we initialize all variables to\n",
-    "arbitrary values. Then while taking each sample, we also iterate\n",
-    "through the dimensions and replace its value with a sample from the\n",
-    "univariate conditional distribution. For example we can update $x_1$\n",
-    "using $p(x_1 \\mid x_0, x_2, ..., x_d)$, which is easy to sample over\n",
-    "because it is  only one dimension."
+    "## Gibbs Sampling\n\nAn efficient way **of** performing the sampling is through the use of\nGibbs sampling. The latter uses the conditional probability instead of\nthe full probability as done in the Metropolis algorithm.\n\nGibbs sampling is the standard Monte Carlo engine behind\nRestricted Boltzmann Machine (RBM) training, where the gradient of the\nnegative log-likelihood requires the **negative phase**\n\n$$\n\\\\nabla_{\\\\boldsymbol{\\\\Theta}}\\\\log Z(\\\\boldsymbol{\\\\Theta})\n= \\\\mathbb{E}_{p(\\\\boldsymbol{x};\\\\boldsymbol{\\\\Theta})}\n  \\\\bigl[\\\\nabla_{\\\\boldsymbol{\\\\Theta}}\\\\log f(\\\\boldsymbol{x};\\\\boldsymbol{\\\\Theta})\\\\bigr].\n$$\n\nBecause $Z$ is intractable, this expectation must be approximated by\nMonte Carlo samples from the model distribution.\n\nThe following cells develop Gibbs sampling systematically:\nthe mathematical setting, the update rule, invariance, detailed\nbalance, the Metropolis connection, the Ising model, and convergence.\nA concrete bivariate-Gaussian illustration follows the theory.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Sampling Problem\n\nSuppose we want to sample from a probability distribution\n$\\\\pi(x_1,\\\\dots,x_d)$ on a high-dimensional space.\n\nTypical situations include:\n- a Boltzmann distribution in statistical mechanics,\n- a posterior distribution in Bayesian inference,\n- a latent-variable model in machine learning (e.g. an RBM).\n\nIn many important cases $\\\\pi$ is known only up to normalisation,\ndirect sampling is difficult, but the **conditional distributions\nare tractable**. This is exactly the setting where Gibbs sampling\nbecomes useful.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Target Distribution and the Partition Function\n\nThe target distribution often has the form\n\n$$\n\\\\pi(\\\\boldsymbol{x}) = \\\\frac{1}{Z}\\\\,f(\\\\boldsymbol{x}),\n\\\\qquad\nZ = \\\\int f(\\\\boldsymbol{x})\\\\,d\\\\boldsymbol{x}\n\\\\;\\\\text{or}\\\\;\nZ = \\\\sum_{\\\\boldsymbol{x}} f(\\\\boldsymbol{x}).\n$$\n\nIn statistical physics this is the **Boltzmann distribution**:\n\n$$\n\\\\pi(\\\\boldsymbol{x}) = \\\\frac{1}{Z}\\\\,e^{-\\\\beta E(\\\\boldsymbol{x})},\n$$\n\nwhere $E(\\\\boldsymbol{x})$ is the energy, $\\\\beta = 1/(k_B T)$, and\n$Z$ is the partition function. The difficulty: $Z$ is usually\nhard to compute explicitly. Gibbs sampling lets us draw samples\nfrom $\\\\pi$ **without ever computing $Z$**.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Why Conditional Distributions Help\n\nEven when $\\\\pi(x_1,\\\\dots,x_d)$ is hard to sample from directly,\nthe **conditional distributions**\n\n$$\n\\\\pi(x_i \\\\mid x_1,\\\\dots,x_{i-1},x_{i+1},\\\\dots,x_d)\n$$\n\nmay be easy to sample. The core idea of Gibbs sampling is:\n\n> Sample one coordinate at a time from its **exact** conditional\n> distribution, while keeping all other coordinates fixed.\n\nThis creates a Markov chain whose stationary distribution is\n$\\\\pi$ — no rejection step is ever needed.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Conditional Probability: Formal Definition\n\nFor two variables $X, Y$ the conditional distribution is\n\n$$\n\\\\pi(x\\\\mid y) = \\\\frac{\\\\pi(x,y)}{\\\\pi_Y(y)},\n\\\\qquad\n\\\\pi_Y(y) = \\\\int \\\\pi(x,y)\\\\,dx.\n$$\n\nIn $d$ dimensions, the conditional for coordinate $i$ given all\nother coordinates $\\\\boldsymbol{x}_{-i}$ is\n\n$$\n\\\\pi(x_i \\\\mid \\\\boldsymbol{x}_{-i})\n= \\\\frac{\\\\pi(x_i, \\\\boldsymbol{x}_{-i})}{\\\\int \\\\pi(x_i', \\\\boldsymbol{x}_{-i})\\\\,dx_i'}.\n$$\n\nNote that $Z$ cancels in this ratio, which is why Gibbs sampling\ndoes not require $Z$.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The Gibbs Sampling Update Rule\n\nLet the current state be\n$\\\\boldsymbol{x}^{(t)} = (x_1^{(t)},\\\\dots,x_d^{(t)})$.\n\nOne **full Gibbs sweep** updates coordinates sequentially:\n\n$$\nx_1^{(t+1)} \\\\sim \\\\pi(x_1 \\\\mid x_2^{(t)},\\\\dots,x_d^{(t)}),\n$$\n$$\nx_2^{(t+1)} \\\\sim \\\\pi(x_2 \\\\mid x_1^{(t+1)},x_3^{(t)},\\\\dots,x_d^{(t)}),\n$$\n$$\n\\\\vdots\n$$\n$$\nx_d^{(t+1)} \\\\sim \\\\pi(x_d \\\\mid x_1^{(t+1)},\\\\dots,x_{d-1}^{(t+1)}).\n$$\n\nEach coordinate is updated using the **most current** values of all\nother coordinates (systematic-scan). The random-scan variant chooses\na coordinate uniformly at random at each step.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Two-Variable Example\n\nFor a two-variable target $\\\\pi(x,y)$, one Gibbs iteration is:\n\n$$\nx^{(t+1)} \\\\sim \\\\pi(x \\\\mid y^{(t)}),\n\\\\qquad\ny^{(t+1)} \\\\sim \\\\pi(y \\\\mid x^{(t+1)}).\n$$\n\nThis captures the essential structure:\n- each update is **exact** conditional sampling,\n- the resulting process is **Markovian**,\n- the full joint distribution $\\\\pi$ is preserved.\n\nThe bivariate Gaussian implemented in the code cells below\nfollows exactly this two-variable scheme.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Invariance of the Target Distribution\n\n**Claim:** $\\\\pi$ is invariant under each single-coordinate Gibbs update.\n\n**Proof.** A single-coordinate update for variable $i$ leaves all\n$\\\\boldsymbol{x}_{-i}$ fixed and resamples $x_i$ from its conditional.\nUsing the factorisation\n$\\\\pi(d\\\\boldsymbol{x}) = \\\\pi(dx_i \\\\mid \\\\boldsymbol{x}_{-i})\\\\,\\\\pi_{-i}(d\\\\boldsymbol{x}_{-i})$:\n\n$$\n\\\\int \\\\pi(d\\\\boldsymbol{x})\\\\,K_i(\\\\boldsymbol{x},\\\\,d\\\\boldsymbol{x}')\n= \\\\underbrace{\\\\int \\\\pi(dx_i \\\\mid \\\\boldsymbol{x}_{-i})}_{=\\\\;1}\n  \\\\cdot\\\\; \\\\pi(dx'_i \\\\mid \\\\boldsymbol{x}'_{-i})\\\\,\\\\pi_{-i}(d\\\\boldsymbol{x}'_{-i})\n= \\\\pi(d\\\\boldsymbol{x}').\n$$\n\nSince each coordinate update $K_i$ preserves $\\\\pi$, so does the\nfull sweep $K = K_d \\\\circ \\\\cdots \\\\circ K_1$. $\\\\square$\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Detailed Balance\n\nEach single-coordinate Gibbs update satisfies **detailed balance** with\nrespect to $\\\\pi$:\n\n$$\n\\\\pi(\\\\boldsymbol{x})\\\\,K_i(\\\\boldsymbol{x},\\\\boldsymbol{x}')\n= \\\\pi(\\\\boldsymbol{x}')\\\\,K_i(\\\\boldsymbol{x}',\\\\boldsymbol{x}).\n$$\n\nSince only coordinate $i$ changes, $\\\\boldsymbol{x}_{-i} = \\\\boldsymbol{x}'_{-i}$.\nBoth sides equal\n\n$$\n\\\\pi(\\\\boldsymbol{x}_{-i})\n\\\\,\\\\pi(x_i \\\\mid \\\\boldsymbol{x}_{-i})\n\\\\,\\\\pi(x'_i \\\\mid \\\\boldsymbol{x}_{-i}),\n$$\n\nwhich is symmetric in $x_i \\\\leftrightarrow x'_i$.\n\n**Physical meaning:** in equilibrium, the probability flux from\n$\\\\boldsymbol{x}$ to $\\\\boldsymbol{x}'$ exactly balances the reverse flux.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Gibbs Sampling as a Special Case of Metropolis-Hastings\n\nRecall the Metropolis-Hastings acceptance probability:\n\n$$\n\\\\alpha(\\\\boldsymbol{x},\\\\boldsymbol{x}')\n= \\\\min\\\\!\\\\left(1,\\\\,\n  \\\\frac{\\\\pi(\\\\boldsymbol{x}')\\\\,q(\\\\boldsymbol{x}\\\\mid\\\\boldsymbol{x}')}{\\\\pi(\\\\boldsymbol{x})\\\\,q(\\\\boldsymbol{x}'\\\\mid\\\\boldsymbol{x})}\n  \\\\right).\n$$\n\nIn Gibbs sampling the **proposal is the conditional distribution itself**:\n$q(x'_i \\\\mid \\\\boldsymbol{x}) = \\\\pi(x'_i \\\\mid \\\\boldsymbol{x}_{-i})$.\nSubstituting, the acceptance ratio becomes exactly 1, so **every\nproposed move is accepted**.\n\nGibbs sampling is a **rejection-free** special case of\nMetropolis-Hastings — the most efficient possible acceptance.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Connection to the Boltzmann Distribution\n\nFor the Boltzmann distribution\n$\\\\pi(\\\\boldsymbol{x}) = \\\\frac{1}{Z}e^{-\\\\beta E(\\\\boldsymbol{x})}$,\nthe conditional for coordinate $x_i$ is\n\n$$\n\\\\pi(x_i \\\\mid \\\\boldsymbol{x}_{-i})\n= \\\\frac{e^{-\\\\beta E(x_i,\\\\,\\\\boldsymbol{x}_{-i})}}{\\\\displaystyle\\\\sum_{x_i'} e^{-\\\\beta E(x_i',\\\\,\\\\boldsymbol{x}_{-i})}}.\n$$\n\nThis is the **local Boltzmann law**: update one degree of freedom in\nthe frozen field of its neighbours. The partition function $Z$ cancels.\n\nThis is precisely the structure exploited in RBM training: the\nconditionals\n$p(h_j=1\\\\mid\\\\boldsymbol{x}) = \\\\sigma(b_j + \\\\boldsymbol{x}^T\\\\boldsymbol{w}_{*j})$\nare Gibbs updates on the joint Boltzmann distribution of the RBM.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Physical Example: The Ising Model\n\nFor Ising spins $s_i \\\\in \\\\{-1,+1\\\\}$ with energy\n$E(\\\\boldsymbol{s}) = -\\\\sum_{\\\\langle i,j\\\\rangle} J_{ij}\\\\,s_i s_j - \\\\sum_i h_i\\\\,s_i$,\nonly terms involving spin $i$ matter for the conditional.\n\nDefining the **local field** $H_i^{\\\\mathrm{loc}} = \\\\sum_{j \\\\sim i} J_{ij}\\\\,s_j + h_i$:\n\n$$\n\\\\mathbb{P}(s_i = +1 \\\\mid \\\\boldsymbol{s}_{-i})\n= \\\\frac{1}{1 + e^{-2\\\\beta H_i^{\\\\mathrm{loc}}}},\n\\\\qquad\n\\\\mathbb{P}(s_i = -1 \\\\mid \\\\boldsymbol{s}_{-i})\n= \\\\frac{1}{1 + e^{+2\\\\beta H_i^{\\\\mathrm{loc}}}}.\n$$\n\nThis **sigmoid (logistic) form** is directly analogous to the RBM\nhidden-unit conditional\n$p(h_j=1\\\\mid\\\\boldsymbol{x}) = \\\\sigma(b_j + \\\\boldsymbol{x}^T\\\\boldsymbol{w}_{*j})$.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Continuous Example: Exact Conditionals of the Bivariate Gaussian\n\nLet $(X,Y)$ have the bivariate Gaussian distribution with means\n$(\\\\mu_X,\\\\mu_Y)$, variances $(\\\\sigma_X^2,\\\\sigma_Y^2)$, and\ncorrelation $\\\\rho$. The exact conditionals are:\n\n$$\nX\\\\mid Y=y \\\\;\\\\sim\\\\;\n\\\\mathcal{N}\\\\!\\\\left(\n\\\\mu_X + \\\\rho\\\\frac{\\\\sigma_X}{\\\\sigma_Y}(y-\\\\mu_Y),\\\\;\n\\\\sigma_X^2(1-\\\\rho^2)\n\\\\right),\n$$\n$$\nY\\\\mid X=x \\\\;\\\\sim\\\\;\n\\\\mathcal{N}\\\\!\\\\left(\n\\\\mu_Y + \\\\rho\\\\frac{\\\\sigma_Y}{\\\\sigma_X}(x-\\\\mu_X),\\\\;\n\\\\sigma_Y^2(1-\\\\rho^2)\n\\\\right).\n$$\n\nGibbs sampling alternates exact draws from these two Gaussians.\nThe code below uses $\\\\mu_X=\\\\mu_Y=0$, $\\\\sigma_X=\\\\sigma_Y=1$, $\\\\rho=0.5$.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Convergence\n\nInvariance alone does not guarantee convergence. The chain must also be:\n- **irreducible**: every state can be reached from any other state,\n- **aperiodic**: the chain does not cycle,\n- **positive recurrent**: expected return time to any state is finite.\n\nUnder these conditions:\n\n$$\n\\\\lim_{t\\\\to\\\\infty}\\\\mathbb{P}(\\\\boldsymbol{X}^{(t)}\\\\in A) = \\\\pi(A).\n$$\n\nThe **ergodic theorem** gives the practical estimate:\n\n$$\n\\\\frac{1}{N}\\\\sum_{t=1}^N f(\\\\boldsymbol{X}^{(t)})\n\\\\xrightarrow{N\\\\to\\\\infty}\n\\\\mathbb{E}_{\\\\pi}[f]\n\\\\quad\\\\text{almost surely.}\n$$\n\nThis is why Monte Carlo averages along a Gibbs chain converge\nto the true expectation under $\\\\pi$.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## When Gibbs Sampling Can Be Slow\n\nEven though the algorithm is simple, convergence may be slow if:\n- variables are **strongly correlated** (small effective steps),\n- the target distribution is **multimodal** (chain can get trapped),\n- there is **critical slowing down** near a phase transition.\n\nStandard remedies:\n- **Block Gibbs**: update groups of correlated variables together,\n- **Overrelaxation**: propose reflections past the conditional mean,\n- **Parallel tempering**: couple chains at different temperatures.\n\nIn RBM training, **Contrastive Divergence (CD-$k$)** truncates the\nGibbs chain to just $k$ steps, accepting the resulting bias for\nmuch faster parameter updates. This is developed in the next section.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary: Why Gibbs Sampling Works\n\nThe logic in six steps:\n1. Identify target $\\\\pi(x_1,\\\\dots,x_d)$.\n2. Find tractable conditionals $\\\\pi(x_i \\\\mid \\\\boldsymbol{x}_{-i})$\n   — computable even when $Z$ is not.\n3. Define a Markov chain by sequentially sampling from these conditionals.\n4. **Invariance**: each coordinate update preserves $\\\\pi$\n   ($Z$ cancels in every conditional).\n5. **Detailed balance**: each update is reversible w.r.t. $\\\\pi$.\n6. **Ergodicity**: under mild conditions the chain converges to $\\\\pi$.\n\n**Physical picture:** Gibbs sampling equilibrates one degree of\nfreedom at a time in the frozen field of its neighbours — the same\nlocal relaxation that brings a spin system to thermal equilibrium.\n\nThis is the theoretical foundation on which Contrastive Divergence\nand the entire training procedure for Boltzmann machines rests.\n"
    ]
   },
   {
@@ -1400,15 +1481,7 @@
     "editable": true
    },
    "source": [
-    "## Understanding Gibbs samoling\n",
-    "\n",
-    "This part is best seen with the jupyter-notebook.\n",
-    "These notes have been adapted from <https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/>\n",
-    "\n",
-    "To illustrate Gibbs\n",
-    "sampling we will sample several points and compare them with those\n",
-    "generated from a known distribution, in our case the well-known\n",
-    "two-dimensional Gaussian defined as"
+    "## Numerical Illustration: Bivariate Gaussian\n\nThese notes have been adapted from\n<https://www.inf.ed.ac.uk/teaching/courses/mlpr/2017/>\n\nAfter the mathematical development above, we illustrate\nGibbs sampling concretely on the bivariate Gaussian\n\n$$\np(a, b) = \\\\mathcal{N}\\\\!\\\\left(\n\\\\begin{bmatrix}a\\\\\\\\b\\\\end{bmatrix};\\\\,\n\\\\begin{bmatrix}0\\\\\\\\0\\\\end{bmatrix},\\\\,\n\\\\begin{bmatrix}1 & 0.5\\\\\\\\0.5 & 1\\\\end{bmatrix}\n\\\\right).\n$$\n\nWe compare samples drawn from the true joint distribution with\nsamples produced by alternating between the two univariate conditionals.\n"
    ]
   },
   {
@@ -1597,7 +1670,7 @@
     "editable": true
    },
    "source": [
-    "With this we cna set up the conditionals for this problem"
+    "With this we can set up the conditionals for this problem.\n"
    ]
   },
   {
@@ -1746,6 +1819,13 @@
     "plt.show()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## From Gibbs Sampling to Contrastive Divergence\n\nThe bivariate Gaussian example above illustrates the core mechanics\nof Gibbs sampling. In the context of Restricted Boltzmann Machines\nthe same idea applies to the **joint Boltzmann distribution**\n\n$$\nP_{\\\\rm rbm}(\\\\boldsymbol{x},\\\\boldsymbol{h};\\\\boldsymbol{\\\\Theta})\n= \\\\frac{1}{Z(\\\\boldsymbol{\\\\Theta})}\\\\,\n  e^{\\\\,\\\\boldsymbol{a}^T\\\\boldsymbol{x} + \\\\boldsymbol{b}^T\\\\boldsymbol{h}\n     + \\\\boldsymbol{x}^T\\\\boldsymbol{W}\\\\boldsymbol{h}}.\n$$\n\nBecause of the bipartite structure (no within-layer connections),\nthe conditionals factorise completely:\n\n$$\np(\\\\boldsymbol{h}\\\\mid\\\\boldsymbol{x}) = \\\\prod_j p(h_j\\\\mid\\\\boldsymbol{x}),\n\\\\qquad\np(\\\\boldsymbol{x}\\\\mid\\\\boldsymbol{h}) = \\\\prod_i p(x_i\\\\mid\\\\boldsymbol{h}),\n$$\n\nso an entire layer can be sampled **in one parallel Gibbs step**.\nThis is what makes RBM training computationally tractable.\n\n**Contrastive Divergence (CD-$k$)** runs exactly $k$ alternating Gibbs\nsteps starting from a training example to approximate the negative phase:\n\n$$\n\\\\nabla_{w_{ij}}\\\\mathcal{L}\n\\\\approx \\\\langle x_i h_j\\\\rangle_{\\\\rm data}\n         - \\\\langle x_i h_j\\\\rangle_{k\\\\text{-step Gibbs}}.\n$$\n\nThe RBM theory and implementations follow in the next section.\n"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "624b8b1b",
@@ -5751,4 +5831,4 @@
  },
  "nbformat": 4,
  "nbformat_minor": 5
-}
+}