Ivy Fan-Chiang - The Normal Distribution: A Derivation through Multi-Variable Calculus
  • Home
  • About
  • Projects
  • Blog
  • Misc
  • Contact
  • The Normal Distribution: A Derivation through Multi-Variable Calculus

    Posted 2023/08/29

    The normal distribution or Gaussian distribution (informally known as a bell curve) is the most used continuous probability distribution with many applications in statistics and science. Its importance comes from the central limit theorem, which establishes that the distribution of a sum (or sample mean) of many independent and identically distributed random variables will tend to look like a standard normal distribution with a large enough sample size. This property makes the distribution commonly used to represent random variables whose distributions are not known.

    As such, the distribution is commonly taught as a tool in many courses without a full explanation of all of its properties and derivation. For me, the distribution was recently used in my SYDE162 Human Factors in Design course for calculating the distribution of anthropometric dimensions using z-scores, but no explanation of what the distribution is and its properties were presented in the course. However, the distribution and many of its properties are easy to understand and the Hershel-Maxwell derivation can be understood with some basic understanding of multi-variable differentiation and integration.

    Basic Properties

    Continuous probability distributions like the normal distribution have a cumulative distribution (CDF) and a probability density function (PDF). However, in this post I will be focusing on the probability density function which is the “bell curve” function most people use; the cumulative distribution function can be found by taking the antiderivative of the probability density function we will be deriving.

    The probability density function describes how likely a random outcome is with outcomes near the mean being more likely and outcomes further away from the mean being less likely. This creates a function that looks like a “bell curve” as seen in the image on the right.

    An example of this distribution being used is the sum of die rolls. In a two die roll, rolling a sum of 7 is more likely than the more extreme values of 2 and 12, graphing the probabilities of each outcome from 2-12 will give you a graph that vaguely looks like a normal distribution. However, if you increased the number of die and redrew the graph, the distribution will continue to look closer to the normal “bell curve” distribution as you increase the number of die. This tendency to approach the probability density function of the normal distribution as sample size is increased is called the central limit theorem, and the theorem can be seen in many applications which is why this distribution and its probability density function is so important.

    Probability density functions (in general for any probability distribution) have a few properties that are important:

    1. The probability density function is an integrable over its domain (or else our distribution would not have a cumulative distribution function and property number 4 would not work)

    2. The probability density function is positive everywhere in its domain, p(x)>0p(x)>0 (the probability of an outcome is never negative)

    3. The area under the curve of the probability density function (area between the curve and the xx-axis) is 11 (p(x)dx=1\int_{-\infty}^\infty p(x) \mathit{dx} = 1)

    4. The probability that a random value XX lies in the interval [a,b][a,b] is given by the area under the curve over that interval, P(X[a,b])=abp(x)dxP(X \in [a,b]) = \int_a^b p(x) \mathit{dx}

    This last property is what probability density functions are most commonly used for, they help calculate the probability that a random variable will be a certain value or fall within a certain range. The normal distribution specifically has a couple more properties that are useful and will be used in the derivation:

    1. The probability density function is symmetric around the point x=μx=\mu which is the mean of the distribution (and also the median and mode).
    2. The PDF is unimodal (has a single peak), this means that the first derivative (instantaneous rate of change) is positive at x<μx<\mu, negative at x>μx>\mu, and zero at x=μx=\mu

    The Herschel-Maxwell Theorem and the Cartesian Dartboard Analogy

    Earlier, I explained that we will derive the normal probability density function by following the Herschel-Maxwell derivation. This derivation uses the Herschel-Maxwell’s probability theory, which states that if the probability distribution of a vector in Rn\mathbb R^n is unchanged by (or independent of) rotation and the individual vector components are independently randomly distributed, then the components of the vector are all normally distributed in an identical manner. Following this theorem, we will derive the probability distribution in a basic two-dimensional case before simplifying to one-dimension as it makes some of the later integration easier.

    A good analogy of this theorem and a two-dimensional normal distribution is throwing darts at the origin of a two-dimensional Cartesian plane. The darts are aimed at the origin, but random errors in the throw produce different results. Under this analogy, we can assume that small errors are more likely than large errors, errors of equal distance from the origin are equally likely (errors are independent of orientation/rotation), and errors in the two axes are independent and normally distributed.

    Finding the Basic Shape of the Probability Density Function

    Under the assumptions of independent axes and independence from rotation, we can start with this equation, which equates probabilities in Cartesian coordinates to probabilities in polar coordinates:

    g(r)=p(x)p(y) g(r)=p(x)p(y)

    Here, p(x)p(x) represents the probability density function on the xx-axis, p(y)p(y) represents the function on the yy-axis, and g(r)g(r) represents the probability density with respect to distance from the origin (or radius). We can then partially differentiate the equation with respect to θ\theta or rotation around the origin to get the following:

    0=p(x)dp(y)dθ+p(y)dp(x)dθ 0=p(x)\frac{dp(y)}{d\theta}+p(y)\frac{dp(x)}{d\theta}

    Observe that the probability with respect to radius (g(r)g(r)) is independent of orientation and as such differentiates to 00 and the use product rule on the right side. We can then substitute in x=rcosθx=r\cos\theta and y=rsinθy=r\sin\theta to convert Cartesian coordinates into polar coordinates on the right side and then separate the variables:

    0=p(x)dp(rsinθ)dθ+p(y)dp(rcosθ)dθ0=p(x)p(y)(rcosθ)+p(y)p(x)(rsinθ)p(y)p(x)(rsinθ)=p(x)p(y)(rcosθ)yp(y)p(x)=xp(x)p(y)p(x)xp(x)=p(y)yp(y) \begin{align*} 0 &= p(x)\frac{dp(r\sin\theta)}{d\theta} + p(y)\frac{dp(r\cos\theta)}{d\theta} \\ 0 &= p(x)p'(y)(r\cos\theta) + p(y)p'(x)(-r\sin\theta) \\ p(y)\,p'(x)\,(r\sin\theta) &= p(x)\,p'(y)\,(r\cos\theta) \\ y\,p(y)\,p'(x) &= x\,p(x)\,p'(y) \\ \frac{p'(x)}{x\,p(x)} &= \frac{p'(y)}{y\,p(y)} \end{align*}

    For this differential equation to be true for any independent xx and yy value, the ratio of the differential equation must be constant. This unknown constant can be written as CC:

    p(x)xp(x)=p(y)yp(y)=C \frac{p'(x)}{x\,p(x)} = \frac{p'(y)}{y\,p(y)} = C

    We can then use this equation to solve for the basic form of the normal distribution’s probability density function:

    p(x)xp(x)=Cp(x)p(x)=Cxp(x)p(x)dx=Cxdxln(p(x))=Cx22+cp(x)=eCx22+cp(x)=AeC2x2 \begin{align*} \frac{p'(x)}{x\,p(x)} &= C \\ \frac{p'(x)}{p(x)} &= Cx \\ \int \frac{p'(x)}{p(x)} \mathit{dx} &= \int Cx \mathit{dx} \\ \ln(p(x)) &= \frac{Cx^2}{2}+c \\ p(x) &= e^{\frac{Cx^2}{2}+c} \\ p(x) &= Ae^{\frac{C}{2}x^2} \end{align*}

    However, there are two conditions that need to be met for this to be a valid probability density function. First, we are assuming large deviations from the mean (or large errors in the dartboard analogy) are less likely than small errors and the area under the curve must converge to 11. This condition makes the unknown constant CC negative, as if it was positive, large deviations would be more likely and the area would diverge to \infty. The second condition is that all probability density functions must be positive, this makes the unknown coefficient AA positive (the exponential function it scales is always positive). This allows us to rewrite the basic form of the function into the following:

    p(x)=Aek2x2A,k>0 p(x) = Ae^{-\frac{k}{2}x^2}\qquad A,k>0

    Finding the Coefficient AA

    The value of AA in this basic function form can be found by using the area property of probability density functions (area under the curve is equal to 11):

    Aek2x2dx=1Aek2x2dx=1ek2x2dx=1A \begin{align*} \int_{-\infty}^\infty Ae^{-\frac{k}{2}x^2} \mathit{dx} &= 1 \\ A\int_{-\infty}^\infty e^{-\frac{k}{2}x^2} \mathit{dx} &= 1 \\ \int_{-\infty}^\infty e^{-\frac{k}{2}x^2} \mathit{dx} &= \frac{1}{A} \end{align*}

    The function is symmetric and in its current form is an even function, so the integral can be rewritten like this:

    20ek2x2dx=1A0ek2x2dx=12A \begin{align*} 2\int_0^\infty e^{-\frac{k}{2}x^2} \mathit{dx} &= \frac{1}{A} \\ \int_0^\infty e^{-\frac{k}{2}x^2} \mathit{dx} &= \frac{1}{2A} \end{align*}

    It is difficult to integrate a Gaussian function (ex2e^{-x^2}) so we will bring back the second-dimension from earlier (which looks identical due to symmetry) to make integration easier by rewriting the equation into a double integral:

    (0ek2x2dx)(0ek2y2dy)=(12A)(12A)00ek2x2ek2y2dydx=14A200ek2(x2+y2)dydx=14A2 \begin{align*} \left(\int_0^\infty e^{-\frac{k}{2}x^2} \mathit{dx}\right)\left(\int_0^\infty e^{-\frac{k}{2}y^2} \mathit{dy}\right) &= \left(\frac{1}{2A}\right)\left(\frac{1}{2A}\right) \\ \int_0^\infty \int_0^\infty e^{-\frac{k}{2}x^2} \cdot e^{-\frac{k}{2}y^2} \mathit{dy}\,\mathit{dx} &= \frac{1}{4A^2} \\ \int_0^\infty \int_0^\infty e^{-\frac{k}{2}(x^2+y^2)} \mathit{dy}\,\mathit{dx} &= \frac{1}{4A^2} \\ \end{align*}

    This double integral is symmetrical around the origin, so we can convert it from Cartesian coordinates to polar coordinates. Note the appearance of rr as the Jacobian determinant from this change of coordinates. Following this change of coordinates, we can use standard integration techniques to solve for AA:

    0π20ek2r2rdrdθ=14A20π2dθ0ek2r2rdr=14A2[θ]θ=0θ=π20ek2r2rdr=14A2π20ek2r2rdr=14A20ek2r2rdr=12πA2 \begin{align*} \int_0^\frac{\pi}{2} \int_0^\infty e^{-\frac{k}{2}r^2}r \,\mathit{dr}\,\mathit{d\theta} &= \frac{1}{4A^2} \\ \int_0^\frac{\pi}{2}\mathit{d\theta} \cdot \int_0^\infty e^{-\frac{k}{2}r^2}r \,\mathit{dr} &= \frac{1}{4A^2} \\ \Big[\theta\Big]^{\theta=\frac{\pi}{2}}_{\theta=0} \cdot \int_0^\infty e^{-\frac{k}{2}r^2}r \,\mathit{dr} &= \frac{1}{4A^2} \\ \frac{\pi}{2} \cdot \int_0^\infty e^{-\frac{k}{2}r^2}r \,\mathit{dr} &= \frac{1}{4A^2} \\ \int_0^\infty e^{-\frac{k}{2}r^2}r \,\mathit{dr} &= \frac{1}{2\pi A^2} \end{align*}

    This single improper integral can be evaluated with an uu-substitution of u=k2r2u=-\frac{k}{2}r^2 (du=krdr\mathit{du} = -kr \mathit{dr}), allowing the rest of the equation to be solved:

    1k0eudu=12πA2(u-substitution)0eudu=k2πA2limt0teudu=k2πA2(rewriting the improper integral as a limit)limt[eu]u=0u=t=k2πA2limt(ete0)=k2πA2limt(1et1)=k2πA2(01)=k2πA21=k2πA2A2=k2πA=k2π \begin{align*} -\frac{1}{k}\int_0^{-\infty} e^{u} \,\mathit{du} &= \frac{1}{2\pi A^2} \qquad\text{($u$-substitution)} \\ -\int_0^{-\infty} e^{u} \,\mathit{du} &= \frac{k}{2\pi A^2} \\ -\lim_{t\rightarrow-\infty}\int_0^t e^{u} \,\mathit{du} &= \frac{k}{2\pi A^2} \qquad\text{(rewriting the improper integral as a limit)} \\ -\lim_{t\rightarrow-\infty}\Big[e^u\Big]^{u=t}_{u=0} &= \frac{k}{2\pi A^2} \\ -\lim_{t\rightarrow-\infty}\left(e^t-e^0\right) &= \frac{k}{2\pi A^2} \\ -\lim_{t\rightarrow\infty}\left(\frac{1}{e^t}-1\right) &= \frac{k}{2\pi A^2} \\ -\left(0-1\right) &= \frac{k}{2\pi A^2} \\ 1 &= \frac{k}{2\pi A^2} \\ A^2 &= \frac{k}{2\pi} \\ A &= \sqrt{\frac{k}{2\pi}} \end{align*}

    Note that Ak2πA\neq-\sqrt{\frac{k}{2\pi}} due to the condition that AA must be positive that we deduced above. This gives us a probability function of the form p(x)=k2πek2x2p(x)=\sqrt{\frac{k}{2\pi}} e^{-\frac{k}{2}x^2} with kk being the last step we need to solve for.

    Finding the Value kk

    To solve for the value kk, we have to bring in the variance of the probability distribution. The variance is the square of the perfect deviation and is defined as σ2=(xμ)2p(x)dx\sigma^2 = \int_{-\infty}^{\infty}(x-\mu)^2\,p(x)\,\mathit{dx}. The mean of the function (μ\mu) is defined by the integral xp(x)dx\int_{-\infty}^\infty x\,p(x)\,\mathit{dx}. This integral is odd given the even probability function we found above, so our mean in this case is 00. We can substitute this value and our probability density function into the integral for variance (squared deviation) to solve for kk:

    (xμ)2p(x)dx=σ2(x0)2p(x)dx=σ2x2k2πek2x2dx=σ22k2π0x2ek2x2dx=σ2(simplify even integral)2k2π0xxek2x2dx=σ2 \begin{align*} \int_{-\infty}^{\infty}(x-\mu)^2\,p(x)\,\mathit{dx} &= \sigma^2 \\ \int_{-\infty}^{\infty}(x-0)^2\,p(x)\,\mathit{dx} &= \sigma^2 \\ \int_{-\infty}^{\infty}x^2\cdot\sqrt{\frac{k}{2\pi}} e^{-\frac{k}{2}x^2}\,\mathit{dx} &= \sigma^2 \\ 2\sqrt{\frac{k}{2\pi}}\int_0^{\infty}x^2\cdot e^{-\frac{k}{2}x^2}\,\mathit{dx} &= \sigma^2 \qquad\text{(simplify even integral)} \\ 2\sqrt{\frac{k}{2\pi}}\int_0^{\infty}x\cdot xe^{-\frac{k}{2}x^2}\,\mathit{dx} &= \sigma^2 \end{align*}

    This integral can then be solved with an integration by parts with u=xu=x and dv=xek2x2dv=xe^{-\frac{k}{2}x^2}:

    2k2π(limt[x1kek2x2]x=0x=t01kek2x2dx)=σ22k2π(1klimt[xek2x2]x=0x=t+1k0ek2x2dx)=σ22k2π(1klimt(tek2t20ek202)+1k0ek2x2dx)=σ22k2π(1klimt(1ek2t2kt)+1k0ek2x2dx)=σ2(L’Hoˆpital’s rule)2k2π1k0ek2x2dx=σ2 \begin{align*} 2\sqrt{\frac{k}{2\pi}}\left(\lim_{t\rightarrow\infty}\left[x\cdot\frac{-1}{k}\,e^{-\frac{k}{2}x^2}\right]_{x=0}^{x=t}-\int_0^{\infty}\frac{-1}{k}e^{-\frac{k}{2}x^2}\,\mathit{dx}\right) &= \sigma^2 \\ 2\sqrt{\frac{k}{2\pi}}\left(\frac{-1}{k}\lim_{t\rightarrow\infty}\left[\frac{x}{e^{\frac{k}{2}x^2}}\right]_{x=0}^{x=t}+\frac{1}{k}\int_0^{\infty}e^{-\frac{k}{2}x^2}\,\mathit{dx}\right) &= \sigma^2 \\ 2\sqrt{\frac{k}{2\pi}}\left(\frac{-1}{k}\lim_{t\rightarrow\infty}\left(\frac{t}{e^{\frac{k}{2}t^2}}-\frac{0}{e^{\frac{k}{2}0^2}}\right) + \frac{1}{k}\int_0^{\infty}e^{-\frac{k}{2}x^2}\,\mathit{dx}\right) &= \sigma^2 \\ 2\sqrt{\frac{k}{2\pi}}\left(\frac{-1}{k}\lim_{t\rightarrow\infty}\left(\frac{1}{e^{\frac{k}{2}t^2}\cdot{k}t}\right) + \frac{1}{k}\int_0^{\infty}e^{-\frac{k}{2}x^2}\,\mathit{dx}\right) &= \sigma^2 \qquad\text{(L'Hôpital's rule)} \\ 2\sqrt{\frac{k}{2\pi}}\cdot\frac{1}{k}\int_0^{\infty}e^{-\frac{k}{2}x^2}\,\mathit{dx} &= \sigma^2 \end{align*}

    This final integral we have already solved above when finding the value of AA:

    2k2π12kek2x2dx=σ2(R+R)2k2π12k1A=σ22k2π12k2πk=σ21k=σ2k=1σ2 \begin{align*} 2\sqrt{\frac{k}{2\pi}}\cdot\frac{1}{2k}\int_{-\infty}^{\infty}e^{-\frac{k}{2}x^2}\,\mathit{dx} &= \sigma^2 \qquad(\mathbb R^+\rightarrow\mathbb R) \\ 2\sqrt{\frac{k}{2\pi}}\cdot\frac{1}{2k}\cdot\frac{1}{A} &= \sigma^2 \\ 2\frac{\sqrt{k}}{\sqrt{2\pi}}\cdot\frac{1}{2k}\cdot\frac{\sqrt{2\pi}}{\sqrt{k}} &= \sigma^2 \\ \frac{1}{k} &= \sigma^2 \\ k &= \frac{1}{\sigma^2} \end{align*}

    The Normal Probability Density Function

    We can plug in this value of kk to find the normal probability distribution function:

    p(x)=k2πek2x2p(x)=12πσ2e12σ2x2p(x)=1σ2πe12(xσ)2 \begin{align*} p(x)=\sqrt{\frac{k}{2\pi}}\,e^{-\frac{k}{2}x^2} \\ p(x)=\sqrt{\frac{1}{2\pi\sigma^2}}\,e^{-\frac{1}{2\sigma^2}x^2} \\ p(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{1}{2}\left(\frac{x}{\sigma}\right)^2} \\ \end{align*}

    This form of the normal probability density function has a single constant σ\sigma, the desired standard deviation of the distribution, and has a mean (μ\mu) of 00. However, most applications of this distribution don’t have a mean of 00, so we can horizontally translate the function so that the peak of the “bell curve” lies on the desired mean:

    p(x)=1σ2πe12(xμσ)2 p(x)=\frac{1}{\sigma\sqrt{2\pi}}\,e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}

    Personal Thoughts

    I first worked on a presentation about this derivation of the normal distribution with a group of friends back in my Grade 12 Data Management course in high school as a final project. That presentation and this post were loosely based on a “dartboard” based derivation written by Dan Teague which I did not fully understand back in high school due to some of the derivation being beyond my limited high school level calculus knowledge. After finishing my Calculus 1 and 2 courses (SYDE111 & SYDE112) and seeing more applications of the distribution during my studies at the University of Waterloo, I was inspired to revisit the derivation and hopefully write it out in a more comprehensive form than the original derivation.

    I am not a mathematician or educator, so there may be errors post above or potential improvements I can make for clarity. If you spot any errors or improvements that can be made, feel free to send me an email!

    3Blue1Brown also recently covered the Central Limit Theorem in a series of videos that I found very interesting available here which demonstrate why the normal probability distribution is so useful.

    I haven’t posted anything to my blog in a while so hopefully this marks the beginning of more technical writing posts related to math, programming, and information security!