Review of Basic Probability

JC Mathematics, University Mathematics

So I understand that I lost many readers for the Sampling uploads.It is a bit difficult to the intensive use of notations and also the need for statistics knowledge. So here I’ll review a bit of basic probability. The contents here will be basic, and will involve some H2 Mathematics Statistics too.

Discrete Random Variables

Continuous Random Variables

Conditional Expectations and Variances

Multivariate Distributions

Multivariate Normal Distribution

The following three are in relation to Finance.

Introduction to Martingales

Introduction to Brownian Motion

Geometric Brownian Motion

Introduction to Stochastic Calculus

Introduction to Martingales

JC Mathematics, University Mathematics

This is a rather important topics for anyone interested in doing Finance.
Lets look at their definition first.
A Martingale is a random process \{X_n: 0 \le n \le \infty \} with respect to the information filtration F_n and the probability distribution P, if
$latex \mathbb{E}^P [|X_n|] < \infty$ for all $latex n \ge 0$ $latex \mathbb{E}^P[X_{n+m}|F_n] = X_n$ for all $latex n, m \ge 0$ Martingales are used widely and one example is to model fair games, thus it has a rich history in modelling of gambling problems. If you google Martingale, you will get an image related to a Horse, because it started with Horse-betting. [caption id="attachment_2845" align="alignnone" width="300"]Martingales. Source: NYU Martingales. Source: NYU[/caption]

We define a submartigale by replacing the above condition 2 with
\mathbb{E}^P [X_{n+M}| F_n] \ge X_n for all n, m \ge 0
and a supermartingale with
\mathbb{E}^P [X_{n+M}| F_n] \le X_n for all n, m \ge 0 .
Take note that a martingale is both a submartingale and a supermartingale. Submartingale in layman terms, refers to the player expecting more as time progresses, and vice versa for supermartingale.

Let us try to construct a Martingale from a Random Walk now.
Let S_n := \sum_{i=1}^n X_i be a random walk where the X_i’s are IID with mean \mu.
Let M_n := S_n -n \mu. Then M_n is a martingale because:
\mathbb{E}_n [M_{n+m}]
= \mathbb{E}_n [\sum_{i=1}^{n+m} X_i - (n+m) \mu]
= \mathbb{E}_n [\sum_{i=1}^{n+m} X_i] - (n+m) \mu since expectation distributes linearly
= \sum_{i=1}^n X_i + \mathbb{E}_n [\sum_{i=n+1}^{n+m} X_i] - (n+m) \mu
= \sum_{i=1}^n X_i + m \mu - (n+m) \mu = M_n

So how will a martingale betting strategy be like?
Here, we let X_1, X_2, \ldots be IID random variables with P(X_i = 1) = P(X_i = -1) = \frac{1}{2}. We can imagine X_i to represent the result of a coin-flipping game where,
– player win $1 if the coin comes up heads, that is, X_i = 1
– player lose $1 if the coin comes up tails, that is, X_i = -1

Consider further now a doubling strategy where we keep doubling the bet until we eventually win. Once we win, we stop and our initial bet is $1.
The first thing we note is that the size of bet on the n^{th} play is 2^{n-1} assuming we are still playing at time n. And we can let W_n denote total winnings after n coin tosses, assuming W_0 = 0. Then W_n is a martingale!

To see this, let us prove that W_n \in \{ 1, -2^n +1 \} for all n.
Suppose we win for first time on n^{th} bet. Then
W_n = -(1 + 2 + \ldots + 2^{n-2}) + 2^{n-1}
= -(2^{n-1} - 1) + 2^{n-1} = 1
If we have not yet won after n bets then,
W_n = -(1 + 2 + \ldots + 2^{n-1}) = -2^n +1
Finally, to show W_n is a martingale, we just need to show \mathbb{E}[W_{n+1} | W_n] = W_n which can be easily prove using iterated expectations.
For case 1, W_n = 1, then P(W_{n+1} = 1 | W_n = 1) = 1 so \mathbb{E}[W_{n+1} | W_n = 1] = 1 = W_n
For case 2, W_n = -2^n +1, meaning we bet 2^n on (n+1)^{th} toss so W_{n+1} \in \{ 1 , -2^{n+1} + 1 \}. Since
P(W_{n+1} = 1) | W_n = -2^n +1) = \frac{1}{2}, and
P(W_{n+1} = -2^{n+1}+1 | W_n = -2^n +1) = \frac{1}{2},
\mathbb{E}[W_{n+1} | W_n = -2^n +1] = \frac{1}{2} \times 1 + \frac{1}{2} \times (-2^{n+1} +1) = -2^n + 1 = W_n
Thus, we showed that \mathbb{E} [W_{n+1} | W_n] = W_n

To bring what we learnt a further step, lets look at Polya’s Urn briefly.
Consider an urn which contains red balls and green balls. Initially there is just one green ball and one red ball in the urn.
At each time step, a ball is chosen randomly from the urn:
If ball is red, the its returned to the urn with an additional red ball.
If ball is green, then its returned to the urn with an additional green ball.
Let X_n denote the number of red balls in the urn after n draws. Then
P(X_{n+1} = k+1 | X_n = k) = \frac{k}{n + 2}
P(X_{n+1} = k | X_n = k) = \frac{n + 2 - k}{n + 2}
We can show that M_n := \frac{X_n}{n+2} is a martingale.

Multivariate Distributions

JC Mathematics, University Mathematics

Let X = (X_1 \ldots X_n)^T be an n-dimensional vector of random variables.
For all x = (x_1, \ldots, x_n) \in \mathbb{R}^n, the joint cumulative distribution function of X satisfies
F_{X_i}(x_i) = F_X (\infty, \ldots, \infty, x_i, \infty, \ldots, \infty)

Clearly it is straightforward to generalise the previous definition to join marginal distributions. For example, the join marginal distribution of X_i and X_j satisfies
F_X(x_1, \ldots, x_n) = \int_{-\infty}^{x_1} \ldots \int_{-\infty}^{x_n} f_x (u_1, \ldots, u_n) du_1 \ldots du_n

If X_1 = (X_1 \ldots X_k)^T and X_2 = (X_{k+1} \ldots X_n)^T is a partition of X then the conditional CDF of X_2 given X_1 satisfies
F_{X_2|X_1} (X_2|X_1) = P(X_2 \le x_2 | X_1 = x_1).    If X has a PDF, latex f_X (\bullet)$, then the conditional PDF of X_2 given X_1 satisfies
f_{X_2 | X_1} (X_2 | X_1) = \frac{f_X (X)}{f_{X_1}(X_1)} = \frac{f_{X_1 | x_2}(X_1 | X_2) f_{X_2}(X_2)}{f_{X_1}(X_1)}

and the conditional CDF is then given by
F_{X_2 | X_1}(X_2 |X_1) = \int_{- \infty}^{x_{k+1}} \ldots \int_{- \infty}^{x_n} \frac{f_X (x_1, \ldots , x_k, u_{k+1}, \ldots , u_n)}{f_{X_1}(X_1)} du_{k+1} \dots du_n
where f_{X_1}(\bullet) is the joint marginal PDF of X_1 which is given by
f_{X_1} (x_1, \ldots , x_k) = \int_{- \infty}^{\infty} \ldots \int_{- \infty}^{\infty} f_X (x_1, \ldots , x_K u_{k+1}, \ldots , u_n) du_{k+1} \ldots du_n

We next look at independence, which is something H2 Mathematics can easily relate to.

Here, we say the collection X is independent if joint CDF can be factored into the product of the marginal CDFs so that
F_X (x_1 \ldots , x_n) = F_{X_1} (x_1) \ldots F_{X_n}(x_n)

If X has a PDF, f_X(\bullet) then independence implies that the PDF also factories into the product of marginal PDFs so that
f_X(x) = f_{X_1} (x_1) \ldots f_{X_n}(x_n).

Using the above results, we have that if X_1 and X_2 are independent then
f_{x_2|x_1}(x_2 | x_1) = \frac{f_X (X)}{f_{X_1}(X_1)} = \frac{f_{X_1}(X_1) f_{X_2}(X_2)}{f_{X_1}(X_1)} = f_{X_2}(X_2)
The above results tell us that having information about X_1 tells us nothing about X_2.

Let’s continue to look further on the implications of independence.

Let X and Y be independent random variables. Then for any events A and B, P(X \in A, Y \in B) = P(X \in A) P(Y \in B)

We can check this with,
P(X \in A, Y \in B)
= \mathbb{E}[1_{X \in A} 1_{Y \in B}]
= P(X \in A) P(Y \in B)

In general, if X_1, \ldots, X_n are independent random variables then
\mathbb{E}[f_1 (X_1) f_2(X_2) \ldots f_n(X_n)] = \mathbb{E}[f_1(X_1)] \mathbb{E}[f_2(X_2)] \ldots \mathbb{E}[f_n(X_n)]

Moreover, random variables can also be conditionally independent. For example, we say that X and Y are conditionally independent given Z if
\mathbb{E}[f(X)g(Y)|Z] = \mathbb{E}[f(X)|Z]\mathbb{E}[g(Y)|Z].
The above will be used in the Gaussian copula model for pricing of collateralised debt obligation (CDO).


We let D_i be the event that the i^{th} bond in a portfolio defaults. Not reasonable to assume that the D_i‘s are independent though they are conditionally independent given Z so P(D_1, \ldots, D_n |Z ) = P(D_1|Z) \ldots P(D_n|Z) is often easy to compute.

Lastly, we consider the mean and covariance.
The mean vector of X is given by \mathbb{E}[X]:=(\mathbb{E}[X_1] \ldots \mathbb{E}[X_n])^T
and the covariance matrix of X satisfies
\sum := \mathrm{Cov}(X) := \mathbb{E}[(X- \mathbb{E}[X])(X - \mathbb{E}[X])^T] so the (i,j)^{th} element of \sum is simply the covariance of X_i and X_j.
The covariance matrix is symmetric and its diagonal elements satisfy \sum_{i, i} \ge 0 and is also positive semi definite so that x^T \sum x \ge 0 for all x \in \mathbb{R}^n
Then the correlation matrix \rho (X) has (i, j)^{th} element \rho_{ij} := \mathrm{Corr}(X_i, X_j). This is also symmetric, postiie semi-definite and has 1’s along the diagonal.

For any matrix A \in \mathbb{R}^{k \times n} and vector a \in \mathbb{R}^k,
\mathbb{E}[AX +a] = A \mathbb{E} [X] + a (distributes linearly)
\mathrm{Cov}(AX +a) = A \mathrm{Cov}(X) A^T
\Rightarrow \mathrm{Var} (aX + bY) = a^2 \mathrm{Var}(X) + b^2 \mathrm{Var}(Y) + 2ab \mathrm{Cov}(X, Y)
Recall that if X and Y are independent, then \mathrm{Cov}(X, Y)=0, and the converse is not true in general.

Conditional Expectations and Variances

JC Mathematics, University Mathematics

Here we look at an important concept that is an extension from Bayes Theorem, which we discussed briefly.

The condition expectation identity says \mathbb[X] = \mathbb[\mathbb[X|Y]]
The condition variance identity says Var(X) = Var(\mathbb[X|y]) + \mathbb[Var(X|Y)]

Here both \mathbb{E}[X|Y] and Var(X|Y) are both functions of Y and are therefore random variables themselves.

With this, we start by considering a random sum of random variables. Let W= X_1 + X_2 + \ldots + X_N where X_i‘s are IID with mean \mu_x and variance {\sigma_x}^2, where N is also a random variable, independent of X_i‘s.

= \mathbb{E} [\mathbb{E}[\sum_{i=1}^N x_i | N]]
= \mathbb{E} [N \mu_x]
= \mu_x \mathbb{E}[N]

= Var(\mathbb{E}[W|N]) + \mathbb{E}[Var(W|N)]
= Var(\mu_x N) + \mathbb{E}[N {\sigma_x}^2]
= {\mu_x}^2 Var(N) + {\sigma_x}^2 \mathbb{E}[N]

Continuous Random Variables

JC Mathematics, University Mathematics

We look at the definitions first.

A continuous random variable, X, has a probability density function (PDF), f(\bullet) if f(x) \ge 0 and for all events A
P(X \in A) = \int_A f(y) dy
The CDF and PDF are related by F(x) = \int_{-\infty}^x f(y) dy
It is good to know that we have P(X \in (x - \frac{\epsilon}{2}, x + \frac{\epsilon}{2})) \approx \epsilon f(x)

We X has a normal distribution, X \sim N(\mu, {\sigma}^2), and f(x) = \frac{1}{\sqrt{2 \pi {\sigma}^2}}e^(-\frac{(x - \mu)^2}{2 {\sigma}^2}). And \mathbb{E}[X] = \mu while Var(X) = {\sigma}^2

Log normal Distribution Source:
Log normal Distribution Source:

We also have the log-normal distribution, X \sim LN (\mu, {\sigma}^2) and log(X) \sim N(\mu, {\sigma}^2). Here, \mathbb{E} [X] = e^(\mu + \frac{{\sigma}^2}{2}) and Var(X) = e^(2 \mu + {\sigma}^2) (e^{{\sigma}^2} - 1). The log-normal distribution is very important in financial applications, for starters, the Black Scholes Equation.

Discrete Random Variables

JC Mathematics, University Mathematics

As usual, definitions first.

The cumulative distribution function (CDF), F(\bullet) of a random variable, X, is defined by
F(x) := P(X \le x)

A discrete random variable, X, has probability mass function (PMF), p(\bullet), if p(x) \ge 0 and for all events A we have P(X \in A) = \sum_{x \in A} p(x)

The expected value of a discrete random variable, X is given by
\mathbb{E} [X] := \sum_i x_i p(x_i)

The variance of an random variable, X, is defined as Var(X) := \mathbb{E} [(X- \mathbb[X])^2] = \mathbb{E}[X^2] - \mathbb{E}[X]^2


Now we can look at some examples of discrete random variables.

    Binomial Distribution

We say the X has a binomial distribution, that is, X \sim Bin(n, p), if P(X=r) = {n \choose r} p^r (1-p)^{n-r}
For example, X can represent the number of heads in n independent coin tosses, where p = P(head). We have that the mean \mathbb{E}[X] = np and variance Var(X) = np(1-p)

A simpler case of binomial where there is only one event is called the Bernoulli distribution.

I’ll give a simple illustration of binomial model used in finance. This was also used in finance engineering in the beginning.
Suppose a fund manager outperforms the market in a given year with probability p and under performs the market with probability 1 – p. She has a track record of 10 years and has outperformed the market in 8 out of 10 years. We also note that performance in any one year is independent of performance in other years.
From this illustration, we note that there are only two outcomes, she outperforms or underperforms. We can let X be the number of outperforming years. Assuming the fund manager has no skill, X \sim Bin(10, \frac{1}{2}) and we can find P(X \ge 8) to find out the probability that she outperforms at least 8 out of 10 years.
An extension here will be to consider there are M fund managers instead of 1 now.

    Poisson Distribution

      We say that X has a Po(\lambda) distribution if
      P(X=r) \frac{{\lambda}^r e^{- \lambda}}{r!}
      \mathbb{E} [X] = \lambda
      Var(X) = \lambda

      Next, we look at Bayes Theorem, also known as “conditional probability” in H2 Mathematics.

      Let A and B be two events for which P(B) \neq 0. Then

      P(A | B)
      = \frac{P(A \cap B)}{P(B)}
      = \frac{P(B | A) P(A)}{\sum_j P(B | A_j) P(A_j)}
      where A_j‘s form a partition of the sample space.

Game Theory #2

JC Mathematics, University Mathematics

To illustrate another similar example from last post, now supposed ALL the A-level Candidates met up in the town square. Prior to meeting up, they were all told by the head examiner that if everyone gets the same grade, all of you get an A. Else, the highest will have an A while the rest will fail. One of the students then suggested during he meet up, that why then they all fail the exam together and get zero marks. That will guarantee everyone an A. Supposed you’re one of the candidates there, will you be an angel and follow or a devil that deviate and score 1 mark.

This is of a similar nature to what we saw previously regarding the Prisoner’s Dilemma, only difference is that we have a lot more integrities to consider. After all, there is no telling an angel from a devil, So will you be an angel or devil?

To properly solve this, we need to define a few more things.

A strategy s_{i} {\in} S_{i} is a strictly dominant strategy for player i in game G= [I, [S_i ], \{u_i( \bullet) \} ] if for all {s \prime}_i \neq s_i , we have u_i (s_i, s_{-i}) > u_i (s’_i, s_{-i}) for all s_{-i} \in S_{-i}

A strategy s_i \in S_i is a strictly dominated strategy for player i in game G = [I, \{ S_i \}, \{u_i(\dot) \}] if there exists another strategy s’_i \in S_i such that for all s_{-i} \in S_{-i}, we have u_i(s’ \prime _i, s_{-i}) > u_i (s_i, s_{-i}).

Okay, so let us look at these definitions closely first, before you guys go like meh sets?! This is a definition, and definitions tend to be boring. Let me begin by saying that the index i refers to the player, 1 or 2 for instance. and u refers to utility (payoff) while s here represents the various strategies that the player players. Now that the variables are sort of defined, are you starting to see how this definition is making sense? And ‘ here refers to a complementary strategy.

Next, a decision maker is rational if the strategy she chooses is at least as good as any other available strategy. Clearly, a rational player maximises her (expected) payoff.

  1. In the prisoner”s dilemma we saw, rational prisoners will rat on each other.
  2. In the angel vs devil students’ dilemma we saw, rational students will still do their best.Of course, this two cases differ in the number of people participating, which brings us to the next focus: Iterated deletion of strictly dominated strategies.

Here, we assume that rationality is common knowledge, that is, all man is rational. Thus, as a decision, he will never play a strictly dominated strategy (since it does not maximise his payoff))

Before I go further, I should clarify that payoff is different from income. Consider a dictator allocating an amount of money between himself and another player, the dictator might be altruistic or inequality averse (since its based on his beliefs) and thus maximises her utility (payoff) differently, instead of income. And we remind ourselves again, that a rational player will want to maximise his utility.

So next time, we will look at iterated deletion of strictly dominated strategies.

Click here for Game Theory #1

Game Theory #1

JC Mathematics, University Mathematics

Over the past two months, I’ve been reading up on Game Theory extensively. It has always been a subject of interest for me since undergraduate days. And I’m glad that I found time to read a bit on it. Students that are keen, can also follow the open courseware provided by Yale University for free.

Firstly, this is going to be slightly mathematical (I’m after all trained in Mathematics) and will require some statistics. So let us start by considering the most rudimentary case of Prisoner’s Dilemma, to provide a simple introduction on what follows.

Suppose we have two prisoners and a police officer. Both prisoners are facing a two year sentence. But the police officer offers them a deal now, that if the prisoner cooperate and rat out the other, he will go scot-free while his partner faces seven years of jail. We assume that the police officer offers the same deal to both prisoners independent. So if both chooses to rat, both of them will face five years of jail.

So here, we are facing a situation of strategic interdependence in the sense that each individual’s payoff (jail sentence) depends not only on her own actions, but also on the action of the other individual. As a result, an individual’s best action may depend on what she expects the other player to do. Given this game, what will you do if you’re a prisoner?

Before looking at the game, let us properly define the following:

Dominant Strategy: A strategy that is always the best for a player no matter what other players do.

Dominated Strategy: A strategy that generates worse payoffs than some other strategy for any choices made by other players.

Other words, if you have a dominant strategy here, you should use it while you should use a dominated strategy.

Back to the game, we shall set off a simple playoff matrix here and then analyse the game from there.

So what will I do, I’ll just rat on the other prisoner. Yes, let us be honest and rationale. After all, looking at the table, and assuming he goes through the similar train of thoughts, he probably rat on me. I should do the same to protect my self interests.

Now, any idea why this is called a dilemma? Because each player has a dominant strategy here and the result is Pareto inefficient (socially inefficient outcome).

There isn’t a right or wrong answer here, and it is entirely based on what sort of equilibrium we are looking at. Of course, what kind of people you are dealing with deep down.

I hope this provides a simple introduction to Game Theory. We will look more into various games like auction and business models, as we progress. We will also consider some limitations, like imperfect knowledge. I hope this spark some interest in this subject and gradually show you how Game theory deepens our understanding about a lot of real-world phenomena/ problem!

Bayesian vs Frequentist Statistics

JC Mathematics, University Mathematics

Some of you will probably hear of this two terms and wonder what’s the difference. Both of them are a statistical approach/ inference method. Personally, I am a Bayesian and use more of Bayesian Methods to do my stuffs.

Firstly, the frequentist approach is a much simple model and does not consider prior knowledge. This is a very stark difference that we should note, before looking at a bette definition.

To illustrate it, I’ll borrow an example that I saw online. So I lost my phone somewhere at home and I can use the “find my phone” app to locate it. Once I activate the app, the phone will start to beep. So which area should I search?

Now, the Frequentist reasoning considers probability derived from long run frequency distributions and the underlying parameters are constant during the process. Going back to the example, I’ll have the ability to identify the area where the beep is coming from. And upon hearing the beep, I infer the area that I must search.

Next, the Bayesian reasoning considers probability to be constantly updated from the realisation of new information. Parameters are unknown and described probabilistically. Going back to the example again, apart from the ability to identify the area where the beep is coming from, I also have prior knowledge of where I had misplaced my phone at before. With the aid of the beep and prior knowledge of my clumsiness, I will be able to identify which area to search now. Here, you should intuitively realise that, we have better chance of finding the phone faster. 🙂

This blog post does an excellent job to realise the intuition behind frequentist and bayesian reasoning.

In A’levels Statistics, the closest contact students have with Bayesian Statistics will be Conditional Probability, whereby we understand that the probability of an event changes given new information.

Bayesian Statistics has a branch called Bayesian Estimations which, in the context of what we had discussed, considers probability by constantly being updated with current information. Examples will be the interest rate trends, hedging. This is used widely in real world today. Of course, to do this branch of statistics, one has to be really well acquitted with statistic distributions.

What are stochastic and deterministic processes?

University Mathematics

During class, I occasionally mention to students that the calculus you learnt is deterministic and they go uh huh. So like what is non deterministic?! I’ll attempt to give some simple definitions and example and also touch on stochastic process.

First, a deterministic process is one that given at a particular state and an action, we only have one possible successive state to move on to. That is what we do in A-levels, since we often don’t consider any randomness in our work.

Second, a non-deterministc process is one that given at a particular state and an action, we have a set of possible successive state to move on to.

Lastly, a stochastic process is one that to begin with, has a probability of to be at a particular state, and has a finite number of states that it can transit to. It also has transition probabilities, which adds up to one.

A deterministic process is one where the present state completely determines the future state. If I make a (riskless) investment of $10,000 at 10% interest, compounded annually, then in one year’s time I will have $11,000, in two years’ time I will have $12,100, and so forth.

A stochastic process, on the other hand, is one where there’s some randomness or uncertainty involved. If I buy $10,000 worth of \chi-company stock today, how much money will it be worth in one year’s time? In two years’ time? Who knows? This is a stochastic process.