Entropy: Part I (Hidden Information)

Posted by

My housemates asked me, numerous times:

But what really is entropy??? Is it chaos? Is it information? Does it grow infinitely? Can one reverse it like they do on Tenet? Would I go back in time then?

I thought it is about time to explain a few things that are definitely true from a physics perspective and then let them use that knowledge in any way that suit them in real life.

See, understanding entropy is a bit like understanding a major historical event. A single viewpoint will be in the least limiting, in the worst misleading. So, I have decided to explain the concept from many different angles with different examples in a series of blog posts.

In this first post we take the perspective of “the father of information theory” Claude Shannon. So we will be talking about dice, rainy weather in rainy UK and logarithms…

Entropy as Hidden Information

Imagine that you are taking a flight to somewhere in the UK. You have not bothered to check the weather forecast in advance. You step out of the terminal doors and outside greats you pouring rain and gloomy sky as if you are in a noir detective novel. Are you surprised? Well no, weather in England famously sucks. Now imagine that, instead, you are greeted by salsa music and are bathed in warm shower of sun and smiling blue sky. You will be plenty surprised, if that happened.

So, wouldn’t it be nice if you can somehow quantify your level of surprise, make it precise. That’s exactly what Claude Shannon, one of the greatest minds of the 20th century, set out to do. He imagined a situation with $ n $ possible outcomes, let’s label them $ i = 1,2,\dots, n $. Further, he imagined that we know all the probabilities for each outcome to occur, call these probabilities $ p_i $. Each probability is a number between $ 0 $ and $ 1 $, where $ 0 $ represents absolute certainty that the outcome will not occur, and $ 1 $ represents absolute certainty that the outcome will occur. The collection of all enlisted outcomes and their respective probabilities we call the system.

  • Take for example our weather situation at the terminal in Britain. We can imagine two possible outcomes: it rains, say with average yearly probability $ p_1 = 0.6 = 60\% $, or it doesn’t rain with probability $ p_2 = 0.4 = 40\%$.
  • Another example can be a fair die. When you roll it, one of the six sides will show up, making a total of six possible outcomes. Since we assume that the die is fair, the probability for each outcome is the same and equal to $ 1 /6 $.

Notice that for the general case of $ n $ possible outcomes, the set of all probabilities has a peculiar property:

$$\tag{1}p_1 + p_2 + \dots + p_n = \sum_{i=1}^{n}p_i = 1 \,,\label{eq:sum_to_one}$$

meaning that with absolute certainty at least one of the possible outcomes will occur (this doesn’t tell us which one, of course).

We are ready to start quantifying our “level of surprise”. Shannon thought of a “surprise” as the amount of information, that is initially hidden from you, but gets revealed when you observe a particular outcome.

  • For example, when you step out into the soaking rain out of the terminal, you just learn a very unpleasant piece of information, but you are not very surprised. So the information gain from the observation of this particular outcome should be small. Conversely, if it is sunny and the birds are singing you are surprised so the information gain should be higher.
  • For the fair die example, the probability for each side to show is the same, so it makes sense for the information gain of any of the outcomes to be the same.

These examples show that the information gain from a particular event should be thought of as a function of the probability of that event. Let’s call this quantity $ I(p_i) $, which is to be read: “the information gain from observing outcome $ i $”. Shannon thought deeply and came up with four sensible requirements that this function should obey:

  1. $I(p_i)$ decreases as $p_i$ increases: exactly as in our rain vs. sun example.
  2. $I(p_i) \geq 0$: you can gain zero information from observing an outcome (if it occurs with probability $1$) but you cannot get negative information, that will be as if when you observe the outcome you actually learn less compared to not observing it.
  3. $I(1) = 0$: outcomes which occur with absolute certainty carry no information.
  4. $I(p_i p_j) = I(p_i) + I(p_j)$. If $p_i$ and $p_j$ are the probabilities of two independent outcomes $i$ and $j$, the probability of both of them happening one is $p_ip_j$. But they are independent, so the information gain from observing both should just be the sum of the independent information gains.

The last requirement can be a little bit tricky to understand, so lets give an example. Imagine that as soon as you walk out of the terminal, you also roll your die. Presumably, the die has nothing to do with the weather gods of Britain, and in the case that it rains and you roll a $ 5 $ (an outcome that has total probability $ 0.6 \times \frac{1}{6} $) you learn out of this combined observation the same information that you would have learned if you rolled a $ 5 $ while still in the airplane and only later observed that it rains.

Based on these $4$ properties, Shannon found a function $I(p_i)$ that ensures that they are all satisfied (he also showed that this is the unique and only smooth function that satisfies all $4$ properties):

$$\tag{2}I(p_i) = \, – \log(p_i) \,.$$

To better understand this formula it is useful to plot the function in Figure 1.

Figure 1 Plot of the information gain as a function of the probability.

Property 1. is clearly satisfied. Property 2. is satisfied for all values of the probability that you will actually enter in the function, remember that $ 0 \leq p_i \leq 1 $. Property 3. is also obviously satisfied. Finally, property 4. cannot be directly observed from the plot but it is a generic property of logarithms that essentially follows from

$$\tag{3}e^x e^y = e^{x+y} \implies \log(x) + \log(y) = \log(xy) \,.$$

One funny thing that you might notice from the plot is that $ I(0) \to \infty $. That is nothing to worry about, it actually makes perfect sense. An outcome that is guaranteed not to happen, should carry quite a lot of surprise: an infinite amount of surprise!

Now that we have explained what the information gain is from the observation of a single outcome, we are finally ready to introduce the concept of entropy. We have the pool of information-gains, for each event $ i $, depicted in Figure 2.

The collection of all the information-gains, that one will obtain if one were to observe the particular outcomes.

Figure 2 The collection of all the information-gains, that one will obtain if one were to observe the particular outcomes.

Entropy is some sort of an average of these information gains; it doesn’t aim to tell us anything about a particular outcome and the information it carries if it was to be observed. Rather, entropy is designed to give us an overall feel about how complicated the whole system is. It is a feature of the system, that one can calculate and work with without ever actually observing the system! It is a measure of how much information, on average, you will get out of the system if you were to observe it. So, how exactly should we perform this average? One naive suggestion is to simply sum all information gains and divide by the total number

$$\tag{4}\frac{ I(p_1)+I(p_2)+\dots+I(p_n) }{ n } \,.$$

However, that doesn’t give us quite what we wanted. We wanted to know: how much the information gain will be on average, if one were to make an observation. The correct thing to do is a weighted sum: a sum where each information gain is first multiplied by the probability to observe that outcome and then these products are summed. In its grand beauty, we present the entropy formula

$$\tag{5}S = p_1 I(p_1) + p_2 I(p_2) + \dots + p_n I(p_n) = \sum_{i=1}^{n} p_i I(p_i) = \, – \sum_{i=1}^{n}p_i \log(p_i) \,. \label{eq:entropy}$$

Notice that the entropy formula doesn’t tell you that for sure you will obtain information amount $ S $ when you observe the system. When you do an observation you will gain information $ I(p_i) $, depending on whichever event $ i $, you stumbled upon, and your “level of surprise” might be higher or lower than the averaged “level of surprise”, which is what $ S $ represents. Let’s work out a few explicit examples to get a fell how entropy works

  • Going back to the rainy/sunny terminal. The entropy is laughingly easy to compute
    $$\tag{6}S_{\text{Britain}} = – 0.4 \log(0.4) – 0.6 \log(0.6) \approx 0.673 \,.\label{eq:entropy_britain}$$

    Notice that if the probabilities were reversed, that is, say, you were landing in Spain, there the probability to rain might be $ 0.4 $ and to be sunny $ 0.6 $. In that case, you will get exactly the same answer for the entropy! Actually, if you were landing in Spain, the chances of rain will be minuscule, especially in summer. So more realistically, the chance to be sunny is something like $ 0.9 $ and the chance to rain $ 0.1 $. Computing the entropy in that case we get

    $$\tag{7}S_{\text{Spain}} = -0.1 \log(0.1) – 0.9 \log(0.9) \approx 0.325 \,.$$

    We got smaller entropy compared to (\ref{eq:entropy_britain}). That is because there is more “certainty” in the system. The entropy of a system with only two possible events is maximal when their probabilities are the same

    $$\tag{8}S_{\text{max}} = -0.5 \log(0.5) – 0.5 \log(0.5) \approx 0.693 \,.$$

    Any other arrangement of the two probabilities will yield a smaller answer.

  • Let’s compute the entropy of the fair die that we also discussed. We get
    $$\tag{9}S_{6-\text{die}} = \, – \frac{1}{6} \log\left( \frac{1}{6}\right) – \dots – \frac{1}{6} \log\left( \frac{1}{6}\right) = – \frac{6}{6} \log\left( \frac{1}{6}\right) = \log(6) \approx 1.792 \,.$$

    What if, instead we have a $ 20 $-sided fair die, like the ones used in DND. The entropy in that case is

    $$\tag{10}S_{20-\text{die}} = \, – \frac{20}{20} \log\left( \frac{1}{20}\right) = \log(20) \approx 2.996 \,.$$

From these examples we learn a general lesson. The more close to each other the probabilities are and the more possible outcomes there are, the higher the entropy. That is, the more moving parts there are in the system, and the more evenly distributed the probabilities, the higher the entropy. That is why sometimes people refer to entropy as a “measure of chaos”, but in my humble opinion this way of thinking is at best vague and at worst misleading. What entropy really means is:

On average, how much hidden information lies within the system. That is: on average, what would the information gain be if the system is observed.

The stringent readers among you, have probably noticed that all is nice and good with the entropy formula (\ref{eq:entropy}) provided that someone has told you in advance what are all the possible outcomes and precisely what their probabilities are. Without this apriory data, you cannot compute the entropy! This is indeed a big issue, because often in life you don’t even known what is the set of all possible outcomes, let alone the precise chances for each of them to occur. Indeed, much of the art in working with entropy is to actually deduce this data from first principles (possibly with some justified approximations), or to performing many physical experiments on multiple copies of the system, in order to collect the necessary data.

To conclude, let me mention one simplifying assumption, a special case of the entropy formula, that shows up very often in life. If all our $ n $ outcomes are equally likely, since the probabilities have to sum to $ 1 $, see equation (\ref{eq:sum_to_one}), then each probability is $ \frac{1}{n} $. We get a simplification of the entropy formula

$$\tag{11}S = – \sum_{i=1}^{n} \frac{1}{n} \log\left( \frac{1}{n}\right) = \, – \frac{n}{n} \log \left( \frac{1}{n}\right) = -\log\left( \frac{1}{n}\right) = \log(n) \,.\label{eq:boltzmann}$$

That is: if there are $n$ possible outcomes, and they are all equally likely, the entropy is just the logarithm of the total number of possible outcomes. This formulation is useful when you do not have any principle over which to say that some outcome is more likely, so you just assume the simplest possible thing: they are all the same. Often time it gives very good results for systems with many possible outcomes. Formula (\ref{eq:boltzmann}) is written down on the tombstone of Ludwig Boltzmann, Figure 3, a rebel who understood the statistical origin of entropy, back in times when others were still doubting whether atoms exist. He was truly ahead of his time.

Boltzmann's tombstone in Vienna, Austria. The formula that is inscribed look slightly different compared to (\ref{eq:boltzmann}), but this is simply because he used the letter $W$ to denote the total number of possible outcomes, and, since he was doing physics, he introduced a numerical constant $k = 1.38064852 \times 10^{-23} \text{m}^2 \, \text{kg} \, \text{s}^{-2} \text{K}^{-1}$, that nowadays carries his name and makes sure that entropy has the correct physical units.

Figure 3 Boltzmann’s tombstone in Vienna, Austria. The formula that is inscribed look slightly different compared to (\ref{eq:boltzmann}), but this is simply because he used the letter $W$ to denote the total number of possible outcomes, and, since he was doing physics, he introduced a numerical constant $k = 1.38064852 \times 10^{-23} \text{m}^2 \, \text{kg} \, \text{s}^{-2} \text{K}^{-1}$, that nowadays carries his name and makes sure that entropy has the correct physical units.