Selection
Selection is the process of searching a space of states to identify a target state. Sometimes the target state is known, other times it is implicitly defined by some property. For example, in biology, the state space may be the space of sequences of amino acids, with the target state being the sequence of amino that fold to yield a protein with a desired property. Similarly, for a neural network the state space may be the space of all parameter values of the network with the target state being the set of parameters that allow the network to perform well at a particular task. In both instances, the target state is implicit through some external mechanism that evaluates a given state’s capacity to satisfy a desired property. For neural networks, this mechanism is a loss function, which evaluates the performance of the corresponding neural network on a set of training data. In biology, the mechanism is natural selection. Natural selection posits that more desirable states are selected as they have some competitive advantage over other states which allows them to survive while other states die out. Thus natural selection is often paraphrased as "survival of the fittest". However, these mechanisms do not explain the arrival of the fittest. In both these cases we observe that the selection process yields states close to the desired target space despite the state space being incredibly vast. How can a loss function or natural selection traverse this space, in a reasonable amount of time, and identify the target state? Indeed such mechanisms are not powerful enough to perform selection in these enormous spaces. Instead, there is some other process, namely random variation, that restricts the state space such that these external mechanisms can efficiently identify states close to the target state.
Consider the space of all possible phrases with 17 characters, where a character is a capital letter or the space. In this case, we can explicitly define a target state, say THE TYPING MONKEY. That is, the target state is one particular state out of all possible phrases with 17 characters, where there are 27 characters to choose from. More specifically, the target state is one out of 27 to the power of 17, so even with this relatively simple example, the number of possible states is vast. To put into perspective the enormity of this state space, suppose that you had a grain of sand for each phrase in this state space. Then you would have enough grains of sand to be able to give each person in Australia a planet Earth’s worth of grains of sand.
By appealing to the typing monkey thought experiment, we will use this specific selection problem to explore the selection process. Generally speaking, the typing monkey thought experiment supposes that there is a monkey randomly pushing the keys of a typewriter for an infinite amount of time. The output of the monkey's typing is used to conduct the selection process. By formulating the typing monkey thought experiment in different ways we can make sense as to how evolution identifies good states, and why neural networks find states that have strong generalisation capabilities.
Borel's Typing Monkey
Borel's typing monkey is a canonical formulation of the thought experiment. Here the typing monkey randomly pushes keys to generate a state, which in our example is a sequence of 17 characters. Eventually, after enough typing, one would expect that the monkey would type out the phrase THE TYPING MONKEY and thus solve the selection problem. This is because, although enormous, the state space is still finite and so with an infinite amount of time eventually the monkey will type out the target phrase. Indeed one can show formally that it is certain that the monkey will eventually type out the target phrase.
Borel's typing monkey is conducting what is referred to as single-step selection. That is, the monkey randomly generates a candidate state and then checks whether the state matches the target state. If the candidate state matches the target state then the selection problem is deemed solved, otherwise, the monkey types out another random candidate state.
Note that single-step selection can only be applied to selection problems with an explicitly defined target state. Therefore, it is insufficient as an explanation for selection problems observed in the real world, such as protein synthesis, as often the target state is unknown.
Even though the selection problem we are considering has an explicitly defined target state, THE TYPING MONKEY, this does not stop us from introducing an external mechanism that can guide the selection process by determining states that are close to the target state. Indeed this is a necessary step to ensure that selection problems in such large spaces are solved in a reasonable amount of time. Richard Dawkins's typing monkey experiment does precisely this.
Richard Dawkins's Typing Monkey
In The Blind Watchmaker, Richard Dawkins provides an alternative version of Borel's typing monkey to illustrate the concept of cumulative selection. In cumulative selection, as in single-step selection, the monkey randomly types an initial phrase of 17 characters. Now if the phrase typed out by the monkey is the target phrase, then the monkey has solved the problem. However, it will more than likely not be and so the monkey proceeds by retyping the phrase they have just written. In doing so there are random errors that occur, however, the monkey persists and continues to duplicate their original random sequence of characters. Eventually, the phrase amongst the duplicated phrases that best match THE TYPING MONKEY is chosen. For example, the best phrase may be taken to be the phrase that matches the most number of characters, and their locations to the target phrase. With this best phrase, the monkey then proceeds to duplicate it, with the random errors still present. After the monkey has duplicated the phrase sufficiently many times the phrase which best resembles THE TYPING MONKEY is chosen, and the process repeats.
It is not immediately obvious that cumulative selection should be more efficient than single-step selection, however, we can observe the efficiency gain with the following specific example. Suppose that at the n-th iteration of cumulative selection, the phrase that is deemed to resemble THE TYPING MONKEY the most is the one that matches the n-th character of the phrase. For example, suppose the monkey initially generates FD KFDJOPET HF BB. Then the monkey will proceed to duplicate this phrase until an error causes the initial letter F to change to a T. Once this has occurred the monkey will proceed to duplicate the remaining letters until an error causes the first character to change to a H. Note that after the n-th step the first n characters are fixed and the monkey is only duplicating the remaining n-17 characters. It follows that to solve the selection problem, the monkey only has to match one out of twenty-seven characters a total of seventeen times. Which is significantly easier than matching all seventeen characters in one go.
This method of selection is significantly quicker than single-step selection, however, this is to be expected as it relies on an external mechanism for determining which states are better than others. For biological systems natural selection provides a rough guide as to which states most resemble the target state. However, it is not perfect. For example, natural selection may get trapped in locally good states, which are not globally good. To illustrate what I mean suppose that all possible states lie along a mountain range with deep troughs and various peaks. Where states in the trough are less resemblant of the target state compared to states near the peaks. The random initial state likely lies in one of these deep troughs. Through the duplication process, the states can only move in small directions along this mountain range, however, over time natural selection will push the state up the mountain range as relatively higher states hold an advantage which sees the higher state survive more frequently compared to the relatively lower states. Eventually, natural selection will push the state to a peak in the mountain range. Once at this peak, from the perspective of natural selection, there are no nearby states that are better and so it must be the target state. What natural selection cannot see is that in the distance there is an even higher peak. To reach that higher peak, the states must descend into a trough which natural selection will not permit. In this sense, natural selection is short-sighted.
One can imagine that the enormous state spaces present in biology will be littered with these local peaks. However, what we observe in nature is that evolution manages to reach globally high peaks. Similarly, neural networks have been shown to arrive at, through training, global optimums in the parameter space. Therefore, something additional is required to explain these phenomena. That is why we now turn to the algorithmic typing monkey, which leverages random variation to restrict the possibility space, to allow mechanisms such as natural to identify the global optimum.
The Algorithmic Typing Monkey
In the context of biology, the search for sequences of amino acids occurs in a state space that is immensely larger than the one we are considering. Therefore, even though it is theoretically possible to identify the target state through single-step selection, this would happen on a timescale well beyond that of biological organisms. Therefore, the single-step approach cannot be taken by a biological system to conduct selection. The reason Borel's typing monkey takes an extraordinary amount of time to identify the target state is that it supposes that each state is equally likely. That is, each state is equally likely to be the target state, however, in the real world where there is context, this is not true. For example, in the context of the English language, the phrase THE PLAYING MONKEY is more likely to be the target state than EUOMJA JQI BAUTER. Removing the assumption that all states are equally likely is the goal of the algorithmic typic monkey.
Moreover, we saw that cumulative selection is short-sighted and so cannot provide a complete explanation for observed phenomena such as neural network generalisation. The algorithmic typing monkey is formulated to explain something that cumulative selection cannot. Namely, why should fit states, that is states near the global optimum, emerge?
Consider the states ABABABABABABABABAB and AQAYUPPFAJZKAHLNJO. Despite both being the same length it is reasonable to say that the first state is simpler than the second, but how we can quantify this? One way is to consider the length of the program required to generate the state. For example, the first state is generated by the program.
Type A
Type B
Repeat step 1 then step 2 nine times.
Type A
Type B
Return to Step 1.
Whereas for the second state, there is no obvious pattern that can be exploited to generate the state. Thus the only program that could generate the second state is the following.
Type A
Type Q
Type A
…
Type O.
The state ABABABABABABABABAB requires a program of three steps, whereas the second state requires a program of eighteen steps. It is in this sense that we can quantitatively say that the state ABABABABABABABABAB is simpler than the AQAYUPPFAJZKAHLNJO state. This notion of a program generating states is formalised in Kolmogorov complexity, which says that the complexity of the state is equal to the length of the shortest program that can generate the corresponding state.
We can use the notion of Kolmogorov complexity to restrict the state space to simpler states. Why do we want to restrict the state space to simpler states? Simpler states, by definition, encode some symmetry and pattern of the environment. Therefore, their properties are robust and are likely to be more desirable than the properties of complex states which are perhaps more volatile. For example, in the biological context consider sequences of amino acids which fold to give proteins with specific properties. Sequences of amino acids with no high-level structure will probably fold to give proteins with unstable properties, which is not beneficial for an organism. Moreover, simpler states can be stored more effectively in the DNA of an organism, and so simpler states are more energy efficient.
It is not feasible to directly compute Kolmogorov complexity. It can be shown that no algorithm exists which when given any state can output Kolmogorov complexity. Instead, we can use random variation to implicitly filter out complex states. The algorithmic typing monkey does precisely this. Instead of typing the characters of states directly, the algorithmic typing monkey instead types the program to generate the state. Consequently, as the program for the state ABABABABABABABAB is shorter, it is more likely to appear as the output of a program written by the monkey than the state AQAYUPPFAKAHLNJO which requires a longer program.
The random variation in the biological context is provided by the random errors induced by copying sequences of amino acids, which are then programs for generating proteins. For neural networks, random variation is introduced in the stochastic gradient descent learning algorithm which learns parameters that then program functions. We have already explored why simpler states in the context of biology are desirable. In the context of neural networks, simpler states correspond to sets of parameters that define a relatively simple function. That is a function that does not interpolate the data points in the training data but rather learns high-level representations of the training. This is desirable, as it means that the corresponding network is robust and generalises well to unseen data.
Due to random variation, simpler programs are encountered most frequently, and so the state space is centred around the simple states. In turn, the selection problem becomes much more efficient, as random variation filters out the complex states that are not likely to be the desired target state.