LLMs and Their Chains-of-Thought
Are LLMs memorising our reasoning patterns, or do they have the capacity to generate their own?
Reasoning
When we, humans, are solving, relatively, challenging tasks, we like to think that we are reasoning through the steps that we need to complete to solve the task. It is difficult to gauge how we deduce those steps, are we replaying the steps we observed someone else make when they tackled a similar task or are we generating these steps using our intuition?
Nowadays, we are all intimately connected through social media and transport and so we are heavily influenced by a host of different individuals. Thus it is plausible that the majority of our perceived reasoning is copied from others. However, there is still so much variety and individuality in the world that it is likely that the challenges we encounter differ in nuanced ways from those of others. Therefore, it is probably still the case that we are also generating novel patterns of reasoning. Indeed, humans did not evolve reasoning capabilities from observing an alien species. Ultimately, we started without this ability and have slowly learned it, and thus at least some non-negligible amount of mental processing ought to be novel generation.
Large language models, on the other hand, are trained on data that contains countless instances of reasoning. In deployment, we observe large language models performing similar patterns of reasoning. However, there remains the question as to whether they are just contextualising previously observed patterns of reasoning, or whether they are forming their novel patterns of reasoning.
We can provide some thoughts on this question by considering the storage capacity of these large language models and the memory necessary to memorise algorithmic patterns.
Estimation
Model Capacity
Current large language models have many parameters, with upper-bound estimates on those of the OpenAI models around 1 trillion parameters. However, it is probably less than this, and it has been shown, on open-source models, that their capabilities are largely unaffected through some compression techniques that remove up to 90% of the model’s parameters. Therefore, for our estimates, we will use an upper-bound on, effective, parameter count of 200 billion. Supposing that these parameters are stored in float32, it follows that the effective storage capacities of these models is around 800 billion bytes.
Algorithmic Storage Requirements
Now we concern ourselves with estimating the amount of memory required to memorise a reasoning trajectory, where a reasoning trajectory refers to a framework that can be contextualised to solve an encountered task. For example, consider the challenge of opening a door. We all know the rough steps, or reasoning trajectory, required to complete this action, however, when we encounter a door in practice we have to contextualise these steps such that they are applicable. We will suppose that this reasoning trajectory is formulated at a conscious level of granularity, as otherwise, one could continually reduce the sequence of steps into finer and finer details.
Approach the door to within an arms-length distance.
Elongate and grab onto the handle with your hand.
Unlike the door handle if necessary.
Manipulate the handle to release the door catch.
Contract your arm in a way as to pivot the door around its hinges.
Proceed through the open doorway.
Manoeuvre your arm and body such to reposition the door in the closed position.
Release your hand from the handle in a way to re-engage the catch.
Lock the door handle if required.
Proceed away from the door.
OpenAI's new o1 model explicitly undergoes chain-of-thought reasoning to construct a similar set of steps that it can implement to complete a task prompted by the user, which in this setting acts as the context. However, understandably, we do not have access to the raw set of steps the model uses as this could be quickly reverse-engineered by a competitor or adversary.
Opening a door is a rather simple reasoning strategy. It is probably reasonable to assume then that most of our general reasoning strategies can be summarised with around 10-100 lines of pseudo-code. Humans can certainly execute longer reasoning strategies, however, these are probably an application of several general reasoning strategies, and we probably have memorised relatively few of these. Therefore, we do not consider them in this estimation.
Now supposing that each line has around 50 extended ASCII characters, it follows that, on average, a human reasoning trajectory would require around 500 to 5000 bytes of memory to store. Of course, here we are assuming that these trajectories are stored in something akin to natural language, at least in terms of memory requirements.
Consequences
Assuming that these large language models use all of their storage to memorise these reasoning trajectories, it follows that large language models can memorise around 160 million to 1.6 billion reasoning trajectories. As we have mentioned before, these reasoning trajectories need to be contextualised to be useful. Hence, it cannot be the case that all of a large language model's storage is used for memorising these trajectories, it must use some, probably a significant amount, of it to store knowledge.
Suppose that humans also store these reasoning trajectories as these lines of natural language. Then given that the human brain has a capacity of around 2.5 million billion bytes, at a minimum 1.6 billion reasoning trajectories would occupy around 0.032% of a human's memory. Along with the fact that each human does not have an identical set of reasoning trajectories to other humans, it is likely that to effectively interact with users large language models would have to memorise significantly more reasoning trajectories than 1.6 billion. There is even room to accommodate the fact that the model does not have to memorise all these trajectories as some are inherently physical.
It is perhaps more likely then that these models have a mechanism for generating reasoning patterns in a way that exhausts the vast space of human trajectories more efficiently than simply memorising ones they have observed in the training data. Of course, these methods of generating reasoning trajectories are flawed, leading to some poor behaviours. It is the work of those developing these models to find ways to ensure that the methods the model learns to generate these reasoning trajectories are as robust as those we use to generate our own.