Sections on games and the prisoner’s dilemma are inspired by [1].
Games
Collaboration and interactions are an inherent part of society. Our actions and the actions of others shape our everyday lives. Whether we are aware of it or not, the actions of others have significant effects on our actions. Indeed, the actions of others can make us act rather irrationally.
We can investigate the dynamics of interactions and determine the optimal means by which to interact by formalising interactions as games. An optimal strategy for a game will be one that either benefits an individual over an adversary or mutually benefits all participants equally. As with all attempts at modelling real-world systems, there are limitations. More specifically, a provably optimal strategy for a game may not extend to an optimal strategy for the game re-contextualised in the real world. This is because human psychology is demonstrably selfish, and thus globally equal rewards may not be viewed as such from the perspective of an individual. For example, the victim of an altercation between individuals often exacerbates the harm inflicted upon them by the perpetrator. Whereas the perpetrator is likely to downplay the harm they caused and offer reasonings for their actions. Indeed, an experiment was conducted where individuals shocked each other to the extent that the other person shocked them. Unsurprisingly, the participants shocked the other with increasingly greater power, eventually far exceeding the strength of the initial shock.
Zero-Sum Games
A zero-sum game is one where the gain of a group of participants is at the expense of the others. More specifically, the gain of the group is equal to the losses of the other. In this sense, the amount of reward in the game stays constant. For example, tic-tac-toe is a zero-sum game. A player wins tic-tac-toe only if the other player loses. Similarly, chess is a zero-sum game. However, the stock market is not a zero-sum game. In this game, a trader is trying to increase their capital by investing in prosperous stocks. As stocks can increase without other stocks decreasing by the same amount the game is not zero-sum.
Zero-sum games are relatively easy to investigate as often the factors influencing the reward of the game are internal, which means that they can be tracked. In tic-tac-toe and chess, the factors affecting the reward of each player are the available moves on the board, which can be tracked. However, in the stock market, many external factors can influence the price of a stock, which makes it inherently harder to analyse. Moreover, these external factors mean that the game is not zero-sum as rewards can filter into the game.
Positive-Sum Games
A positive sum game is where the participants can potentially gain a reward without the other participants being negatively affected. Indeed, the stock market is an example of a positive sum game. Positive sum games are more ubiquitous in the real world since many external factors can contribute to one’s perception of reward. Sometimes it is the case that a zero-sum game contextualised into the real world becomes a positive-sum game. For example, suppose you are playing tic-tac-toe against a child who is still trying to grasp the strategies of the game. Despite being able to win against the child you may be incentivized to let them win. The pleasure they'll experience by beating you will surpass the pleasure you'll experience if you win. Moreover, their display of pleasure at beating you may cause you to feel some pleasure. Hence, this game of tic-tac-toe is no longer a zero-sum game. Indeed, human psychology has introduced an external reward system that breaks the conservation of reward present in tic-tac-toe.
The Prisoner's Dilemma
The prisoner’s dilemma is a positive sum game and is set up as follows. Two prisoners are held in custody separately and interrogated by the detaining officers. If one prisoner testifies against the other, whilst the other remains loyal then the defecting prisoner is set free whilst the other is imprisoned for 10 years. If both prisoners remain loyal to the other, then they are each sentenced to 6 months. If both prisons testify against each other then they each receive a sentence of 6 years.
The reason the prisoner’s dilemma is a positive-sum game is that if both remain loyal to each other then they both receive a reduced sentence, that is they are rewarded.
The subtly of the prisoner dilemma is that despite both prisoners being able to reap rewards if they remain loyal, it turns out the prisoners will defect. From the point of view of an individual prisoner, it is better to defect rather than cooperate. If they defect then they will go free if the other remains loyal, and they will only be sentenced to 6 years, rather than 10 years, if the other also defects.
The Iterated Prisoner's Dilemma
The iterated prisoner's dilemma is an augmentation of the prisoner's dilemma where each prisoner interacts repeatedly to accumulate rewards. This variation of the game is more resemblant to interactions between individuals in the real world, and can thus be used as a framework to study human interaction.
Tit-for-Tat
Tit for tat is a general strategy for confronting iterated prisoner dilemma-style situations. The tit-for-tat strategy says that on the first move, you should always cooperate with the other player, and then after that only cooperate if they cooperate otherwise you should defect. The intuition behind this strategy is to set up a friendly correspondence with the other player. With your first move of cooperation, they are less incentivised to defect based on their individualistic perspective. However, the generosity of the tit-for-tat strategy runs out after the first move. After the first move, the other individual must earn your cooperation. Thus, the tit-for-tat strategy avoids one’s kindness being taken for granted.
Moreover, the pattern of the tit-for-tat strategy is easy to identify. Thus, the opposing player can quickly understand that their generosity will be reciprocated and thus are incentivised to be generous.
Generous Tit-for-Tat
The tit-for-tat strategy has empirically been shown in simulations to be a relatively effective strategy for confronting these types of games. However, there is a flaw in the tit-for-tat strategy that arises when it is applied in the real world. More specifically, if two players are employing the tit-for-tat strategy then once one player defects both players are caught in a cycle of defecting. This is not ideal, as in the real world there are many factors which may cause a player to defect without them necessarily having the spite to do so.
The generous tit-for-tat strategy aims to rectify this issue. With the generous tit-for-tat strategy, the player sometimes gives the other player the benefit of the doubt and cooperates when the other has previously defected. This helps to break the cycle of defecting and provides an opportunity for the players to set up a collaborative relationship.
Contrite Tit-for-Tat
There is also a flaw to the generous tit-for-tat strategy if there are players who will always defect. If a player always defects, then a generous tit-for-tat player will frequently give that player the benefit of the doubt and not be reciprocated for their generosity. Unfortunately, in the real world, overly generous players can be exploited, however, the majority of the time your generosity will be acknowledged and rewarded, even if not immediately.
The contrite tit-for-tat strategy is more selective in where it allocates generosity. More specifically, a contrite tit-for-tat player will remember their decisions, and only be cooperative with other players if the mutual defection of a previous interaction was caused by an error or misunderstanding on their behalf. However, if the mutual defection was caused by the other player, then a contrite tit-for-tat player will forever defect against that player.
Consequently, interacting players who are both employing a tit-for-tat strategy can reconcile any previous differences and settle into a collaborative relationship.
Is the World Zero-Sum?
I think a lot of the contention in the world is linked to the puzzle as to whether the world as a system is zero-sum or positive-sum. At a physical level, there are conservation laws, such as the conservation of energy, momentum and so forth. Similarly, in the mathematics of optimisation, there are theorems colloquially known as no free lunch theorems, which indicate that over a uniform space of possibilities, no optimisation algorithm outperforms random guessing [2]. From this, it is common to use the phrase no free lunch colloquially to indicate that any gain is counterbalanced by a drawback.
Hence, at a societal level, there has been a consensus that has leaned toward the world being zero-sum. However, these no free lunch theorems apply to the rational worlds of mathematics and physics. We have seen that when games are contextualised into the real world they can potentially lose their conservative properties.
Those who think of the world as zero-sum may hold the view that billionaires have only amassed their wealth by capitalising on the detriment of others, rather than accumulating their wealth in a positive sum manner. Similarly, the debate around artificial intelligence systems' dangers can be grounded in this conundrum. On the one hand, those cautious of powerful AI systems think that the rise of these systems will result in the demise of humans. On the other hand, those accelerating AI capabilities research are focused on the benefits of AI systems which they believe will outweigh the risks. It is certainly the case that the world is not entirely zero-sum as otherwise there would be no room for progress and we would still be stuck in the stone ages. However, it is also the case that many gather rewards by exploiting others for their rewards.
In the case of the prisoner's dilemma, we saw that there was an opportunity for the prisoners to capitalise on the positive-sum nature of the game, however, their incentives motivated them otherwise. In the setting of the prisoner's dilemma, this individualistic strategy is not optimal, however, for games set in the real world it may be the case that not always taking the positive sum reward is a good thing. As in the real world, positive rewards are inherently subjective. One person's positive reward may not be so positive for another. Moreover, human psychology can be contradictory, and optimise for short-term gratification, therefore seeking to always obtain the positive reward may not be all it seems to be.
For example, reinforcement learning is a process that trains a machine learning model to complete a certain task by rewarding behaviour that resembles the task, and discouraging behaviour that deviates from the task. To quantify this the practitioner constructs a reward function that takes in the action of the model and gives it a reward based on how good its action was. In simple settings, where there are few external factors, this may work fine as a reward function may be constructed that precisely details the desired task. However, in a more complex environment, determining a useful reward function becomes difficult as there are multiple external factors to consider.
With the reward function set, a machine learning model relentlessly optimises their behaviour to seek positive rewards. However, as we noted the reward function may not be able to fully capture complex tasks, thus over-optimising the reward function may lead the machine learning to exhibit behaviour that deviates from the intended behaviour. In some cases, these deviations may be significant and cause inhibit the model being used for the desired tasks. For example, in a paper by DeepMind researchers tried to teach a virtual robot to pick up a ball by rewarding actions that progressively got it closer to picking up the ball. After some time the robot seemed to be learning the action of how to pick up the ball. However, it turned out the robot was only making the correct gestures in the space in front of the ball. Indeed, from the perspective of the reward system, it seemed as though the robot was completing the intended task, however, in reality, it was not [3].
The real world has zero-sum and positive-sum aspects. One must strive for rewards to foster the benefits that innovations bring. However, one must operate with a sense of caution so that potential consequences do not suffocate the benefits. Moreover, one must maintain an awareness to ensure that the rewards they are striving for are indeed positive, and not a proxy reward signal.
Bibliography
[1] The Better Angels of Our Nature - Steven Pinker
[2] Wikipedia contributors. (2024, January 15). No free lunch theorem. In Wikipedia, The Free Encyclopedia. Retrieved 05:44, February 17, 2024, from https://en.wikipedia.org/w/index.php?title=No_free_lunch_theorem&oldid=1195771929
[3] https://deepmind.google/discover/blog/specification-gaming-the-flip-side-of-ai-ingenuity/