Logo Exploring Collaboration Mechanisms for LLM Agents:
A Social Psychology View

Jintian Zhang* , Xin Xu* , Ningyu Zhang , Ruibo Liu Bryan Hooi Shumin Deng

Zhejiang University National University of Singapore, NUS-NCS Joint Lab Google DeepMind
*Equal Contribution Corresponding Author
"What magical trick makes us intelligent? The trick is that there is no trick. The power of intelligence stems from our vast diversity, not from any single, perfect principle."
—— Marvin Minsky, The Society of Mind, p. 308

An example of the chess move validity task. Given previous chess game moves, agents are required to predict a valid next move for a specified piece.

Abstract

As Natural Language Processing (NLP) systems are increasingly employed in intricate social environments, a pressing query emerges: Can these NLP systems mirror human-esque collaborative intelligence, in a multi-agent society consisting of multiple large language models (LLMs)? This paper probes the collaboration mechanisms among contemporary NLP systems by melding practical experiments with theoretical insights. We fabricate four unique 'societies' comprised of LLM agents, where each agent is characterized by a specific 'trait' ( Logo easy-going or Logo overconfident ) and engages in collaboration with a distinct 'thinking pattern' (Logo debate or Logo reflection). Through evaluating these multi-agent societies on three benchmark datasets, we discern that certain collaborative strategies not only outshine previous top-tier approaches, but also optimize efficiency (using fewer API tokens). Moreover, our results further illustrate that LLM agents manifest human-like social behaviors, such as conformity and consensus reaching, mirroring foundational social psychology theories. In conclusion, we integrate insights from social psychology to contextualize the collaboration of LLM agents, inspiring further investigations into the collaboration mechanism for LLMs.



Simulation Setup

First Slide

Figure 2: The overview of machine society simulation. Multiple agents with different traits make up diverse machine societies. These agents engage in debate or self-reflection across multiple rounds to complete tasks.





Second Slide

This figure presents the definition of an individual agent.

First, we define the traits of an agent, where we have designed two fundamental and contrasting personalities: easygoing and overconfident. The advantage of an overconfident personality is the ability to concentrate resources on significant tasks without wasting time in communication. This trait is often seen in startups where a single shareholder holds a majority stake. On the other hand, the advantage of an easygoing personality is the recognition and correction of one's mistakes. To make this concept more vivid, we represent agents with 'puzzles'. An agent without any missing pieces symbolizes overconfidence, as they tend to be impermeable to external influences. In contrast, agents with missing pieces represent the easygoing nature, indicating openness to others' opinions and fostering better collaboration.

We then define the thinking patterns of an agent during problem-solving. We have conceptualized two primary methods: debate and reflection. In simple terms, the debate thinking pattern involves acquiring opinions from others, akin to a debate competition. In contrast, the reflection pattern relies solely on oneself, similar to how individuals are isolated from others' answers during an exam. In our paper, 'p0' denotes debate, while 'p1' represents reflection. For mnemonic purposes, the number '0' resembles an open mouth, symbolizing the need to engage in debate, and the number '1' resembles a closed mouth, indicative of self-reflection.


Third Slide


This figure illustrates the definition of society. A society is composed of multiple agents. In the main experiment, we define the number of agents as three. This is partly to facilitate decision-making (minority yields to majority) and partly to reduce diversity. We have set the number of collaboration rounds to three.

Due to the differing traits of agents, there exist various types of societies. Based on permutations and combinations, we can identify four distinct types of societies.

Given the diversity in agents' thinking patterns, each round of collaboration presents a unique set of combinations. We term the specific array of thinking patterns chosen by agents in any round as the 'collaborative strategy'. Our classification hinges on whether all agents in a round adopt identical thinking patterns. The focus of our main experiment is on those collaborative strategies where uniformity in thinking patterns is observed within a round, while other variations are examined in our ablation studies. Based on permutations and combinations, we identify a total of 2^3 distinct collaborative strategies in the main experiment.







Datasets

We conduct a rigorous evaluation of the reasoning and decision-making capabilities of various machine societies across three distinct tasks, utilizing diverse collaborative strategies:

  • High School Multiple-Choice. Leveraging the MMLU dataset, where problems span high school subjects such as statistics, mathematics, computer science, biology, chemistry, and physics, agents are required to identify the correct answer among four multiple-choice options. Our evaluation set consists of 50 randomly selected questions from this dataset.
  • Math. Drawing from MATH dataset, a repository of math problems sourced from competitive events and expressed in LaTeX, we assess the model proficiency in advanced mathematical and scientific reasoning. The dataset segments these problems into five graded difficulty levels, and for our evaluation, we have randomly chosen 50 cases from Level 3 to 5.
  • Chess Move Validity. Utilizing the dataset from the chess state tracking task within the comprehensive BIG-Bench Benchmark, a sequence of chess moves denoted in UCI notation is provided. Agents are required to predict a legitimate subsequent move for a specified chess piece.


Evaluation Metric

To enhance result reliability, we present average accuracy (denoted as 'Acc') and their respective standard deviations across five trials. Notably, our experiments exhibit substantial standard deviations. Hence, we introduce WIN-TIE (denoted as 'W-T') metric, indicating the frequency (over five trials) where the accuracy either matches or surpasses the continuous debate baseline. Meanwhile, we gauge the average token costs (denoted as 'Cost') consumed.

It's important to note that there are two categories of metrics for both Cost and W-T. 'Cost' represents the average number of tokens consumed across all societies for a particular strategy, whereas 'Cost' denotes the average number of tokens expended by a single society across all strategies. 'W-T' indicates the non-loss situations in all societies for the current strategy compared to strategy p0p0p0 (with a value range of 0 to 20), and 'W-T' reflects the non-loss situations in one society across seven strategies relative to strategy p0p0p0 (with a value range of 0 to 35).





The above displays the detailed settings and corresponding motivations of the social simulation, as well as the dataset and evaluation metrics. You can click the buttons above to view the respective contents.




Main Results

First Slide





Second Slide
Third Slide
Third Slide
Third Slide
Third Slide





Conformity

First Slide

Figure 6: Variation of answer correctness in the situation of conformity, under 3-round collaboration, on ChatGPT, where conformity brings about benefits: Ratio(False→True + True→True) > Ratio(True→False + False→False); conformity brings about detriments: Ratio(False→True + True→True) < Ratio(True→False + False→False).



Third Slide

Figure 28: Variation of answer correctness in the situation of conformity, using LlaMA2-13B-chat, where conformity brings about benefits: Ratio(False→True + True→True) > Ratio(True→False + False→False); conformity brings about detriments: Ratio(False→True + True→True) < Ratio(True→False + False→False).

Third Slide

Figure 37: Variation of answer correctness in the situation of conformity, using LlaMA2-70B-chat, where conformity brings about benefits: Ratio(False→True + True→True) > Ratio(True→False + False→False); conformity brings about detriments: Ratio(False→True + True→True) < Ratio(True→False + False→False).

Third Slide

Figure 51: Variation of answer correctness in the situation of conformity, using Qwen 72B, where conformity brings about benefits: Ratio(False→True + True→True) > Ratio(True→False + False→False); conformity brings about detriments: Ratio(False→True + True→True) < Ratio(True→False + False→False).

Third Slide

Figure 65: Variation of answer correctness in the situation of conformity, using Mixtral-8x7B, where conformity brings about benefits: Ratio(False→True + True→True) > Ratio(True→False + False→False); conformity brings about detriments: Ratio(False→True + True→True) < Ratio(True→False + False→False).

For conformity, we solely focus on agents actively engaging in debate, disregarding those in reflection during a given round. Let the answer of the i-th agent at j-th round be denoted as \(a_{i,j}\) . For the k-th agent at j-th round, if \(Frequency(\{a_{i,j−1}|i ∈[1, n]\}) = a_{k,j} \), we identify this as the occurrence of conformity by agent k at j-th round, where \(Frequency(\cdot)\) represents the most frequently given answer (excluding instances where all answers occur only once, as such cases are considered as nonconformity). Additionally, we categorize the correctness of answers both before and after conformity into four cases, with 'True' denoting correct and 'False' denoting incorrect.

We classify the phenomenon of conformity into four distinct categories, based on how answers change. The rationale behind this classification stems from the notion that conformity within human societies acts as a double-edged sword. Its benefits or drawbacks are often best assessed by looking at the outcomes. To illustrate with a couple of unsuitable examples: Imagine a scenario where, at a red traffic light, one individual decides to jaywalk and others follow suit. This type of conformity is detrimental. Conversely, consider a situation during an examination where I am surrounded by high-achieving students. I sneak a glance at their answers and notice they match mine. In this case, I choose not to alter my answer (or, if their answers differ from mine, I adjust mine to theirs), and it turns out that the official answer aligns with these answers, making this form of conformity advantageous (It’s important to note that this is merely an example for illustration purposes. Cheating is unethical, and we certainly do not condone it).


Consensus

First Slide

Figure 7: Average quantity of consensus clusters (i.e., unique answers among multiple agents) under different rounds of collaboration with 3-round collaborative strategies, using ChatGPT. Smaller quantity of consensus clusters, more easier it is to reach a consensus. Round 0 is equal to self-consistency. More details are in Appendix G.1.



Third Slide

Figure 29: Average quantity of consensus clusters (i.e., unique answers among multiple agents) under different rounds of collaboration with 3-round collaborative strategies, on LlaMA2-13B-chat. Smaller quantity of consensus clusters, more easier it is to reach a consensus. Round 0 is equal to self-consistency.

Third Slide

Figure 38: Average quantity of consensus clusters (i.e., unique answers among multiple agents) under different rounds of collaboration with 3-round collaborative strategies, on LlaMA2-70B-chat. Smaller quantity of consensus clusters, more easier it is to reach a consensus. Round 0 is equal to self-consistency.

Third Slide

Figure 52: Average quantity of consensus clusters (i.e., unique answers among multiple agents) under different rounds of collaboration with 3-round collaborative strategies, using Qwen 72B. Smaller quantity of consensus clusters, more easier it is to reach a consensus. Round 0 is equal to self-consistency.

Third Slide

Figure 66: Average quantity of consensus clusters (i.e., unique answers among multiple agents) under different rounds of collaboration with 3-round collaborative strategies, using Mixtral-8×7B. Smaller quantity of consensus clusters, more easier it is to reach a consensus. Round 0 is equal to self-consistency.

For consensus, we examine the evolution of the number of distinct answers (i.e., consensus clusters) with increasing rounds of collaboration. Let the answer of the i-th agent at time j be denoted as ai,j . For the j-th round, consensus clusters is defined as \( \left \|\text{Set}(\{a_{i,j}|i\in[1,n]\})\right \| \), where \( \left \|\text{Set}(\cdot)\right \| \) represents the count of different answers. Here, we have gathered and analyzed the overall performances of various societies.


Take away

  • 1. Starting or dominating multi-agent collaboration with debate, yields relatively optimal outcomes.
  • 2. Totally reflection strategy like p1p1p1 is generally worst in performance.
  • 3. For difficult tasks, debate combined with continuous reflection is superior; for simple tasks, self-consistency or reflection is enough.
  • 1. Surprisingly, "overconfident" agents lose that trait in groups!
  • 2. Setting agent numbers to 3 is generally advantageous in performance and cost.
  • 3. The rounds of collaboration is relatively suitable to set as 3, both effective and efficient.
  • 4. Employing the uniform thinking patterns across all agents within a round enhance efficacy.
  • 5. Scaling up the number of agents is better than scaling up the number of collaboration rounds.



  • 1. Collaboration is generally effective in the group, especially for tackling difficult tasks.
  • 2. Collaboration widely leads to conformity, either beneficial or harmful in performance.
  • 3. As the number of rounds increases, benefits of conformity will decrease; and detriments of conformity will increase.
  • 4. The totally easy-going society is more likely to reach a consensus, debate helps to consensus reaching while reflection impedes it.

BibTeX


      @article{Multi-Agent_Collaboration_SocialPsychology,
        author       = {Jintian Zhang and
                        Xin Xu and
                        Ningyu Zhang and
                        Ruibo Liu and
                        Bryan Hooi and
                        Shumin Deng},
        title        = {Exploring Collaboration Mechanisms for {LLM} Agents: {A} Social Psychology View},
        journal      = {CoRR},
        volume       = {abs/2310.02124},
        year         = {2023},
        url          = {https://doi.org/10.48550/arXiv.2310.02124},
        doi          = {10.48550/ARXIV.2310.02124}
      }      

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.