🔥🔥🔥 Awesome-LLM-Ensemble
"Harnessing Multiple Large Language Models: A Survey on LLM Ensemble" (IJCAI Survey 2026)

Zhijun Chen, Xiaodong Lu, Jingzheng Li, Pengpeng Chen, Zhuoran Li, Kai Sun, Yuankai Luo, Qianren Mao, Ming Li, Likang Xiao, Dingqi Yang, Xiao Huang, Yikun Ban, Hailong Sun, Philip S. Yu

If you like our project, please give it a star ⭐ to show your support！Thank you:)

📣 Notices

🔥🔥🔥 This is a collection of papers on LLM Ensemble.

[2026-05] Accepted by IJCAI Survey 2026!

🔥🔥🔥 [2026-04] We have updated our arXiv paper with a new version! Stay tuned for our journal-style paper in recent months.

[Always] [Add your papers in this repo] Thank you to all the papers that have cited our survey.
We will add all related citing papers to this GitHub repo, in a timely manner, to help increase the visibility of your contributions.

[Always] [Maintain] We will make this list updated frequently!
If you found any missed/new paper, please don't hesitate to contact us or Pull requests.

🍀 Citation

If you find this survey useful, please consider citing our paper:

@article{chen2025harnessing,
  title={Harnessing Multiple Large Language Models: A Survey on LLM Ensemble},
  author={Chen, Zhijun and Lu, Xiaodong and Li, Jingzheng and Chen, Pengpeng and Li, Zhuoran and Sun, Kai and Luo, Yuankai and Mao, Qianren and Li, Ming and Xiao, Likang and Yang, Dingqi and Huang, Xiao and Ban, Yikun and Sun, Hailong and Yu, Philip S},
  journal={arXiv preprint arXiv:2502.18036},
  year={2025}
}

Contents

1. LLM Ensemble and Taxonomy

1.1 LLM Ensemble

Paper Abstract:

LLM Ensemble---which involves the comprehensive use of multiple large language models (LLMs), each aimed at handling user queries during the downstream inference, to benefit from their individual strengths---has gained substantial attention recently. The widespread availability of LLMs, coupled with their varying strengths and out-of-the-box usability, has profoundly advanced the field of LLM Ensemble. This paper presents the first systematic review of recent developments in LLM Ensemble. First, we introduce our taxonomy of LLM Ensemble and discuss several related research problems. Then, we provide a more in-depth classification of methods under the broad categories of ``ensemble-before-inference, ensemble-during-inference, ensemble-after-inference'', and review all relevant methods. Finally, we introduce related benchmarks and applications, summarize existing studies, and suggest several future research directions. A curated list of papers on LLM Ensemble is available at https://github.com/junchenzhi/Awesome-LLM-Ensemble.

1.2 Taxonomy

Figure 1: Illustration of LLM Ensemble Taxonomy. (Note that for (b) ensemble-during-inference paradigm, there is also a process-level ensemble approach that we have not represented in the figure, mainly because that this approach is instantiated by a single method.)

Figure 2: Taxonomy of All LLM Ensemble Methods. (Please note that this figure may not be fully updated to include all the papers listed below.)

(a) Ensemble before inference.
Since the ensemble-before-inference methods require routing a query to the most suitable LLM before LLM inference, the core of such methods lies in predicting the utility of candidate models for a given query under certain preferences (e.g., performance or cost). Based on how they formulate the utility of candidate LLMs, we divide existing methods into two categories:
- (a1) Discrete utility methods, discretize the model utility into categorical labels;
- (a2) Continuous utility methods model LLM utility as real-valued variables, such as response length or performance scores. This formulation enables a fine-grained characterization of model behavior, capturing subtle performance differences obscured by categorical definitions.
(b) Ensemble during inference.
As the most granular form of ensemble among the three broad categories, this type of approach encompasses:
- (b1) Token-level ensemble methods, which integrate the token-level outputs of multiple models at the finest granularity of decoding;
- (b2) Span-level ensemble methods, which conduct ensemble at the level of a sequence fragment (e.g., a span of four words);
- (b3) Process-level ensemble methods, which select the optimal reasoning process step-by-step within the reasoning chain for a given complex reasoning task. Note that for these ensemble-during-inference methods, the aggregated text segments will be concatenated with the previous text and fed again to models.
(c) Ensemble after inference.
These methods can be classified into two categories:
- (c1) Non cascade methods, which perform ensemble using multiple complete responses contributed from all LLM candidates;
- (c2) Cascade methods, which consider both performance and inference costs, progressively reasoning through a chain of LLM candidates largely sorted by model size to find the most suitable inference response.

2. Papers

2.1 Ensemble Before Inference

Figure 3: Summary analysis of the key attributes of ensemble-before-inference methods. (Please note that this table may not be fully updated to include all the papers listed below.)

2.1.1 (a,1) Discrete utility methods

Date	Name	Title	Paper/Github
2025-10	`DiSRouter`	DISROUTER: Distributed Self-Routing for LLM Selections	-
2025-06	`TagRouter`	TAGROUTER: Learning Route to LLMs through Tags for Open-Domain Text Generation Tasks	-
2025-06	`Router-R1`	Router-R1: Teaching LLMs Multi-Round Routing and Aggregation via Reinforcement Learning
2025-06	`RadialRouter`	RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing	-
2025-05	`RTR`	Route to Reason: Adaptive Routing for LLM and Reasoning Strategy Selection
2024-12	`Bench-CoE`	Bench-CoE: a Framework for Collaboration of Experts from Benchmark
2024-10	`GraphRouter`	GraphRouter: A Graph-based Router for LLM Selections
2024-09	`Eagle`	Eagle: Efficient Training-Free Router for Multi-LLM Inference	-
2024-08	`SelectLLM`	SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models	-
2024-06	`RouteLLM`	RouteLLM: Learning to Route LLMs with Preference Data
2024-05	`LLM Routing Lessons`	Harnessing the Power of Multiple Minds: Lessons Learned from LLM Routing
2024-04	`Hybrid-LLM`	Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing	-
2024-03	`ETR`	An Expert is Worth One Token: Synergizing Multiple Expert LLMs as Generalist via Expert Token Routing
2024-01	`Routoo`	Routoo: Learning to Route to Large Language Models Effectively	-
2024	`RouterDC`	RouterDC: Query-Based Router by Dual Contrastive Learning for Assembling Large Language Models
2023-11	`ZOOTER`	Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models	-
2023-08	`FORC`	Fly-Swat or Cannon? Cost-Effective Language Model Choice via Meta-Modeling
2023	`Benchmark Routing`	LLM Routing with Benchmark Datasets	-

2.1.2 (a,2) Continuous utility methods

Date	Name	Title	Paper/Github
2025-10	`WebRouter`	WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent	-
2025-10	`LLMRank`	LLMRank: Understanding LLM Strengths for Model Routing	-
2025-05	`Avengers`	The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants
2025-05	`InferenceDynamics`	InferenceDynamics: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling	-
2025-05	`kNN Router`	Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers	-
2025	`RELM`	Co-optimizing Recommendation and Evaluation for LLM Selection	-
2025-02	`LLM Bandit`	LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing	-
2024-12	`PickLLM`	PickLLM: Context-Aware RL-Assisted Large Language Model Routing	-
2024-08	`TO-Router`	TensorOpera Router: A Multi-Model Router for Efficient LLM Inference	-
2024-07	`MetaLLM`	MetaLLM: A High-performant and Cost-efficient Dynamic Framework for Wrapping LLMs
2024-06	`HomoRouter`	Query Routing for Homogeneous Tools: An Instantiation in the RAG Scenario	-
2024-01	`Blending`	Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM	-

2.2 Ensemble During Inference

Figure 4: Summary analysis of the key attributes of ensemble-during-inference methods. (Please note that this table may not be fully updated to include all the papers listed below.)

2.2.1 (b,1) Token-Level Ensemble

Date	Name	Title	Paper/Github
2025-10	`SAFE`	When to Ensemble: Identifying Token-Level Points for Stable and Fast LLM Ensembling	-
2025-10	`CoRe`	Harnessing Consistency for Robust Test-Time LLM Ensemble	-
2025-05	`Transformer Copilot`	Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
2025-02	`ABE`	Token-level Ensembling of Models with Different Vocabularies
2025-02	`CITER`	CITER: Collaborative Inference for Efficient Large Language Model Decoding with Token-Level Routing
2024-10	`UniTe`	Determine-Then-Ensemble: Necessity of Top-k Union for Large Language Model Ensembling	-
2024-06	`GaC`	Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
2024-04	`DeePEn`	Ensemble Learning for Heterogeneous Large Language Models with Deep Parallel Collaboration
2024-04	`PackLLM`	Pack of LLMs: Model Fusion at Test-Time via Perplexity Optimization
2024-04	`EVA`	Bridging the Gap between Different Vocabularies for LLM Ensemble
2024-02	`-`	Purifying large language models by ensembling a small language model	-

2.2.2 (b,2) Span-Level Ensemble

Date	Name	Title	Paper/Github
2025-06	`RLAE`	RLAE: Reinforcement Learning-Assisted Ensemble for LLMs	-
2025-02	`Speculative Ensemble`	Speculative Ensemble: Fast Large Language Model Ensemble via Speculation
2024-12	`SpecFuse`	SpecFuse: Ensembling Large Language Models via Next-Segment Prediction	-
2024-09	`SweetSpan`	Hit the Sweet Spot! Span-Level Ensemble for Large Language Models	-
2024-07	`Cool-Fusion`	Cool-Fusion: Fuse Large Language Models without Training	-

2.2.3 (b,3) Process-Level Ensemble

Date	Name	Title	Paper/Github
2025-11	`CBS`	Collaborative Beam Search: Enhancing LLM Reasoning via Collective Consensus	-
2024-12	`LE-MCTS`	Ensembling Large Language Models with Process Reward-Guided Tree Search for Better Complex Reasoning	-

2.3 Ensemble After Inference

Figure 5: Summary analysis of the key attributes of ensemble-after-inference methods. (Please note that this table may not be fully updated to include all the papers listed below.)

2.3.1 (c,1) Non Cascade

Date	Name	Title	Paper/Github
2025-12	`LLM-PeerReview`	Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
2025-10	`LLMartini`	LLMartini: Seamless and Interactive Leveraging of Multiple LLMs through Comparison and Composition	-
2025-10	`-`	Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
2025-10	`OW/ISP`	Beyond Majority Voting: LLM Aggregation by Leveraging Higher-Order Information	-
2025-09	`FLAME`	Explainable Fault Localization for Programming Assignments via LLM-Guided Annotation
2025-09	`CARGO`	CARGO: A Framework for Confidence-Aware Routing of Large Language Models	-
2025-07	`LENS`	LENS: Learning Ensemble Confidence from Neural States for Multi-LLM Answer Integration	-
2025-05	`EL4NER`	EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models	-
2025-03	`Symbolic-MoE`	Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
2025-01	`DFPE`	DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
2025-01	`DMoA`	Balancing Act: Diversity and Consistency in Large Language Model Ensembles	-
2024-12	`Smoothie`	Smoothie: Label Free Language Model Routing
2024-10	`LLM-Forest`	LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation
2024-10	`LLM-TOPLA`	LLM-TOPLA: Efficient LLM Ensemble by Maximising Diversity
2024-10	`MLKF`	Two Heads are Better than One: Zero-shot Cognitive Reasoning via Multi-LLM Knowledge Fusion	-
2024-08	`URG`	URG: A Unified Ranking and Generation Method for Ensembling Language Models	-
2024-02	`Agent-Forest`	More Agents Is All You Need
2023-06	`LLM-Blender`	LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion
2023-05	`MoRE`	Getting MoRE out of Mixture of Language Model Reasoning Experts

2.3.2 (c,2) Cascade

Date	Name	Title	Paper/Github
2025-12	`RoBoN`	RoBoN: Routed Online Best-of-n for Test-Time Scaling with Multiple LLMs
2025-09	`-`	Semantic Agreement Enables Efficient Open-Ended LLM Cascades	-
2025-04	`EMAFusionTM`	EMAFusionTM: A Self-Optimizing System for Seamless LLM Selection and Integration	-
2025-04	`ModelSwitch`	Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute
2024-12	`DER`	Dynamic Ensemble Reasoning for LLM Experts	-
2024-10	`Cascade Routing`	A Unified Approach to Routing and Cascading for LLMs
2024-04	`-`	Language Model Cascades: Token-level uncertainty and beyond	-
2023-10	`AutoMix`	AutoMix: Automatically Mixing Language Models
2023-10	`neural caching`	Cache & Distil: Optimising API Calls to Large Language Models
2023-10	`-`	Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
2023-10	`EcoAssistant`	EcoAssistant: Using LLM Assistant More Affordably and Accurately
2023-05	`FrugalGPT`	FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance	-
2023-01	`-`	When Does Confidence-Based Cascade Deferral Suffice?	-
2022-10	`Model Cascading`	Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems	-

2.4 Others: Benchmarks, Applications, Systems and Related Surveys

2.4.1 Benchmarks

Date	Name	Title
2026-01	`-`	LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing
2025-12	`-`	Scoring, Reasoning, and Selecting the Best! Ensembling Large Language Models via a Peer-Review Process
2025-09	`-`	RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers
2025-07	`-`	FusionFactory: Fusing LLM Capabilities with Multi-LLM Log Data
2025-03	`RouterEval`	RouterEval: A Comprehensive Benchmark for Routing LLMs to Explore Model-level Scaling Up in LLMs
2024-03	`RouterBench`	RouterBench: A Benchmark for Multi-LLM Routing System
2023-06	`MixInstruct`	LLM-BLENDER: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

2.4.2 Applications

Beyond the methods presented before, the concept of LLM Ensemble has found applications in a variety of more specialized tasks and domains. Here we give some examples:

Date	Name	Title	Paper/Github
2025-09	`FLAME`	Explainable Fault Localization for Programming Assignments via LLM-Guided Annotation
2025-05	`Expert Orchestration`	Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Large Language Models	-
2025-04	`Consensus Entropy`	Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR	-
2024-11	`BWRS`	Bayesian Calibration of Win Rate Estimation with LLM Evaluators
2024-06	`FuseGen`	FuseGen: PLM Fusion for Data-generation based Zero-shot Learning
2024-05	`-`	PromptMind Team at MEDIQA-CORR 2024: Improving Clinical Text Correction with Error Categorization and LLM Ensembles	-
2024-02	`-`	LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction	-
2023-11	`-`	On Preserving the Knowledge of Long Clinical Texts	-
2023-10	`Ensemble-Instruct`	Ensemble-Instruct: Generating Instruction-Tuning Data with a Heterogeneous Mixture of LMs

2.4.3 Systems

Date	Name	Title	Paper/Github
2025-10	`LLMartini`	LLMartini: Seamless and Interactive Leveraging of Multiple LLMs through Comparison and Composition	-

2.4.4 Related Surveys

Date	Title	Paper/Github
2026-02	Dynamic Model Routing and Cascading for Efficient LLM Inference: A Survey	-
2025-07	Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration	-
2025-06	Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques	-
2025-05	A Survey on Collaborative Mechanisms Between Large and Small Language Models	-
2025-03	A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well
2025-02	Doing More with Less – Implementing Routing Strategies in Large Language Model-Based Systems: An Extended Survey	-
2025-02	Doing More with Less: A Survey on Routing Strategies for Resource Optimisation in Large Language Model-Based Systems	-
2024-08	Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities
2024-08	A Survey on Model MoErging: Recycling and Routing Among Specialized Experts for Collaborative Learning	-
2024-07	Merge, Ensemble, and Cooperate! A Survey on Collaborative Strategies in the Era of Large Language Models	-
2023-09	Deep Model Fusion: A Survey	-
2023-02	A comprehensive review on ensemble deep learning: Opportunities and challenges	-

3 Others: Some public implementations of the LLM Ensemble methods

Date	Title	Github
2025	Ensemble-Hub

4 Others: Some other related interesting papers

Here we briefly list some related papers, which are either discovered by us or suggested by the authors to this repository. They mainly focus on LLM Collaboration.

4.1 Test-Time Scaling

Date	Name	Title	Paper/Github
2025-10	`-`	Stable LLM Ensemble: Interaction between Example Representativeness and Diversity	-

4.2 LLM Collaboration and Others

Date	Name	Title	Paper/Github
2025-02	`Heter-MAD`	If Multi-Agent Debate is the Answer, What is the Question?	-
2025-03	`GENOME`	Nature-Inspired Population-Based Evolution of Large Language Models
2024-10	`LLM-Forest`	LLM-Forest: Ensemble Learning of LLMs with Graph-Augmented Prompts for Data Imputation
2025-08	`SLC`	Small-Large Collaboration: Training-efficient Concept Personalization for Large VLM
2025-09	`Best-of-∞`	Best-of-∞ -- Asymptotic Performance of Test-Time LLM Ensembling
2025-09	`MoT`	Mixture of Thoughts: Learning to Aggregate What Experts Think, Not Just What They Say
2025-10	`ColMAD`	Towards Scalable Oversight with Collaborative Multi-Agent Debate in Error Detection	-
2025-10	`AdCo`	Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning
2025-10	`-`	Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation	-
2025-12	`CogER`	Beyond Fast and Slow: Cognitive-Inspired Elastic Reasoning for Large Language Models	-

5 Summarization

Figure 6: Summary analysis of the key attributes of LLM Ensemble approaches.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.idea		.idea
fig		fig
LICENSE		LICENSE
README.md		README.md
google5a451adcfb3b22dc.html		google5a451adcfb3b22dc.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation