Benchmarking Agentic Workflow Generation

Shuofei Qiao♠* , Runnan Fang♠* , Zhisong Qiu♠* , Xiaobin Wang , Ningyu Zhang♠† , Yong Jiang♢† , Pengjun Xie , Fei Huang , Huajun Chen♠† ,

Zhejiang University Alibaba Group
*Equal contribution Corresponding Author
"If you can't describe what you are doing as a process, you don't know what you're doing."
—— W. Edwards Deming

Figure 1: Workflow and its application.

Abstract

Large Language Models (LLMs), with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WORFBENCH, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WORFEVAL, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.



WorFBench

Overview

Figure 2: The overview framework of our WORFBENCH. Sector 1 is the benchmark construction where we first synthesize the node chain and then the workflow graph. Sector 2 is our data filtering process. Sector 3 describes the algorithms in WORFEVAL to evaluate the predicted workflow of LLM agents. Sector 4 is a detailed data point of our WORFBENCH. Note that each node in this figure is uniquely identified by its color. Numbers on the nodes represent their indexes in the gold chain. Nodes matched with gold chain or graph are circled by in Sector 3.



Experiment Results

Main Results

Table 1: Main Results. We evaluate all the models with identical carefully designed instructions and two-shot examples. We categorize the models based on whether the models are open-source and their scales. The best results for each category are marked in bold, and the second-best results are marked with underline.



Analysis

First Slide

Figure 3: Performance Distribution of GPT-4. The distribution of f1_chain for the number of nodes and the distribution of f1_graph for the number of edges.


We analyze the performance of GPT-4 across different numbers of nodes and edges in workflow. With the increase of nodes and edges, both the f1_chain and f1_graph performance of GPT-4 tend to decline, with occasional brief spikes likely caused by uneven sample distribution. Therefore, for complex planning tasks with more planning steps, the performance of GPT-4 is unsatisfying no matter for linear planning or graph planning, let alone other models. This is clearly inadequate for many complex real-world scenarios, which is why many agent architectures are currently only at the theoretical level.




Second Slide

Table 2: Generalization Results of fine-tuned (FT) models on held-out tasks compared to baselines.


We evaluate the trained models' capabilities on both held-in and held-out tasks. While these models demonstrate strong performance on Seal-Tools, their advantages are not as pronounced as on held-in tasks, with even untrained 7B models achieving approximately 74%. On more complex tasks such as InterCodeSQL, the trained models only slightly outperform smaller models (7B and 13B). This indicates that, while they excel in held-in scenarios, their generalization to held-out tasks, particularly embodied tasks, remains constrained. It suggests that structured workflow planning cannot be mastered solely through fitting a large amount of data.




Third Slide

Figure 4: Error Statistics.


Through meticulous manual checks and categorization, we identify four kinds of typical errors: 1) Granularity. The decomposition of subtasks does not meet the minimum executable granularity. 2) Explicitness. The summary of subtasks is overly vague. 3) Graph. The subtask is correct, but the graph structure is incorrect. 4) Format. The output does not adhere to the specified text format.






The Role of Workflow for Agent Planning

Enhance End-To-End Performance

First Slide

Table 3: End-to-end Performance augmented by workflow as prior knowledge.



Workflow as Structured Prior Knowledge. Using workflows as prior knowledge can guide LLM agents in planning, especially in environments where they lack prior knowledge and typically rely on trial-and-error. By inputting the workflow along with the task, GPT-4, Llama-3.1-8B, and Qwen-2-72B show improved performance, as seen in Table 3, with greater benefits in more complex scenarios like ALFWorld. The findings also suggest a "weak-guide-strong" paradigm, where a smaller model with specific environmental knowledge can effectively supervise the planning of a larger, more general model.




Second Slide

Figure 5: Relative Function Call Accuracy of workflow-augmented Qwen-2-7B (Qwen-2-7B+W) on StableToolBench compared with various baselines.


Workflow as CoT Augmentation. Chain-of-Thought (CoT) enhances LLM reasoning but its long-context nature can lead to errors, especially in multi-step planning. Our workflow, where each node corresponds to a function call, helps agents focus by generating CoT at each step and retrieving relevant APIs. This process improves function invocation accuracy, as demonstrated by comparisons with ToolLlama and baselines on StableToolBench.




Reduce End-To-End Inference-Time

First Slide

Figure 6: Average Task Execution Time of linear ToolLlama and parallel ToolLlama.


Parallel Planning Steps. In graph-structured workflows, nodes without dependencies can be executed in parallel, reducing task completion time compared to linear execution. Analysis on StableToolbench shows that identifying the longest path (Critical Path) in the workflow graph helps optimize execution time, leading to a significant reduction in average task completion time—by one-fifth to one-third across various tests. This parallelization not only speeds up inference but also alleviates issues with long contexts in multi-step tasks, improving task quality.




Second Slide

Table 4: Average Planning Steps.


Shorten Planning Steps. Workflows not only reduce inference time through parallel execution but also decrease the planning steps required for LLM agents. When lacking prior environmental knowledge, agents typically rely on random trial-and-error, which can introduce irrelevant information and hinder performance. By incorporating workflow knowledge, the agent's actions become more purposeful, significantly reducing unnecessary planning steps, as shown in the quantitative analysis in Table 4.

BibTeX


@misc{qiao2024benchmarkingagenticworkflowgeneration,
    title={Benchmarking Agentic Workflow Generation}, 
    author={Shuofei Qiao and Runnan Fang and Zhisong Qiu and Xiaobin Wang and Ningyu Zhang and Yong Jiang and Pengjun Xie and Fei Huang and Huajun Chen},
    year={2024},
    eprint={2410.07869},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2410.07869}, 
}

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.