Instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs), which bridges the gap between the next-word prediction objective of LLMs and human preference. To construct a high-quality instruction dataset, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.
EasyInstruct is a Python package which is proposed as an easy-to-use instruction processing framework for Large Language Models(LLMs) like GPT-4, LLaMA, ChatGLM in your research experiments. EasyInstruct modularizes instruction generation, selection, and prompting, while also considering their combination and interaction.
APIs & Engines
module standardizes the instruction execution process, enabling the execution of instruction prompts on specific LLM API services or locally deployed LLMs.Generators
module streamlines the instruction generation process, enabling automated generation of instruction data based on chat data, corpus, or knowledge graphs.Selectors
module standardizes the instruction selection process, enabling the extraction of high-quality instruction datasets from raw, unprocessed instruction data.Prompts
module standardizes the instruction prompting process.
The instruction generation methods implemented in Generators
are categorized into three groups, based on their respective seed data sources: chat data, corpus, and knowledge graphs. The evaluation metrics in Selectors
are divided into two categories, based on the principle of their implementation: statistics-based and LM-based.
We detail the components of Generators
and Selectors
modules in the table below:
The framework is designed to cater to users with varying levels of expertise, providing a user-friendly experience ranging from code-free execution to low-code customization and advanced customization options:
We provide two ways for users to quickly get started with EasyInstruct. You can either use the shell script or the Gradio app based on your specific needs.
Step1: Prepare a configuration file. Users can easily configure the parameters of EasyInstruct in a YAML-style file or just quickly use the default parameters in the configuration files we provide. Following is an example of the configuration file for Self-Instruct:
Step2: Run the shell script. Users should first specify the configuration file and provide their own OpenAI API key. Then, run the following shell script to launch the instruction generation or selection process.
We provide a Gradio app for users to quickly get started with EasyInstruct. Users can choose to launch the Gradio app locally on their own machines or alternatively, they can try the hosted Gradio app that we provide on HuggingFace Spaces.
In experiments, we mainly consider four instruction datasets as follows: (a) self_instruct_5k is constructed by employing the Self-Instruct method to distill instruction data from text-davinci-003; (b) alpaca_data_5k is randomly sampled from the Alpaca dataset; (c) evol_instruct_5k is constructed by employing the Evol-Instruct method; (d) easyinstruct_5k is collected by integrating the three instruction datasets above and applying multiple Selctors
in EasyInstruct to extract high-quality instruction datasets.
To conduct the experiments on the effect of instruction datasets, we adopt a LLaMA2 (7B) model. We fine-tune all our models with LoRA in the format proposed in Alpaca. The evaluation is conducted by comparing the generated results from different fine-tuned models based on the AlpacaFarm evaluation set. Following AlpacaFarm, for each comparison, we employ ChatGPT as the evaluator to automatically compare two outputs from different models and label which one they prefer, reporting the win rate as the evaluation metric.
Instruction Diversity. To study the diversity of the instruction datasets considered in our experiments, we identify the verb-noun structure in the generated instructions and plot the top 20 most prevalent root verbs and their top 4 direct nouns in the figure below. Overall, we see a wide range of intents and textual formats within these instructions.
Main Results. We compare the generated outputs from models fine-tuned separately on the four instruction datasets with the outputs from the base version of the LLaMA2 (7B) model on the AlpacaFarm evaluation set. As depicted in the figure below, there are improvements in the win rate metric for all the settings. Moreover, the model performs optimally under the easyinstruct_5k setting, indicating the importance of a rich instruction selection strategy.
Case Study. To conduct a qualitative evaluation of EasyInstruct, we sample several instruction examples selected by the Selctors
module in easyinstruct_5k for the case study.
We also attach the corresponding evaluation scores for each of these instruction examples, as shown in the table below.
We observe that the selected instructions often possess fluent language and meticulous logic.
@article{ou2024easyinstruct,
title={EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models},
author={Ou, Yixin and Zhang, Ningyu and Gui, Honghao and Xu, Ziwen and Qiao, Shuofei and Bi, Zhen and Chen, Huajun},
journal={arXiv preprint arXiv:2402.03049},
year={2024}
}