💻 Usage Instructions & Steps to reproduce
We structure the code available in this replication package based on the stages involved in the LLM-based annotation process.
🤖 LLM-based annotation
The folder contains the code used to generate the LLM-based annotations.llm_annotation
There are two main scripts:
create_assistant.py is used to create a new assistant with a particular provider and model. This class includes the definition of a common system prompt across all agents, using the file as the basis.data/guidelines.txt
annotate_emotions.py is used to annotate a set of emotions using a previously created assistant. This script includes the assessment of the output format, as well as some common metrics for cost-efficiency analysis and output file generation.
Our research includes an LLM-based annotation experimentation with 3 LLMs: GPT-4o, Mistral Large 2, and Gemini 2.0 Flash. To illustrate the usage of the code, in this README we refer to the code execution for generating annotations using GPT-4o. However, full code is provided for all LLMs.
🔑 Step 1: Add your API key
If you haven't done this already, add your API key to the file in the root folder. For instance, for OpenAI, you can add the following:.env
OPENAI_API_KEY=sk-proj-...
🛠️ Step 2: Create an assistant
Create an assistant using the script. For instance, for GPT-4o, you can run the following command:create_assistant.py
python ./code/llm_annotation/create_assistant_openai.py --guidelines ./data/guidelines.txt --model gpt-4o
This will create an assistant loading the file and using the GPT-4o model.data/guidelines.txt
📝 Step 3: Annotate emotions
Annotate emotions using the script. For instance, for GPT-4o, you can run the following command using a small subset of 100 reviews from the ground truth as an example:annotate_emotions.py
python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth-small.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10
For annotating the whole dataset, run the following command (IMPORTANT: this will take more than 60 minutes due to OpenAI, Mistral and Gemini consumption times!):
python ./code/llm_annotation/annotate_emotions_openai.py --input ./data/ground-truth.xlsx --output ./data/annotations/llm/temperature-00/ --batch_size 10 --model gpt-4o --temperature 0 --sleep_time 10
Parameters include:
input: path to the input file containing the set of reviews to annotate (e.g., ).data/ground-truth.xlsx
output: path to the output folder where annotations will be saved (e.g., ).data/annotations/llm/temperature-00/
batch_size: number of reviews to annotate for each user request (e.g., 10).
model: model to use for the annotation (e.g., ).gpt-4o
temperature: temperature for the model responses (e.g., 0).
sleep_time: time to wait between batches, in seconds (e.g., 10).
This will annotate the emotions using the assistant created in the previous step, creating a new file with the same format as in the file.data/ground-truth.xlsx
🔄 Data processing
In this stage, we refactor all files into iterations and we consolidate the agreement between multiple annotators or LLM runs. These logic serves both for human and LLM annotations. Parameters can be updated to include more annotators or LLM runs.
✂️ Step 4: Split annotations into iterations
We split the annotations into iterations based on the number of annotators or LLM runs. For instance, for GPT-4o (run 0), we can run the following command:
python code/data_processing/split_annotations.py --input_file data/annotations/llm/temperature-00/gpt-4o-0-annotations.xlsx --output_dir data/annotations/iterations/
This facilitates the Kappa analysis and agreement in alignment with each human iteration.
🤝 Step 5: Analyse agreement
We consolidate the agreement between multiple annotators or LLM runs. For instance, for GPT-4o, we can run the following command to use the run from Step 3 (run 0) and three additional annotations (run 1, 2, and 3) already available in the replication package (NOTE: we simplify the process to speed up the analysis and avoid delays in annotation):
python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-0 gpt-4o-1 gpt-4o-2 gpt-4o-3
For replicating our original study, run the following:
python code/evaluation/agreement.py --input-folder data/annotations/iterations/ --output-folder data/agreements/ --annotators gpt-4o-1 gpt-4o-2 gpt-4o-3
📊 Evaluation
After consolidating agreements, we can evaluate both the Cohen's Kappa agreement and correctness between the human and LLM-based annotations. Our code allows any combination of annotators and LLM runs.
📈 Step 6: Emotion statistics
We evaluate the statistics of the emotions in the annotations, including emotion frequency, distribution, and correlation between emotions. For instance, for GPT-4o and the example in this README file, we can run the following command:
python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o-0123
For replicating our original study, run the following:
python code/evaluation/emotion_statistics.py --input-file data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output-dir data/evaluation/statistics/gpt-4o
⚖️ Step 7: Cohen's Kappa pairwise agreement
We measure the average pairwise Cohen's Kappa agreement between annotators or LLM runs. For instance, for GPT-4o and the example in this README file, we can run the following command:
python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-0,gpt-4o-1,gpt-4o-2,gpt-4o-3
For replicating our original study, run the following:
python code/evaluation/kappa.py --input_folder data/annotations/iterations/ --output_folder data/evaluation/kappa/ --annotators gpt-4o-1,gpt-4o-2,gpt-4o-3 --exclude 0,1,2
In our analysis, we exclude iterations 0, 1 and 2 as they were used for guidelines refinement.
✅ Step 8: LLM-based annotation correctness
We measure the correctness (accuracy, precision, recall, and F1 score) between a set of annotated reviews and a given ground truth. For instance, for GPT-4o agreement and the example in this README file, we can run the following command:
python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-0-gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o
For replicating our original study, run the following:
python code/evaluation/correctness.py --ground_truth data/ground-truth.xlsx --predictions data/agreements/agreement_gpt-4o-1-gpt-4o-2-gpt-4o-3.xlsx --output_dir data/evaluation/correctness/gpt-4o
📝 Step 8: Check results
After completing these steps, you will be able to check all generated artefacts, including:
LLM annotations: available at data\annotations\llm\
Agreement between LLM annotations and humans: available at data\evaluation\kappa
Correctness of LLM annotations with respect to Human agreement: available at data\evaluation\correctness
📜 License