Turning Llama 3.2 Into a Legal Q&A Pro: Fine-Tuning a Model for UK Legislation - Part 2

Dwain Barnes
Nov 30, 2024
6 min read

In the first part of this series, I shared how I pretrained Meta’s Llama 3.2 model to focus on UK legislation. The goal? To create a model that could understand legal texts and interpret them effectively. But pretraining was only step one. Now, I’ve taken it further by teaching the model how to answer user questions with a custom Q&A dataset—and the results have been pretty exciting.

Here’s what I did, step by step, and how you can access everything I built.

Step 1: Generating a Q&A Dataset on UK Legislation

The first challenge was to create a dataset that contained realistic question-and-answer pairs based on UK legislation. I wrote a Python script to generate these pairs using GPT-4o-mini. Why GPT-4o-mini? It’s excellent at creating relevant, context-aware outputs, making it perfect for bootstrapping a dataset like this and its cheap, to do the whole dataset it cost around £3.50.

Here’s the script I used:

from openai import OpenAI
import time
from math import ceil

# Initialize the OpenAI client
client = OpenAI(api_key="YOUR OPENAI API KEY HERE")

def generate_qa_batch(batch_number, batch_size):
    # Define the legal areas to cover
    legal_areas = [
        "Constitutional and Administrative Law - covering topics like parliamentary sovereignty, judicial review, and human rights",
        "Criminal Law - including elements of crimes, defenses, and criminal procedure",
        "Contract Law - focusing on formation, terms, breach, and remedies",
        "Tort Law - covering negligence, nuisance, and other civil wrongs",
        "Employment Law - including contracts, discrimination, and workplace rights",
        "Company Law - covering corporate structure, director duties, and shareholder rights",
        "Family Law - including marriage, divorce, child custody, and domestic violence",
        "Intellectual Property Law - covering patents, trademarks, copyright, and design rights"
    ]
    
    # Calculate which legal areas to focus on for this batch
    areas_per_batch = min(batch_size, len(legal_areas))
    selected_areas = legal_areas[(batch_number * areas_per_batch) % len(legal_areas):] + legal_areas[:(batch_number * areas_per_batch) % len(legal_areas)]
    selected_areas = selected_areas[:areas_per_batch]
    
    area_prompt = "\n".join(f"- {area}" for area in selected_areas)
    
    prompt = f"""
    Generate {batch_size} unique question-and-answer pairs about UK legislation, focusing on the following areas of law:
    
    {area_prompt}
    
    Each answer must be detailed and at least two paragraphs long, including:
    - Specific references to relevant legislation and statutes
    - Important case law and precedents
    - Recent amendments or changes
    - Practical implications and real-world applications
    - Common misconceptions or areas of complexity
    
    Number each Q&A pair explicitly as {batch_number * batch_size + 1} through {batch_number * batch_size + batch_size}.
    
    Format each pair like this example:
    
    {batch_number * batch_size + 1}. Q: What are the key elements and defenses in UK criminal law regarding self-defense?
    A: Self-defense in UK criminal law is governed primarily by the Common Law and clarified through Section 76 of the Criminal Justice and Immigration Act 2008. The law establishes that a person may use such force as is reasonable in the circumstances for self-defense, defense of another, defense of property, or prevention of crime. The test for reasonableness is both subjective and objective: the defendant must have honestly believed force was necessary (subjective test) and the force used must be proportionate to the threat as the defendant perceived it (objective test).

    Key case law, including R v Williams (1987) and R v Owino (1996), has established that the defendant's actions must be judged based on the circumstances as they honestly believed them to be at the time, even if their belief was mistaken. However, as demonstrated in R v Clegg (1995), the force used must still be proportionate to the perceived threat. Recent developments, including the Crime and Courts Act 2013, have provided additional protection for householders, allowing them to use disproportionate (but not grossly disproportionate) force when defending themselves against intruders in their homes.
    """
    
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a legal expert specializing in UK legislation. Provide detailed, comprehensive answers that include specific references to legislation, case law, practical implications, and relevant examples. Each answer must be at least two paragraphs long."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=3000
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error in batch {batch_number + 1}: {str(e)}")
        return None

def generate_qa_pairs():
    try:
        total_pairs = int(input("Enter the total number of Q&A pairs to generate: "))
        batch_size = 5
        num_batches = ceil(total_pairs / batch_size)
        
        print(f"\nGenerating {total_pairs} Q&A pairs in {num_batches} batches...")
        
        all_qa_pairs = []
        successful_pairs = 0
        
        for batch_num in range(num_batches):
            remaining_pairs = min(batch_size, total_pairs - (batch_num * batch_size))
            
            print(f"\nProcessing batch {batch_num + 1}/{num_batches} (generating {remaining_pairs} pairs)...")
            
            batch_content = generate_qa_batch(batch_num, remaining_pairs)
            
            if batch_content:
                all_qa_pairs.append(batch_content)
                successful_pairs += remaining_pairs
                print(f"Batch {batch_num + 1} completed successfully")
            else:
                print(f"Batch {batch_num + 1} failed")
            
            if batch_num < num_batches - 1:
                print("Waiting 5 seconds before next batch...")
                time.sleep(5)
        
        output_file_path = 'uk_legal_qa_pairs.txt'
        with open(output_file_path, 'w', encoding='utf-8') as file:
            output_content = '\n\n'.join(all_qa_pairs)
            file.write(output_content)
            
        print(f"\nGeneration complete!")
        print(f"Successfully generated {successful_pairs} Q&A pairs")
        print(f"Output written to {output_file_path}")
        
        print("\nFirst 500 characters of output:")
        print(output_content[:500] + "...")
                
    except Exception as e:
        print(f"\nAn error occurred: {str(e)}")
        import traceback
        print("\nFull error traceback:")
        print(traceback.format_exc())

if __name__ == "__main__":
    generate_qa_pairs()

You can run the script and input how much Questions and answer pairs you would like. i recommend using sepearete sessions. You will need to input your Open API key, be aware it will cost.

The outputs looked like this.

Step 2: Formatting the Dataset in Alpaca Style

Once I had my Q&A pairs, I needed to format them into Alpaca style, which is perfect for training instruction-following models. This involves converting each question into an "instruction" field and the answer into an "output" field.

Here’s the script I for that:

import json
import re

def convert_to_alpaca_format(input_file, output_file, debug=False):
    alpaca_data = []
    skipped = []

    try:
        with open(input_file, 'r', encoding='utf-8') as file:
            lines = file.readlines()
    except UnicodeDecodeError as e:
        raise Exception("Error reading the file. Ensure it's encoded in UTF-8.") from e

    current_instruction = ""
    current_output = ""

    for line in lines:
        line = line.strip()

        # Skip markers like "### 761."
        if re.match(r"^###\s*\d+\.", line):
            continue

        # Detect a question line (starts or contains "Q:")
        if "Q:" in line:
            # Save the previous Q&A if it exists
            if current_instruction and current_output:
                alpaca_data.append({
                    "instruction": current_instruction.strip(),
                    "input": "",
                    "output": current_output.strip()
                })
                current_instruction = ""
                current_output = ""

            # Extract the question after "Q:"
            current_instruction = line.split("Q:", 1)[1].strip()
        
        # Detect an answer line (starts with "A:")
        elif line.startswith("A:"):
            # Start a new answer
            current_output = line[2:].strip()
        elif current_output:
            # Append continuation lines to the current output
            current_output += " " + line.strip()
        else:
            # If the line doesn't fit the Q&A structure, log it for debugging
            if debug:
                skipped.append(line)

    # Append the last Q&A
    if current_instruction and current_output:
        alpaca_data.append({
            "instruction": current_instruction.strip(),
            "input": "",
            "output": current_output.strip()
        })

    # Save the output
    try:
        with open(output_file, 'w', encoding='utf-8') as out_file:
            json.dump(alpaca_data, out_file, indent=2, ensure_ascii=False)
    except Exception as e:
        raise Exception("Error writing to output file.") from e

    if debug:
        with open("skipped_lines.txt", 'w', encoding='utf-8') as debug_file:
            debug_file.write("\n".join(skipped))

    print(f"Conversion complete. Processed {len(alpaca_data)} Q&A pairs.")
    if debug:
        print(f"{len(skipped)} lines were skipped. Check 'skipped_lines.txt' for details.")

# Example usage
convert_to_alpaca_format('combined.txt', 'alpaca_formatted.json', debug=True)

This structured format made the dataset compatible with the fine-tuning process later.

Step 3: Cleaning and Deduplicating

One of the not-so-glamorous but super-important steps was cleaning the dataset. Duplicate entries can skew training and hurt the model’s performance, so I wrote a deduplication script to filter them out:

import json

def remove_duplicates(alpaca_file, cleaned_file):
    try:
        with open(alpaca_file, 'r', encoding='utf-8') as file:
            data = json.load(file)
    except Exception as e:
        raise Exception("Error reading the file. Ensure it's a valid JSON.") from e

    seen_questions = set()
    seen_answers = set()
    cleaned_data = []

    for entry in data:
        question = entry.get("instruction", "").strip()
        answer = entry.get("output", "").strip()

        # Check if the question or answer is a duplicate
        if question in seen_questions or answer in seen_answers:
            continue

        # If unique, add to the cleaned data and track seen questions and answers
        cleaned_data.append(entry)
        seen_questions.add(question)
        seen_answers.add(answer)

    # Save the cleaned data to a new file
    try:
        with open(cleaned_file, 'w', encoding='utf-8') as file:
            json.dump(cleaned_data, file, indent=2, ensure_ascii=False)
    except Exception as e:
        raise Exception("Error writing to the cleaned file.") from e

    print(f"Duplicates removed. Cleaned data saved to {cleaned_file}.")
    print(f"Original entries: {len(data)}. Cleaned entries: {len(cleaned_data)}.")

# Example usage
remove_duplicates(' test.json', 'alpaca_cleaned_test.json')

Step 4: Fine-Tuning the Model

Once the dataset was ready, I fine-tuned the model using Unsloth, an efficient framework for fine-tuning large language models. I set up the training to focus on my cleaned Alpaca-formatted dataset. Please see my other blogs for how to set that up.

Step 5: Access the Model, Dataset, and Code

Everything I built for this project is freely available, so you can explore or even replicate it for your own use cases.

Fine-Tuned Model: Download the model
GGUF Quantized Version: Get the quantized versions
Dataset: View the dataset
Code: Check out the GitHub repo

What’s Next?

With this fine-tuned model, answering UK law-related questions has become a much more streamlined process. But I’m not stopping here. Next, I want to expand the dataset to include more edge cases and maybe even dive into some cases. There’s so much potential to make this tool even better not just for legal professionals but for anyone trying to navigate the complexities of UK law.

If you give the model a try, let me know how it works for you. I’d love to hear your feedback!