An LLM that responds like me - a tale of fine tuning models

So, what if you want your personal AI to be just that - a very close facsimile of a human collaborator? Perhaps the best person you can think to be that companion is a trusted friend, colleague, mentor or famous figure. 2. You want an AI that expresses itself and responds in deeply human ways - personifies someone with tone of voice, decision making and responds to you in a specific and personal way. To train an LLM on the personality of another, you might need a large corpus of reference material to work from - typically examples of a persons written work do a fantastic job when trying to replicate those characteristics in a synthetic response. But if you don't have access or rights to someone else's intellectual property, then turning inward is a valid place to start - after all, for some of us, the leap from internal dialogue to a conversation with our digital selves isn't really that far... But how do you get the AI to think and respond like you do? I'll short cut to the answer on this one: you can get the majority of the way to a close likeness for tone of voice through detailed system prompts. But decision making and phrasing when responding requires a level of training. So I thought I'd give fine-tuning a model a go. **When is fine-tuning a model a good idea?** 1. **You need the model to learn specific behaviours, formats, or styles** - You want it to adopt a **consistent tone of voice**, structure, or domain-specific style. - Example: Training a customer support bot to write in your company’s exact tone or response format. 2. **You have a clearly defined and repeated task that generic models don’t perform well on** - And few-shot prompting or RAG doesn’t get you close enough. - Example: Legal contract summarisation in your firm’s preferred format with specific clause weighting 3. **You have access to high-quality, labeled data for your use case** - At least a few thousand good examples, especially for small/efficient fine-tuning (LoRA, QLoRA, etc.). - Example: Historical support tickets matched to resolved replies, annotated legal texts, or high-quality technical writing examples. 4. **You need the model to perform offline or at the edge** - You can’t depend on APIs and want a locally running, domain-specialised model. 5. **Your use case is not easily solved by prompt engineering, embeddings, or RAG** - If you’ve exhausted in-context learning and still can’t hit accuracy, coherence, or response style requirements. 6. **You want model privacy or customisation that vendors can’t offer** - Example: You’re embedding sensitive company IP or strategy into the model’s base reasoning. **Fine-Tuning Is Not a Good Idea When…** 1. **You’re trying to teach the model factual knowledge or documents** - Use **RAG (Retrieval-Augmented Generation)** instead — it’s cheaper, faster, and more flexible. 2. **You only have a few examples** - Try **prompt engineering**, **few-shot prompting**, or **custom instructions** before going nuclear. 3. **You just want to change the output format or tone occasionally** - Use prompt templates or system messages instead. Much more efficient. 4. **You’re trying to make the model better at general tasks** - You’re unlikely to outperform the base model’s training unless you have _massive_ compute and data. 5. **You haven’t fully explored lighter-weight methods like adapters or instruction tuning** - Full fine-tuning is expensive. Methods like LoRA or SFT on a smaller model might be enough. Clearly fine-tuning a model isn't going to be the silver bullet for a personal AI - but I wanted to gain some understanding of the fine-tuning process and see if I could create a 'personal model' - one that I could run locally and create a closer degree of likeness that I could combine with RAG and other techniques. To do so I set up an experiment of controlled questions, gathered a set of representative documents I had written and began to explore the possibilities of the LoRA method. Low Rank Adaption (LoRA) creates a set of new parameters from training data that can used to extend a base models training, without changing the underlying model. This is beneficially faster to train the small additional parameters and keeps the general knowledge of the base model 'whole' - a good fit for fine tuning a personal likeness as the objective was to simply overlay likeness and tone of voice, adding some new knowledge but not losing general capabilities. The training data is typically question and answer 'pairs.' Examples of what you might prompt an LLM with and a model answer. To create the training data, I deployed AI again to do the heavy lifting - feeding example documents into a project context and asking the LLM to create a JSON file of 'pairs' for the express use of training an LLM to response like myself and using the example format of the training tool. Doing so, saved hours of data preparation and ensured a smooth upload of the data to the tools. Example: "instruction": "What is Cathedral Thinking and why does it matter?", "response": "Cathedral Thinking refers to long-term strategic planning that spans generations. It’s essential in AI governance to ensure sustainability beyond immediate trends.", "context": "From 'Cathedral Thinking'" Training requires some heavy compute power and so I opted to try a couple of different methods for training. Firstly, Hugging face : Hugging face have an 'Auto-train' app that runs in their 'Spaces' capability - think of this like deploying a tool into virtualised computing space that you rent - allowing you to spin up powerful and expensive GPUs for a fraction of time that is affordable. Since each 'space' is somewhat dedicated to each tool and step in the method I ended up with a series of sequential spaces and scripts (which also required som AI assisted learning and development) to create a training pipeline. This iterative training journey took me through multiple phases, each with its own model base, methodology, and lesson learned. Here’s a look at how the training evolved over time: ### **Training Infrastructure & Pipeline Setup** Training a model—even a small one—requires significant compute. To keep things cost-effective, I explored two parallel routes: the Hugging Face AutoTrain pipeline and OpenAI’s fine-tuning interface. Hugging Face’s Spaces feature allowed me to rent high-powered GPUs on demand, building a modular and reusable training pipeline: 1. Generate training pairs (JSON) from my writing using Claude. 2. Train LoRA adapters using Hugging Face AutoTrain. 3. Merge adapters with the base model using a custom script. 4. Convert the merged model to GGUF format for local deployment. 5. Install and test the model in MSTY (my personal AI interface). This setup gave me flexibility to swap datasets, adjust parameters, and compare outputs quickly across iterations. ### **Fine-Tuning Challenges** As with any machine learning task, fine-tuning presented common pitfalls: **Overfitting** — Smaller datasets and high epoch counts led the model to overly memorise the training data, echoing key terms like “MSTY” disproportionately. **Underfitting** — Insufficient training or too low a learning rate meant the model failed to internalise style or structure meaningfully. **Catastrophic Forgetting** — Without careful adapter use, some experiments caused the model to forget general language skills in favour of narrow behavioural mimicry. These issues helped me shape not only the training settings, but also how I curated and expanded the dataset to create a more realistic simulation of how I think and write. ### **Phase 1: SMOL 1.7B (LoRA Fine-Tuning via Hugging Face)** I began with the lightweight SMOL 1.7B model using Hugging Face’s AutoTrain tool and the Low-Rank Adaptation (LoRA) method. The first training attempt showed clear signs of overfitting—repeating specific concepts like “MSTY” excessively, and mixing up key ideas due to the limited breadth of the dataset. The second iteration broadened the dataset, but resulted in underfitting—the model lost grip on the key content and tone entirely. A useful insight emerged: while tone could be shaped somewhat through prompt tuning, deeper behavioural imitation would require more consistent, high-quality training data with sufficient variation. ### **Phase 2: Mistral 7B (Adapter Fine-Tuning, Custom Merge)** Switching to Mistral 7B gave the model more expressive capacity. I refined the dataset further, focusing on capturing tone, reasoning structure, and journaling workflows. This phase also saw improved LoRA settings (e.g. increased dropout, higher r values), and a better training pipeline, now split into modular scripts for each phase. After resolving technical friction around tokenizer loading and adapter merging, I was able to generate a GGUF version of the merged model for local use in MSTY. This version saw noticeable gains in output tone and reasoning structure, particularly when combined with a retrieval-augmented generation (RAG) layer. The fine-tuning gave it “me-like” phrasing, while RAG kept the facts grounded. ### **Phase 3: GPT-4o Mini (OpenAI Fine-Tuning)** Next, I tested OpenAI’s own fine-tuning interface using their GPT-4o Mini model. The training process was smoother, with a more integrated UI for uploading training data and configuring training runs. Using the same dataset, the initial results were clean, polished, and stylistically close—but often too concise. The second run, with adjusted settings and longer-form prompts in the dataset, produced much better results. Responses aligned more strongly with my voice, and the journaling answers began to reflect structure and thought patterns more effectively. However, hallucinations still occurred—especially for terms like “Euformia” which had partial presence in the training data. The solution here is contrastive examples and clarification within the training dataset itself. ### **Lessons Learned** 1. **Style Transfer ≠ Knowledge Transfer** - Fine-tuning excels at shaping tone and phrasing, but struggles with factual grounding—RAG remains essential. 2. **The Model Learns What You Feed It** - If the training set is overly concise, the model will be too. If you want structured articles, train it on structured articles. 3. **Merging Is a Skill** - Adapter training via Hugging Face is powerful but fiddly—managing tokenizers, merging, and format conversion requires care. 4. **Fine-Tuning Alone Isn’t Enough** - The best results come from combining fine-tuning (for tone and reasoning) with RAG (for factual relevance) and base model prompting (for form). In the end what I’ve built wasn't perfect, but the experiment was meaningful. It reveals the need to layer multiple approaches: fine-tuning for character and personality, RAG to think with context, and prompt engineering to behave structurally. Each method has its place. Together, they create something more coherent and expressive: a personalised thinking partner that genuinely reflects how I work, write, and reason.