Fine-Tuning with OpenAI: Process and  Requirements

OpenAI:

Fine-tuning an OpenAI model is customizing a pre-trained model (like GPT-4) with your own data via OpenAI’s API. OpenAI has improved this process significantly, making it more developer-friendly. Fine-tuning an OpenAI model can turn a generic GPT into a domain expert that speaks your company’s language. But doing it right takes careful prep, experts from a custom software development company Belitsoft say. Here’s exactly how the process works and what results startups see.

Belitsoft offers full-cycle generative AI implementation services, including choosing AI model architecture (RAG or LLM), configuring infrastructure (cloud or on-premises), fine-tuning models with industry-specific data, integrating AI software with other organizational systems, testing, and deployment.

Fine-Tuning Process Overview

OpenAI’s documentation lays out three main steps to create a fine-tuned model​:

  1. Prepare Training Data

Assemble a file of examples that demonstrate the behavior you want from the model. For GPT style models, these are typically prompt-completion pairs (for example: prompt: “Question: Explain X… Answer:” completion: “(the ideal answer)”). OpenAI allows fine-tuning with conversation-style data (including user messages and desired assistant responses). The data should be in JSONL format and adhere to OpenAI’s guidelines (each example should not exceed the token limit, and avoid forbidden content).

  1. Upload and Fine-Tune via API

Using OpenAI’s API or command-line tool, the developer uploads the training file and kicks off a fine-tuning job. You specify the base model and some parameters (like number of training epochs). 

OpenAI then performs the fine-tuning on their servers. During this process, OpenAI passes the training data through safety filters – they use their Moderation API and even GPT-4-based moderators to detect any unsafe content in the training data that conflicts with their policies​. (For example, if your dataset included hate speech examples, those would likely be flagged or rejected.) This step ensures the fine-tuned model doesn’t learn toxic or disallowed behaviors. Fine-tuning jobs can take anywhere from a few minutes (for small datasets) to several hours for larger ones. You get an email or notification when it’s done.

  1. Use the Fine-Tuned Model

Once completed, the fine-tuned model is assigned a name and is available via the OpenAI API. You can now make requests to it just as you would to the base model. The fine-tuned model responds in the style or with the knowledge from your training data. OpenAI keeps the model in your account, and importantly the fine-tuned model is private to you – it’s not to be used to update OpenAI’s base models, and other users cannot access it. OpenAI has stated that data submitted for fine-tuning belongs to the customer and is not used to train OpenAI’s broader models​, which is a key assurance for enterprises worried about data confidentiality.

How Much Data Do You Need to Fine-Tune GPT

Volume of Data. Fine-tuning generally needs hundreds of examples at least to be effective, though not as many as training from scratch. OpenAI observed that with as few as <100 examples, you can start seeing benefits over prompt engineering​. Every time you double the number of training examples, model quality tends to improve linearly (until a point of diminishing returns)​. For example, OpenAI’s research showed even 50 examples can significantly boost performance on a specific task, and going from 50 to 100, 200, 400 examples yields steady gains. Many real-world fine-tuning cases use a few thousand examples. For example, Viable fine-tuned thousands of customer feedback snippets to reach high accuracy​, and Keeper Tax kept adding 500 examples every week to continuously improve​. There is no hard minimum enforced by the API (even a dozen examples can fine-tune technically), but qualitatively, more high-quality data = better model.

Quality and Format. The data should closely represent the inputs and outputs you expect in deployment. If you’re building a chatbot, training on actual chat logs of good conversations is ideal. Data should be cleaned of errors because the model latches onto patterns in the training data (including typos or incorrect answers). OpenAI’s fine-tuning endpoint expects data in a conversation format (with roles like user/system/assistant) if you want to tune behavioral aspects like tone or format. They also impose a limit on each training example’s length (each example must be under the model’s max context length – tokens). Very large fine-tuning jobs (many millions of tokens) can be expensive.

Rate Limits. Fine-tuned models have the same throughput limits as the base model for your account (allows X requests per minute for you, your fine-tuned version shares that quota)​. So you don’t get extra capacity by fine-tuning. It’s more about quality, not quantity. For very high-volume use, one may need to request rate limit increases or use OpenAI’s enterprise arrangements.

Safety and Moderation. An important aspect is that OpenAI attempts to ensure fine-tuning doesn’t produce an unsafe model. They use automated checks on the training data​ and errors out if you try to fine-tune with content that violates their policies (e.g., extremism, sexually explicit content involving minors, etc.). Moreover, the fine-tuned model inherits the base model’s safety mitigations. For example, if you fine-tune GPT to answer questions in a specific domain, it should still refuse requests that are disallowed (like instructions to do something harmful) just as the base model would. OpenAI confirms that the default safety features are preserved​. However, there is a risk: if your fine-tuning data itself contains biased or toxic language, the model could learn that style. So one must curate training data carefully from a safety perspective. OpenAI’s monitoring helps mitigate this by flagging problematic training inputs.

What Startups Gain From Fine-Tuning

OpenAI and its community have reported numerous success stories highlighting what fine-tuning enables.

Improved Prompt Adherence and Format Customization. Fine-tuning can make the model follow instructions much more strictly than zero-shot or few-shot prompting. For instance, OpenAI noted you can have a model that always responds in German if that’s in your fine-tuning data​. This was harder to achieve with just prompt engineering. A fine-tuned model effectively has those instructions baked in. Tests with GPT fine-tuning showed that teams could reduce prompt size by up to 90% by moving instructions into the model rather than sending them each request​. Shorter prompts not only cut cost but also reduce the chance the model ignores instructions (since they’re internalized). The Indeed case above is a concrete example. By fine-tuning, they got the model to produce the desired output without needing lengthy example-laden prompts each time​.

Cost and Latency Reduction.  A side effect of shorter prompts and more relevant responses is lower usage cost and faster responses. OpenAI pointed out that fine-tuning can save costs at runtime because you can omit large system prompts or few-shot examples​. For over thousands of requests, this is significant savings. Latency improves too (less token processing). Many startups fine-tune specifically to optimize inference efficiency at scale. For example, an e-commerce chatbot fine-tuned to answer product questions may not need a long context prompt each time, making each chat snappier and cheaper to serve. This enables scaling up user volume without proportional cost increase​.

Domain-Specific Accuracy Gains. Perhaps the biggest draw is that fine-tuning can inject proprietary or domain-specific knowledge that the base model lacks. OpenAI’s base models are trained on broad internet data, which may not include, say, your company’s internal wiki or the latest technical documentation in a niche field. By fine-tuning on a custom dataset, the model’s effective knowledge or skill in that area improves. We saw this with Harvey: base GPT-4 is a great general reasoner, but it didn’t “know” the entirety of case law. After Harvey fine-tuned (and even did intermediate training) on legal texts, the model could cite legal precedents and use terminology far better, leading to attorneys strongly preferring the fine-tuned model’s answers over the original​. Another example: SK Telecom (SKT) worked with OpenAI to fine-tune GPT-4 on telecom customer service data. They reported a 35% improvement in conversation summarization quality and 33% better intent recognition compared to base GPT-4 for their Korean telecom domain​. Customer satisfaction scores went up as a result (from 3.6 to 4.5 out of 5)​. These are significant jumps in performance that can be directly attributed to fine-tuning with domain-specific examples. Such improvements often mean the difference between a model that’s “okay” for an enterprise use-case and one that’s production-grade.

Extended Functionality. Fine-tuning can also extend a model’s capabilities slightly. For example, while you cannot increase the fundamental size or memory of the model, you can chain it with retrieval (RAG) and then fine-tune it to better use that retrieved data. OpenAI has mentioned that fine-tuning combined with techniques like RAG is a powerful approach​.

Process in Practice – What a Startup Does.  A startup doesn’t just fine-tune once and call it a day. They usually do several rounds. First, they run an initial fine-tuning with the data they have. Then, they test how well the model performs. If it’s not good enough, they go back, fix or expand their dataset, and fine-tune again.

OpenAI now gives tools to help manage this process.

  • Epoch checkpointing. This means that after each training cycle (epoch), the system saves that model state. If things start going wrong (like overfitting), you can roll back to an earlier checkpoint instead of starting over.
  • Validation metrics. While fine-tuning, you can see metrics that show whether the model is improving or not. This helps you pick the best version, rather than guessing.
  • Weights & Biases (W&B) is a popular tool developers use to track and visualize all their training runs, hyperparameters, and results. OpenAI integrated with W&B, so startups can easily compare different fine-tuning runs and see which one performed best.

When the fine-tuned model is ready, you just use the same API calls you used before — but now you point to your custom model name. Everything else stays the same. 

If you’re looking for expertise in LLM development projects, bespoke LLM training, or AI chatbot development, Belitsoft’s engineers are ready to discuss your project requirements. 

Subscribe

* indicates required