I'm interested to know if anyone is using fine-tuning to train a model on proprietary or in-house codebases and documentation.
RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
How much effort is required to turn code into something one can use for fine-tuning?
I’ve actually found the opposite. At work, we went from a fine-tuned model to a RAG system for internal and external documentation and a generic coding-focused model for code.
Fine tuning against in-house code seems like a small gain over a base model and search. It’s unlikely your code is unique and special and big enough that it’s hard to get results from a base model. You’ll be pinned to a certain version of a certain model, and you won’t be able to upgrade to future models nearly as quickly. Of course, you’re also fighting time again on each commit changing the code unless you continually fine tune it.
A RAG model might still struggle with a super vague question like “where does the foo cal bar with bax set” but it’s unlikely that this would work for fine tuning as well. This is where static code search by symbols really should be used.
There are frameworks for graph-based RAG that mix both approach. One LLM encodes info as a knowledge graph, gradually building up an ontology. Another LLM is used to query this knowledge graph by emitting speculative queries. As the database grows, the second LLM is fine-tuned again and again with exemple queries using the ontology the first LLM came up with.
Would you mind naming some of the frameworks?
RAG definitely is helpful! Fine-tuning imo is extremely powerful but it's still relatively alchemy - technically gpt4, Claude any large model is a finetune of a base model! Reasoning finetuning is also very powerful!
Tbh the hardest part is the lifecycle - ie new data, updating, serving etc - that seems to be the biggest issue
Is anyone having success with iteratively feeding chunks of code (or other documents) to LLM for search? I understand 'haystack' issues with LLMs are quite bad, but RAG is quite bad too and a lot of that haystack research seems to be with feeding very large contexts in.
Well, why not both? If you've already got a tuned model why not use RAG on that to get even better results? It already knows the big picture, it just needs the details so it doesn't have to hallucinate them.
Yes RAG combined is pretty cool! Fyi I'm planning to add optimized RAG directly into unsloth as well!
> I'm interested to know if anyone is using fine-tuning to train a model on proprietary or in-house codebases and documentation.
I've done it, 1/2 the team though it was great 20% of the time, 1/2 the team hated it from day 0. I used roughly 500K lines of code.
> How much effort is required to turn code into something one can use for fine-tuning?
Very little to moderate, less than 200 lines of python, QWEM FIM, HF, LLAMA.CPP, LLAMA.CPP code extension.
> RAG solutions seem to have their limitations, and fine-tuning might be a more effective approach.
The only problem either way is keeping the information up to date, RAG just adds more cost to the inference process (which at my dev speed is pretty important).
> How much effort is required to turn code into something one can use for fine-tuning?
Fine tuning "fill in the middle" process is the process of taking a file, cutting out a some text in the middle and asking AI to guess what was there - there is a hugging face example that will have you doing it in an hour or less - your OPs team saying "No you cant litreally copy all code to a single folder" is probably the biggest hurdle (advise them you'll do it in CI and then they can stand up a FIM training endpoint that accepts a csv, pretty easy)
Oh fill in the middle is definitely smart especially for codebases!!
Love unsloth btw, use it for some other stuff at work, GRPO stuff was fun :)
I know its coming but "mUlTi GpU PlZ" :pleading: <3
I would like to see more knowledgeable people with experience talk about this.
Is it just a matter of assembling Q/A pairs like: “What’s class X?”, “class X { … }”
Do you really need to do this training on the base model instead, which means you have to fine tune chat on it afterward?
How does this work?
I've not done fine tuning on code bases but I have done other fine tuning.
You will generally get better results when you fine-tune the base model on your data.
Since you still want to use it with the chat template in the end, you fine-tune the base model with the chat template with your specific data.
From there you'll have a lora that knows your data alright, but still doesn't really work for chatting.
You take that lora, merge it with the base model. Let's call this the stage model.
Then you use mergekit to merge the base model with both the stage model and the chat model. I used the TIES merge method in the past. Now you have your final model.
I use vLLM for inference, and needed access to multiple fine tunes on only a single set of hardware. So from that point I go and take the base model and my final model and extract a new lora. I also take the base model and chat model and extract another lora for that. Then I load up vLLM with the base model and as many of the fine tune loras I need + the chat lora.
The only time this hasn't worked is if the chat model adds a bunch of new tokens on top of the base model. If I remember right there was an issue with that
This has worked well for me in the past.
Yes!! The trick is the merging of models weights!!
Thank you, this was a great explanation!
Very welcome, I wish you luck!
Yes qa pairs does work - I found training your dataset concatted with general datasets to work well!
Generally it is recommended that you fine-tune if you want to shape the output style. If you eant ro only output json, or just output jsdoc, etc. To add knowledge, RAG is a better idea and easier to keep updated on code changes. If rag does not give back good results then that's a problem with the retrieval part, which can be measured and improved. We're building documentation + rag systems for enterprise code bases and this is what we saw works best.
We are at Scribe[1]. We do it to make sense of knowledge workflows on computers to predict the next step in the process (our software points out where in the DOM a user might need to interact with next). We fine tune with tons of JSON data and DOM data. I’m sure doing it with code no more complicated.
[1] https://scribehow.com/library/scribe-agent
We see a lot of this in large orgs! The main issue imo is actually the selection of chat templates - there's a lot of people who use a template for finetuning then totally forget to use it for finetuning.
A lot of financial, legal and health companies do fine-tuning! Reasoning finetuning via GRPO is also very powerful since you don't need any cot data in between! Just inputs and outputs!