They test two fine tuning tasks in the article - reliable output formatting and custom tone. These are two tasks (reliable output formatting in particular) that are advertised regularly as areas where fine tuning an LLM should work. The goal is not to change what the LLM knows, but to change how the LLM communicates what it knows. In theory the user wants to leverage the LLMs knowledge base and using the different output format is more useful to the user.

The hard question IMO is the question of when does it make sense to fine tune an LLM to update it's knowledge and how much data is needed in this case? I have not seen anyone show a real example of succeeded in this case and wonder if it's close to as difficult as training the LLM from scratch or if it's a feasible fine tuning use case.

Is this something like short term vs long term memory? The context window for LLMs is its short term memory where you can tell it to do things or quickly define something and the LLM can learn very quickly even with just 1 example or a sentence. But it forgets immediately once the work is done. But for finetuning, it commits the knowledge into its weight network and have a "deeper" understanding? The cost is it takes more effort and energy to do so?

If so, let say in the future, we have an LLM with 100K token context windows but with a subsystem where it notices some knowledge keeps being repeated in the context and then store that knowledge for finetuning when the LLM is not doing inference. Basically a mirror of the way we human work? Is that possible? An LLM that constantly improved and can adapt to new knowledge?

One thing I love to see here, mainly because I was going to make it myself, is the chart on latency.

GPT4 is practically unusable unless you spend $10k+ a month and have an enterprise account.

No real end user wants to wait 20-40 seconds for a response, only to find it was 80% of what they wanted.

I'm surprised that this few samples made such a big difference. Perhaps the learning rate is jacked up? Otherwise, the last couple of batches during regular training would set the overall model tone.
Does the majority of the 'power' of these finetunes lay in the dataset? So once GPT4 finetuning is out (this fall), could they in theory just use the exact same datasets for GPT4 finetune?