As the world races towards the "perfect" Large Language Model (LLM) in the pursuit of artificial general intelligence (AGI), there are models in the foreseeable future that will have limited intelligence and capabilities.
The models we work with today will seem clunky, stupid, and incapable in a few months when more capable models emerge. This repeating process of changing expectations developers have towards models will continue.
The current paradigm popular in the community is to separate system prompts and conversations, both of which are textual ways to instruct a model to do things in a special way. When writing effective prompts, the goal is to explain a task in great detail, even providing examples and instructions on how to solve it to increase the accuracy of the inference. These are the fancy research terms like few-shot prompting or chain-of-thought reasoning.
While working with such models every day for the last 2 years, one frustration that constrains the speed at which developers build applications is poor instruction following.
You would be surprised how much time one spends figuring out why this inherently probabilistic model is doing a job right 9/10 times and then fixing the 10% of edge cases. The Pareto principle applies; you can get an LLM to work in your application with 80% reliability in 20% of the time.
The approach today is that developers iterate over prompts and validate accuracy with a test set of data. The limit to this is the textual limit of a prompt.
Anthropic and other foundational LLM companies introduced slightly sophisticated ways to make the prompts more effective. Nowadays, you can introduce information inside of tags to enhance the understanding of the LLM. But this is just the tip of the iceberg.
I suggest a more novel approach to breaking through the limitations of simple textual data in the prompt. The problem is that in a long prompt, LLMs struggle to differentiate between information along the dimension of importance. It requires high sophistication to understand which aspect of an instruction matters most.
What I propose is that we add a dimension of importance to system prompts, allowing the human to have better control of how an LLM follows instructions by weighing the information. This, of course, is irrelevant once models become capable enough, but it is a solution to a short-term problem we will continue to have in the coming years.
Practically speaking, just like the approach of using tags to improve the accuracy of the model in following instructions, foundational LLMs could use tags (or similar) to allow for differentiation of the importance of a given set of instructions. A simple categorization alongside importance of high and low would already solve many problems for developers today. Of course, a change to the post-training is needed, which is linked to a significant resource investment.
I believe this is a potential path that would allow developers to build better applications today by overcoming the current limitations in instruction following.