English as an API

2023-08-18T19:00:00.000Z

Is it possible to write code that reliably turns each of these strings of text

Set the power to a hundred milliwatts.

Set power to one hundred milliwatts.

Set power to point one watts.

Change power to zero point one watts.

and other English sentences of a similar but unknown form into JSON that parses the same as this

{
    "power": 100
}

and handles other known parameters of a hypothetical system?

No.

But, with a language model, using few-shot learning and with a different output format we can get close and then use traditional software and UI techniques to handle inaccuracies.

For this discussion I'm going to cut off any metaphysical reasoning about what Large Language Models (LLMs) are doing and over-simplify by saying all you should be considering here is that LLMs are a very good auto-complete engines similar to the kind you probably have on your phone's keyboard.

N-shot Learning

The examples I have today all use something called "few-shot learning" without fine-tuning for this application.

Fine-tuning is a form of training a machine learning model that differs from regular training by training on a dataset for a specific application. Both fine-tuning and outright training modify the underlying weights of the model. These permanently alter the weights of the probability model of the "auto-completer" for LLMs in a way that will affect every prompt given to it later.

N-shot learning, whether zero-shot, one-shot or few-shot, involves feeding some number of examples of expected completions as input to a model so that a pattern is induced in the model and the most likely next completion continues the pattern as opposed to breaking the pattern. These examples are part of the prompt. This does not change the underlying "weights" of the model.

While it's dead simple to get an LLM to do tasks with few-shot learning there's probably a low limit on the number of examples you can feed it and have it take into account accurately.

What Model Was Used and Why

The model I'm using to demonstrate this is VMWare's fine-tuned version of Meta's Llama 2 transformer model, specifically the version with 7 billion parameters.

The first and most important reason for using this model is that it can be run locally. Models, or systems which include models, like each version of ChatGPT could be replaced with different weights without notice, it could be replaced with a better model but it could be replaced with one that breaks a given use case. Using remote services could preclude certain use cases where the user's computer is not or cannot be connected to the Internet.

The second consideration is that Hugging Face, a kind of GitHub for machine learning models, maintains a leaderboard of LLMs assessed against a set of benchmarks. At the time models based on Llama 2 were scoring near the top of the leaderboard. However, if you observe the parameter count of the models then you're likely to notice (if you can understand the various naming schemes) that the ranking is heavily influenced by the number of parameters with more parameters scoring higher. This brings us to the third reason.

The third reason is that language models take up a huge amount of memory. I already noted that the model I'm using has (about) 7 billion parameters. But you should also know that this model uses 16-bit floating point numbers, or in IEEE 754 parlance binary16. So this means that each parameter takes up 2 bytes or twice as many gigabytes as there are billions of parameters. I'll note that this is the smallest model in the Llama 2 series of models which provide 7 billion, 13 billion and 70 billion parameters. There are tools for "quantizing" LLMs so that the parameters are lossy compressed so that they take up as few a 4 bits per parameter. I didn't attempt that for this test and won't need it for this 7 billion parameter model but is something to keep in mind for the future. But even without quantizing the model this size is fairly usable in a modern laptop, mine has 32GiB for example. But even for most of the latest discrete Nvidia consumer GPUs this is pushing the limit, even assuming nothing else is using your GPU's memory, except for the 24GB RTX 4090.

If you're choosing a model for your own experiments you should also consider the license for the model and what data it was trained on. For example, this model was fine-tuned by VMware for taking and responding to instructions. And you will also want to consider whether you want to use a model that was trained on pirated works, for example the original Llama model manuscript listed a training dataset called Books3 which is a set of pirated books.

Demonstration

You can download the repository for this demonstration here.

As part of the demo two other repositories are downloaded as submodules. The first is llama.cpp which provides a tool for converting from Pytorch models to GGML models. The second is the actual Llama v2 model.

The first step in running the demo is converting the Pytorch model to a GGML model with llama.cpp. Then an initial prompt is fed into the model and the session after this prompt is fed in is saved so that later inference requests can be re-run from that point efficiently.

Note that the format used for the initial prompt is shown below. It starts with a line of English text describing some fictional system parameter power, duration or interval followed by a spelled out non-negative number followed by a unit or some informal description that maps to some specific value like "off" meaning 0 milliseconds. It then is followed by a line in all lowercase that notes which parameter was set and to what value it was set in either milliwatts or milliseconds. This pattern alternates between English descriptions and a simple, consistent format that is machine parse-able.

Set the power to fifty milliwatts.
power 50
Set power to one hundred milliwatts.
power 100
Set power to point two five watts.
power 250
Change power to zero point five watts.
power 500
Set the duration to ten milliseconds.
duration 10
Set duration to two hundred and fifty milliseconds.
duration 250
Set duration to zero point one seconds.
duration 100
Change duration to zero point four seconds.
duration 400
Set the interval to five hundred milliseconds.
interval 500
Set interval to point three five seconds.
interval 350
Set interval to point eight five seconds.
interval 850
Turn off the interval.
interval 0
Use the shortest interval.
interval 10

As part of the demo you'll see the whole initial prompt printed out followed by the individual prompts for inference in order to be explicit about the state leading into any given inference. After all the inference prompts have generated output there will be a report about the inference. The most important elements of that report are as follows:

prompt which prints the string that was fed in after the inference session was restored and before generation occurred.
inferred_text which is what the language model generated in response.
inferred_parameter_set which is what the traditionally programmed code parsed out of the language model's output.

Typical output looks something like this:

Inference stats:
InferenceResult {
    elapsed: 10.121904283s,
    prompt: "Turn interval to two hundred fifty milliseconds.\n",
    inferred_text: "interval 250",
    inferred_parameter_set: Ok(
        Interval(
            250,
        ),
    ),
    inference_statistics: InferenceStats {
        feed_prompt_duration: 6.615519305s,
        prompt_tokens: 205,
        predict_duration: 10.121688579s,
        predict_tokens: 211,
    },
}

The newline \n is fed in as part of the prompt so that the language model doesn't have a chance to try to continue the line and has to move on to generating the machine parse-able line. The inferred_parameter_set field indicates that the command was successfully parsed (Ok) and that the plan of action is to set the system's interval parameter to 250 milliseconds.

But because LLMs are stochastic in nature even when well-tuned they don't always do exactly what you want.

If you run this model enough times you'll get output where the language model attempts to issue commands in English to itself (but is cut off for reasons explained later). In this case you'll note that there actually is still a valid output contained in the larger erroneous output that could still be recovered with traditional programming techniques.

    inferred_text: "interval 250\nInterval is",

You might also encounter cases where the language model decides that it's going to do something somewhat novel with the format and include a parameter that isn't a number. This should be a case where the rest of the program catches this non-sense output and rejects it as unrecoverable. This doesn't mean that you can't use different pseudo-random number generator state and attempt to re-generate the output with a different result if it's fast enough.

    prompt: "Set the power to one hundred milliwatts.\n",
    inferred_text: "power on\nSet the duration to ten",
    inferred_parameter_set: Err(
        ParameterSetConversion,
    ),

The most insidious inference errata or what I'll call "inferrata" is when the generated output follows the format exactly and if you didn't know the prompt or the valid values for the plan of action would be indistinguishable from a correct translation.

    prompt: "Set the duration to three hundred thirty milliseconds.\n",
    inferred_text: "duration 130",
    inferred_parameter_set: Ok(
        Duration(
            130,
        ),
    ),

Determining what to do with a stream of plans of action which have incorrect plans mixed in with correct ones requires more traditional engineering and risk management and user interface design.

Rejecting Inferrata

Since a component which cannot always operate correctly even in principle sits between the user and their task there should also be consideration as to whether or not an unreliable or "fuzzy" controller should be able to make a potentially final or hazardous decision on its own.

Automatically Rejecting Inferrata

To know whether an action a fuzzy controller is making is something that needs the permission of a human we first need to know what decision it's actually trying to make. In some cases we can use traditional programming techniques to recover a meaningful decision from generated output with minor issues.

If you know what the longest output you can expect from the LLM is then you should put a limit on how many tokens can be produced. The reason for this is that without this limit the LLM chooses when to stop producing text and if it generates output which convinces it to start writing out prose it can go on for a long time keeping your processing elements busy and whatever the system your user is interacting with for quite some time. If the LLM does get into that state where it starts writing prose a sufficiently short token limit should cut it off in time. If your program has to invoke some method to request each token in turn you can stop when your expected output format unambiguously ends in addition to setting a generated token limit.

If you don't have a way to stop as soon as your format unambiguously ends then you'll also want to use an output format which allows you to cut off extra output that can't be part of the command. For example, if your format only permits a single line of output and the LLM starts generating a second line of text you can split the first line from the rest of the output and use only the first line for your output parsing.

Use formats that don't lend themselves to "rambling" in a valid output format. For example, I was experimenting with the earlier demonstration using an output format in JSON where the output would look like {"power":100} and the LLM would produce something like {"power":100, "duration":250, and so on presumably because a JSON object with just one key-value pair is unusual and once it places a comma it keeps going.

Use an output format which is trivial to parse with a hand-written program. While you could choose from the universe of existing serialized data formats which have text representations and throw an existing parser against that a general purpose serialization format will probably be structured. While having structured output would be nice for programmers since it would require little work that ends up putting more work on the LLM than is necessary. Plus we're using the LLM to transform natural language input into any format that can be easily parsed, balancing parentheses, square brackets, curly braces etc. is not what it's here for.

The last method I have for you is that once you've parsed the LLM's output you should be doing specific checks on what you've parsed. If you expect to be an unsigned numeric argument you should be parsing it as an unsigned number or parsing it as a number and then checking that it's not negative. You should also be checking that the argument is in a sensible range and isn't setting a parameter to twice what the system is even capable of or some value you don't allow your system to operate at. And if you're setting multiple parameters simultaneously check all these values together before setting each on in-turn only to find out one of them wasn't permissible.

Manually Rejecting Inferrata

Everything beforehand ignored the role of the user in interacting with the user, we just assumed the input prompt came from nowhere and only affected the internals of some piece of software. But now that we've transformed that prompt into a plan of action some decision needs to be made if that plan is appropriate and execute it.

This decision process is necessary because even if a fuzzy controller was perfectly aligned with the goals of its user it doesn't even have the depth of sensory experience a human user ought to have to know if a piece of software is at imminent risk of causing harm. And while modern software systems often tend to put software or firmware in-between the user and the tool which can be used to cause harm these are usually as simple as possible mechanisms to convey permission: a button on a GUI, a footswitch closing a circuit and so on.

But not every plan of action in a given system necessarily needs explicit permission from the user. A fuzzy controller might produce plans of action which can't cause harm alongside plans of actions which might. Erroneously inferring a plan of action which can't cause harm doesn't really need human input to be carried out even if what the user wanted was some other plan of action.

Fuzzy Controller Safety Tree

This tree structure may serve as a helpful rule-of-thumb for when a plan of action requires human permission. These are not hard-and-fast rules and may vary for your application.

If the plan of action could cause harm:
- Then get permission from the user before potentially causing harm.
Otherwise:
- If the plan of action is expensive to execute:
  - Then get permission from the user before executing some expensive operation.
- Otherwise:
  - If the plan of action has no side-effects:
    - Then go ahead and execute the plan.
  - Otherwise:
    - If the side-effects are easy to reverse:
      - Then you may go ahead and execute the plan but make those side-effects clear to the user.
    - Otherwise:
      - Then get permission from the user to prevent changes which are hard to reverse.

The meaning of "harm", "expensive", "side-effect" and "easy" will vary from application to application. This is particularly difficult to talk about in the abstract when your plan of action has a side-effect that sets a parameter whose misuse becomes more hazardous or risky by changing it but that parameter does not immediately perform a potentially harmful action. In these circumstances you might choose on one extreme to require immediate permission to another extreme of just having an indication of the thing that was changed. In this circumstance particularly but in all the items discussed here human discretion is still necessary.