Lexical Vectors

2024-07-29T04:41:06.000Z

Introduction

When I wrote about using RocksDB as a database using its lexically ordered key-value store I only used keys which used tags consisting of programmer-controlled ASCII strings concatenated with fixed-width integers. And I'm sure you can get a lot of mileage out of mixing and matching what I talked about in that article. However, there are more complex cases which aren't covered by every intuitive approach you might build on that with. Later I'll introduce a serialization format which can be made to preserve lexical ordering of a data structure. And I'll show how that serialization format can be used to preserve lexical ordering in an arbitrarily deeply nested manner.

Possible Approaches

So let's assume that you have an ASCII string terminated by a null character '\0'. If we wanted to turn this into a serialized form which preserves the lexical ordering of ASCII strings this is as easy as casting each character to a byte. Deserialization would just require that either the string indicates its own ending or use the byte container to indicate when the string ends.

Null Terminators

Vector of Strings

Now let's say that we want to serialize a vector of ASCII strings terminated by a null character '\0'. For this method let's assume that external metadata noting the overall length of the serialized data exists so we don't read past the end.

If we wanted to store such a data structure we can just concatenate the strings in the vector and serialize it as it is. When comparing two data structures where one is a prefix of another the overall lexical order should be preserved by the overall lexical sorter due to the length. If any string within the vector is shorter than the one it's being compared against the null character '\0' will ensure it's sorted earlier. And if there is a difference in an actual character then that affects the lexical order in the expected way.

Notably this works with UTF-8 encodings as well as long as your collation order is the same as the order of the byte representation.

Vector of Vectors of Strings

Now let's try to extend the above approach. Consider serializing a vector of vectors of ASCII strings terminated by a null character '\0'.

With this change there's no way to reliably deserialize the structure. And that also means there is no way to impose a one-to-one mapping from data structures to lexical order. For example [["a"], ["b"]] and [["a", "b"]] would serialize to the same byte sequence ['a', '\0', 'b', '\0'] despite [["a"], ["b"]] sorting before [["a", "b"]].

So this idea is a no-go for general data structures.

Tagging Bits

Since we can't just add a terminator to the end of the leaf data structures to indicate when they end then how about inserting something throughout the data structure recursively to unambiguously indicate payload data and its termination?

Let's start with a vector of bytes. Before each bit of each byte place a 1-bit. This preserves the ability to deserialize it and preserves the lexical order since all serialized values have the inserted 1-bits at the same locations and doesn't cause some serialized value to swap how early it ends with other values that also undergo this tagging process. And we can also add some fixed number of 0-bits to the end of this value. Since any longer value would have a 1-bit at that location it will still be lexicographically sorted earlier than other values which have it as a prefix. This also causes it to signal its own ending.

Now let's try an example again with just a string. So if we have a string "Hello" and turn it into characters we'd get ['H', 'e', 'l', 'l', 'o', '\0']. And splitting this conversion process up into tiny steps gets us this.

[0x48, 0x65, 0x6C, 0x6C, 0x6F, 0x00]
[0x4, 0x8, 0x6, 0x5, 0x6, 0xC, 0x6, 0xC, 0x6, 0xF, 0x0, 0x0]
[0b0100, 0b1000, 0b0110, 0b0101, 0b0110, 0b1100, 0b0110, 0b1100, 0b0110, 0b1111, 0b0000, 0b0000]
[0b1011_1010, 0b1110_1010, 0b1011_1110, 0b1011_1011, 0b1011_1110, 0b1111_1010, 0b1011_1110, 0b1111_1010, 0b1011_1110, 0b1111_1111, 0b1010_1010, 0b1010_1010]
[0xBA, 0xEA, 0xBE, 0xBB, 0xBE, 0xFA, 0xBE, 0xFA, 0xBE, 0xFF, 0xAA, 0xAA]

And then choosing to append eight 0-bits to the end as a byte we get this to indicate the end of the encoded sequence.

[0xBA, 0xEA, 0xBE, 0xBB, 0xBE, 0xFA, 0xBE, 0xFA, 0xBE, 0xFF, 0xAA, 0xAA, 0x00]

And if we wanted to store a vector of strings we could just take the string encoded as a series of bytes as above, concatenate them and then apply the same encoding again. The only 0-byte we'd see in that format is the one ending the top-level vector but reversing the encoding by chopping off the terminating 0-byte and pulling out the 1-bit tags would leave us with a byte array delimited by 0-bytes which unambiguously mark the stop of each encoded string.

Going back to the encoded vector of encoded strings, if we wanted to have a vector of those we could just concatenate the encoded vectors of encoded strings and apply the encoding to that overall byte sequence. So this scheme works to both serialize, deserialize and preserve the lexical order of any byte sequence which also preserves the lexical ordering of the data it serializes.

This scheme works for any depth of encoding since it essentially hides the payload from the higher levels of encoding. This is works for showing that lexical order can be composed in data structures during serialization. But it's incredibly impractical since every level of composition of a container type doubles the size of the underlying data.

Tagging Bytes

The obvious goal to make the tagging approach more space efficient would be to use fewer tags. If we can assume the serialized format is stored in octets then, instead of tagging bits, we can tag bytes.

With this approach you would take all the unencoded bits and concatenate them together, take every chunk of 8 bits, join them together with a 1-bit and then re-chunk that to store in bytes. Bits that are otherwise undefined are 0. A trailing 0-byte is added if the re-chunked sequence of bits has a length evenly divided by eight. The patterns for various lengths is as follows. The 0b binary radix indicator has been removed for clarity and p indicates a payload bit.

00000000

1ppppppp p0000000

1ppppppp p1pppppp pp000000

1ppppppp p1pppppp pp1ppppp ppp00000

1ppppppp p1pppppp pp1ppppp ppp1pppp pppp0000

1ppppppp p1pppppp pp1ppppp ppp1pppp pppp1ppp ppppp000

1ppppppp p1pppppp pp1ppppp ppp1pppp pppp1ppp ppppp1pp pppppp00

1ppppppp p1pppppp pp1ppppp ppp1pppp pppp1ppp ppppp1pp pppppp1p ppppppp0

1ppppppp p1pppppp pp1ppppp ppp1pppp pppp1ppp ppppp1pp pppppp1p ppppppp1 pppppppp 00000000

And so on.

Since all unencoded payloads with the same length have the same encoded form in the listing above and the bits aren't out of order encoding won't affect comparison with other encoded payloads of the same length. And any payload which is a prefix of a longer payload will look like the longer payload up until the first 0-bit of the encoding container is encountered.

For example, [0] sorts before [0, 0], they encode to 1000_0000 0000_0000 and 1000_0000 0100_0000 0000_0000 respectively and the second byte would correctly sort the two payloads.

This encoding asymptotically grows as a factor of 9/8 instead of 2 for every level of nesting. And this is much better. But you could imagine that instead of tagging bytes you could instead tag 2 bytes or 3 bytes or any number of bytes of your choosing for more space efficiency. However, this also requires that your underlying payload is sized so that it comes in multiples of 2 or 3 or however many bytes.

Lexical Vectors

So what if we took that last approach to an extreme and instead of tagging bits and bytes we directly encoded bit sequences in lexical order? We can do that by generating dictionaries of bit sequences up to a particular length.

Up to 0 bits:

[]

Up to 1 bit:

[]
[0]
[1]

Up to 2 bits:

[]
[0]
[0, 0]
[0, 1]
[1]
[1, 0]
[1, 1]

Up to 3 bits:

[]
[0]
[0, 0]
[0, 0, 0]
[0, 0, 1]
[0, 1]
[0, 1, 0]
[0, 1, 1]
[1]
[1, 0]
[1, 0, 0]
[1, 0, 1]
[1, 1]
[1, 1, 0]
[1, 1, 1]

And so on.

Note that the dictionary up to bits is the bit dictionary with new entries. So this means that since the 0 bit dictionary has one entry the number of entries in the bit dictionary can be calculated as follows.

And that means the bit dictionary requires at least bits to encode. So before we even know how long the payload might be we can tune the storage efficiency of the encoding. And it can be driven arbitrarily high for very long payloads. But if you're encoding a lot of very short payloads then choosing a dictionary with very long strings isn't efficient in practice. Also, while the asymptotic efficiency is tunable it still grows exponentially by a constant factor of your choosing for a given number of repeated encodings.

So for practical purposes let's split the difference by saying that payloads get split into chunks of bits with the last chunk potentially being less than that and that chunks that aren't the maximal length indicate that the payload terminates there. And to make sure that payloads that evenly divide into the chosen number of bits get terminated a single empty chunk must be appended to them. For example, these bit sequences get chunked into the 3 bit dictionary as follows.

[] => [[]]

[0] => [[0]]

[0, 0] => [[0, 0]]

[0, 0, 0] => [[0, 0, 0], []]

[0, 0, 0, 0] => [[0, 0, 0], [0]]

[0, 0, 0, 0, 0] => [[0, 0, 0], [0, 0]]

[0, 0, 0, 0, 0, 0] => [[0, 0, 0], [0, 0, 0], []]

[0, 0, 0, 0, 0, 0, 0] => [[0, 0, 0], [0, 0, 0], [0]]

And so on.

Note that you can complicate this scheme by using varying chunk sizes if your payloads have a favorable length distribution. For example, instead of splitting the payload into chunks of sizes at most [7, 7, 7, 7, 7, ...] you can use [7, 7, 15, 31, 63, ...] as long as the encoder and decoder agree on the dictionary size being used. However, this extension isn't necessary to explain the principles and is an exercise left to the reader.

Encoding and Decoding

The first thing to note about the lexical vector format is that while the payload was broken up into chunks there wasn't a description of how those chunks are actually represented in the encoding. In order to maintain the lexical order in the encoding the entries in a given dictionary need to be strictly monotonically increasing, meaning that no two entries have the same value and that the numerical value of the entry only increases going to later entries in the dictionary.

The most straightforward assignment to use here is to just start at 0 and increment for each entry. This also happens to leave a leftover code when using binary codes that can be used as a non-representable entry past the end of the dictionary. That code without an entry would be invalid to try to decode but is convenient in the context of range-scanning.

So how do we encode and decode a chunk? The most direct way would be to generate the dictionary directly and use it as a lookup table. That would work but would get very impractical very quickly since the dictionary grows exponentially. So we have to determine some pattern that can be turned into a short algorithm instead.

Let's recreate the dictionary listings we did before but also treat it like it was a binary tree that had offsets pointing to the nodes in the tree which append either a 0 or a 1 to the end of the unencoded payload.

Up to 0 bits:

[] => none

Up to 1 bit:

0 = []  => +1, +2
1 = [0] => none
2 = [1] => none

Up to 2 bits:

0 = []     => +1, +4
1 = [0]    => +1, +2
2 = [0, 0] => none
3 = [0, 1] => none
4 = [1]    => +1, +2
5 = [1, 0] => none
6 = [1, 1] => none

Up to 3 bits:

 0 = []        => +1, +8
 1 = [0]       => +1, +4
 2 = [0, 0]    => +1, +2
 3 = [0, 0, 0] => none
 4 = [0, 0, 1] => none
 5 = [0, 1]    => +1, +2
 6 = [0, 1, 0] => none
 7 = [0, 1, 1] => none
 8 = [1]       => +1, +4
 9 = [1, 0]    => +1, +2
10 = [1, 0, 0] => none
11 = [1, 0, 1] => none
12 = [1, 1]    => +1, +2
13 = [1, 1, 0] => none
14 = [1, 1, 1] => none

So if you wanted to encode [1, 0, 1] using the 3 bit dictionary you would:

If you ran out of unencoded payload bits you would just stop on that entry since it already encodes exactly the payload input so far.

Suspiciously, all the offsets appear to be powers of 2. If you imagine that the dictionary is constructed inductively it becomes easy to see why. If you take an bit dictionary and make a copy of it you go from entries to entries in total. If you prepend a 0-bit to all the entries in the original dictionary and prepend a 1-bit to all the entries in the copy you almost have the next larger dictionary. For the 1-bit prepended dictionary you also have to add the size of the original dictionary, , to the codes. You can then join the 0-bit and 1-bit dictionaries with the 0-bit dictionary before the 1-bit dictionary. But we're still missing an entry, the empty bit sequence, which goes at index 0 and causes the index of all the other entries to increase by 1 and means we have to add 1 to all the other codes. That results in the dictionary having entries, which matches the formula for the size shown earlier but with instead of . Let's see what that would look like to build the 3 bit long dictionary from the 2 bit long dictionary, which has a size of 7.

         0 = []        => +1, +8
--------------------------------
0+1   =  1 = [0]       => +1, +4
1+1   =  2 = [0, 0]    => +1, +2
2+1   =  3 = [0, 0, 0] => none
3+1   =  4 = [0, 0, 1] => none
4+1   =  5 = [0, 1]    => +1, +2
5+1   =  6 = [0, 1, 0] => none
6+1   =  7 = [0, 1, 1] => none
--------------------------------
0+1+7 =  8 = [1]       => +1, +4
1+1+7 =  9 = [1, 0]    => +1, +2
2+1+7 = 10 = [1, 0, 0] => none
3+1+7 = 11 = [1, 0, 1] => none
4+1+7 = 12 = [1, 1]    => +1, +2
5+1+7 = 13 = [1, 1, 0] => none
6+1+7 = 14 = [1, 1, 1] => none

This means that to encode a sequence of bits adding a 0-bit adds 1 to the code and adding a 1-bit adds 2 raised to the number of bits of space remaining including the bit currently being encoded. So for a 3 bit long dictionary the sequence [1, 1, 1] starts with code 0 and adds 8, 4 and 2 respectively to sum to 14. And likewise for the sequence [0, 0, 0] starts with code 0 and adds 1, 1 and 1 respectively to sum to 3.

For decoding we'll be exploiting the same recursive structure of the dictionary. The first thing to note is that the sequences which start with the 1-bit are all above some power of 2 and all the sequences that don't are below it, regardless of the size of the dictionary. So if we use the 3 bit dictionary as our example the sequences starting with the 1-bit have a code of 8 or greater and all other sequences have a code of 7 or less. In practice you can check if the sequence starts with a 1-bit with a bit mask or an inequality. If the code is the empty sequence (with a code of 0) or the undecodeable code (with a code of 15) then we're done and can emit the accumulated sequence so far or that the code is not decodeable. If the check determines the sequence starts with a 1-bit then you can subtract 8 from the code, in which case your remaining sequence is 1 bit shorter and in the range of the 2 bit dictionary. If the check determines the sequence doesn't start with a 1-bit then you can subtract 1 from the code and have a remaining sequence 1 bit shorter and a code in the range of the 2 bit dictionary. Following these steps with progressively smaller dictionaries will eventually lead to the 0 bit dictionary and the empty sequence at which point this algorithm reaches its base case and terminates.

This decoding procedure can also be used to simply measure the length of the sequence represented by the code instead of emitting the sequence itself. If you only emit the number of steps it took to get to the empty sequence or undecodeable code then you don't have to manipulate a scratch buffer.

Terminal Vectors

Going from a bit array of [1, 0] to [1, 0, 0] or [1, 0, 1] is a pretty concrete operation. But there is another operation on lexical vectors that is necessary to make use of range-scanning that's a bit more abstract. If you think of a bit array as a string "101" the regular expression which matches this string and every string which it prefixes would be "101.*". And for range scans being able to pick up exactly the bit sequences that match that regular expression is necessary.

Forward iteration through an ordered list (or in the case of RocksDB an ordered map) of bit sequences needs to determine when to terminate based on the latest acceptable entry which matches the prefix.

One approach would be to decode each payload and directly check if it's prefixed with the required bit sequence. This works but does more than is necessary if this comparison can be done directly on the encoded payload.

Another approach would be to form the last valid encoded payload which could match the prefix and compare entries against that. This won't work though because the last bit sequence which matches "101.*" is 1011111111111111111... with an infinite number of 1-bits at the end. And the encoding scheme doesn't handle that directly.

Yet another approach is to exploit the undecodeable code left over when we constructed the code dictionary. So for every decodeable code there will be a code which will lexically come after it and every code it prefixes even if it's the last decodeable code. What does that actually mean? Let's use "Z" to refer to this undecodeable code. If you have a code which decodes to "100" the bit sequence "101" comes immediately after every sequence which "100" prefixes. If you have a code which decodes to "10" the bit sequence "11" comes immediately after every sequence which "10" prefixes. But if you have a code which decodes to "1" there is no bit sequence which comes after every sequence it prefixes, so you'd use "Z" to represent the code which comes after every code prefixed by "1".

If that felt too abstract then there's a demonstration later with source code you can examine. But we'll continue with explaining how to use codes to bound ranges.

Figuring out the first code which isn't prefixed by another code feels like it would be hopelessly complex but we can use the same trick of manually finding the offsets from one entry to the entry with that relationship that we did for encoding. So if we calculate the offset from a prefix code to the next one that isn't prefixed by it for various dictionary sizes we get the following.

Up to 0 bits:

[] => +1

Up to 1 bit:

0 = []  => +3
1 = [0] => +1
2 = [1] => +1

Up to 2 bits:

0 = []     => +7
1 = [0]    => +3
2 = [0, 0] => +1
3 = [0, 1] => +1
4 = [1]    => +3
5 = [1, 0] => +1
6 = [1, 1] => +1

Up to 3 bits:

 0 = []        => +15
 1 = [0]       => +7
 2 = [0, 0]    => +3
 3 = [0, 0, 0] => +1
 4 = [0, 0, 1] => +1
 5 = [0, 1]    => +3
 6 = [0, 1, 0] => +1
 7 = [0, 1, 1] => +1
 8 = [1]       => +7
 9 = [1, 0]    => +3
10 = [1, 0, 0] => +1
11 = [1, 0, 1] => +1
12 = [1, 1]    => +3
13 = [1, 1, 0] => +1
14 = [1, 1, 1] => +1

And just like before there is a bifurcating recursive relationship in the dictionary's construction. The offset at each step should be apparent as based on that recursive construction where is the number of bits of space left in the sequence. And you can find the number of bits of space left in a sequence given the maximum number of bits that can fit in the code minus the length of the sequence represented by the code as discussed earlier.

You may have noticed a strange property of the "Z" code. Earlier I noted that the last sequence as a string which matches "101.*" is 1011111111111111111... with an infinite number of 1-bits at the end. And if you think about it enough the first sequence to come after this lexicographically is "11". But if we were to choose cutoffs of the infinite string and partition the sequence up into 3 bit chunks we'd get a different result for each one.

"101" => ["101"] => ["11"]

"101111" => ["101", "111"] => ["101", "Z"]

"101111111" => ["101", "111", "111"] => ["101", "111", "Z"]

"101111111111" => ["101", "111", "111", "111"] => ["101", "111", "111", "Z"]

And each of these come after the infinite yet decodeable sequence 1011111111111111111... which is a property we want, but it should be seen as strange that there would be multiple such sequences which divide the list of bit sequences in the same place. If we look at another example it should be apparent what's happening. Let's do the same exercise but with the finite 0101.

"0101" => ["010", "1"] => ["010", "Z"]

And you can also determine that the first sequence to come after 0101 lexicographically is 011. From this you should recognize that the "Z" code represents a kind of arithmetic "rollover" in lexical sequences. So if a code sequence isn't just ["Z"] you can chop off the "Z" item and find the first code representing a sequence which isn't prefixed by the last chunk of bits in the new sequence. That step can be applied iteratively until the code sequence is just ["Z"] or the "Z" code disappears. Applying this to the earlier example, the "Z" would roll over like so.

"101111111111"                  # Start with this.
["101", "111", "111", "111"]    # Chunk the sequence.
["101", "111", "111", "Z"]      # Terminal vector.
["101", "111", "Z"]             # Roll over.
["101", "Z"]                    # Roll over.
["11"]                          # Roll over making "101" => "11".

Notably for sequences which are all 1-bits their terminal vector is always ["Z"] under this scheme.

You don't strictly have to apply this rollover behavior but it gives you a smaller, canonical, and more frequently decodeable representation.

Software Demonstration

If you still have uncertainties about anything described above hopefully this program will help resolve them. The program linked in the repository generates a dictionary of 9 bit sequences, splits them into 3 bit chunks and encodes them. Then several operations are run on it, the output of which is printed to the terminal as below. The result of these operations are also used to determine range bounds on the 9 bit sequences. The output is very long (remember that's 1023 entries) but here are some excerpts if you don't want to run the code.

Using 3 bit words and 4 bit codes.
""
sentence
  to_codes_vec   [0]
  to_bool_vec    [[]]
  to_strings_vec [""]
  to_length_vec  [0]
  to_string      ""
terminal
  to_codes_vec   [15]
  to_bool_vec    [Terminal]
  to_strings_vec ["Z"]
  to_length_vec  [0]
  to_string      "Z"
prefixes ["":"111111111"] precedes "(none)"

"0"
sentence
  to_codes_vec   [1]
  to_bool_vec    [[false]]
  to_strings_vec ["0"]
  to_length_vec  [1]
  to_string      "0"
terminal
  to_codes_vec   [8]
  to_bool_vec    [Regular([true])]
  to_strings_vec ["1"]
  to_length_vec  [1]
  to_string      "1"
prefixes ["0":"011111111"] precedes "1"

"00"
sentence
  to_codes_vec   [2]
  to_bool_vec    [[false, false]]
  to_strings_vec ["00"]
  to_length_vec  [2]
  to_string      "00"
terminal
  to_codes_vec   [5]
  to_bool_vec    [Regular([false, true])]
  to_strings_vec ["01"]
  to_length_vec  [2]
  to_string      "01"
prefixes ["00":"001111111"] precedes "01"
"010011111"
sentence
  to_codes_vec   [6, 7, 14, 0]
  to_bool_vec    [[false, true, false], [false, true, true], [true, true, true], []]
  to_strings_vec ["010", "011", "111", ""]
  to_length_vec  [3, 3, 3, 0]
  to_string      "010011111"
terminal
  to_codes_vec   [6, 8]
  to_bool_vec    [Regular([false, true, false]), Regular([true])]
  to_strings_vec ["010", "1"]
  to_length_vec  [3, 1]
  to_string      "0101"
prefixes ["010011111":"010011111"] precedes "0101"

"0101"
sentence
  to_codes_vec   [6, 8]
  to_bool_vec    [[false, true, false], [true]]
  to_strings_vec ["010", "1"]
  to_length_vec  [3, 1]
  to_string      "0101"
terminal
  to_codes_vec   [7]
  to_bool_vec    [Regular([false, true, true])]
  to_strings_vec ["011"]
  to_length_vec  [3]
  to_string      "011"
prefixes ["0101":"010111111"] precedes "011"

Addendum

Getting the Order Wrong

While a lexical vector will preserve the lexical order of the underlying data it won't automatically pick the right concrete representation of your abstract data. For example, if you created a dictionary of all the human-readable strings you care about it can't pick the order you should order them in a dictionary consisting of just those words. In fact the order of such words is potentially dependent on the locale used to handle them. The example from Mozilla is reproduced here.

console.log("ä".localeCompare("z", "de")); // a negative value: in German, ä sorts before z
console.log("ä".localeCompare("z", "sv")); // a positive value: in Swedish, ä sorts after z

Total Order

Another thing to note up front is that lexical orders are total orders for the purpose of lexical vectors. This means that every lexical vector is always earlier, later or equal to another lexical vector in a dictionary and ordering is transitive. You can never have two lexical vectors which are incomparable. For example, if you arrange nodes in a directed acyclic graph, or DAG, and the arrows between nodes represents their order then that order may be only partial and you might not be able to determine the relative order of every pair of nodes. Another example is IEEE 754 floating point numbers, while most floating point numbers can be ordered in terms of increasing value NaN, or "Not a Number", values do not have such an ordering. And sorting floating point numbers based on byte representation would result in increasing positive numbers followed by decreasing negative numbers due to the leading sign bit. So some custom mapping or filtering is probably called for when it comes to floating point numbers.

Signed 2's complement numbers also have a similar problem unrelated to its order being total or not in that their sorted byte representation would start at 0, increase, roll over to the most negative number and then continue to increase until -1. But they can be ordered by value in the byte representation by adding where is the number of bits in the number before serializing and then adding it again after deserializing, ignoring rollover each time.

Encoding and Decoding in Larger Pieces

Right when I introduced lexical vectors I implicitly assumed that encoding and decoding would be done bit-by-bit (pun not intended). And encoding a single bit at a time is the most space efficient way to fill up a container of a given length, assuming you don't make use of the fact that using larger encoder units (not chunks) don't use the entirety of an extra bit for a chunk.

If you're encoding individual bits you could have 1, 3, 7, 15, and so on entries in your dictionary and with the undecodeable code that comes to 2, 4, 8, 16, and so on or 1, 2, 3, 4, and so on bits to encode. But if you encode pairs of bits you could have 1, 5, 21, 85, and so on entries in your dictionary and with the undecodeable code that comes to 2, 6, 22, 86, and so on or approximately 1, 2.58, 4.46, 6.43, and so on bits to encode. For the rest of this discussion I'm assuming that that the pair of bits encoder with a 21 entry dictionary takes 5 bits to encode a chunk and not 9 bits to encode 2 chunks together.

So with that assumption out of the way we can know that no matter what the encoding unit size or the chunk size the code will always be exactly 1 bit larger than the maximum number of bits that can fit in a chunk. And using that we can try various encoding unit sizes, chunk sizes and container sizes and determine the space efficiency.

Ideally you'd choose an encoding unit which evenly divides your payload since the ending isn't signalled if it isn't evenly divided. You can find the program I wrote to calculate various efficiencies here, but I only consider encoding units which divide and are divided by 8.

Handling larger encoding units is more tricky than for single bits but the principles for handling it are sufficiently demonstrated above.

Finally, with larger encoding units fragmentation internal to the container becomes a concern. Where a single bit encoder could completely fill its container by choosing the right chunk size you might not completely fill the container with chunks and filling it with multiple chunks also means wasting that extra bit per chunk.