How to generate human-friendly identifiers

Call centre

"Can you please tell me the ID on your package?"

"Yes, of course... I-O-7-U-4-V... Hang on, it looks like 1-0-7-V-4-U"

Communicating IDs between humans is error prone and we've all experienced this at some point with things like packages, orders and bank accounts, to name a few. One source of confusion are similar looking symbols, like O and zero, or I and 1, especially when having to spell out an ID or dealing with hand-written notes.

If you're a programmer or software architect, you have the power to do something about this. Let's find out what.

Base 32 comes to the rescue

Engineers have already thought about how to make IDs more human friendly, and had come up with several solutions. They typically go along the following lines:

  1. Minimising transcription errors by avoiding similar looking symbols;
  2. Using compact encodings to reduce the overall length of the ID.

One common solution is to generate identifiers that are base 32 encoded. It is specified in RFC 4648 and uses the English alphabet (one case only), followed by the numbers from 2 to 7. The numbers 0 and 1 are deliberately skipped, due to their similarity with the letters O and I. By using letters this encoding is also 3 times more efficient compared to using only numbers (base 10), so you also get IDs that are relatively compact.

Another benefit of base 32, this time for computers, is that strings are inherently URL safe (when the = padding symbol is omitted).

This is how the base 32 character set looks like:

A   J   S   3
B   K   T   4
C   L   U   5
D   M   V   6
E   N   W   7
F   O   X   =
G   P   Y
H   Q   Z
I   R   2 

The = sign is used for padding, but is redundant and therefore can be safely excluded. Can check out the spec if you need more info.

Choose base 32 for IDs if you expect them to be copied or spelled out by people at some point, while keeping their size relatively compact (in case you need a large number space). The Connect2id server for example applies base 32 to the generated client identifiers. Companies that operate IdP and OAuth servers may need to answer support calls from client developers, so that's good use case for base 32.

How to generate base 32 identifiers?

Here is one recipe:

  1. Estimate the maximum number of IDs that may be needed over the lifespan of the application. A long integer for example has 8 bytes, which means 2^64. This is will then become the spec for your database key column.

  2. Create a secure random generator, to ensure the issued IDs cannot be easily predicted (guessed). This is a must for things such as tracking and order numbers. If you need to have IDs that have the unpredictability of an AES key, their size should be at least 16 bytes (see step 1).

  3. Every time you need a new ID, generate the required random bytes for it and pass them to a base 32 encoder. All program languages have support for that, either in the standard library or through 3rd party libraries. Make sure the = sign used for padding is stripped from the output string.

Java example:

import java.security.SecureRandom;
import org.apache.commons.codec.binary.Base32;

// Create a secure random generator (it's thread-safe)
SecureRandom sr = new SecureRandom();

// Allocate an array for 8 bytes
byte[] random = new byte[8];

// Generate the random bytes
sr.nextBytes(random);

// Create the encoded ID, strip any padding
String id = new Base32().encodeToString(random).replace("=", "");

Example base 32 encoded ID with 8 bytes:

XBZO4EWP5J5B4

Base 32 encoded UUIDs

If you wander how the standard UUID encoding compares with base 32 in terms of string size:

UUID (std format)                 : 38503690-0475-4c48-95ed-a3c9eaa2ac3a
UUID (Base 32)                    : HBIDNEAEOVGERFPNUPE6VIVMHI
UUID (Base 32, with extra dashes) : HBID-NEAE-OVGE-RFPN-UPE6-VIVM-HI

Other base 32 variants

It's worth noting two alternative base 32 encodings that take care of the human factor one step further.

Z-base-32

The most notable feature of z-base-32 is the permutation of the symbol set in such a way that characters that are considered easier to read, write, speak and remember are made to occur more frequently. The number of potentially confusing characters is also further reduced.

To find out more about the design decisions behind z-base-32 take a look at its spec. It's quite an interesting and enjoyable read.

The z-base-32 alphabet:

y   e   o   a
b   j   t   3
n   k   1   4
d   m   u   5
r   c   w   h
f   p   i   7
g   q   s   6
8   x   z   9

Note that here we also have one case, but with preference for lower case, which is more pleasant to the eyes. The padding character is also entirely gone.

Base 32 by Douglas Crockford

Douglas Crockford, who made JSON popular, also had a go at base 32 encoding. The most notable feature of his version is the optional support for a check character at the end of the string, for simple error detection. This can be quite handy when dealing with important IDs, such as money accounts. International Bank Account Numbers for example use 2 check digits to eliminate routing and account number errors.

Note that even if you go along with the vanilla base 32 encoding, you can still include a check, by pre-encoding it into the base 32 input.

The base 32 alphabet of Douglas Crockford:

0   8   G   R
1   9   H   S
2   A   J   T
3   B   K   V
4   C   M   W
5   D   N   X
6   E   P   Y
7   F   Q   Z

IDs are not just numbers

Remember this next time you you design a data model. IDs often become part of the user experience, whether in tiny URLs or tracking orders. Fortunately, we have a tool to make this experience a bit more pleasant for people as well as a bit more efficient for the computer - so remember base 32 :)