Efficient ASCII bitset in Go
Tingfeng 6 min read December 31, 2022 [Go] #Go #data #structures #bitset #runeWhenever we want to check a string for prohibited ASCII characters, such as in username validation, a simple method is to use a set of ASCII characters.
While other languages like Python and JavaScript have built-in sets, Go does not.
Instead, the natural way to implement sets in Go is to use the built-in map. With Go map, sets of any comparable variable type can be created.
import "fmt"
set := make(map[byte]struct)
set[42] = struct
_, exists := set[42]
fmt.Println(exists) // expect: true
_, exists := set[99]
fmt.Println(exists) // expect: false
However, for sets of ASCII characters, the Go source code offers a faster solution, the asciiSet.
In this blog post, we shall build ASCIISet, an extension of asciiSet from the Go source code, from scratch. Afterwards, we shall compare its performance against traditional map based sets.
Introducing the ASCIISet
Instead of a map, an uint32
array of size 8 is defined as a type ASCIISet to store the set values. The set methods, Add()
, Contains()
and Remove()
are implemented.
import "unicode/utf8"
// ASCIISet is a 32-byte value, where each bit represents the presence of a
// given ASCII character in the set. The 128-bits of the lower 16 bytes,
// starting with the least-significant bit of the lowest word to the
// most-significant bit of the highest word, map to the full range of all
// 128 ASCII characters. The 128-bits of the upper 16 bytes will be zeroed,
// ensuring that any non-ASCII character will be reported as not in the set.
// This allocates a total of 32 bytes even though the upper half
// is unused to avoid bounds checks in ASCIISet.Contains.
type ASCIISet [8]uint32
// Add inserts character c into the set.
func (c byte)
// Contains reports whether c is inside the set.
func (c byte) bool
// Remove removes c from the set
//
// if c is not in the set, the set contents will remain unchanged.
func (c byte)
Here is how the characters @
and J
are stored in the ASCIISet, and how we can verify that ASCIISet contains the characters @
and J
(and nothing else).
var as ASCIISet
chars := "@J"
for _, c := range chars
fmt.Println(as.Contains('@')) // expect: true
fmt.Println(as.Contains('J')) // expect: true
fmt.Println(as.Contains('5')) // expect: false
How ASCIISet works
According to the Go documentation:
The 128-bits of the lower 16 bytes, starting with the least-significant bit of the lowest word to the most-significant bit of the highest word, map to the full range of all 128 ASCII characters.
In other words, each bit in the first 128 bits (lower 16 bytes) of ASCIISet represents one of the 128 ASCII characters. If a bit has a value of 1, the ASCII character it represents exists in the ASCIISet. If the value is 0, the ASCII character does not exist in the ASCIISet.
📝 This type of data structure is commonly known as a bitset
Single-character ASCIISets
Bit manipulation can be tricky to visualise.
Lets take a look at the single-character ASCIISet for each of the 128 ASCII characters.
var c byte
for c = 0; c < 128; c++
Output as follows
[1 0 0 0 0 0 0 0] // c = 0
[2 0 0 0 0 0 0 0] // c = 1
[4 0 0 0 0 0 0 0] // c = 2
...
[536870912 0 0 0 0 0 0 0] // c = 29
[1073741824 0 0 0 0 0 0 0] // c = 30
[2147483648 0 0 0 0 0 0 0] // c = 31
[0 1 0 0 0 0 0 0] // c = 32
[0 2 0 0 0 0 0 0] // c = 33
[0 4 0 0 0 0 0 0] // c = 34
...
[0 536870912 0 0 0 0 0 0] // c = 61
[0 1073741824 0 0 0 0 0 0] // c = 62
[0 2147483648 0 0 0 0 0 0] // c = 63
[0 0 1 0 0 0 0 0] // c = 64
[0 0 2 0 0 0 0 0] // c = 65
[0 0 4 0 0 0 0 0] // c = 66
...
[0 0 536870912 0 0 0 0 0] // c = 93
[0 0 1073741824 0 0 0 0 0] // c = 94
[0 0 2147483648 0 0 0 0 0] // c = 95
[0 0 0 1 0 0 0 0] // c = 96
[0 0 0 2 0 0 0 0] // c = 97
[0 0 0 4 0 0 0 0] // c = 98
...
[0 0 0 536870912 0 0 0 0] // c = 125
[0 0 0 1073741824 0 0 0 0] // c = 126
[0 0 0 2147483648 0 0 0 0] // c = 127
From left-to-right, each element is filled up with increasing powers of 2 (20 to 231), for each consecutive set of 32 ASCII characters (4 sets in total).
A bit for every character
Why powers of 2? Because in binary, powers of 2 have the form (0001, 0010, 0100, ...), which is useful when we want each bit in a number to represent the existence of something unique.
Therefore, the existence of each of the first 32 ASCII characters in ASCIISet can be represented by each bit of element 0.
[0b00000000000000000000000000000001 ...] // c = 0
[0b00000000000000000000000000000010 ...] // c = 1
[0b00000000000000000000000000000100 ...] // c = 2
...
[0b00100000000000000000000000000000 ...] // c = 29
[0b01000000000000000000000000000000 ...] // c = 30
[0b10000000000000000000000000000000 ...] // c = 31
In a similar manner, the second, third, and fourth sets of 32 ASCII characters are represented by the bits in elements 1, 2, and 3 of ASCIISet respectively.
In total, we have 32 + 32 + 32 + 32 = 128 bits representing each of the 128 ASCII characters, as stated by the documentation.
Add()
Now we are in a better position to understand the Add()
method.
// Add inserts character c into the set.
func (c byte)
c
is always an ASCII character, so it falls within the range [0,127].- This restricts
c/32
to the following values: [0,1,2,3] (i.e. indices of the first 4 elements of ASCIISet). as[c/32]
thus points to the set of 32 ASCII characters thatc
belongs to. Example: ifc = 34
→as[c/32] = 1
→c
belongs to the second set of 32 ASCII characters (out of 4 sets).- In
1 << (c % 32)
, 1 is right-shifted by c % 32 steps to the position of the bit representing characterc
- Finally, this
1 << (c % 32)
bitmask is OR-ed withas[c/32]
to storec
in the ASCIISet, by flipping the bit representing characterc
to 1. - Passing non-ASCII bytes like the British Pound £ to
Add()
will always have no effect, because of the conditionif c < utf8.RuneSelf
.
Here is how an empty ASCIISet changes as the 3 characters !, a, and d are added to it one-by-one
[0 0b00000000000000000000000000000000 0 0b00000000000000000000000000000000 ...] // empty
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000000000 ...] // ['!']
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000000010 ...] // ['!','a']
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000010010 ...] // ['!','a','d']
Contains()
// Contains reports whether c is inside the set.
func (c byte) bool
- Like before,
as[c/32]
points to the set of 32 ASCII characters thatc
belongs to, and the1 << (c % 32)
bitmask has 1 in the position of the bit representing characterc
. - We AND both expressions to check if the bit representing
c
in ASCIISet is 1. If so, the ASCII character exists, otherwise, it does not exist. - Passing non-ASCII bytes like the British Pound £ to
Contains()
will always return false. For example,as[c/32]
for £ evaluates toas[5]
, which is always 0 since onlyas[0]
toas[3]
are used to represent ASCII characters.
Remove()
// Remove removes c from the set
//
// if c is not in the set, the set contents will remain unchanged.
func (c byte)
- We can remove a character from the ASCIISet by inverting the usual
1 << (c % 32)
bitmask using NOT, then AND it withas[c/32]
to flip the bit representing characterc
to 0. - Passing non-ASCII bytes like the British Pound £ to
Remove()
will always have no effect, because the bitmask1 << (c % 32)
will never be applied to any ofas[0]
toas[3]
.
Bitmasking
Add()
, Contains()
, and Remove()
all apply the expression 1 << (c % 32)
as a mask over as[c/32]
using different bitwise operations. This action is called bitmasking.
Idempotence
Add()
, Contains()
, and Remove()
are idempotent; running the same method and arguments repeatedly on the same ASCIISet will lead to the same result.
Performance comparison
In Add()
, Contains()
, and Remove()
, the number of bitwise and arithmetic operations is not affected by input size. They are of O(1) time complexity, but so are Go map set operations, which use hash functions.
Here is a benchmark comparison between the ASCIISet and Go map set operations.
The benchmark measures time taken to fill up 10 sets with every 2nd ASCII character, to check all 10 sets for presence of every ASCII character, and to remove every ASCII character from all 10 sets.
ASCIISet vastly outperforms map as bitwise operations are less expensive than the hash function of map.
Conclusion
We have explored how bitwise operations can be used to implement fast bitsets for specialised use cases, like checking strings for prohibited ASCII characters.
Should we use ASCIISet over Go's built-in map for ASCII character sets? It depends whether the total execution time-savings is worth the cognitive load of an additional project dependency.
The source code for ASCIISet is available here.