Efficient ASCII bitset in Go

Tingfeng 6 min read December 31, 2022 [Go] #Go #data #structures #bitset #rune

Whenever we want to check a string for prohibited ASCII characters, such as in username validation, a simple method is to use a set of ASCII characters.

While other languages like Python and JavaScript have built-in sets, Go does not.

Instead, the natural way to implement sets in Go is to use the built-in map. With Go map, sets of any comparable variable type can be created.

import "fmt"

set := make(map[byte]struct{})

set[42] = struct{}{}

_, exists := set[42]
fmt.Println(exists) // expect: true

_, exists := set[99]
fmt.Println(exists) // expect: false

However, for sets of ASCII characters, the Go source code offers a faster solution, the asciiSet.

In this blog post, we shall build ASCIISet, an extension of asciiSet from the Go source code, from scratch. Afterwards, we shall compare its performance against traditional map based sets.

Introducing the ASCIISet

Instead of a map, an uint32 array of size 8 is defined as a type ASCIISet to store the set values. The set methods, Add(), Contains() and Remove() are implemented.

import "unicode/utf8"

// ASCIISet is a 32-byte value, where each bit represents the presence of a
// given ASCII character in the set. The 128-bits of the lower 16 bytes,
// starting with the least-significant bit of the lowest word to the
// most-significant bit of the highest word, map to the full range of all
// 128 ASCII characters. The 128-bits of the upper 16 bytes will be zeroed,
// ensuring that any non-ASCII character will be reported as not in the set.
// This allocates a total of 32 bytes even though the upper half
// is unused to avoid bounds checks in ASCIISet.Contains.
type ASCIISet [8]uint32

// Add inserts character c into the set.
func (as *ASCIISet) Add(c byte) {
    if c < utf8.RuneSelf { // ensure that c is an ASCII byte
        as[c/32] |= 1 << (c % 32)
    }
}

// Contains reports whether c is inside the set.
func (as *ASCIISet) Contains(c byte) bool {
    return (as[c/32] & (1 << (c % 32))) != 0
}

// Remove removes c from the set
//
// if c is not in the set, the set contents will remain unchanged.
func (as *ASCIISet) Remove(c byte) {
    as[c/32] &= ^(1 << (c % 32))
}

Here is how the characters @ and J are stored in the ASCIISet, and how we can verify that ASCIISet contains the characters @ and J (and nothing else).

var as ASCIISet
chars := "@J"
for _, c := range chars {
    as.Add(byte(c))
}

fmt.Println(as.Contains('@')) // expect: true
fmt.Println(as.Contains('J')) // expect: true
fmt.Println(as.Contains('5')) // expect: false

How ASCIISet works

According to the Go documentation:

The 128-bits of the lower 16 bytes, starting with the least-significant bit of the lowest word to the most-significant bit of the highest word, map to the full range of all 128 ASCII characters.

In other words, each bit in the first 128 bits (lower 16 bytes) of ASCIISet represents one of the 128 ASCII characters. If a bit has a value of 1, the ASCII character it represents exists in the ASCIISet. If the value is 0, the ASCII character does not exist in the ASCIISet.

📝 This type of data structure is commonly known as a bitset

Single-character ASCIISets

Bit manipulation can be tricky to visualise.

Lets take a look at the single-character ASCIISet for each of the 128 ASCII characters.

var c byte
for c = 0; c < 128; c++ {
    var as ASCIISet
    as.Add(c)
    fmt.Println(as)
}

Output as follows

[1 0 0 0 0 0 0 0] // c = 0
[2 0 0 0 0 0 0 0] // c = 1
[4 0 0 0 0 0 0 0] // c = 2
...
[536870912 0 0 0 0 0 0 0] // c = 29
[1073741824 0 0 0 0 0 0 0] // c = 30
[2147483648 0 0 0 0 0 0 0] // c = 31
[0 1 0 0 0 0 0 0] // c = 32
[0 2 0 0 0 0 0 0] // c = 33
[0 4 0 0 0 0 0 0] // c = 34
...
[0 536870912 0 0 0 0 0 0] // c = 61
[0 1073741824 0 0 0 0 0 0] // c = 62
[0 2147483648 0 0 0 0 0 0] // c = 63
[0 0 1 0 0 0 0 0] // c = 64
[0 0 2 0 0 0 0 0] // c = 65
[0 0 4 0 0 0 0 0] // c = 66
...
[0 0 536870912 0 0 0 0 0] // c = 93
[0 0 1073741824 0 0 0 0 0] // c = 94
[0 0 2147483648 0 0 0 0 0] // c = 95
[0 0 0 1 0 0 0 0] // c = 96
[0 0 0 2 0 0 0 0] // c = 97
[0 0 0 4 0 0 0 0] // c = 98
...
[0 0 0 536870912 0 0 0 0] // c = 125
[0 0 0 1073741824 0 0 0 0] // c = 126
[0 0 0 2147483648 0 0 0 0] // c = 127

From left-to-right, each element is filled up with increasing powers of 2 (2⁰ to 2³¹), for each consecutive set of 32 ASCII characters (4 sets in total).

A bit for every character

Why powers of 2? Because in binary, powers of 2 have the form (0001, 0010, 0100, ...), which is useful when we want each bit in a number to represent the existence of something unique.

Therefore, the existence of each of the first 32 ASCII characters in ASCIISet can be represented by each bit of element 0.

[0b00000000000000000000000000000001 ...] // c = 0
[0b00000000000000000000000000000010 ...] // c = 1
[0b00000000000000000000000000000100 ...] // c = 2
...
[0b00100000000000000000000000000000 ...] // c = 29
[0b01000000000000000000000000000000 ...] // c = 30
[0b10000000000000000000000000000000 ...] // c = 31

In a similar manner, the second, third, and fourth sets of 32 ASCII characters are represented by the bits in elements 1, 2, and 3 of ASCIISet respectively.

In total, we have 32 + 32 + 32 + 32 = 128 bits representing each of the 128 ASCII characters, as stated by the documentation.

Add()

Now we are in a better position to understand the Add() method.

// Add inserts character c into the set.
func (as *ASCIISet) Add(c byte) {
    if c < utf8.RuneSelf { // ensure that c is an ASCII byte
        as[c/32] |= 1 << (c % 32)
    }
}

c is always an ASCII character, so it falls within the range [0,127].
This restricts c/32 to the following values: [0,1,2,3] (i.e. indices of the first 4 elements of ASCIISet).
as[c/32] thus points to the set of 32 ASCII characters that c belongs to. Example: if c = 34 → as[c/32] = 1 → c belongs to the second set of 32 ASCII characters (out of 4 sets).
In 1 << (c % 32), 1 is right-shifted by c % 32 steps to the position of the bit representing character c
Finally, this 1 << (c % 32) bitmask is OR-ed with as[c/32] to store c in the ASCIISet, by flipping the bit representing character c to 1.
Passing non-ASCII bytes like the British Pound £ to Add() will always have no effect, because of the condition if c < utf8.RuneSelf.

Here is how an empty ASCIISet changes as the 3 characters !, a, and d are added to it one-by-one

[0 0b00000000000000000000000000000000 0 0b00000000000000000000000000000000 ...] // empty
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000000000 ...] // ['!']
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000000010 ...] // ['!','a']
[0 0b00000000000000000000000000000010 0 0b00000000000000000000000000010010 ...] // ['!','a','d']

Contains()

// Contains reports whether c is inside the set.
func (as *ASCIISet) Contains(c byte) bool {
    return (as[c/32] & (1 << (c % 32))) != 0
}

Like before, as[c/32] points to the set of 32 ASCII characters that c belongs to, and the 1 << (c % 32) bitmask has 1 in the position of the bit representing character c.
We AND both expressions to check if the bit representing c in ASCIISet is 1. If so, the ASCII character exists, otherwise, it does not exist.
Passing non-ASCII bytes like the British Pound £ to Contains() will always return false. For example, as[c/32] for £ evaluates to as[5], which is always 0 since only as[0] to as[3] are used to represent ASCII characters.

Remove()

// Remove removes c from the set
//
// if c is not in the set, the set contents will remain unchanged.
func (as *ASCIISet) Remove(c byte) {
    as[c/32] &= ^(1 << (c % 32))
}

We can remove a character from the ASCIISet by inverting the usual 1 << (c % 32) bitmask using NOT, then AND it with as[c/32] to flip the bit representing character c to 0.
Passing non-ASCII bytes like the British Pound £ to Remove() will always have no effect, because the bitmask 1 << (c % 32) will never be applied to any of as[0] to as[3].

Bitmasking

Add(), Contains(), and Remove() all apply the expression 1 << (c % 32) as a mask over as[c/32] using different bitwise operations. This action is called bitmasking.

Idempotence

Add(), Contains(), and Remove() are idempotent; running the same method and arguments repeatedly on the same ASCIISet will lead to the same result.

Performance comparison

In Add(), Contains(), and Remove(), the number of bitwise and arithmetic operations is not affected by input size. They are of O(1) time complexity, but so are Go map set operations, which use hash functions.

Here is a benchmark comparison between the ASCIISet and Go map set operations.

The benchmark measures time taken to fill up 10 sets with every 2nd ASCII character, to check all 10 sets for presence of every ASCII character, and to remove every ASCII character from all 10 sets.

go test -bench . -benchmem -cpu 1
goos: linux
goarch: amd64
pkg: github.com/elliotwutingfeng/asciiset
cpu: AMD Ryzen 7 5800X 8-Core Processor
BenchmarkASCIISet/ASCIISet_Add()                 1654597               752.9 ns/op             0 B/op          0 allocs/op
BenchmarkASCIISet/ASCIISet_Contains()            1934626               600.9 ns/op             0 B/op          0 allocs/op
BenchmarkASCIISet/ASCIISet_Remove()               756562              1589 ns/op               0 B/op          0 allocs/op
BenchmarkMapSet/map_Add                           125552              9618 ns/op               0 B/op          0 allocs/op
BenchmarkMapSet/map_Contains                       70255             16938 ns/op               0 B/op          0 allocs/op
BenchmarkMapSet/map_Remove                        485190              2484 ns/op               0 B/op          0 allocs/op
PASS
ok      github.com/elliotwutingfeng/asciiset    11.579s

ASCIISet vastly outperforms map as bitwise operations are less expensive than the hash function of map.

Conclusion

We have explored how bitwise operations can be used to implement fast bitsets for specialised use cases, like checking strings for prohibited ASCII characters.

Should we use ASCIISet over Go's built-in map for ASCII character sets? It depends whether the total execution time-savings is worth the cognitive load of an additional project dependency.

The source code for ASCIISet is available here.