Tagged Unions are actually quite sexy
I have recently tried to update my Zig project from 0.11.0 version of Zig, to 0.12.0. Zig compiler brings in its own build system. Since Zig is still under development, that API changes. The 0.12.0 changes forced me to dive a bit deeper into Zig’s source code. I went through Build.zig and Step.zig. What caught my attention was the usage of Tagged Unions.
Tagged unions aren’t nothing new. C has them… kind of. In C, the tagged unions are more of a design pattern than a language feature.
An instance of an union is just a space in memory, with a size of the biggest variant listed in its definition.
union supported_values {
bool b;
uint8_t u8;
uint16_t u16;
uint32_t u32;
int8_t i8;
int16_t i16;
int32_t i32;
};
Syntactically it’s the same as a struct
. The only difference is that all the fields start at the same memory address, which means you can use only one field at a time.
It’s C - nothing stops you from writing into .u8
and then using .32
. If the union supported_values val
isn’t initialized, you might see printf
printing garbage values. The first byte will print as 0xA
.
int main(int argc, char* argv[])
{
union supported_values val;
val.u8 = 0xA;
printf("0x%x\n", val.u32);
}
The issue here is that if you want to do something more complex with those values, your code should be able to introspect on itself. It needs to know which of the union’s variant is currently used. This can be solved by creating a tagged union:
enum value_type {
VAL_BOOL,
VAL_U8,
VAL_U16,
VAL_U32,
VAL_S8,
VAL_S16,
VAL_S32,
};
typedef struct value_tracker {
enum value_type type;
union supported_values value;
} value_tracker;
Here’s how you’d handle a tagged union like that:
int main(int argc, char *argv[]) {
struct value_tracker tagged_union_val =
(struct value_tracker){.type = VAL_U8, .value.u8 = 0xA};
switch (tagged_union_val.type) {
case VAL_BOOL:
printf("%s\n", tagged_union_val.value.b ? "TRUE" : "FALSE");
break;
case VAL_U8:
printf("%" PRIu8 "\n", tagged_union_val.value.u8);
break;
case VAL_U16:
printf("%" PRIu16 "\n", tagged_union_val.value.u16);
break;
case VAL_U32:
printf("%" PRIu32 "\n", tagged_union_val.value.u32);
break;
case VAL_S8:
printf("%" PRIi8 "\n", tagged_union_val.value.i8);
break;
case VAL_S16:
printf("%" PRIi16 "\n", tagged_union_val.value.i16);
break;
case VAL_S32:
printf("%" PRIi32 "\n", tagged_union_val.value.i32);
break;
default:
break;
}
}
Once again, it’s C, and you can make a mistake in this code and it’ll still compile and run. The enum
and union
coupling here is… loose. If you modify one, you have to remember to modify the other. You can mix them up and not notice.
Languages like Zig, Odin or Hare implement tagged unions as a language feature, which means it’s impossible to mix those up.
I’ll go back to how the newer languages make tagged unions more robust, but lets go back to C for a little bit. When I explored this topic, I had a perfect candidate for re-implementation, using tagged unions.
At work, we log telemetry data. We have a separate sub-system for that. It handles boolean values, 8, 16, 32, 64 bit, signed and unsigned and 64 bit floating point values values. The values are actually always stored in a uint64_t
value, but the structure already has an enum
field. The value is already tagged but certain amount of type safety has been thrown out the window, because the collected values always have to be packed into a uint64_t
field.
typedef void (*get_value_cb)(void *);
struct value_tracker {
enum telemetry_value_type type;
get_value_cb value_cb;
// Precision is used for double.
uint8_t precision;
uint64_t value;
};
Notice that the get_value_cb
function is using a void *
parameter. That’s because whenever you create an instance of struct value_tracker
you should give it a pointer to a function that promises that it’ll write correct data type to that void *
address.
Here you can see how one would use this structure.
static void fetch_gyro_x(void *data)
{
*(double *)data = accelerometer_get_data()->gyro.x;
}
struct value_tracker val = {
.type = VAL_DOUBLE,
.value_cb = fetch_gyro_x,
.value = 0,
.value_previous = 0,
}
Then, whenever you need to update the actual value, you’d iterate through a collection of those structures and call the value_cb
of each:
void update_value(struct value_tracker *val_track)
{
// Here we call the `fetch_gyro_x`.
val_track->value_cb((void *)&val_track->value);
switch(val->type) {
...
// Here we type cast to uint32_t and save in the int64_t.
case VAL_U32: {
uint32_t *p_val = (uint32_t *)&val_track->value;
val_track->value = *p_val;
break;
}
case VAL_DOUBLE: {
double *p_val = (double *)&val_track->value;
int64_t i_value = *p_val * pow(10, val_track->precision);
val_track->value = i_value;
break;
}
...
}
...
}
The val_track
points to one of the values, we track in the our values collection. There’s a lot of trust between pieces of code. The value_cb
gets a raw address of the uint64_t value
and we trust it to write no more than 4 bytes into that address. Moreover, the developer is responsible for making sure that the type the value_cb
writes and val_track->type
are in sync. If you start throwing around void*
you put compiler in a very difficult position, where it isn’t able to type check. You’re on your own, buddy.
I have also left the VAL_DOUBLE
case. It also shows the computation needed to store the double
in an uint64_t
. For our needs converting the double
into int64_t
and then uint64_t
was just wasted cycles, but that’s because of how we use this data further in our pipeline. We can all agree that this pointer casting dance is ugly and dangerous.
Could I improve this system by using tagged unions?
Let’s go back to the structure, which is used to store the values we’re tracking:
struct value_tracker {
enum telemetry_value_type type;
get_value_cb value_cb;
uint8_t precision;
uint64_t value;
};
It already carries the tag and we know well, what types we support.
union supported_values {
bool b;
uint8_t u8;
uint16_t u16;
uint32_t u32;
uint64_t u64;
int8_t i8;
int16_t i16;
int32_t i32;
int64_t i64;
double f64;
};
struct value_tracker {
enum telemetry_value_type type;
get_value_cb value_cb;
uint8_t precision;
union supported_values value;
};
Now, the value
can be one of the types listed in the union supported_values
, but nothing else. It’s size didn’t change because the biggest member, uint64_t
and double
, are still 4 bytes. The tagged union describes the system in a clearer way.
Now we can change how we fetch the value:
static union supported_values fetch_gyro_x(void)
{
return (union supported_values){ .f64 = accelerometer_get_data()->gyro.x};
}
struct value_tracker val = {
.type = VAL_DOUBLE,
.value_cb = fetch_gyro_x,
.precision = 4,
.value = 0,
}
The value_cb
becomes simpler. It returns the union, meaning that it returns, in an opaque way, one of the types listed under our union supported_values
. The only way we know which one it is, is by looking at the .type
field.
We don’t really need to switch
on the .type
field, the way we did before, in the update_value
function. That function becomes extremely simple:
void update_value(struct value_tracker *val_track)
{
// Here we call the `fetch_gyro_x`.
val_track->value = val_track->value_cb();
}
The value_cb
returns union supported_values
, the value
is of type union supported_values
. The type safety is still with us an the void *
has been banished!
I was lucky to stumble on a piece of code that benefits from using tagged unions. C however doesn’t go that far, providing a safe and ergonomic way to use tagged unions - the enum
and union
coupling is too loose.
In Zig, a tagged union, would look like so:
const ComplexTypeTag = enum {
ok,
not_ok,
};
const ComplexType = union(ComplexTypeTag) {
ok: u8,
not_ok: void,
};
pub fn main() void {
var c = ComplexType{ .ok = 42 };
switch (c) {
ComplexTypeTag.ok => |*value| value.* += 1,
ComplexTypeTag.not_ok => unreachable,
}
}
But the nice part is that if you extend the enum ComplextTypeTag
, with a field maybe
, then you’ll get a compile time error:
main.zig:8:21: error: enum field(s) missing in union
main.zig:6:5: note: field 'maybe' missing, declared here
main.zig:3:24: note: enum declared here (exit status 1)
If you extend the Complex Type
union with a maybe: u8
field, you’ll also get a compile time error:
main.zig:11:5: error: no field named 'maybe' in enum 'main.ComplexTypeTag'
main.zig:3:24: note: enum declared here (exit status 1)
The enum
and union
have to stay in sync. Zig is looking out for you. If the enum
isn’t reused in other parts of your code, you can be a bit more terse:
const ComplexType = union(enum) {
ok: u8,
not_ok: void,
};
pub fn main() void {
var c = ComplexType{ .ok = 42 };
switch (c) {
ComplexType.ok => |*value| value.* += 1,
ComplexType.not_ok => unreachable,
}
}
Zig leverages unions, and the ability to test the unions type, to return “result or error”, from functions. fn mightReturnAValue() !u8
function would return either a result of type u8
or an error. This specific dichotomy, between proper values and errors, is quite common in programming and because of that languages support syntax ergonomic for handling unions like these. Zig brings tools like try
, catch
, if (mightReturnAValue()) |value| {...} else |err| switch (err) {...}
, errdefer
, et al.
I know what you’re thinking.
I know very little about Hare but it’s another language that leverages the power of unions. On top of union
keyword, it supports a simpler way of creating a union type.
type signed = (int | i8 | i16 | i32 | i64);
Hare introduction shows an example of using bufio::read_line
function. The function signature:
fn read_line(h: io::handle) ([]u8 | io::EOF | io::error);
This function returns one of three, distinct types. The first being a data result (u8
), the second being an “exception” result (io::EOF
) and the third one being an error (io::error
). That’s a nice example of logically grouping ideas, using tagged unions. Hare provides the match
keyword to control the flow, depending on the variable’s type.
While I won’t analyze it, I’ll mention Odin, which also, gladly leverages tagged unions.
Unions are definitely a tool, one should feel comfortable with.
If you’d like to read more on this topic, I suggest:
- https://www.rfleury.com/p/the-codepath-combinatoric-explosion - Ryan Fleury on why unions/sum types aren’t that sexy
- https://en.wikipedia.org/wiki/Algebraic_data_type