Parsing JSON-RPC in C

Yeah you’re not convincing me that JSON-RPC is a real protocol <

Let’s deal with JSON-RPC in C! Something I don’t ever want to hear when sipping an iced coffee, during a sunny, June afternoon. Fortunately, it’s January and I haven’t seen the sun in a while! This article is sponsored by my lack of vitamin D!

I have been building my own Moonraker GUI client. Inspired by Mainsail and Fluidd but not a web application. I want to get more experience in immediate GUI paradigm and memory management through different types of allocators. I’ll write about that project soon, but today I’ll be focusing on a small part of it. That part is “parsing JSON-RPC messages using C, but honestly not fully even, just a part of them”.

The RPC part of JSON-RPC stands for Remote Procedure Call. Messages are just instructions to “call this method with this data on your side of our connection”. There’s however no protocol level distinction between the method and the data. method key is just a part of the same JSON that encodes the data. The order of key value pairs can’t be guaranteed. Your implementation can use a simple heuristic of “search for method string in the first 64 bytes of the payload” in order to avoid branching into JSON parsing code - it’s costly and at this point you’re interested in a very small chunk of the entire payload. You’re just trying to determine which part of your code will be able to handle the actual data. If you however want to be robust™ you’ll probably avoid this types of hacks and first pull a complete JSON-RPC message and then branch into full JSON parsing.

Moonraker’s notifications get sent to you, as the client, unprompted. The incoming message needs to be classified and routed to a piece of code that knows what to do with it. We need to identify the type of the notification received. Here’s an example notification payload:

{
    "jsonrpc": "2.0",
    "method": "notify_status_update",
    "params": [
        {
            "gcode_move": {
                "speed": 1500,
            },
            "toolhead": {
                "status": "Ready"
            }
        },
        578243.57824499
    ]
}

The unique type of the notification has been encoded by the value of the method key. At the time of writing this article, Moonraker defines 26 unique notifications. So in order to classify this message I need to first find the method string, which itself isn’t free, and then compare that key’s value against 26 unique strings. This happens for every notification the client receives. That’s a lot of work for such a simple task.

There’s 26 unique notification types and the smallest, 8 bit variable type can store 255 unique values. In other words, the notification type could be a single byte. Well, except that isn’t true because JSON is a text based protocol so a number 20 would be a 2 byte string 20, and the client’s code would need to parse it into a number. At least, even if we saturate the entire space of 8 bit variable, we’d have to deal with 1 to 3 characters. Not with an unspecified length string. Wouldn’t that be nice?! Wouldn’t it be nice if we didn’t use JSON for a wire protocol and instead used a protocol that makes it possible to write performant code?

At this point I have to mention that I’m solving it by using some pre-generated code. I’m using metadesk to do that. It’s a C library, which I’m using in my C code generator. Just so it’s clear: I have written a small C program that takes in a Metadesk syntax input file and outputs C code file, which I then include in my main project. Building and executing that generator is a step of my project’s build, which means that during normal development the code generation is unnoticeable.

Here’s my .mdesk file:

@enum
@names
mr_ntf: {
   "notify_gcode_response",
   "notify_status_update",
   "notify_klippy_ready",
   "notify_klippy_shutdown",
   "notify_klippy_disconnected",
   "notify_filelist_changed",
   "notify_update_response",
   "notify_update_refreshed",
   "notify_cpu_throttled",
   "notify_proc_stat_update",
   "notify_history_changed",
   "notify_user_created",
   "notify_user_deleted",
   "notify_user_logged_out",
   "notify_service_state_changed",
   "notify_job_queue_changed",
   "notify_button_event",
   "notify_announcement_update",
   "notify_announcement_dismissed",
   "notify_announcement_wake",
   "notify_sudo_alert",
   "notify_webcams_changed",
   "notify_active_spool_set",
   "notify_spoolman_status_changed",
   "notify_agent_event",
   "notify_sensor_update",
}

It describes a node named mr_ntf, tagged with two tags enum and names. Those tags have no specific meaning for Metadesk. It’s my C code generating program that can look those up and decide to act on them. The children of the mr_ntf are all the notifications’ IDs. Each notification will identify itself with one of those strings.

The output of my C generator is this source code:

typedef enum mr_ntf mr_ntf;
enum mr_ntf
{
  MR_NTF_NOTIFY_GCODE_RESPONSE,
  MR_NTF_NOTIFY_STATUS_UPDATE,
  MR_NTF_NOTIFY_KLIPPY_READY,
  MR_NTF_NOTIFY_KLIPPY_SHUTDOWN,
  MR_NTF_NOTIFY_KLIPPY_DISCONNECTED,
  MR_NTF_NOTIFY_FILELIST_CHANGED,
  MR_NTF_NOTIFY_UPDATE_RESPONSE,
  MR_NTF_NOTIFY_UPDATE_REFRESHED,
  MR_NTF_NOTIFY_CPU_THROTTLED,
  MR_NTF_NOTIFY_PROC_STAT_UPDATE,
  MR_NTF_NOTIFY_HISTORY_CHANGED,
  MR_NTF_NOTIFY_USER_CREATED,
  MR_NTF_NOTIFY_USER_DELETED,
  MR_NTF_NOTIFY_USER_LOGGED_OUT,
  MR_NTF_NOTIFY_SERVICE_STATE_CHANGED,
  MR_NTF_NOTIFY_JOB_QUEUE_CHANGED,
  MR_NTF_NOTIFY_BUTTON_EVENT,
  MR_NTF_NOTIFY_ANNOUNCEMENT_UPDATE,
  MR_NTF_NOTIFY_ANNOUNCEMENT_DISMISSED,
  MR_NTF_NOTIFY_ANNOUNCEMENT_WAKE,
  MR_NTF_NOTIFY_SUDO_ALERT,
  MR_NTF_NOTIFY_WEBCAMS_CHANGED,
  MR_NTF_NOTIFY_ACTIVE_SPOOL_SET,
  MR_NTF_NOTIFY_SPOOLMAN_STATUS_CHANGED,
  MR_NTF_NOTIFY_AGENT_EVENT,
  MR_NTF_NOTIFY_SENSOR_UPDATE,
  MR_NTF_COUNT
};
char *mr_ntf_labels[] =
{
  "notify_gcode_response",
  "notify_status_update",
  "notify_klippy_ready",
  "notify_klippy_shutdown",
  "notify_klippy_disconnected",
  "notify_filelist_changed",
  "notify_update_response",
  "notify_update_refreshed",
  "notify_cpu_throttled",
  "notify_proc_stat_update",
  "notify_history_changed",
  "notify_user_created",
  "notify_user_deleted",
  "notify_user_logged_out",
  "notify_service_state_changed",
  "notify_job_queue_changed",
  "notify_button_event",
  "notify_announcement_update",
  "notify_announcement_dismissed",
  "notify_announcement_wake",
  "notify_sudo_alert",
  "notify_webcams_changed",
  "notify_active_spool_set",
  "notify_spoolman_status_changed",
  "notify_agent_event",
  "notify_sensor_update"
};
#define MR_NTF_METHOD_LABEL(j) (mr_ntf_labels[j])

This gives me all the labels, now listed in an array, but also an enum which can act as a list of indices for the labels array. That gives me the ability to identify the incoming notification and propagate its data, labelled with an enum variant. Since both are generated from a single .mdesk file, they’ll stay in sync whenever I need to add something or need to reorder them.

But I still need to iterate over my collection of known labels and compare with the value behind the method key, on any notification I receive.

Here’s what I used to do for a bit (I’m using sj.h to parse JSON):

r = sj_reader(cursor, msg.len - (strlen(start_marker)));
obj = sj_read(&r);
sj_Value key, val;

ASSERT(obj.type == SJ_OBJECT);

while (sj_iter_object(&r, obj, &key, &val)) {
   if (strncmp(key.start, "method", key.end - key.start) == 0) {
      for (int i = 0; i < MR_NTF_COUNT; i++) {
         if (gb_strncmp(MR_NTF_METHOD_LABEL(i), val.start, val.end - val.start) == 0)
         {
            fprintf(stderr, "matched index: %d, matched label: %s\n", i, MR_NTF_METHOD_LABEL(i));

            // Act on the matched notification here.
         }
      }
  }
}

First I’m looking for a method key. When I find it, I need to iterate over all 26 notification labels and compare each with the value behind the method key. That’s 26 strncmps for a, on average, 20+ character string. That’s a lot of work. Unreasonable amount of work for step one of parsing a network protocol!

Again, wouldn’t it be nice if we could just label methods with 1 byte number and get away with ~26 single byte comparisons?!

I could hash those method labels which would get me from comparing strings to comparing two N byte numbers. A comparison like that needs a number on both sides of the comparison (duh!). Any time a message arrives I should find the method key and hash its value. Then I can iterate over the labels array, hashing each label, before comparing those two with == operator.

There’s one obvious optimization step here. I can offload a lot of work to the code generator. I have the list of all possible names of the notifications’ methods. I can precompute the hashes for those, which means that whenever I receive a notification, the only thing I need to do is to hash its method label and compare it against a table of compile time computed hashes.

I have extended my code generator, using the Metadesk bundled DJBX33A hashing algorithm, and here’s the result:

uint64_t mr_ntf_labels_hashes[] =
{
  4498715642734265293ULL,
  6455740154458924259ULL,
  11518241604245676746ULL,
  4558261855550409297ULL,
  11584199697906845480ULL,
  4141196533563477090ULL,
  5987017119843964302ULL,
  13104124201032801591ULL,
  16490482724205127038ULL,
  1833760163147166030ULL,
  15131595754982332920ULL,
  12034200136961631251ULL,
  12034200137752784306ULL,
  7582671519945366660ULL,
  8049819003972249495ULL,
  5631422414262810757ULL,
  11077662951613017882ULL,
  13738553566512168970ULL,
  13741106797930444236ULL,
  9583240546567924303ULL,
  3195522510840930223ULL,
  4677062460812568456ULL,
  4024478643145480208ULL,
  13024503810364638226ULL,
  13190305697511578925ULL,
  5717464191006905465ULL
};
#define MR_NTF_METHOD_LABEL_HASH(j) (mr_ntf_labels_hashes[j])

The source for this data is the same as for the labels array and the enum. The order and the length of this array corresponds to the labels array and the enum. Both arrays can be accessed with the same index. Same safety conditions apply, like for example:

if (i >= 0 && i < MR_NTF_COUNT) {
  printf("hash of %s is %llu\n", mr_ntf_labels[i], mr_ntf_labels_hashes[i]);
}

Now, using the same hashing algorithm, I hash the incoming method and try to match it with my array of hashes. The index at which the match succeeds can be translated to an enum variant or used with the labels array.

r = sj_reader(cursor, msg.len - (strlen(start_marker)));
obj = sj_read(&r);
sj_Value key, val;

ASSERT(obj.type == SJ_OBJECT);

while (sj_iter_object(&r, obj, &key, &val)) {
   if (strncmp(key.start, "method", key.end - key.start) == 0) {
      // Compute the hash of the value behind the 'method' key.
      uint64_t h = hashstr(val.start, val.end - val.start);
      for (int i = 0; i < MR_NTF_COUNT; i++) {
         if (MR_NTF_METHOD_LABEL_HASH(i) == h)
         {
            fprintf(stderr, "matched index: %d, matched label: %s\n", i, MR_NTF_METHOD_LABEL(i));

            // Act on the matched notification here.
         }
      }
  }
}

Before I wrap this article up, complaining about JSON-RPC, let’s see if the hash approach actually performs better.

I have generated some random data of 60 strings and computed their hashes.

char *random_data_labels[] =
{
  "doubtfully_middle_meh_nTNxIFLs",
  "agreeable_inventory_what_06dPo55z",
  "galoshes_hm_cleaner_8Usr29ng",
  ... ,
  "pension_boohoo_out_kPWsvYRa",
  "with_hence_sans_WjwugYG7"
};
#define RANDOM_DATA_LABEL(j) (random_data_labels[j])

uint64_t random_data_labels_hashes[] =
{
  257920638079657867ULL,
  16964662114189550377ULL,
  12254638687541731993ULL,
  ... ,
  17409308107150959363ULL,
  14851118142684434913ULL
};
#define RANDOM_DATA_LABEL_HASH(j) (random_data_labels_hashes[j])

#define RANDOM_DATA_LAST_LABEL_LEN (24U)

Below you’ll see the code I have profiled. I’m using rdtsc() timer to actually see how many ticks are spent on the entire run of the run function.

void run(void)
{
   // Create a profiling structure, which stores rdtsc samples for start and end of this scope.
   PROF_TIMEFUNCTION;

#ifdef HASH_SEARCH
   uint64_t h = hashstr(random_data_labels[RANDOM_DATA_COUNT - 1], RANDOM_DATA_LAST_LABEL_LEN);
#endif

   for (int j = 0; j < 100; j++) {
      for (int i = 0; i < RANDOM_DATA_COUNT; i++) {
#ifdef HASH_SEARCH
         if (RANDOM_DATA_LABEL_HASH(i) == h) {
            break;
         }
#else
         if (strncmp(RANDOM_DATA_LABEL(i), random_data_labels[RANDOM_DATA_COUNT - 1], RANDOM_DATA_LAST_LABEL_LEN) == 0) {
            break;
         }
#endif
      }
   }
}

The matching process is performed 100 times, the entire executable gets started 200 times and the results of those 200 runs get averaged.

The result for the strncmp matching is ~140000 ticks while for the hash comparing matching it’s around 65000 ticks. Hash approach seems to be twice as fast as strncmp.

All of this to identify an incoming message… I do not understand the need for protocols that emphasize human readability over efficiency and ease of parsing. Computers all over the world exchange these JSONs, all the time. Terabytes (peta?) of JSONs are travelling through the undersea, fiber optic cables, every hour. 99.99999% (probably way more) of those messages are never read by a human, because why would they be? It’s all just a waste of time and resources. Why wouldn’t we stick to sensible, binary protocols and build tooling which would quickly decode those protocols?

JSON based protocols blow.