Home > NymphRPC, programming, RPC > Refactoring NymphRPC for zero-copy optimisation

Refactoring NymphRPC for zero-copy optimisation

When I originally wrote the code for what became NymphRPC [1], efficiency was not my foremost concern, but rather reliable functionality was. Admittedly, so long as you just send a couple of bytes and short strings to and from client and server, the overhead of network transmission is very likely to mask many inefficiencies. That is, until you try to send large chunks of data.

The motivation for refactoring NymphRPC came during the performance analysis of NymphCast [2] using Valgrind’s Callgrind tool. NymphCast uses NymphRPC for all its network-based communications, including the streaming of media data between client and server. This involves sending the data in chunks of hundreds of kilobytes, which is where the constant copying of data strings in NymphRPC showed itself to be a major overhead.

Specifically this showed itself (on Linux) in many calls to __memcpy_avx_unaligned_erms, largely originating from within std::string. There were multiple reasons for this, involving the copying of std::string instances into a NymphString type, copying this data again during message serialisation, and repeated copying of message data during deserialisation: first after receiving the messages from the network socket, then again during deserialisation of the message.

Finally, the old NymphRPC API was designed thus that all data would be copied inside the NymphRPC types, which added convenience, but at a fairly large performance impact, as seen.

Using a benchmark program created using the Catch2 benchmarking framework [3][4] – consisting out of a NymphRPC client and server – the following measurements were obtained after compilation with Visual Studio 2019 (MSVC 16) with -O2 optimisation level:

benchmark name                            samples    iterations          mean
uint32                                          20             1    178.387 us
double                                          20             1    138.282 us
array     1:                                    20             1    197.452 us
array     5:                                    20             1    198.407 us
array    10:                                    20             1    204.417 us
array   100:                                    20             1    512.027 us
array  1000:                                    20             1    3.08481 ms
array 10000:                                    20             1    32.8876 ms
blob       1:                                   20             1    188.677 us
blob      10:                                   20             1    141.712 us
blob     100:                                   20             1    174.832 us
blob    1000:                                   20             1    133.617 us
blob   10000:                                   20             1    211.097 us
blob  100000:                                   20             1    362.747 us
blob  200000:                                   20             1    1.35672 ms
blob  500000:                                   20             1    3.37874 ms
blob 1000000:                                   20             1    8.19277 ms

In order to reduce the number of calls to memcpy, it was decided to move to a zero-copy approach, which effectively means that no data is copied by NymphRPC unless it’s absolutely necessary, or there is no significant difference between copying and taking the pointer address of a value.

This involved changing the NymphRPC type system to still copy simple types (integers, floating point, boolean), but only accept pointers to an std::string, character array, std::vector (‘array’ type) or std::map (‘struct’ type), with optional transfer of ownership to NymphRPC. Done this way, this means that ideally the original non-simple value is allocated once (stack or heap), and copied once into the transfer buffer for the network socket. The serialisation itself is done into a pre-allocated buffer, avoiding the use of std::string altogether.

On the receiving end the receiving character buffer is filled in with the received data, and the parsing routine creating a pointer reference to non-simple types within the received data. In the receiving application’s code, this can then be read straight from this buffer, which in the case of NymphCast means that its internal ring buffer can copy the blocks of data straight from the received data buffer into the ring buffer with a single call to memcpy(), without any intermediate copying of the data.

Running the same benchmark (adapted for the new API) with the same compilation settings results in the following results:

benchmark name                            samples    iterations          mean
uint32                                          20             1    122.193 us
double                                          20             1    140.368 us
array     1:                                    20             1    173.963 us
array     5:                                    20             1    189.888 us
array    10:                                    20             1    220.653 us
array   100:                                    20             1    573.168 us
array  1000:                                    20             1    3.33472 ms
array 10000:                                    20             1    31.8041 ms
blob       1:                                   20             1    181.433 us
blob      10:                                   20             1    194.048 us
blob     100:                                   20             1    153.998 us
blob    1000:                                   20             1    174.073 us
blob   10000:                                   20             1    166.228 us
blob  100000:                                   20             1    240.223 us
blob  200000:                                   20             1    343.233 us
blob  500000:                                   20             1    716.233 us
blob 1000000:                                   20             1     2.0748 ms

Taking into account natural variation when running benchmark tests (even with network data via localhost), it can be noted that there is no significant change for simple types, and arrays (std::vector) show no major change either. For the latter type a possible further optimisation can be achieved by streamlining the determination of total binary size for the types within the array, avoiding the use of a loop. This was a compromise solution during refactoring that may deserve revisiting in the future.

The most significant change can – as expected – be observed in the character strings (‘blob’). Here entire milliseconds are shaved off for the larger transfers, making for a roughly 3.5x improvement. In the case of NymphCast, which uses 200 kB chunks, this means a reduction from about 1.4 milliseconds to 350 microseconds, or 4 times faster.

After integration of the new NymphRPC into NymphCast, this improvement was observed during a subsequent analysis with Callgrind: the use of __memcpy_avx_unaligned_erms dropped from being at the top of the list of methods the application spent time in to somewhere below the noise floor to the point of being inconsequential. In actual usage of NymphCast, the improvements were somewhat noticeable in improved response time.

Further analysis would have to be performed to characterise the improvements in memory (heap and stack) usage, but it is presumed that both are lower – along with CPU usage – due to the reduction in copies of the data, and reduction in CPU time spent on creating these copies.




[1] https://github.com/MayaPosch/NymphRPC/
[2] https://github.com/MayaPosch/NymphCast
[3] https://github.com/catchorg/Catch2
[4] https://github.com/MayaPosch/NymphRPC/blob/master/test/test_performance_nymphrpc_catch2.cpp

  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: