Archive for the ‘programming’ Category

NymphRPC: my take on remote procedure call libraries

January 29, 2018 Leave a comment

Recently I open-sourced NymphRPC [1], which is a Remote Procedure Call (RPC) library I have been working on for the past months. In this article I would like to explain exactly why I felt it to be necessary to unleash yet another RPC library onto the world.

The past years I have had to deal quite a lot with a particular RPC library (Etch [2][3]) due to a commercial project. This is now a defunct project, but it spent some time languishing as an Apache project after Cisco open-sourced it in 2011 and got picked up by BMW as an internal protocol for their infotainment systems [4].

During the course of this aforementioned commercial project it quickly became clear to me that the Etch C library which I was using had lots of issues, including stability and general usability issues (like the calling of abort() without recovery option when any internal assert failed). As the project progressed, I found myself faced with the choice to either debug this existing library, or reimplement it.

At this point the C-based library was around 45,000 lines of code (LoC), with countless concurrency-related and other issues which made Valgrind spit out very long log files, and which proved to be virtually impossible to diagnose and fix. Many attempts resulted in the runtime becoming more unstable in other places.

Thus it was that I made the decision to reimplement the Etch protocol from scratch in C++. Even though there was already an official C++ runtime, it was still in Beta and it too suffered from stability issues. After dismissing it as an option, this led me to the next problem: the undocumented Etch protocol. Beyond a high-level overview of runtime concepts, there was virtually no documentation for Etch or its protocol.


Fast-forward a few months and I had reverse-engineered the Etch protocol using countless Wireshark traces and implemented the protocol in a light-weight C++-based runtime of around 2,000 LoC. Foregoing the ‘official’ runtime architecture, I had elected to model a basic message serialisation/deserialisation flow architecture instead. Another big change was the foregoing of any domain specific language (DSL) as with Etch to define the available methods.

The latter choice was primarily to avoid the complexity that comes with having a DSL and compiler architecture which has to generate functioning code that then has to be compiled into the project in question. In the case of a medium-sized Etch-based project, this auto-generated code ended up adding another 15,000 LoC to the project. With my runtime functions were defined in code and added to the runtime on start-up.

In the end this new runtime I wrote performed much better (faster, lower RAM usage) than the original runtime, but it left me wondering whether there was a better RPC library out there. Projects I looked at included Apache Thrift [5] and Google Protocol Buffers [6].

Both sadly also are quite similar to Etch in that they follow the same DSL (ISL) and auto-generated code for clients/servers path. Using them is still fairly involved and cumbersome. Probably rpclib [7] comes closest, but it’s still very new and has made a lot of design choices which do not appeal to me, including the lack of any type of parameter validation for methods being called.


Design choices I made in NymphRPC include such things as an extremely compact binary protocol (about 2x more compact than the Etch protocol) while allowing for a wide range of types. I also added dynamic callbacks (settable by the client). To save one the trouble of defining each RPC method in both the client and server, instead the client downloads the server’s API upon connecting to it.

At this point NymphRPC is being used for my file revision control project, FLR [8], as the communication fabric between clients and server.

Performance and future

Even though the network code in NymphRPC is pretty robust and essentially the same as what currently runs on thousands of customer systems around the world – as a result of this project that originally inspired the development of NymphRPC – it is still very much under development.

The primary focus during the development of NymphRPC has been on features and stability. The next steps will be to expand upon those features, especially more robust validation and ease of exception handling, and to optimise the existing code.

The coming time I’ll be benchmarking [9] NymphRPC to see how it compares other libraries and optimise any bottlenecks that show up. Until then I welcome anyone who wishes it to play around with NymphRPC (and FLR) and provide feedback 🙂



Categories: C++, Networking, NymphRPC, Protocols, RPC Tags: , ,

MQTTCute: a new MQTT desktop client

January 2, 2018 Leave a comment

At the beginning of 2017 I was first introduced to the world of MQTT as part of a building monitoring and control project, and while this was generally a positive experience, I felt rather frustrated with one area of this ecosystem: the lack of proper MQTT clients, regardless of mobile or desktop. The custom binary protocol that was being used to communicate over MQTT with the sensor and control nodes also made those existing clients rather useless.

I would regularly have to resort to using Wireshark/tcpdump to check the MQTT traffic on TCP level, or dump the messages received into a file and open it with a hex editor, just so that I could inspect the payloads being sent by the nodes and services. This was annoying enough, and even more annoying was that the system was intended to be fully AES-encrypted, with only mosquitto_pub and mosquitto_sub actually supporting TLS certificates.

As a result I have had this urge to write my own MQTT client that would actually work in this scenario. Courtesy of getting laid off just before Christmas, I had some additional time to work on this new project. After about a week of work, I released the 0.1 Alpha version today on my GitHub account [1].

MQTTCute screenshot

Called MQTTCute, it’s written in C++ and Qt 5, and using the Mosquitto client library for MQTT communication. Not surprisingly, it shares a fair bit of code with the Command & Control client for the BMaC system [2] which I also developed during the last year. I’ll be writing more on the BMaC project the coming time.

With this first version of the MQTTCute client all basic functionality is present: connecting to an MQTT broker, publishing on and subscribing to topics, along with being able to publish binary messages and see received messages both in their text and hexadecimal formats. Since an MDI interface is used, it should be possible to keep track of a large number of topics without too much trouble.

I’m hoping for feedback on this MQTT client, but regardless I’ll be implementing new features to improve the workflow and general options with the client. Hopefully someone beyond just myself will be finding it useful 🙂



Categories: C++, MQTT Tags: , , ,

On the merits of comment-driven development

April 9, 2017 3 comments

The number of ‘correct’ ways to develop software are almost too many to number. While some are all-encompassing for a (multi-person) project, others apply mostly to the engineering of the code itself. This article is about a method in the latter category.

In recent years, all-encompassing methods that have become popular include ‘Waterfall’ and ‘Agile’. For the ensuring of code quality, so-called ‘test-driven development’ (TDD) is often enforced.

Personally, I have been doing software development since I was about 7 years old, starting with QBasic (on MSDOS) and trying out many languages before settling on using C++ and VHDL as my preferred languages. What I have come to appreciate over these decades is a) to start with a plan, and b) writing out the code flow in comments first before committing any lines of code to the structure.

Obvious things

I hope that I do not have to explain here why starting off with a set of requirements and a worked-out design is a good idea. Even though software is often easier to fix afterwards than a hardware-based design, the cost of a late refactoring can be higher than one may be willing – or can afford – to pay. And sometimes one’s software project is part of a satellite around a distant planet.

The same is true of documenting APIs, classes, methods and general application structure, as well as protocols. Not doing this seems like a brilliant idea until you’re the person doing the cursing at some predecessor or previous developer who figured that anyone could figure out what they were thinking when they wrote that code.

Comment-driven development

As I mentioned earlier, I prefer to first put down comments before I start writing code. This is of course not true for trivial code or standard routines, but definitely for everything else. The main benefit of this for me is that it allows me to organise my thoughts and consider alternate approaches before committing myself to a certain architecture.

Recently I became aware of this style of developing being called ‘comment-driven development’, or CDD. While some seem to take this style as a bit of a joke, more and more people are taking it seriously these days.

When I look back at my old code from years ago, I really appreciate the comment blocks where I pen down my thoughts and considerations. Instead of reading the code itself, I can read these comments and use the code merely for illustrative purposes. This to me is another major benefit of CDD: it makes the source code truly self-documenting.

The steps of CDD can be summed as follows:

  1. Take the worked out requirements and design documents.
  2. Implement basic application structure.
  3. Fill in the skeleton classes and functions with comment blocks describing intent and considerations.
  4. Find any collisions and issues with assumptions made in one comment relative to another.
  5. Go back and fix the issues in the design and/or architecture which caused these results.
  6. Repeat steps 4 through 5 until everything looks right.
  7. Start implementing the first code blocks.
  8. When finding issues with the commentary, return to step 4.
  9. Perform tests (unit, integration, etc.). If an error in the commentary text is found, go back to step 4. Code implementation errors are okay.
  10. Final integration, validation, documentation and delivery.

What is very valuable about CDD is that it necessitates one to consider the validity of assumptions made, or more simply put: whether one truly wants to write that code one was thinking of writing or should reconsider some aspects or all of it. It also forms a valuable bridge between requirements and design documents, as well as source code, tests and documentation.

CDD doesn’t supersede or seek to compete with Waterfall, Agile, TDD or such, but instead complements any development process by providing an up-to-date view on the current status of a design and its implementation. When combined with a file revision system such as SVN or Git one can thus track the changes to the design.

It’s also possible to add tags to the comments to indicate questions or known defects. Common here is to use the ‘TODO:’ and ‘FIXME:’ strings, which some editors also parse and display in an overview. Using such a system it’s immediately clear at a glance which issues and questions exist.


One thing which I often hear about comments is that one should not use them at all, that the ‘working documentation’ for code is contained in file revision system commit messages and kin. Also that comments are always out of date.

The thing there is that even the best commit messages I have seen cannot provide the in-context level of detail which comments can provide. Commit messages also rarely contain questions, remarks and the like. Where they do it’s not easy to provide a central overview in an editor of outstanding issues even with access to the repository.

Out of date comments is merely a sign of lack of discipline. CDD isn’t unlike TDD in that if one doesn’t maintain the comments or tests, the whole system stops working. As both systems are complementary, this isn’t too surprising.

A good reason to complement TDD with CDD is that it can drastically reduce the number of test cases. By strategically testing only specific cases which came forward as being ‘important’ during the CDD phase, only a limited number of unit tests are required, with integration testing sufficient for further cases. CDD improves test planning.

Writing documentation is made infinitely easier with CDD, as all one has to do is to take the commentary in each source and header file and turn it into a more standard documentation format. Accuracy and completeness are improved.

Final thoughts

The above are mostly just my own thoughts and experiences on the subject of CDD. I do not claim that the above is the end-of-be-all of CDD, just that it’s the form of CDD which I have used for years and which works really well for me.

I welcome constructive feedback and thoughts on the topic of CDD and other ways in which it can improve or support a project. If CDD is an integral part of your personal or even professional projects I would like to hear about it as well 🙂


Request dispatcher using C++ threads

November 30, 2016 Leave a comment

A couple of weeks ago I found myself looking at implementing a request dispatcher for an existing embedded C++ project. The basic concept was to have a static dispatcher class which could have requests added to it from various sources. Worker threads would then get requests when available and process them.

Looking around for inspiration I didn’t find anything which really appealed to me. The main issue with many of them was that they didn’t so much use a dispatcher as more of a request queue, with each worker thread responsible for polling the queue in order to wait for new requests. This seemed rather messy and error-prone to me.

The solution I ultimately settled on was a dispatcher which would spawn a specified number of worker threads, each of which would launch and add themselves to the dispatcher, where they would wait on a request using their own condition variable. An incoming request would be assigned to a waiting worker, after which its condition variable would be signalled. The worker thread would then process the request, finish it and add itself back to the dispatcher.

While I originally implemented the Dispatcher in C++ 2003 using the POCO libraries for its threading features as C++11 support was absent for the target platform, I later reimplemented it in pure C++11 using its native threading API. Basic usage of this dispatcher looks like the following:

#include "dispatcher.h"
#include "request.h"

#include <iostream>
#include <string>
#include <csignal>
#include <thread>
#include <chrono>

using namespace std;

// Globals
sig_atomic_t signal_caught = 0;
mutex logMutex;

void sigint_handler(int sig) {
	signal_caught = 1;

void logFnc(string text) {
	cout << text << "\n";

int main() {
	signal(SIGINT, &sigint_handler);
	cout << "Initialised.\n";
	int cycles = 0;
	Request* rq = 0;
	while (!signal_caught && cycles < 50) {
		rq = new Request();
	cout << "Clean-up done.\n";
	return 0;

Of note here are the sigint_handler() and logFnc() methods. The first is used in combination with signal() from the header to allow one to interrupt the application. This is useful for for example a server application. In this particular implementation it’s unnecessary, but serves as inspiration for a more complex application.

The logFnc() method is the method being passed to each request, to allow it to pass any output to from the processing of the request. Here it’s just used to safely output to cout.

The main() method is rather uneventful: after installing the SIGINT signal handler, the Dispatcher is initialised with a total of ten worker threads. After this a loop is entered during which a total of fifty Request instances are added to the Dispatcher. Next we sleep for five seconds to allow all requests to be processed. After this we stop the Dispatcher: this will terminate all worker threads.

Moving on, the Dispatcher class definition looks as follows:

#pragma once

#include "abstract_request.h"
#include "worker.h"

#include <queue>
#include <mutex>
#include <thread>
#include <vector>

using namespace std;

class Dispatcher {
	static queue<AbstractRequest*> requests;
	static queue<Worker*> workers;
	static mutex requestsMutex;
	static mutex workersMutex;
	static vector<Worker*> allWorkers;
	static vector<thread*> threads;
	static bool init(int workers);
	static bool stop();
	static void addRequest(AbstractRequest* request);
	static bool addWorker(Worker* worker);


As one can see, this is a fully static class. It offers the init(), stop(), addRequest() and addWorker() methods, which should be quite self-explanatory at this point. Moving on with the implementation:

#include "dispatcher.h"

#include <iostream>
using namespace std;

// Static initialisations.
queue<AbstractRequest*> Dispatcher::requests;
queue<Worker*> Dispatcher::workers;
mutex Dispatcher::requestsMutex;
mutex Dispatcher::workersMutex;
vector<Worker*> Dispatcher::allWorkers;
vector<thread*> Dispatcher::threads;

// --- INIT ---
// Start the number of requested worker threads.
bool Dispatcher::init(int workers) {
	thread* t = 0;
	Worker* w = 0;
	for (int i = 0; i < workers; ++i) {
		w = new Worker;
		t = new thread(&Worker::run, w);

After the static initialisations the init() method creates as many threads with Worker instances as specified. Having at least one worker instance here would be beneficial, of course. Each thread and Worker pointer is pushed into its own vector for later use:

// --- STOP ---
// Terminate the worker threads and clean up.
bool Dispatcher::stop() {
	for (int i = 0; i < allWorkers.size(); ++i) {
	cout << "Stopped workers.\n";
	for (int j = 0; j < threads.size(); ++j) {
		cout << "Joined threads.\n";

In the stop() method both vectors are used, first to tell each Worker instance to terminate, then to wait for each thread instance to join.

The real magic starts in the remaining two methods, however:

// --- ADD REQUEST ---
void Dispatcher::addRequest(AbstractRequest* request) {
	// Check whether there's a worker available in the workers queue, else add
	// the request to the requests queue.
	if (!workers.empty()) {
		Worker* worker = workers.front();
		condition_variable* cv;
	else {

// --- ADD WORKER ---
bool Dispatcher::addWorker(Worker* worker) {
	// If a request is waiting in the requests queue, assign it to the worker.
	// Else add the worker to the workers queue.
	// Returns true if the worker was added to the queue and has to wait for
	// its condition variable.
	bool wait = true;
	if (!requests.empty()) {
		AbstractRequest* request = requests.front();
		wait = false;
	else {
	return wait;

The inline comments explain the basic code flow in each method. A request is added to an available worker thread if available, else it’s added to the queue. An incoming Worker has a request assigned to it if available, else it’s added to the queue. A waiting Worker has its condition variable signalled.

The next important class to look at is the abstract AbstractRequest class:

#pragma once

class AbstractRequest {
	virtual void setValue(int value) = 0;
	virtual void process() = 0;
	virtual void finish() = 0;


This class defines the interface for each request type. Deriving from it, one can implement specific requests to fit whichever source they originate from and whatever response format they require. One can also add specific extra methods to the request implementation when needed:

#pragma once
#ifndef REQUEST_H
#define REQUEST_H

#include "abstract_request.h"

#include <string>

using namespace std;

typedef void (*logFunction)(string text);

class Request : public AbstractRequest {
	int value;
	logFunction outFnc;
	void setValue(int value) { this->value = value; }
	void setOutput(logFunction fnc) { outFnc = fnc; }
	void process();
	void finish();


As one can see in the Request implementation which we employ in the usage example, we added an additional setOutput() method to the API, which allows one to assign a function pointer which is used for logging:

#include "request.h"

// --- PROCESS ---
void Request::process() {
	outFnc("Starting processing request " + std::to_string(value) + "...");

// --- FINISH ---
void Request::finish() {
	outFnc("Finished request " + std::to_string(value));

This request implementation does not do anything in particular, merely outputting on the function pointer when it’s been told to process or finish the request. In an actual implementation one would here do the processing, or alternatively one can use a different AbstractRequest API and assign the request instance to another (static) class which handles the processing. This is heavily implementation-dependent, however.

That said, let’s finish by looking at the Worker class:

#pragma once
#ifndef WORKER_H
#define WORKER_H

#include "abstract_request.h"

#include <condition_variable>
#include <mutex>

using namespace std;

class Worker {
	condition_variable cv;
	mutex mtx;
	unique_lock<mutex> ulock;
	AbstractRequest* request;
	bool running;
	bool ready;
	Worker() { running = true; ready = false; ulock = unique_lock<mutex>(mtx); }
	void run();
	void stop() { running = false; }
	void setRequest(AbstractRequest* request) { this->request = request; ready = true; }
	void getCondition(condition_variable* &cv);


The constructor is very basic, merely initialising some variables, including the mutex lock used by the condition variable. The run() method is the entry point for the thread running the Worker instance. The stop() method serves to terminate the loop in the run() method as we’ll see in a moment.

Finally the setRequest() method is used by the Dispatcher to set a new request, with getCondition() used to obtain a reference to the Worker’s condition variable so that it can be signalled.

#include "worker.h"
#include "dispatcher.h"

#include <chrono>

using namespace std;

// --- GET CONDITION ---
void Worker::getCondition(condition_variable* &cv) {
	cv = &(this)->cv;

// --- RUN ---
// Runs the worker instance.
void Worker::run() {
	while (running) {
		if (ready) {
			// Execute the request.
			ready = false;
		// Add self to Dispatcher queue and execute next request or wait.
		if (Dispatcher::addWorker(this)) {
			// Use the ready loop to deal with spurious wake-ups.
			while (!ready && running) {
				if (cv.wait_for(ulock, chrono::seconds(1)) == cv_status::timeout) {
					// We timed out, but we keep waiting unless the worker is
					// stopped by the dispatcher.

The run() method immediately enters a while loop which checks the ready variable. This boolean variable is not changed unless the stop() method is called. Since it’s a boolean, access should be atomic, but for a production implementation one may wish to verify this for the relevant processor architecture.

The other boolean variable (ready) indicates whether a request is waiting to be executed. If it’s not, the Worker adds itself to the Dispatcher. At this point two things can happen: a request is assigned (changing ready to true), or the Worker is assigned to the queue and has to wait. In the latter case the condition variable is waited for, until signalled, or the timer expires.

Hereby the ready variable serves to deal with spurious wake-ups: unless a request has been set, the condition variable will simply be waited upon again. Only if ready is changed to true, or the running variable to false will the waiting loop terminate. In the case of the latter variable being changed, this will also terminate the main loop.

Upon executing the demonstration application, we see the following output:

$ ./dispatcher_demo.exe
Starting processing request 0...
Starting processing request 2...
Finished request 2
Starting processing request 3...
Starting processing request 7...
Starting processing request 9...
Finished request 7
Starting processing request 10...
Finished request 10
Starting processing request 11...
Finished request 11
Starting processing request 12...
Finished request 12
Starting processing request 13...
Finished request 13
Starting processing request 14...
Finished request 14
Starting processing request 15...
Finished request 15
Starting processing request 16...
Finished request 16
Starting processing request 17...
Finished request 17
Finished request 3
Starting processing request 18...
Finished request 9
Starting processing request 19...
Finished request 18
Starting processing request 20...
Starting processing request 21...
Finished request 20
Starting processing request 4...
Finished request 21
Starting processing request 22...
Starting processing request 8...
Finished request 22
Finished request 4
Starting processing request 23...
Finished request 8
Finished request 23
Starting processing request 24...
Starting processing request 25...
Starting processing request 27...
Finished request 24
Starting processing request 26...
Finished request 27
Starting processing request 28...
Finished request 25
Finished request 28
Finished request 26
Starting processing request 29...
Starting processing request 31...
Starting processing request 30...
Starting processing request 32...
Finished request 32
Finished request 31
Starting processing request 33...
Starting processing request 1...
Finished request 33
Starting processing request 34...
Starting processing request 5...
Finished request 1
Finished request 0
Finished request 5
Starting processing request 36...
Starting processing request 37...
Finished request 36
Starting processing request 6...
Finished request 37
Finished request 6
Starting processing request 40...
Starting processing request 41...
Starting processing request 38...
Finished request 19
Starting processing request 39...
Finished request 38
Finished request 41
Starting processing request 43...
Starting processing request 35...
Starting processing request 44...
Finished request 29
Finished request 44
Finished request 35
Starting processing request 46...
Starting processing request 45...
Finished request 30
Finished request 45
Finished request 46
Starting processing request 48...
Starting processing request 49...
Finished request 48
Finished request 49
Finished request 34
Starting processing request 47...
Finished request 47
Finished request 39
Finished request 43
Finished request 40
Starting processing request 42...
Finished request 42
Stopped workers.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Joined threads.
Clean-up done.

One can see that even though the requests are passed in in sequential order, the processing is performed in a definitely asynchronous manner. This particular demonstration was compiled using GCC 5.3 (64-bit) in MSYS2 Bash on Windows 7, and run on an Intel i7 6700K system, with 4 hardware cores and 8 effective due to hyper-threading enabled.

As a bonus, here’s the Makefile used to compile the above code:

GCC := g++

OUTPUT := dispatcher_demo
SOURCES := $(wildcard *.cpp)
CCFLAGS := -std=c++11 -g3

all: $(OUTPUT)
	rm $(OUTPUT)
.PHONY: all

I hope that this article was at least somewhat useful or informative to any who read it. Feel free to leave any questions and/or constructive feedback 🙂


Cerflet: like Servlets, but with more C++

A few months ago I wrote about the research I had been doing on multiple ways to write server-based web applications, using Java Servlets, FastCGI/C++ and Qt/C++. While this showed that C++-based applications tend to be faster than Java-based ones, it only looked at single-threaded, sequential requests.

While looking at ways to get proper concurrent performance out of a Servlet-like C++ implementation I decided to look again at the POCO C++ Libraries [1] and found that its HTTP server implementation implements proper thread-pool-based working threads for excellent scaling across many concurrent requests.

After spending a few hours putting a basic wrapper library together, I wrote the following ‘Hello World’ example code to demonstrate a basic HTTP Cerflet:

#include <httpcerflet.h>

#include <iostream>
#include <string>

using namespace std;

class HelloRequestHandler :	public HTTPRequestHandler {
	void handleRequest(HTTPServerRequest& request, HTTPServerResponse& response) {
		Application& app = Application::instance();
        app.logger().information("Request from " + request.clientAddress().toString());
		std::ostream& ostr = response.send();
		ostr << "<!DOCTYPE html><html><head><title>Hello World</title></head>";
		ostr << "<body>

Hello World!


int main(int argc, char** argv) {
	// 0. Initialise: create Cerflet instance and set routing.
	HttpCerflet cf;
	RoutingMap map;
	map["/"] = &createInstance<HelloRequestHandler>;
	// 1. Start the server	
	return, argv);

In the main() function we create a new HttpCerflet instance and a new RoutingMap. The latter contains the routes we wish to map to a handler, which in this case is the HelloRequestHandler. For the handler instance we create a reference to the template method createInstance<>(), with the name of our custom handler as the template argument.

What this mapping does is that when a new request is mapped against one of the keys of the RoutingMap, it instantiates a copy of the specified handler, pushing it onto a waiting worker thread.

The handler class itself derives from the HTTPRequestHandler class, which is a standard POCO Net class, reimplementing its handleRequest() method. This shows that Cerflet is more of a complement to POCO instead of abstracting it away. The main goal of Cerflet is to hide some of the complexities and boilerplate of POCO’s HTTP server, allowing one to focus on writing the actual business logic.


As for performance, an ApacheBench benchmark was run with a concurrency of 5, for a total of 100,000 requests.

1. Java Servlet

Server Software:        Apache-Coyote/1.1
Server Hostname:
Server Port:            8080

Document Path:          /examples/servlets/servlet/HelloWorldExample
Document Length:        400 bytes

Concurrency Level:      5
Time taken for tests:   7.697 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      56200000 bytes
HTML transferred:       40000000 bytes
Requests per second:    12992.07 [#/sec] (mean)
Time per request:       0.385 [ms] (mean)
Time per request:       0.077 [ms] (mean, across all concurrent requests)
Transfer rate:          7130.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     0    0   0.5      0      14
Waiting:        0    0   0.4      0      14
Total:          0    0   0.5      0      14

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%     14 (longest request)

2. Cerflet

Server Software:
Server Hostname:
Server Port:            9980

Document Path:          /
Document Length:        99 bytes

Concurrency Level:      5
Time taken for tests:   7.220 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      19900000 bytes
HTML transferred:       9900000 bytes
Requests per second:    13850.42 [#/sec] (mean)
Time per request:       0.361 [ms] (mean)
Time per request:       0.072 [ms] (mean, across all concurrent requests)
Transfer rate:          2691.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     0    0   0.5      0      10
Waiting:        0    0   0.4      0      10
Total:          0    0   0.5      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%     10 (longest request)


In this benchmark, Cerflet is about 7% faster than the equivalent Tomcat-based Hello World example. Cerflet hereby also logs requests to console, slowing it down somewhat, while Tomcat does not. Cerflet’s Hello World example was compiled using -Og optimisation setting in 32-bit GCC 5.3 (on Windows, MSYS2). The POCO libraries version 1.6 were used, as obtained via MSYS2’s Pacman package manager.

For Tomcat the binary distribution for 8.0.30 as obtained via the official Apache site was used, with the server manually started using the provided startup.bat script. Both servers were run on a Windows 7 Ultimate x64 platform (Intel i7 6700K, 32 GB DDR4) with ApacheBench using the loopback device.


Without compensating for all differences between the two examples used and other potential differences, it is fair to say at this point that both Servlets and Cerflets are roughly equivalent in terms of performance for a simple Hello World example. Likely Cerflets are slightly faster (5-10%) with more gains to be obtained via compiler optimisations (-O3).

The type of operations which would be performed further in the business logic likely will have the most influence on the overall performance between these two platforms. Cerflets do however show that C++-based server-side web applications are more than just a viable option, backed by a mature programming language (C++) and cross-platform libraries (POCO).

Cerflets as they exist today are reminiscent of Spring Boot Java applications, which also feature a built-in HTTP server, thus not relying on a Servlet container (e.g. Tomcat). The advantage of Cerflets is however that they only depend on the POCO libraries (if not linked fully statically), and are not dependent on a central runtime (JVM). This significantly eases deployment.

The Cerflet project’s Github page [2] can be accessed to try the here used example oneself, or to use the HTTP Cerflet implementation in a new project. Both feedback and contributions are welcome.



Categories: C++, Cerflet, HTTP Tags: , , , ,

First look at servlet and FastCGI performance

January 5, 2016 2 comments

As a primarily C++ developer who has also done a lot of web-related development (PHP, JSP, Java Servlets, etc.), one of the nagging questions I have had for years was the possible performance gain by moving away from interpreted languages towards native code for (server-based) web applications.

After the demise of the mainframe and terminals setup in the 1980s, the World Wide Web (WWW, or simply ‘web’), has been making a gradual return to this setup again, by having web-based applications based on servers (‘mainframes’) serve content to web-browser-using clients (‘terminals’). As part of this most processing power had to be located on the servers, with little processing power required on the client-side, until the advent of making fancy UIs in resource-heavy JavaScript on the client.

Even today, however, most of the processing is still done on the servers, with single servers serving thousands of clients per day, hour, or even minute. It’s clear that even saving a second per singular client-request on the server-side can mean big savings. In light of this it is however curious that most server-side processing is done in either interpreted languages via CGI or related (Perl, PHP, ColdFusion, JavaScript, etc.), or bytecode-based languages (C#, Java, VB.NET), instead of going for highly optimised native code.

While I will not go too deeply into the performance differences between those different implementations in this article, I think that most reading this will at least be familiar with the performance delta between the first two groups mentioned. Interpreted languages in general tend to lag behind the pack on sheer performance metrics, due to the complexity of parsing a text-based source file, creating bytecode out of that and running this with the language’s runtime.

In this light, the more interesting comparison in my eyes is therefore that between the last two groups: bytecode-based and native code. To create a fair comparison, I will first have to completely understand how for example Java servlets are implemented and run by a servlet container such as Tomcat in order to create a fair comparison in native code.

As a start, I have however set up a range of examples which I then benchmarked using ApacheBench. The first example uses the ‘Hello World’ servlet example which is provided with Apache Tomcat 8.x. The second uses a basic responder C++ application connected using FastCGI to a Lighttpd server. The third and final example uses C++/Qt to implement a custom QTcpServer instance which does HTTP parsing and responds to queries using a basic REST-based API.

The host system is an Intel 6700K-based x86-64 system, with 32 GB of RAM and running Windows 7 x64 Ultimate. The servlet example is used as-is, with modification to the distribution from Apache. The FastCGI’s C++ example is compiled using Mingw64 (GCC 5.3) with -O1. The Qt-based example is compiled using Mingw (GCC 4.9) from within Qt Creator in debug mode.

All ApacheBench tests are run with 1,000 requests and a concurrency of 1, since no scaling will be tested until the scaling of servlets and their containers is better understood.

Next, the results:

1. Java servlet

Server Software:        Apache-Coyote/1.1
Server Hostname:
Server Port:            8080

Document Path:          /examples/servlets/servlet/HelloWorldExample
Document Length:        400 bytes

Concurrency Level:      1
Time taken for tests:   0.230 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      562000 bytes
HTML transferred:       400000 bytes
Requests per second:    4347.83 [#/sec] (mean)
Time per request:       0.230 [ms] (mean)
Time per request:       0.230 [ms] (mean, across all concurrent requests)
Transfer rate:          2386.21 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.8      0      10
Processing:     0    0   1.1      0      10
Waiting:        0    0   0.8      0      10
Total:          0    0   1.4      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)

2. FastCGI

Server Software:        LightTPD/1.4.35-1-IPv6
Server Hostname:
Server Port:            80

Document Path:          /cerflet/
Document Length:        146 bytes

Concurrency Level:      1
Time taken for tests:   26.531 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      307000 bytes
HTML transferred:       146000 bytes
Requests per second:    37.69 [#/sec] (mean)
Time per request:       26.531 [ms] (mean)
Time per request:       26.531 [ms] (mean, across all concurrent requests)
Transfer rate:          11.30 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0   27  11.0     30      50
Waiting:        0   26  11.0     30      40
Total:          0   27  11.0     30      50

Percentage of the requests served within a certain time (ms)
  50%     30
  66%     30
  75%     30
  80%     40
  90%     40
  95%     40
  98%     40
  99%     40
 100%     50 (longest request)

3. C++/Qt

Server Software:
Server Hostname:
Server Port:            8010

Document Path:          /greeting/
Document Length:        50 bytes

Concurrency Level:      1
Time taken for tests:   0.240 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      109000 bytes
HTML transferred:       50000 bytes
Requests per second:    4166.67 [#/sec] (mean)
Time per request:       0.240 [ms] (mean)
Time per request:       0.240 [ms] (mean, across all concurrent requests)
Transfer rate:          443.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      10
Processing:     0    0   1.2      0      10
Waiting:        0    0   0.9      0      10
Total:          0    0   1.3      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)


It should be noted here that to make the FastCGI example work, the original approach using the fcgi_stdio.h header as suggested by the FastCGI documentation had to be abandoned, and instead the fciapp.h header and its methods were used. With the former approach the response times would get slower with each run, with the latter approach they remain constant.

The FastCGI application ended up looking like this:

#include "include/fcgiapp.h"
#include <cstdlib>

int count;
FCGX_Request request;

void initialize() {
	int sock = FCGX_OpenSocket(":9000", 5000);
	if (sock < 0) {
		// fail: handle.
	FCGX_InitRequest(&request, sock, 0);
	count = 0;

int main() {
/* Initialization. */  

/* Response loop. */
	while (FCGX_Accept_r(&request) == 0)   {
		FCGX_FPrintF(request.out, "Content-type: text/html\r\n"
		   "<title>FastCGI Hello! (C, fcgi_stdio library)</title>"
		   "<h1>FastCGI Hello! (C, fcgi_stdio library)</h1>"
		   "Request number %d running on host <i>s</i>\n",
	return 0;

Compared to the baseline values from the Tomcat servlet benchmark, the results from the FastCGI benchmark are downright disappointing, with each request taking roughly 30 ms or longer. The servlet instance needed <1 ms or 10 ms at most. Despite attempts to optimise the FastCGI example, it appears that there exist significant bottlenecks. Whether this is in the Lighttpd server, the mod_fcgi server module, or the FastCGI library is hard to say at this point.

For the C++/Qt example one can unabashedly say that even with the hacked together code which was used, this unoptimised code ran on-par with the highly optimised production code of the Tomcat server and its servlet API. It should be noted hereby that although this example used the Qt networking classes, it didn't use Qt-code for the actual socket communication beyond the accepting of the client connection.

Due to known issues with the QTcpSocket class on Windows, instead a custom, drop-in class was used which interfaces with the Winsock2 (ws2_32) DLL directly using the standard Berkeley socket API. This class has been used with other projects before and is relatively stable at this point. How this class compares performance-wise with the QTcpSocket class is at this point unknown.

Summarising, it seems at this point at the very least plausible that native code can outperform bytecode for web applications. More research has to be done into scaling methods and performance characteristics of applications more complex than a simple 'Hello World' as well.


A look at asm.js and the future with WebAssembly

December 30, 2015 3 comments

Earlier this year the WebAssembly [1][2] project was announced, describing itself as “a new, portable, size- and load-time-efficient format suitable for compilation to the web”. It’s a W3C community group project, headed by representatives of all major browser developers. It follows similar efforts by Mozilla (asm.js) and Google (NaCl, Native Client) to create a bytecode format to run on browsers. This would complement the existing JavaScript runtimes, while adding much desired features like real multi-threading, local variables and a massive speed boost.

Since the late 90s I have used web-based technologies in both a hobby and professional fashion, observing how a mostly static web began to move towards adding as much scripting as possible to any page, first using Visual Basic Script, JavaScript, ActiveX and Java, then basically just JavaScript. This half-baked language grew from a quick addition by Netscape to keep up with the competition into the center point of the modern web, being forced into roles it was never meant or designed for.

Since that time JavaScript runtimes have become modern wonders of JIT VM implementations, minimising parsing times while dealing with JavaScript’s idiosyncrasies in such a way to maximise performance. There is however no denying that having a text-based scripting language is much slower than starting off with native code, or bytecode for that matter. This is the reasoning which underlies these efforts by Google and Mozilla to respectively use native code and JavaScript as assembly language.

While Google’s NaCl effort is a fairly straightforward implementation which allows a limited set of the native platform’s code (x86/x86-64, ARM or MIPS) to be executed in a sandboxed environment, Mozilla’s asm.js efforts aimed to use JavaScript as an intermediate (almost bytecode) language, providing a mapping from C/C++ code to a subset of JavaScript. The benefit of the asm.js approach is that it runs in virtually every modern browser, while NaCl requires extensive browser support.

With all of this in mind I decided to look at asm.js from the mindset of an experienced C/C++ developer to see just how far one can push this system.

The first hit by harsh reality comes when you realise that there is only a relatively small sub-set of C/C++ code which can be compiled for an asm.js target. Things which do not work include threads, checks for endianness, low-level features such as longjmp [3]. Beyond this, 64-bit integers are to be avoided since JavaScript doesn’t have a native type for this. The more unsettling limitations appear when one considers the further limitations [4]:

  • No access to a non-sandboxed filesystem (except when running inside node.js with the right configuration).
  • No main loop. JavaScript uses cooperative multi-tasking, so a script cannot run indefinitely or wait for input.
  • Networking is limited to Websocket-only unless you bridge with the JavaScript side.
  • Function pointer handling is… hairy at best, broken at worst. [5]
  • Everything is compiled into one massive JavaScript file, including libc and other system libraries.

These limitations are due to the limitations of JavaScript and its runtime. JavaScript is a single-threaded, prototype-based language with no real concept of scoped variables and the like.

To circumvent the issue of not being able to have a main loop, the Emscripten toolchain [6] requires one to register a function to be called to simulate a loop, with the parameters allowing one to specify how often it should be called in total and per second:

void emscripten_set_main_loop(em_callback_func func, int fps, int simulate_infinite_loop)

Furthermore, one can tweak the behaviour of this simulated main loop with further commands, getting something which somewhat approaches a main loop as one would have in most applications [7]. If one intends to use the same code across multiple targets including asm.js, it rapidly becomes clear that one has to liberally use preprocessor statements to check for the Emscripten (emcc/em++) compiler, as detailed in the Emscripten documentation:

int main() {
#ifdef __EMSCRIPTEN__
  emscripten_set_main_loop(one_iter, 60, 1);
  while (1) {
    // Delay to keep frame rate constant (using SDL)

// The "main loop" function.
void one_iter() {
  // process input
  // render to screen

This is a rather minor adjustment, one may say. Depending on the application one wishes to port to asm.js it might be the worst of concessions one has to make, but the real struggle comes when one wants to do something beyond merely using the built-in LibSDL support to draw to the screen and handle local events.

Emscripten comes with a range of APIs which match or approach/equal standard desktop APIs, such as its audio support (via HTML5 audio) or networking (Berkeley socket API, with limitations). Hereby the latter is easily the most restricted, as one of the things which one does not get with asm.js and thus Emscripten is access to raw sockets, including standard TCP/UDP sockets. Instead Websocket is all one gets.

What this means is that you can only use ‘TCP’, non-blocking sockets with the sockets one creates. All data being sent and received by the asm.js application is further encapsulated by the Websocket protocol, meaning that any server or client one tries to communicate with also has to speak Websocket protocol. While the WebRTC protocol has been added to the project in the past, this implementation is currently non-functional due to issues. In short, networking with asm.js is quite crippled compared to what one would be used to on most other platforms.

A related limitation here is one of an async nature, such as when one uses functions like sleep() in C. In plain JavaScript an approximation of this would merely block the JavaScript runtime [8]. An experimental feature in Emscripten allows one to approximate the intended behaviour, enabled when calling the emcc/em++ compiler with the ‘-s ASYNCIFY=1’ flag passed to it. This feature is crucial to make software like for example ncurses and similar UI-oriented software work correctly.

Originally I had set out to write a simple terminal application in asm.js, with remote server communication and ncurses functionality. After poking at and experimenting with various approaches I found myself rather disappointed in the limitations of the asm.js platform. I realise that entire games and their engines have been ported to asm.js, and I do not say that it is impossible to do so. Merely that is a lot more work than one would assume at first glance. Sadly it’s not so much abou tmerely switching compile targets from native to asm.js, but rather tweaking one’s code to work with this eccentric and highly limited target platform.

After this run-in with asm.js, I figured that I might as well look at the state of WebAssembly (WASM). Despite being announced half a year ago, the Minimum Viable Product (MVP) goal is not close to being reached, with the actual bytecode format for WASM and related design features still in flux. The most recent status update I found on a Mozilla blog, from earlier this month [9].

Even at this early stage, one can already tell that it is far more like Google’s NaCl than asm.js. One very welcome, upcoming feature is that of having threading support [10], as well as real exception handling capabilities. At this point the WASM project is still hampered by having to use a polyfill approach, which simulates the WASM runtime capabilities in JavaScript, giving it the same limitations as asm.js.

In summary, while I can see promise in WASM, I feel that asm.js at the very least is quite overhyped. It’s so limited and specialistic that beyond porting games to run inside a browser using WebGL and LibSDL, it’s hard to think of suitable use-cases. It’s therefore no surprise that most of the projects I stumbled across during my research which actually made it into production are exactly such games.

How long will it take for WASM to reach MVP and post-MVP status? I don’t know. It’s practically a volunteer project at this point, meaning no deadlines or promises. Just the allusions to it possibly becoming something in between asm.js and NaCl. Almost native code, but not quite native.

I’m more interested at this point to see what other approaches to create the new web-based apps of the (near) future have been dreamed up by people so far. Something which does not involve abusing a flawed scripting language’s runtime to do unspeakable things.