Archive

Archive for the ‘HTTP’ Category

Cerflet: like Servlets, but with more C++

A few months ago I wrote about the research I had been doing on multiple ways to write server-based web applications, using Java Servlets, FastCGI/C++ and Qt/C++. While this showed that C++-based applications tend to be faster than Java-based ones, it only looked at single-threaded, sequential requests.

While looking at ways to get proper concurrent performance out of a Servlet-like C++ implementation I decided to look again at the POCO C++ Libraries [1] and found that its HTTP server implementation implements proper thread-pool-based working threads for excellent scaling across many concurrent requests.

After spending a few hours putting a basic wrapper library together, I wrote the following ‘Hello World’ example code to demonstrate a basic HTTP Cerflet:

#include <httpcerflet.h>

#include <iostream>
#include <string>

using namespace std;


class HelloRequestHandler :	public HTTPRequestHandler {
public:
	void handleRequest(HTTPServerRequest& request, HTTPServerResponse& response) {
		Application& app = Application::instance();
        app.logger().information("Request from " + request.clientAddress().toString());
		
		response.setChunkedTransferEncoding(false);
        response.setContentType("text/html");
		
		std::ostream& ostr = response.send();
		ostr << "<!DOCTYPE html><html><head><title>Hello World</title></head>";
		ostr << "<body>

Hello World!

</body></html>";
	}	
};


int main(int argc, char** argv) {
	// 0. Initialise: create Cerflet instance and set routing.
	HttpCerflet cf;
	RoutingMap map;
	map["/"] = &createInstance<HelloRequestHandler>;
	cf.routingMap(map);
	
	// 1. Start the server	
	return cf.run(argc, argv);
}

In the main() function we create a new HttpCerflet instance and a new RoutingMap. The latter contains the routes we wish to map to a handler, which in this case is the HelloRequestHandler. For the handler instance we create a reference to the template method createInstance<>(), with the name of our custom handler as the template argument.

What this mapping does is that when a new request is mapped against one of the keys of the RoutingMap, it instantiates a copy of the specified handler, pushing it onto a waiting worker thread.

The handler class itself derives from the HTTPRequestHandler class, which is a standard POCO Net class, reimplementing its handleRequest() method. This shows that Cerflet is more of a complement to POCO instead of abstracting it away. The main goal of Cerflet is to hide some of the complexities and boilerplate of POCO’s HTTP server, allowing one to focus on writing the actual business logic.

Benchmarks:

As for performance, an ApacheBench benchmark was run with a concurrency of 5, for a total of 100,000 requests.

1. Java Servlet

Server Software:        Apache-Coyote/1.1
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /examples/servlets/servlet/HelloWorldExample
Document Length:        400 bytes

Concurrency Level:      5
Time taken for tests:   7.697 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      56200000 bytes
HTML transferred:       40000000 bytes
Requests per second:    12992.07 [#/sec] (mean)
Time per request:       0.385 [ms] (mean)
Time per request:       0.077 [ms] (mean, across all concurrent requests)
Transfer rate:          7130.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     0    0   0.5      0      14
Waiting:        0    0   0.4      0      14
Total:          0    0   0.5      0      14

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%     14 (longest request)

2. Cerflet

Server Software:
Server Hostname:        127.0.0.1
Server Port:            9980

Document Path:          /
Document Length:        99 bytes

Concurrency Level:      5
Time taken for tests:   7.220 seconds
Complete requests:      100000
Failed requests:        0
Total transferred:      19900000 bytes
HTML transferred:       9900000 bytes
Requests per second:    13850.42 [#/sec] (mean)
Time per request:       0.361 [ms] (mean)
Time per request:       0.072 [ms] (mean, across all concurrent requests)
Transfer rate:          2691.63 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.2      0       1
Processing:     0    0   0.5      0      10
Waiting:        0    0   0.4      0      10
Total:          0    0   0.5      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%     10 (longest request)

Notes:

In this benchmark, Cerflet is about 7% faster than the equivalent Tomcat-based Hello World example. Cerflet hereby also logs requests to console, slowing it down somewhat, while Tomcat does not. Cerflet’s Hello World example was compiled using -Og optimisation setting in 32-bit GCC 5.3 (on Windows, MSYS2). The POCO libraries version 1.6 were used, as obtained via MSYS2’s Pacman package manager.

For Tomcat the binary distribution for 8.0.30 as obtained via the official Apache site was used, with the server manually started using the provided startup.bat script. Both servers were run on a Windows 7 Ultimate x64 platform (Intel i7 6700K, 32 GB DDR4) with ApacheBench using the loopback device.

Discussion:

Without compensating for all differences between the two examples used and other potential differences, it is fair to say at this point that both Servlets and Cerflets are roughly equivalent in terms of performance for a simple Hello World example. Likely Cerflets are slightly faster (5-10%) with more gains to be obtained via compiler optimisations (-O3).

The type of operations which would be performed further in the business logic likely will have the most influence on the overall performance between these two platforms. Cerflets do however show that C++-based server-side web applications are more than just a viable option, backed by a mature programming language (C++) and cross-platform libraries (POCO).

Cerflets as they exist today are reminiscent of Spring Boot Java applications, which also feature a built-in HTTP server, thus not relying on a Servlet container (e.g. Tomcat). The advantage of Cerflets is however that they only depend on the POCO libraries (if not linked fully statically), and are not dependent on a central runtime (JVM). This significantly eases deployment.

The Cerflet project’s Github page [2] can be accessed to try the here used example oneself, or to use the HTTP Cerflet implementation in a new project. Both feedback and contributions are welcome.

Maya

[1] http://pocoproject.org/
[2] https://github.com/MayaPosch/Cerflet

Categories: C++, Cerflet, HTTP Tags: , , , ,

First look at servlet and FastCGI performance

January 5, 2016 2 comments

As a primarily C++ developer who has also done a lot of web-related development (PHP, JSP, Java Servlets, etc.), one of the nagging questions I have had for years was the possible performance gain by moving away from interpreted languages towards native code for (server-based) web applications.

After the demise of the mainframe and terminals setup in the 1980s, the World Wide Web (WWW, or simply ‘web’), has been making a gradual return to this setup again, by having web-based applications based on servers (‘mainframes’) serve content to web-browser-using clients (‘terminals’). As part of this most processing power had to be located on the servers, with little processing power required on the client-side, until the advent of making fancy UIs in resource-heavy JavaScript on the client.

Even today, however, most of the processing is still done on the servers, with single servers serving thousands of clients per day, hour, or even minute. It’s clear that even saving a second per singular client-request on the server-side can mean big savings. In light of this it is however curious that most server-side processing is done in either interpreted languages via CGI or related (Perl, PHP, ColdFusion, JavaScript, etc.), or bytecode-based languages (C#, Java, VB.NET), instead of going for highly optimised native code.

While I will not go too deeply into the performance differences between those different implementations in this article, I think that most reading this will at least be familiar with the performance delta between the first two groups mentioned. Interpreted languages in general tend to lag behind the pack on sheer performance metrics, due to the complexity of parsing a text-based source file, creating bytecode out of that and running this with the language’s runtime.

In this light, the more interesting comparison in my eyes is therefore that between the last two groups: bytecode-based and native code. To create a fair comparison, I will first have to completely understand how for example Java servlets are implemented and run by a servlet container such as Tomcat in order to create a fair comparison in native code.

As a start, I have however set up a range of examples which I then benchmarked using ApacheBench. The first example uses the ‘Hello World’ servlet example which is provided with Apache Tomcat 8.x. The second uses a basic responder C++ application connected using FastCGI to a Lighttpd server. The third and final example uses C++/Qt to implement a custom QTcpServer instance which does HTTP parsing and responds to queries using a basic REST-based API.

The host system is an Intel 6700K-based x86-64 system, with 32 GB of RAM and running Windows 7 x64 Ultimate. The servlet example is used as-is, with modification to the distribution from Apache. The FastCGI’s C++ example is compiled using Mingw64 (GCC 5.3) with -O1. The Qt-based example is compiled using Mingw (GCC 4.9) from within Qt Creator in debug mode.

All ApacheBench tests are run with 1,000 requests and a concurrency of 1, since no scaling will be tested until the scaling of servlets and their containers is better understood.

Next, the results:

1. Java servlet

Server Software:        Apache-Coyote/1.1
Server Hostname:        127.0.0.1
Server Port:            8080

Document Path:          /examples/servlets/servlet/HelloWorldExample
Document Length:        400 bytes

Concurrency Level:      1
Time taken for tests:   0.230 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      562000 bytes
HTML transferred:       400000 bytes
Requests per second:    4347.83 [#/sec] (mean)
Time per request:       0.230 [ms] (mean)
Time per request:       0.230 [ms] (mean, across all concurrent requests)
Transfer rate:          2386.21 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.8      0      10
Processing:     0    0   1.1      0      10
Waiting:        0    0   0.8      0      10
Total:          0    0   1.4      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)

2. FastCGI

Server Software:        LightTPD/1.4.35-1-IPv6
Server Hostname:        127.0.0.1
Server Port:            80

Document Path:          /cerflet/
Document Length:        146 bytes

Concurrency Level:      1
Time taken for tests:   26.531 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      307000 bytes
HTML transferred:       146000 bytes
Requests per second:    37.69 [#/sec] (mean)
Time per request:       26.531 [ms] (mean)
Time per request:       26.531 [ms] (mean, across all concurrent requests)
Transfer rate:          11.30 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0   27  11.0     30      50
Waiting:        0   26  11.0     30      40
Total:          0   27  11.0     30      50

Percentage of the requests served within a certain time (ms)
  50%     30
  66%     30
  75%     30
  80%     40
  90%     40
  95%     40
  98%     40
  99%     40
 100%     50 (longest request)

3. C++/Qt

Server Software:
Server Hostname:        127.0.0.1
Server Port:            8010

Document Path:          /greeting/
Document Length:        50 bytes

Concurrency Level:      1
Time taken for tests:   0.240 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      109000 bytes
HTML transferred:       50000 bytes
Requests per second:    4166.67 [#/sec] (mean)
Time per request:       0.240 [ms] (mean)
Time per request:       0.240 [ms] (mean, across all concurrent requests)
Transfer rate:          443.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      10
Processing:     0    0   1.2      0      10
Waiting:        0    0   0.9      0      10
Total:          0    0   1.3      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)

Discussion:

It should be noted here that to make the FastCGI example work, the original approach using the fcgi_stdio.h header as suggested by the FastCGI documentation had to be abandoned, and instead the fciapp.h header and its methods were used. With the former approach the response times would get slower with each run, with the latter approach they remain constant.

The FastCGI application ended up looking like this:

#include "include/fcgiapp.h"
#include <cstdlib>

int count;
FCGX_Request request;

void initialize() {
	FCGX_Init();
	int sock = FCGX_OpenSocket(":9000", 5000);
	if (sock < 0) {
		// fail: handle.
	}
	
	FCGX_InitRequest(&request, sock, 0);
	count = 0;
}

int main() {
/* Initialization. */  
  initialize();

/* Response loop. */
	while (FCGX_Accept_r(&request) == 0)   {
		FCGX_FPrintF(request.out, "Content-type: text/html\r\n"
		   "\r\n"
		   "<title>FastCGI Hello! (C, fcgi_stdio library)</title>"
		   "<h1>FastCGI Hello! (C, fcgi_stdio library)</h1>"
		   "Request number %d running on host <i>s</i>\n",
			++count);
			
		FCGX_Finish_r(&request);
	}
	  
	return 0;
}

Compared to the baseline values from the Tomcat servlet benchmark, the results from the FastCGI benchmark are downright disappointing, with each request taking roughly 30 ms or longer. The servlet instance needed <1 ms or 10 ms at most. Despite attempts to optimise the FastCGI example, it appears that there exist significant bottlenecks. Whether this is in the Lighttpd server, the mod_fcgi server module, or the FastCGI library is hard to say at this point.

For the C++/Qt example one can unabashedly say that even with the hacked together code which was used, this unoptimised code ran on-par with the highly optimised production code of the Tomcat server and its servlet API. It should be noted hereby that although this example used the Qt networking classes, it didn't use Qt-code for the actual socket communication beyond the accepting of the client connection.

Due to known issues with the QTcpSocket class on Windows, instead a custom, drop-in class was used which interfaces with the Winsock2 (ws2_32) DLL directly using the standard Berkeley socket API. This class has been used with other projects before and is relatively stable at this point. How this class compares performance-wise with the QTcpSocket class is at this point unknown.

Summarising, it seems at this point at the very least plausible that native code can outperform bytecode for web applications. More research has to be done into scaling methods and performance characteristics of applications more complex than a simple 'Hello World' as well.

Maya

Implementing A Cookiejar for QtWebKit; QNetworkCookieJar Analysis

February 24, 2012 2 comments

As some of you may know already, I am working on the WildFox browser project which while it initially was going to fork the Mozilla code is now building a browser on top of QtWebKit. See the previous blog post for details on this decision. The WildFox project page and source can be found at www.mayaposch.com/wildfox.php.

One of the features I recently implemented was an advanced cookiejar for storing HTTP cookies. Why was this necessary, you may ask? QtWebKit does implement a cookiejar in QNetworkCookieJar, but even aside from the inability to save any of the cookies to disk, a quick look at its source code shows the following issues: a limit of 50 cookies, which is less than the 300 required by the current standard for HTTP cookies (RFC 2617) [1]. It also uses a basic QList to store the cookies, which requires it to search in linear time through every cookie to find ones for a specific URL and duplicate cookies when storing them.

In other words, the default implementation is unsuitable for any web browser. One thing it does do right, however, is the way it verifies domains. Due to the design of internet Top Level Domains (TLDs) it is impossible to algorithmically determine whether an internet URL is valid, or specifies a proper TLD.

The need to verify the domain is made clear when one imagines someone setting a cookie for the domain .com, which would then be a cookie valid for every website ending with the TLD .com. Obviously this can’t be allowed and the obvious approach would be to disallow single dot domains (.com, .org, .net). This doesn’t work for domains like .co.uk, however. Disallowing two dot domains would cause issues with the former, single dot type. Further there are more variations on this, such as URLs in the US where the public suffix can entail city.state.us style domains. Clearly the only way to do this verification is to use a look-up table. This can be found in Mozilla’s public suffix list [2].

What we need for a better QtWebKit cookiejar thus entails the following:

  • the ability to store cookies to disk.
  • storing at least 300 cookies.
  • quick look-ups of cookies based on their domain.

For this we recycle the existing functionality in Qt required to do the public suffix verification. The relevant files in Qt 4.8.0 are:

  • src/corelib/io/qtldurl.cpp
  • src/qurltlds_p.h

The former contains some basic routines to obtain the public suffix which we will expand upon and the latter contains the Public Suffix list processed in a more accessible format. The latter we’ll use almost as-is, with just the Qt namespace sections removed. The former has a major omission we’ll add. The functions we’ll keep from qtldurl.cpp are in renamed form:

  • containsTLDEntry(const QString &entry)
  • isEffectiveTLD(const QString &domain)
  • topLevelDomain(const QString &domain)

We add the following function:

QString getPublicDomain(const QString &domain) {
    QStringList sections = domain.toLower().split(QLatin1Char('.'), QString::SkipEmptyParts);
    if (sections.isEmpty())
        return QString();

    QString tld = "";
    for (int i = sections.count() - 1; i >= 0; --i) {
        tld.prepend(QLatin1Char('.') + sections.at(i));
        if (!isEffectiveTLD(tld.right(tld.size() - 1))) {
             return tld;
        }
    }

    return tld;
}

This allows us to obtain the public suffix plus the initial non-public (TLD) domain. For example, “http://www.slashdot.org&#8221; would be reduced to “.slashdot.org”. It is different from topLevelDomain() in that the latter would return just the public suffix, e.g. “.org” in the previous example, which is not desirable for our use.

With the domain verification taken care of, we move on to the next stage, which involves the data structure and storage method. To store cookies on disk we elect to use an SQLite database, as this is both efficient in terms of storage density, but also prevents disk fragmentation and allows for SQL-based look-ups instead of filesystem-based ones, as used to be common with older browsers. QtSQL comes with an SQLite driver. Do be sure to use the current version of Qt (4.8) as recently SQLite introduced journaling and the Qt 4.7.x libraries still use the old SQLite client.

For the in-memory data structure we use a QMultiMap. The rationale behind this is the key-based look-up based on the cookie domain. By taking the URL we’re seeking matching cookies for and obtaining its top domain (“.slashdot.org”) we can find any cookie in our data structure using this top domain as the key. This means we can search a large number of cookies in logarithmic (O(log N)) time for a match on the domain, a major improvement on the linear (O(N)) search of the default QList.

The link between the in-memory and on-disk storage is accomplished by the following rules:

  • All new cookies and cookie updates are stored in both in-memory and on-disk, except for session cookies, which are stored only in-memory.
  • Stored cookies are read into memory per-domain and on-demand.

In addition to this I have implemented a cookie manager dialogue which allows one to look through and manage (delete) stored cookies. Expired cookies are automatically deleted the first time they are fetched from the database or before they’re stored. Blocking 3rd-party cookies is also very easy, with a comparison between the top domain and the cookie’s intended domain:

QString baseDomain = getPublicDomain(url.host());
if (skip3rd && (baseDomain != getPublicDomain(cookie.domain()))) {
     continue;
}

With this we got a relatively efficient cookie storage and retrieval mechanism with the ability to manage the set cookies. It can store an unlimited number of cookies and should remain efficient with look-ups even with over 10,000 cookies thanks to the logarithmic search of the QMultiMap.

Essential features still missing in the WildFox browser at this point are bookmarks and sessions. The next article on WildFox should be about the Chrome extension support I’m currently implementing, with as direct result XMarks bookmark synchronization support as well as the bookmarks feature. Stay tuned.

Maya

[1] http://www.ietf.org/rfc/rfc2617.txt
[2] http://publicsuffix.org/

Categories: HTTP, programming, Projects, Qt, WildFox Tags: , , ,