Archive for the ‘Java’ Category

First look at servlet and FastCGI performance

January 5, 2016 2 comments

As a primarily C++ developer who has also done a lot of web-related development (PHP, JSP, Java Servlets, etc.), one of the nagging questions I have had for years was the possible performance gain by moving away from interpreted languages towards native code for (server-based) web applications.

After the demise of the mainframe and terminals setup in the 1980s, the World Wide Web (WWW, or simply ‘web’), has been making a gradual return to this setup again, by having web-based applications based on servers (‘mainframes’) serve content to web-browser-using clients (‘terminals’). As part of this most processing power had to be located on the servers, with little processing power required on the client-side, until the advent of making fancy UIs in resource-heavy JavaScript on the client.

Even today, however, most of the processing is still done on the servers, with single servers serving thousands of clients per day, hour, or even minute. It’s clear that even saving a second per singular client-request on the server-side can mean big savings. In light of this it is however curious that most server-side processing is done in either interpreted languages via CGI or related (Perl, PHP, ColdFusion, JavaScript, etc.), or bytecode-based languages (C#, Java, VB.NET), instead of going for highly optimised native code.

While I will not go too deeply into the performance differences between those different implementations in this article, I think that most reading this will at least be familiar with the performance delta between the first two groups mentioned. Interpreted languages in general tend to lag behind the pack on sheer performance metrics, due to the complexity of parsing a text-based source file, creating bytecode out of that and running this with the language’s runtime.

In this light, the more interesting comparison in my eyes is therefore that between the last two groups: bytecode-based and native code. To create a fair comparison, I will first have to completely understand how for example Java servlets are implemented and run by a servlet container such as Tomcat in order to create a fair comparison in native code.

As a start, I have however set up a range of examples which I then benchmarked using ApacheBench. The first example uses the ‘Hello World’ servlet example which is provided with Apache Tomcat 8.x. The second uses a basic responder C++ application connected using FastCGI to a Lighttpd server. The third and final example uses C++/Qt to implement a custom QTcpServer instance which does HTTP parsing and responds to queries using a basic REST-based API.

The host system is an Intel 6700K-based x86-64 system, with 32 GB of RAM and running Windows 7 x64 Ultimate. The servlet example is used as-is, with modification to the distribution from Apache. The FastCGI’s C++ example is compiled using Mingw64 (GCC 5.3) with -O1. The Qt-based example is compiled using Mingw (GCC 4.9) from within Qt Creator in debug mode.

All ApacheBench tests are run with 1,000 requests and a concurrency of 1, since no scaling will be tested until the scaling of servlets and their containers is better understood.

Next, the results:

1. Java servlet

Server Software:        Apache-Coyote/1.1
Server Hostname:
Server Port:            8080

Document Path:          /examples/servlets/servlet/HelloWorldExample
Document Length:        400 bytes

Concurrency Level:      1
Time taken for tests:   0.230 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      562000 bytes
HTML transferred:       400000 bytes
Requests per second:    4347.83 [#/sec] (mean)
Time per request:       0.230 [ms] (mean)
Time per request:       0.230 [ms] (mean, across all concurrent requests)
Transfer rate:          2386.21 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.8      0      10
Processing:     0    0   1.1      0      10
Waiting:        0    0   0.8      0      10
Total:          0    0   1.4      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)

2. FastCGI

Server Software:        LightTPD/1.4.35-1-IPv6
Server Hostname:
Server Port:            80

Document Path:          /cerflet/
Document Length:        146 bytes

Concurrency Level:      1
Time taken for tests:   26.531 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      307000 bytes
HTML transferred:       146000 bytes
Requests per second:    37.69 [#/sec] (mean)
Time per request:       26.531 [ms] (mean)
Time per request:       26.531 [ms] (mean, across all concurrent requests)
Transfer rate:          11.30 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.0      0       0
Processing:     0   27  11.0     30      50
Waiting:        0   26  11.0     30      40
Total:          0   27  11.0     30      50

Percentage of the requests served within a certain time (ms)
  50%     30
  66%     30
  75%     30
  80%     40
  90%     40
  95%     40
  98%     40
  99%     40
 100%     50 (longest request)

3. C++/Qt

Server Software:
Server Hostname:
Server Port:            8010

Document Path:          /greeting/
Document Length:        50 bytes

Concurrency Level:      1
Time taken for tests:   0.240 seconds
Complete requests:      1000
Failed requests:        0
Total transferred:      109000 bytes
HTML transferred:       50000 bytes
Requests per second:    4166.67 [#/sec] (mean)
Time per request:       0.240 [ms] (mean)
Time per request:       0.240 [ms] (mean, across all concurrent requests)
Transfer rate:          443.52 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.5      0      10
Processing:     0    0   1.2      0      10
Waiting:        0    0   0.9      0      10
Total:          0    0   1.3      0      10

Percentage of the requests served within a certain time (ms)
  50%      0
  66%      0
  75%      0
  80%      0
  90%      0
  95%      0
  98%      0
  99%     10
 100%     10 (longest request)


It should be noted here that to make the FastCGI example work, the original approach using the fcgi_stdio.h header as suggested by the FastCGI documentation had to be abandoned, and instead the fciapp.h header and its methods were used. With the former approach the response times would get slower with each run, with the latter approach they remain constant.

The FastCGI application ended up looking like this:

#include "include/fcgiapp.h"
#include <cstdlib>

int count;
FCGX_Request request;

void initialize() {
	int sock = FCGX_OpenSocket(":9000", 5000);
	if (sock < 0) {
		// fail: handle.
	FCGX_InitRequest(&request, sock, 0);
	count = 0;

int main() {
/* Initialization. */  

/* Response loop. */
	while (FCGX_Accept_r(&request) == 0)   {
		FCGX_FPrintF(request.out, "Content-type: text/html\r\n"
		   "<title>FastCGI Hello! (C, fcgi_stdio library)</title>"
		   "<h1>FastCGI Hello! (C, fcgi_stdio library)</h1>"
		   "Request number %d running on host <i>s</i>\n",
	return 0;

Compared to the baseline values from the Tomcat servlet benchmark, the results from the FastCGI benchmark are downright disappointing, with each request taking roughly 30 ms or longer. The servlet instance needed <1 ms or 10 ms at most. Despite attempts to optimise the FastCGI example, it appears that there exist significant bottlenecks. Whether this is in the Lighttpd server, the mod_fcgi server module, or the FastCGI library is hard to say at this point.

For the C++/Qt example one can unabashedly say that even with the hacked together code which was used, this unoptimised code ran on-par with the highly optimised production code of the Tomcat server and its servlet API. It should be noted hereby that although this example used the Qt networking classes, it didn't use Qt-code for the actual socket communication beyond the accepting of the client connection.

Due to known issues with the QTcpSocket class on Windows, instead a custom, drop-in class was used which interfaces with the Winsock2 (ws2_32) DLL directly using the standard Berkeley socket API. This class has been used with other projects before and is relatively stable at this point. How this class compares performance-wise with the QTcpSocket class is at this point unknown.

Summarising, it seems at this point at the very least plausible that native code can outperform bytecode for web applications. More research has to be done into scaling methods and performance characteristics of applications more complex than a simple 'Hello World' as well.


My new game development book got published

October 10, 2015 Leave a comment

Some people may have noticed a drop in published content on this blog for a while. Part of it was due to working on a new book for Packt Publishing, titled ‘Mastering AndEngine Game Development’, which was finalised last month with its publication. For those interested, it can be purchased both at the Packt store [1] and at Amazon [2].

What this book is, is an in-depth look at how to go from ‘making a basic mobile game’ using a game engine such as AndEngine [3], to making a truly advanced (mobile) game using 3D assets in a 2D game with OpenGL ES, dynamic and static lighting, frame-based and skeletal-based animation, anti-aliasing, GLSL shaders, 3D sound and advanced sound effects using OpenAL & OpenSL, and much more. While it’s aimed at extending AndEngine-based games, it’s written in a generic enough manner that it should be useful for those using other game engines, on Android or other platforms.

So far this is my first published book, but it probably won’t be my last. In the meantime I will try to step up the publication of content on this blog again, both with programming and electronics-related postings. Please stay tuned 🙂



Java On Android TCP Socket Issue

August 8, 2013 1 comment

Related to my previous post [1] involving a project using Java sockets, I’d like to post about an issue I encountered while debugging the project. Allow me to first describe the environment and set up.

The Java side as an extended version of the class described in the linked post ran as client on Android, specifically a Galaxy Nexus device running Android 4.2.2 and later 4.3. Its goal was to send locally collected arrays of bytes via the socket to the server after connecting. The server was written in C++ with part of the networking side handled by the Qt framework (QTcpServer) and the actual communication via native sockets on a Windows 7 Ultimate x64 system.

The problem occurred upon the connecting of the Android client to the server: the connecting would be handled fine, the thread to handle the native socket initialized and started as it should be. After that however the issue was that never any data would be received on the server-side of the client socket. Checks using select() showed that there never arrived any data in the buffer. Upon verification with a telnet client (Putty) it turned out that the server was able to receive data just fine, and thus that the issue had to lie with the client side, i.e. the Android client.

Inspection using the Wireshark network traffic sniffer during the communication between the Android client and the server showed a normal TCP sequence, with SYN, SYN-ACK and ACK packets followed by a PSH-ACK from the client with the first data. This followed by an ACK from the server, indicating that the server network stack had at least acknowledged the data package. Everything seemed in order, although it was somewhat remarkable that the first client-side ACK had the exact same timestamp in Wireshark as the PSH-ACK packet.

Mystified, I stumbled over a few posts [2][3] on StackOverflow in which it was suggested that using Thread.sleep() after the connecting phase would resolve this. Trying this solution with a 500 ms sleep period I found that suddenly the client-server communication went flawlessly. My only question hereby is why this is the case.

Looking through the TCP specifications I didn’t find anything definite, even though the evidence so far suggests that the ACK on SYN-ACK can not be accompanied by a PSH or similar at the same time. Possibly that something else plays a role here, but it seems that the lesson here is that in case of non-functioning Java sockets one just has to wait a little while before starting to send data. I’d gladly welcome further clarification on why this happens.



Binary Network Protocol Implementation In Java Using Byte Arrays

July 26, 2013 2 comments

Java in many ways is a very eccentric programming language. Reading the designer’s responses to questions on its design lead to interesting ideas, such as that unsigned integer types would be confusing and error-prone to the average programmer. There’s also the thought that Java is purely object-oriented, even though it has many primitive types and concepts lurking in its depths. Its design poses very uncomfortable issues for developers who seek to read, write and generally handle binary data and files, as the entire language seems to be oriented towards text-based formats such as XML. This leads one to such problems as how to implement a basic binary networking protocol.

Many network and communication protocols are binary as this makes them easier and faster to parse, more light-weight to transfer and generally less prone to interpretation. The question hereby is how to implement such a protocol in a language which is wholly unfamiliar with the concepts of unsigned integers, operator overloading and similar. The most elegant answer I have found so far is to stay low-level, and I really do mean low-level. We will treat Java’s built-in signed integers as though they are unsigned using bitwise operators where necessary and use byte-arrays to translate between Java and the outside world.

The byte type in Java is an 8-bit signed integer with a range from -128 to 127. For our purposes we will ignore the sign bit and treat it as an unsigned 8-bit integer. Network communication occurs in streams of bytes, with the receiving side interpreting it according to a predefined protocol. This means that to write on the Java side to the network socket we will have to put the required bytes into a prepared byte array. As Java arrays are fixed size like in C, it makes the most sense to either use one byte-array per field or to pre-allocate the whole array and copy the bytes into it.

Writing is done into the Java Socket via its OutputStream which we wrap into a BufferedOutputStream.

public class BinaryExample {
	Socket mSocket;
	String mServer = "";
	int mServerPort = 123;
	byte[] header = {0x53, 0x41, 0x4D, 0x50, 0x4C, 0x45}; // SAMPLE
	int mProtocolVersion = 0;
	OutputStream mOutputStream;
	InputStream mInputStream;
	BufferedOutputStream mBufferedOutputStream;

	public void run() {
		try {
			// set up connection with server
			this.mSocket = new Socket(mServer, mServerPort);
		} catch (Exception ee) {

		// get the I/O streams for the socket.
		try {
			mOutputStream = this.mSocket.getOutputStream();
			mBufferedOutputStream = new BufferedOutputStream(mOutputStream);
			mInputStream = this.mSocket.getInputStream();
		} catch (IOException e) {

		byte version = (byte) mProtocolVersion;
		int messageLength = 4 + header.length + version.length;
		byte[] msgSize = intToByteArray(messageLength);

		// write to the socket
		try {
		} catch (IOException e1) {

		// Writes provided 4-byte integer to a 4 element byte array in Little-Endian order.
		public static final byte[] intToByteArray(int value) {
			return new byte[] {
				(byte)(value & 0xff),
				(byte)(value >> 8 & 0xff),
				(byte)(value >> 16 & 0xff),
				(byte)(value >>> 24)

Any ASCII strings in the protocol we define as individual bytes. Fortunately the ASCII codes only go to 127 (0x7F) and thus fit within the positive part of Java’s byte type. For values stretching into the negative range of the byte we might have to use bit masking to deal with the sign bit, or do the conversion ourselves. We define the protocol version as an int (BE signed, 32-bit), which we convert to a byte using a simple cast, stripping off the higher three bytes. Again pay attention to the value of the int. If it’s higher than 127 you have to deal with the sign bit again or risk an overflow.

In this example we implement a lower-endian (LE) protocol. This means that in converting to a byte array from a 16-bit or larger integer we have to place the LSB first, as is done in the function intToByteArray(). We also add a message length indicator at the beginning of the message we’re sending in the form of an int, extending the message by 4 bytes.

Reading the response and interpreting it is similar:

		// wait for response. This is a blocking example.
		byte[] responseBytes = new byte[5];
		int bytesRead = 0;
		try {
			bytesRead =, 0, 4);
		} catch (IOException e1) {

		if (bytesRead != 5) {
			// communication error. Abort.

		// the fifth byte now contains the value of the response code. 0 means OK, everything else is an error.
		short responseCode = (short) responseBytes[4];
		if (responseCode != 0) { return; }

This is a brief and naive sample which just has to read a single response, skipping the message length indicator and reading just five bytes. In a more complex application you would convert the individual sections of the byte array to their respective formats (strings, ints, etc.) and verify them. For this you would use a function to invert from LE-order byte array to BE-order int such as the following:

	// Writes provided 4-byte array containing a little-endian integer to a big-endian integer.
	public static final int byteArrayToInt(byte[] value) {
		int ret = ((value[0] & 0xFF) << 24) | ((value[1] & 0xFF) << 16) |
					((value[2] & 0xFF) << 8) | (value[3] & 0xFF);

		return ret;

In many ways it’s ironic that bit shifts and bitwise operators are the way to go with a language which profiles itself as a high-level language, but such is the result of the design choices made. While it is true that the above byte array-oriented code could be encapsulated by fancy classes which would take the tediousness out of implementing such a protocol, in essence they would do the exact same as detailed above. With the upcoming Java 8 release unsigned integers will be introduced for the first time in a limited manner, but for most projects (including Android-based ones) it’s not an option to upgrade to it.

For reference, the above code is used in an actual project I’m working on and is as far as I am aware functional. I can however not accept any liability for anything going haywire, applications crashing, marriages torn up or pets set on fire. Any further checks and handling of errors is probably an awesome idea to make the code more robust.