It’s benchmark season again and as I have raised some concerns about the results of the published benchmark, I was asked to properly explain my concerns without making it look like a rant. So this is what I try with this blog post.
Given the results of the published benchmark, I could go “Wooohooo, KWin’s the fastest!”, but instead I raise concerns. I don’t see that in the data and I hope nobody else sees that in the published data.
First a little bit about my background. After finishing my computer science studies I have been working the last two and a half years in a research department, not as a researcher, but as a developer to support research. Our main tasks are to store and manage scientific results that is experimental data.
You cannot work for more than two years in research without having that influence how you see such data. For me a benchmark is very similar to a scientific experiment.
First you have a hypothesis you come up with (e.g. “Compositing on X11 influences the game rendering performance”), then you start to setup an experiment to prove your hypothesis (e.g. “running PTS on various hardware and distributions”), then you run your experiment multiple times to have statistically relevant data and last but not least you validate your gathered results and go back to step one if something doesn’t fit. All these steps must be properly documented, so that others can reproduce the results. Being able to reproduce the results is most important.
If you don’t follow these steps your experiment/benchmark is not useful to show anything. And then you should not publish it. I’m personally not a fan of the attitude of science to not publish failure, but you should at least make clear that your setup has failures.
Now I want to assume that the published benchmark is a “paper” and that I would have the task to review it.
The first thing I would point out is that the gathered data is not statistically relevant. It is not clear how the environment influenced the results. The benchmark has been performed only on a “Ubuntu 12.10 development snaphot” on one Core i7 “Ivy Bridge” system. This means we don’t know whether the fact that it is a development snapshot has any influence on the result. Maybe there are debug builds? Maybe temporary changed defaults? Also it’s testing unreleased software (e.g. Compiz) against released software (e.g. KWin). So here we have multiple flaws in the experimental setup:
- Only one operating system
- Only one hardware used
- Comparing software in different development stages
Also the fact that it uses an “Ubuntu 12.10 development snapshot” means that one cannot reproduce the results independently as one doesn’t know the exact state of the software in use.
I want to further stress the point of the operating system. I think this is in fact the major flaw of the setup. Looking at e.g. the performance improvements of OpenSUSE 12.2 due to switching the toolchain that is something which can quite influence the results. So we don’t know whether Ubuntu is good or bad to do such benchmarks, we just don’t know. It would have needed to run the benchmark on multiple distributions to perform the same results (yes obviously that’s not possible as Compiz only exists for Ubuntu). Especially for the tests of GNOME Shell this is relevant as Ubuntu is focusing only on Compiz and one doesn’t know how that influences the performance of other systems. Also in general the desktop environments are tested here, but hardly any distribution ships a pure KDE SC version. They all do adjustments, changing settings and so on. One has to gather enough data to ensure that the results are not becoming faulty.
The point of the multiple hardware is of course obvious. The differences between hardware are too large to not be considered. A computer is a highly complex system, an operating system a highly non-deterministic system where changing one piece can have quite some influence. Maybe the Intel drivers are just not suited for gaming, I don’t know and the benchmark neither.
Now let’s move forward and have a look at the individual experiments. The first thing which strikes me is that the standard deviation is missing. This tells me quite a lot about the experimental setup. Given that it doesn’t tell how often the experiment was run (that is how many data sets go into one graph) and the standard deviation not being provided, I assume that the experiment was just run once. This would mean that the experiment is completely flawed and that the gathered data does not provide any statistical significance. If it were a paper I would stop the review here and notify the editor. Just think about Nepomuk starting in the background to index your files while you run the benchmark or the package manager starting to update your system. This would have quite some influence on your result, but you cannot be sure that this happened in the given data.
But let’s assume I continue with looking at the data. Now I think back of the hypothesis we have and I notice that while we have quite some data sets on the influence of desktop environments on the game rendering performance, one important data set is missing: the control. For the given hypothesis only one control can be thought of: running just an X-Server without any desktop environment and run the test there. This would be a very good control as it ensures that there is no overhead introduced by any desktop environment. But it’s missing. Again if I would review this as a paper I would stop here and notify the editor.
Let’s continue nevertheless. I now want to pick out the data for Nexuiz v2.5.2 on resolution 1920×1080. The values are in the range of 9.73 fps (KWin) and 12.79 fps (KWin with unredirection). The latter value is quite higher than the others, so let’s look at the second best: 10.15 fps (LXDE). So we have results of 10 fps vs. 10 fps vs. 10 fps vs 10 fps. Now I’m not a gamer, but I would not want to play a game at 10 fps. For me the result of this specific experiment does not show any difference in the various systems, but just that the hardware is not capable of playing such a game.
At this point I want to stop the analysis of this benchmark. I think it is clear that the this benchmark does not “Prove Ubuntu Unity Very Slow to KDE, GNOME, Xfce, LXDE”, heck that tile is so wrong that I don’t know where to start with. There is no “prove” and there is nothing which shows it to be slow, just look at the example given above: the difference between the frames per seconds is in the non noticeable area. Furthermore it’s just about game rendering performance and only on the one system using a pre-release of Ubuntu. So maybe as a title “Benchmark on a development snapshot of Ubuntu 12.10 shows Unity to slow down game rendering performance on an Intel Ivy Bridge compared to KDE, GNOME, XFCE, LXDE”, yes not very catchy I agree 🙂
My point here is that this doesn’t prove anything and I care about that, because given the methodology of these benchmarks it’s quite likely that the next time a benchmark is published it “proves” KDE to be slowest and then FUD is spread about KDE just like when there was a benchmark “proving” that KDE uses more RAM. You are in a much better position to highlight the flaws of the benchmarks if you are the “winner” of the benchmark, otherwise people tell you that you are in the “denial” stage.
Maybe the most sane approach to handle these benchmarks is to detect the PTS in KWin and to go to benchmark mode, just like games. I have to think about that.