Why I don’t like game rendering performance benchmarks

It’s benchmark season again and as I have raised some concerns about the results of the published benchmark, I was asked to properly explain my concerns without making it look like a rant. So this is what I try with this blog post.

Given the results of the published benchmark, I could go “Wooohooo, KWin’s the fastest!”, but instead I raise concerns. I don’t see that in the data and I hope nobody else sees that in the published data.

First a little bit about my background. After finishing my computer science studies I have been working the last two and a half years in a research department, not as a researcher, but as a developer to support research. Our main tasks are to store and manage scientific results that is experimental data.

You cannot work for more than two years in research without having that influence how you see such data. For me a benchmark is very similar to a scientific experiment.

First you have a hypothesis you come up with (e.g. “Compositing on X11 influences the game rendering performance”), then you start to setup an experiment to prove your hypothesis (e.g. “running PTS on various hardware and distributions”), then you run your experiment multiple times to have statistically relevant data and last but not least you validate your gathered results and go back to step one if something doesn’t fit. All these steps must be properly documented, so that others can reproduce the results. Being able to reproduce the results is most important.

If you don’t follow these steps your experiment/benchmark is not useful to show anything. And then you should not publish it. I’m personally not a fan of the attitude of science to not publish failure, but you should at least make clear that your setup has failures.

Now I want to assume that the published benchmark is a “paper” and that I would have the task to review it.

The first thing I would point out is that the gathered data is not statistically relevant. It is not clear how the environment influenced the results. The benchmark has been performed only on a “Ubuntu 12.10 development snaphot” on one Core i7 “Ivy Bridge” system. This means we don’t know whether the fact that it is a development snapshot has any influence on the result. Maybe there are debug builds? Maybe temporary changed defaults? Also it’s testing unreleased software (e.g. Compiz) against released software (e.g. KWin). So here we have multiple flaws in the experimental setup:

  • Only one operating system
  • Only one hardware used
  • Comparing software in different development stages

Also the fact that it uses an “Ubuntu 12.10 development snapshot” means that one cannot reproduce the results independently as one doesn’t know the exact state of the software in use.

I want to further stress the point of the operating system. I think this is in fact the major flaw of the setup. Looking at e.g. the performance improvements of OpenSUSE 12.2 due to switching the toolchain that is something which can quite influence the results. So we don’t know whether Ubuntu is good or bad to do such benchmarks, we just don’t know. It would have needed to run the benchmark on multiple distributions to perform the same results (yes obviously that’s not possible as Compiz only exists for Ubuntu). Especially for the tests of GNOME Shell this is relevant as Ubuntu is focusing only on Compiz and one doesn’t know how that influences the performance of other systems. Also in general the desktop environments are tested here, but hardly any distribution ships a pure KDE SC version. They all do adjustments, changing settings and so on. One has to gather enough data to ensure that the results are not becoming faulty.

The point of the multiple hardware is of course obvious. The differences between hardware are too large to not be considered. A computer is a highly complex system, an operating system a highly non-deterministic system where changing one piece can have quite some influence. Maybe the Intel drivers are just not suited for gaming, I don’t know and the benchmark neither.

Now let’s move forward and have a look at the individual experiments. The first thing which strikes me is that the standard deviation is missing. This tells me quite a lot about the experimental setup. Given that it doesn’t tell how often the experiment was run (that is how many data sets go into one graph) and the standard deviation not being provided, I assume that the experiment was just run once. This would mean that the experiment is completely flawed and that the gathered data does not provide any statistical significance. If it were a paper I would stop the review here and notify the editor. Just think about Nepomuk starting in the background to index your files while you run the benchmark or the package manager starting to update your system. This would have quite some influence on your result, but you cannot be sure that this happened in the given data.

But let’s assume I continue with looking at the data. Now I think back of the hypothesis we have and I notice that while we have quite some data sets on the influence of desktop environments on the game rendering performance, one important data set is missing: the control. For the given hypothesis only one control can be thought of: running just an X-Server without any desktop environment and run the test there. This would be a very good control as it ensures that there is no overhead introduced by any desktop environment. But it’s missing. Again if I would review this as a paper I would stop here and notify the editor.

Let’s continue nevertheless. I now want to pick out the data for Nexuiz v2.5.2 on resolution 1920×1080. The values are in the range of 9.73 fps (KWin) and 12.79 fps (KWin with unredirection). The latter value is quite higher than the others, so let’s look at the second best: 10.15 fps (LXDE). So we have results of 10 fps vs. 10 fps vs. 10 fps vs 10 fps. Now I’m not a gamer, but I would not want to play a game at 10 fps. For me the result of this specific experiment does not show any difference in the various systems, but just that the hardware is not capable of playing such a game.

At this point I want to stop the analysis of this benchmark. I think it is clear that the this benchmark does not “Prove Ubuntu Unity Very Slow to KDE, GNOME, Xfce, LXDE”, heck that tile is so wrong that I don’t know where to start with. There is no “prove” and there is nothing which shows it to be slow, just look at the example given above: the difference between the frames per seconds is in the non noticeable area. Furthermore it’s just about game rendering performance and only on the one system using a pre-release of Ubuntu. So maybe as a title “Benchmark on a development snapshot of Ubuntu 12.10 shows Unity to slow down game rendering performance on an Intel Ivy Bridge compared to KDE, GNOME, XFCE, LXDE”, yes not very catchy I agree :-)

My point here is that this doesn’t prove anything and I care about that, because given the methodology of these benchmarks it’s quite likely that the next time a benchmark is published it “proves” KDE to be slowest and then FUD is spread about KDE just like when there was a benchmark “proving” that KDE uses more RAM. You are in a much better position to highlight the flaws of the benchmarks if you are the “winner” of the benchmark, otherwise people tell you that you are in the “denial” stage.

Maybe the most sane approach to handle these benchmarks is to detect the PTS in KWin and to go to benchmark mode, just like games. I have to think about that.

39 thoughts on “Why I don’t like game rendering performance benchmarks”

  1. Thanks for writing such a well argued piece about this “benchmark” in particular and the whole problem of pts in general. Science, it works…

  2. “but hardly any distribution ships a pure KDE SC version.”

    – KDE in Arch Linux is basically pure vanilla and I love it that way

    1. Oh, forgot to mention that you’re right. Anyone with more scientific background is looking at benchmarks… lets call it “sceptically”. Phoronix is not a respectful serious page and has to be taken more like a good blog.

      1. it doesn’t even qualify as good. Most of his articles read as if he was dead drunk while writing, seems he took only parts of “write on whiskey, edit on caffeine” to heart and lately Micheal has been stroking the egos of people sending crap to the kernel-dev mailing list “because they are funny lol!” and when called on it he came up with some crap about how anyone who wanted attention could just write for his site. Which is imo even worse then the usual troll-bait headlines which is the standard.

  3. Phoronix-test-suite does try to have statistical relevance, but c’mon, you can’t assume a gaming/linux benchmark site, and every benchmark from that site, to be like a scientific paper. People would not bother to read it!

    You can find more info if you follow the link to openbenchmarking, and I think you can also use the command-line app to download the xml file containing all the individual results from the multiple runs.

    The idea of not using a stable version, and instead using the latest test version of ubuntu, is that it is supposed to be tested! If any results taken with a beta version are not valid, should people wait for the final “”stable”” version to start regression testing? Isn’t that against the idea of testing? Sure, it might have some strange data due to test versions, but it’s not like phoronix doesn’t repeat benchmarks later when the final version is released. It says right there “With Unity/Compiz being in a constant state of flux, these same tests will likely be carried out again (and from more GPUs/drivers) once Ubuntu 12.10 is officially released in October.”

    I think phoronix reporting has its issues, sure, but I think you are also being a bit unfair in taking a bunch of assumptions of what this benchmark should be and show, and then criticizing the article for not following them.

    1. Agreed. Yes, the method is fundamentally flawed from a scientific point of view. So is EVERY benchmark from EVERY hardware, software and gaming website out there. While Mr Larabee should probably not be using pre-release software, other than that he’s rather in the top-10% in terms of how well he does his benchmarking. The thing he wrote does do multiple runs, automated and repeatable.

      Yes, hardware and software are never equal – that is an issue. But the hardware he uses is rather typical for a modern gaming rig so it’s not that bad imho… It’s mostly the software (unstable, pre-release stuff) I have a bit of a problem with. Testing stable releases of a few distro’s would make more sense (but yes, Unity is not available for non-Ubuntu systems ATM).

    2. If you look closely at my blog post you will find that my main critics is the headline of the post, because that is what media spreads around. And the heading has the word “prove” in it. All I show in this post is that nothing is proven. And yes if you want to prove something don’t use e.g. a beta release. It needs to be tested, sure, but also the data needs to be analyzed.

      You know I can give you a good example of a benchmark which worked well. Owen Taylor from GNOME Shell fame did a proper benchmark on compositors after one of the Phoronix tests. It showed that KWin’s performance goes down with the number of open windows. It didn’t make sense, I did run the benchmark on my system, I studied the code and the end result is that KWin got a huge performance improvement in 4.8. You have to analyze the data gathered by the benchmark. And that’s just missing, it’s not helpful for developing a compositor.

      1. Exactly, it’s just the headline that is flawed, and there is no real way to make it better within the constraints of, well, a headline. There is a good reason why scientific papers and presentations have such long names, yes, but even then usually it is sugested to keep those names short and more vague – you can tell what is really being talked about while reading the thing itself (and in this case, it’s obvious that the aim was to benchmark compositors on Ivy Bridge using a development snapshot of Ubuntu – it’s other people who should not use the results to talk about other platforms). Although I have to admit, the headlines there are traditionally pretty bad.

        And you already gave an answer to the second paragraph there yourself. Making “proper” (or, rather, more sophisticated) benchmarks takes a long time and effort, and it’s not what the website is all about.

  4. Interesting read, thank you.

    I just want to point out that Compiz is packaged and maintained in at least one another distribution (Mageia, by me FWIW).

    regards
    Julien

    1. You are shipping the latest 0.9 branch? When I last checked most distributions still shipping Compiz used the 0.8 branch

  5. Wanna step up and do some tests yourself? You look like you know the proper way and it would be nice to see some test not from phoronix, as far as I know they are the only ones doing benchmarks for linux :/

    Would love to see some benchmarks from you :)

    1. No, because I don’t have the time to do that. Just think how long it would take to do a proper benchmark which I have described here. It would take days to weeks. So it’s not an easy thing to do.

      And even if I had the time, there is one point not shown in this post: I do not consider game rendering performance as an important factor at all. Heck our solution is to tell people to disable compositing. We optimize for real desktop usage, so I do not care at all about any game rendering performance.

      1. See, this is the problem.
        No one but Michael is taking some time to benchmark Linux software in general.
        My opinion is that his tools are very good and provide all details you mentioned, but the articles are just an overview, if you really care about the details you should follow the link to openbenchmarking.org.
        And yes it does sounds like a rant, mostly because your problem seams to be with the title, but you wrote this much complaining about things which doesn’t even apply!

        1. no, no, you misunderstood. The title is just the tip of the iceberg, the catchy incorrect thing that is copied everywhere. People on e.g. reddit just read the headline, maybe they go to Phoronix, but they will not follow the links to openbenchmark. Such an article just must include all the information on how it is performed.

          And all the points are valid, of course it’s a problem that it’s only run on one system and one distribution, going to openbenchmark.org doesn’t make it better. (Please note that I also consider the complete PTS fundamentally flawed, but also other benchmark suits such as 3dmark 2000)

  6. For some reason, everytime I read one of your blog posts, I just can’t help but think, “That guy sounds like someone with an awesome beard.”

    Just a random post for the sake of being random. :P

  7. Actually, as a good scientist you should name the dubious benchmark (even though the whole article sounds like Phoronix).

    1. I was sure that those who want to know would figure it out – for everyone else it actually doesn’t matter.

      1. Still had to do _two_ Google searches to find it… (*)

        (*) okay, the first one just failed because I copied “Ubuntu 12.10 development snaphot” and Google didn’t like your typo :-P

  8. Good points here. I think there’s very good reason to believe that the choice of Ubuntu over OpenSuSE (say) would affect the results. We have a computer lab that has run Ubuntu for years. After updating to 12.04, the systems ran so slowly that they were essentially unusable for us. I’m still not sure why — something with NFS perhaps. We switched to OpenSuSE 12.2 yesterday, and they are much, much faster. So choice of distribution can have a HUGE impact on performance.

    A study that averaged several systems running several different distributions would definitely be more accurate — but also much harder to pull off.

  9. As usual, if one does not like some benchmark, one could always do ones own. There’s a lot(a lot!) of open source WM(people find need for them obviously), so if some open benchmarks are not “good” enough, there will be more coming.

    Little bit on the “scientific” stuff:
    I totally agree, it’s not just the WM that could slow down 3D intensive applications(remember that games are just the best of them, not the only one using 3D!), but…the trend in some camps is that the WM is pretty much most of the big desktop(we all love when simple scripts could brake our log-in session :d ).
    Don’t get me wrong, but when it comes to
    “All these steps must be properly documented, so that others can reproduce the results”
    any peace of relatively complex software does not comply 100% with that! Does any relatively big KDE program behave 100% the same on all possible installations?! And by that – benchmarking, or testing, that kind of software by definition could not be 100% “scientific” accurate ;)

  10. More than the science of the benchmark, I question the results: what’s a difference of 1fps in two test WMs? I doubt it makes a real difference (both from a “statistical” point of view, and also from a user’s perspective)

  11. You say Compiz is Ubuntu only, but that is not true ~ you can build and use compiz (the latest code, or ubuntu source package) on pretty much ANY distro. it should also be noted that compiz (itself) often is faster than Unity (including right now, using the revision/bzr that he used in his last benchmarks for Unity).

    but i agree with most of what you say, michael NEVER does proper benchmarking, he just runs a quick once through, often prefering intel iGPUs and tests everything with Ubuntu (which he rationalizes is the ‘standard’ for linux, since it is the most used distro). OS makes a big difference, toolchain, etc all have to be factored in && these types of benchmarks should be done across a wide-variety of H/W.

    1. Well, for linux usage he does have a point about ubuntu being most relevant.

      Lets keep things straight: OS is linux. That stays the same whether he/they test ubuntu, suse, gentoo or whatevernot. Distro for given OS change ( so there is where distinction and separate test are needed), where applicable (ie, mainly linux. On *BSD I mainly know FreeBSD (so please add for other *BSDs) and on desktop there is freebsd, pcbsd and ghostbsd, networking there is junos, pfsense and m0n0wall and storage there is freenas.

      Regards “michael NEVER does proper benchmarking”, I haven’t chekc benchmarking.org thoroughly enough to make that call, but most of his conclusions in the articles on phoronix.com is a bit premature from the data presented.

  12. @DeadLock: [quote]And by that – benchmarking, or testing, that kind of software by definition could not be 100% “scientific” accurate[/quote]

    yes it could!, but you would have to make sure, there i a proper scientific method behind that you can replicate n number of times, anywhere in the world, …so thats not an experiment, an experiment would use statistics, I would expect to see at least a x**2 test. I also think you are missing something important here, the tool or concept of aproximation; finding an answer that is good enough! imho all this ‘benchmarks’ are junk and I’d never pay attention to them unless they start to use the correct statistical methods that is. (which isn’t at all difficult, just time consuming)

  13. “Maybe the most sane approach to handle these benchmarks is to detect the PTS in KWin and to go to benchmark mode, just like games. I have to think about that.”

    Hehehe, you wouldn’t :p Not after your other blog that entioned such an hack for OpenOffice which you removed with good reasons :)

    Cheers,
    Mark

  14. Very nicely written with proper perspective on these so called benchmark numbers, no one talks about usability, in that sense, Unity is not maturing to be one with maximum ease of usability, I use Chakra KDE and Ubuntu 12.04 on different system so my take is that both are progressing in their right directions and can’t be directly compared, its what you want and Linux always has been about that.

Comments are closed.