On Benchmarks

A well knwon Linux website published a “benchmark” about Plasma Wayland vs Plasma Xorg vs Gnome Shell (Wayland and Xorg). Before anybody tries to draw any conclusion: this is not a proper benchmark. It shows no statistical relevance as it was only tested on one hardware and only on one distribution. It shows numbers, that’s it. The numbers might be nice or not, I don’t know. I am not able to draw any conclusions from these numbers.

This has been a general problem with those benchmarks for years. The setup is IMHO utterly stupid and not helpful for development. Rather the opposite as we have to spend time on explaining why the benchmark has a useless setup. There are two possible ways to address those benchmarks: ignore or volkswagen. On X11 we do to a certain degree volkswagen those benchmarks as we turn the compositor off if the games used in the benchmark request it. But we don’t do the full story yet, we don’t check for any of the benchmark applications and do not yet adjust our rendering to get better numbers. If anyone is interested: most of it should be easily achievable through KWin scripts. But yeah, if the numbers show better results for X11 than for Wayland it might be due to KWin cheating on X11.

32 Replies to “On Benchmarks”

Marco says:

25. February 2018 at 5:24 pm

Please, define volkswagen… it’s a slang I don’t know.
1. Martin Flöser says:
  
  25. February 2018 at 5:36 pm
  
  It refers to Dieselgate.
  1. Marco says:
    
    25. February 2018 at 6:56 pm
    
    Ah, ok! 😉 Thanks a lot! 🙂
2. Volodymyr says:
  
  25. February 2018 at 5:45 pm
  
  P.S. And Yes, I’m interested in using KWin scripts for this purpose 🙂
Volodymyr says:

25. February 2018 at 5:33 pm

It would be great to have more control on turning compositor on and off while playing games. Many Linux games which I play are not optimized well for Linux, so even having possibility to get few more FPS is good. However, I’m not completely sure it would make any noticeable difference.
On the other hand, I often meet situations when a game hangs or crashes and after killing it there is no window borders and all I can do is to reboot.
In rare cases even windowed applications disable some effects like shadows (I’m talking about Unreal Engine Editor).
1. Nick says:
  
  25. February 2018 at 7:00 pm
  
  Oh yes. There are many bad Linux-Ports out there. The main problem is mostly not, that Linux, Wayland or the used Compositor, is not able to do this in a better way. But that problem exist with many awkward vendors, who only see Windows primary and nothing more. Mostly it’s better to ignore such vendors, they have obviously enough money. Personally I play only Open-Source-Games, like Xonotic and others, who run perfectly all the time. With many AAA-Games I had very frustrating problems, on Linux and in earlier times on Windows too. I will never be an Alpha-Tester again, for all these awkward vendors. Not when every Game is much expensive. For me this is typical for Closed-Source-Software, don’t ask for quality.
2. Jan says:
  
  26. February 2018 at 12:04 am
  
  Well, you can either disable compositing temporarily yourself with keyboard shortcut (alt+shift+F12 by default), or you can use KWin’s per-application rules to specify “this specific game will block compositing while it’s running”.
  
  Is that what you mean by “more control”?
Luca says:

25. February 2018 at 6:05 pm

Since you state that “the setup is IMHO utterly stupid and not helpful for development” could you please constructively suggest an alternative benchmark setup that would be more helpful for the users and the developers?
1. Martin Flöser says:
  
  25. February 2018 at 8:58 pm
  
  I don’t think it is possible to run a tool to benchmark x vs y.
Kurt says:

25. February 2018 at 7:48 pm

Martin, in general the benchmarks on that website are very helpful and keep the competition running between graphics drivers, game ports, file systems etc.
Which way would you propose to get solid numbers that are comparable between x/wayland and gnome kde? I guess you do performance work yourself, which numbers do you use? I’m pretty sure Michael L. will happily incorporate any suggestions in his test suite, as long as there are numbers to compare.
1. Martin Flöser says:
  
  25. February 2018 at 9:02 pm
  
  I’m sure Michael is aware of the issues with his benchmark setups. For starters just fixing the statistical relevance would be an improvement. Run the tests multiple times, on multiple systems, on those systems with multiple distributions. Provide standard deviation and the mean average. Now the thing is: that is not a business to make money. It’s not the thing for click bait “Benchmark 5 systems, see which has highest number”. This is serious work of probably several days setting up the system and running the benchmark. So I doubt Michael is interested in improving his benchmark system. It’s not his business model.
  1. Marco says:
    
    26. February 2018 at 1:06 pm
    
    > Provide standard deviation and the mean average.
    
    Just for Michael’s defense: He provided the standard deviations (SE) and mean average in his plots. Unfortunately, he doesn’t tell us how often he actually run the test and hides the distributions from us.
    So we don’t whether there are outliers, trashing the statistics.
    
    From my point of view, a frametimes and a X-percentage quantile would be more interesting than the arithmetical average FPS.
    
    Further, how is the GPU-Test result to be interpreted? Does the quotient Q of the points tell us machine X is Q times faster (or slower) then other, or what?
    
    Nonetheless, as you said, the results apparently only represent the performance on this very particular system with its unique configuration and it is unlikely to represent the performance of other systems.
    1. Martin Flöser says:
      
      26. February 2018 at 6:30 pm
      
      The benchmark I wrote this blog post about do not show standard deviation and mean average.
      1. Marco says:
        
        26. February 2018 at 9:37 pm
        
        Maby we are talking about two different things. I thought you referred to:
        
        https://www.phoronix.com/scan.php?page=news_item&px=Tumbleweed-Xorg-Wayland-Feb18
        
        where all but the first show the SE + Mean
        
        Martin Flöser says:
        
        27. February 2018 at 6:17 am
        
        The first graph doesn’t and that’s where I stopped looking.
        
        - says:
        
        28. February 2018 at 6:37 am
        
        What I read before is that the standard deviation is shown only if it’s relevant. I think that’s a good system.
        
        Tomme says:
        
        28. February 2018 at 8:55 pm
        
        Years…
        and still no explanation why the Numbers are misleading for me as a tumbleweed User.
        
        Tomme says:
        
        28. February 2018 at 8:56 pm
        
        Woops, wrong thread, sorry.
  2. Tomme says:
    
    27. February 2018 at 4:25 pm
    
    Phoronix is not all about helping devs. Its also about helping users. For me as a tumbleweed User those Numbers were helpful. Why Do you even Think Michael wants to help you? You are not helping him Either.
    a User requested those Benchmarks. Not you. Did you even read the article?
    1. Martin Flöser says:
      
      27. February 2018 at 6:05 pm
      
      If the information is wrong, how can it help users? If I as the maintainer of the software question the benchmark, how do you as a non-expert want to draw any useful information from it? No, Michael is not helping users. He is confusing them at best.
      1. Tomme says:
        
        27. February 2018 at 8:17 pm
        
        Wrong or right lies within perspective. For me the Numbers represent tumbleweeds out of the Box Performance on the given system. What is Wrong with that?
        Maybe the way you tried to explain it was wrong. Emotional choice of words usually does not seem persuasive.
        
        Tomme says:
        
        27. February 2018 at 8:27 pm
        
        Also: how Do you Expect Michael to understand the inner parts of kwin, or the shortcomings of historischen setup, if you are not communicating with him. Pic up the phone and get constructive. Throwing around Slandering Blog Posts does help whom? And if Michael in the end still answers “who pays my bill” and continues with objectively wrong Benchmarks – then you can slander.
        
        Martin Flöser says:
        
        27. February 2018 at 9:22 pm
        
        You know I talked with Michael years ago
        
        Tomme says:
        
        28. February 2018 at 8:58 pm
        
        Years…
        and still no explanation why the Numbers are misleading for me as a tumbleweed User.
        
        Matthias says:
        
        4. March 2018 at 10:10 am
        
        As Martin points out, is the distribution not the only factor.
        
        Michael does Benchmarks since a couple of years and states by himself that it is a lot of work to do them ‘right’
        
        He simply seems to reduce quality in order to improve his financial income.
        
        And he repeatedly ignores the advice and critics on his tests.
        
        If I go and test such issues, is contacting people such as Martin the first thing I do.
CTown says:

25. February 2018 at 9:54 pm

There really is a segment of the market that cares how literally each bit of RAM is used. “I have 16 GB of RAM but refuse to use this DE since it uses 100 MB more RAM than this other.” This problem is clear to see in the Android world where phones cheat in bench mark apps by jumping to a higher performance tier when they see apps with certain names running. Even if considering that most computers today are so “extra” for most people’s needs.

In fact, the new HP laptop with 16GB of RAM with a new AMD Ryzen 5 APU can be found under $800 on Amazon and HP’s website!
Sandeep says:

25. February 2018 at 11:54 pm

Maybe it’s because games typically run on X, so under Wayland compositors they use XWayland which ends up causing a performance hit.

I’ve tried running teeworlds 0.7-git (it uses SDL2 with Wayland support) on KDE on Wayland – was much smoother than running teeworlds on KDE on Xorg.

Of course, this is a subjective opinion, I didn’t run any benchmarks, but looks like Wayland worked better overall.

This was on an old weak computer (AMD C-60, 1 GHz dual core).
Stanislav says:

26. February 2018 at 5:02 am

Maybe it will be offtopic, I think there have be a auto-tuning KWin’s wayland depends on hardware. Not manual. Some type of build in benchmark for the first boot tuning
M@yeulC says:

26. February 2018 at 12:03 pm

> But yeah, if the numbers show better results for X11 than for Wayland it might be due to KWin cheating on X11.

Actually, it seems to be rather the opposite in that case.

Anyway, I fully understand your grudges against “benchmarks” like these, but they still have some value. I put “benchmarks” in quotes, as it is more of a performance test than a proper benchmark.
That said, I would rather have a single data point than none. In that case, I know that at least some configurations allow to have the same performance under X and (X)Wayland. Which is more helpful than you seem to think.

Of course, the methodology isn’t good enough to do a complete performance evaluation of compositor one vs compositor two, but it allows me to judge where we are in terms of performance: I remember seeing some that showed a huge performance hit due to XWayland some years ago. Here, both compositors seem to perform rather well, and stay comparable.
Mikkle says:

27. February 2018 at 4:21 am

You can’t turn compositing off for wayland, right?

The main reason that the numbers are useless is that phoronix benchmarked software that nobody uses (or is else a synthetic benchmark).

I still think the benchmarks are *a little bit* interesting, since they show that, besides one useless piece of software, mutter and kwin are basically equal performance-wise.
1. Mikkle says:
  
  27. February 2018 at 4:23 am
  
  Whoops, that is not what I meant to say. I meant to say that the thought passed my mind when I saw the graphs, but remembered what benchmarking software was used and how varied it was and remembered that it was mostly useless.
Dimitar Tomov says:

28. February 2018 at 12:50 pm

This well known Linux website has generated useless and misleading benchmarks for years. Comparison between SBC especially and SoC running Linux. I am positively surprised that someone else publicly confronts such “technical reports”, which serve only the author and his ad revenue, but have little value to users

Comments are closed.