KWin and the unmanageable combinations of drivers

This week I once more noticed the biggest problem of KWin development: the differences between the multiple drivers. Since Monday KWin uses a new OpenGL 2 based coding path as default. The code was mostly written on one of my systems with the nouveau driver and regression testing was mostly done on my second system using fglrx which only supports the "legacy" old coding path by default but can be forced to the new one (but too slow for productive usage).

Around Wednesday notmart reported that the code is broken on NVIDIA proprietary driver. As I had developed on nouveau I had not tested with the blob. So I had to switch the driver again and could confirm the regression (after fixing: it was clearly our bug and not NVIDIA’s). The regression occurred in a code path which seems to be only executed with the NVIDIA blob. It took me hours to figure out what is causing the bug and how to fix it. And it illustrates the big problem in KWin development: without an NVIDIA card and the driver I would not have been able to fix it and I doubt that anyone not knowing the code would have been able to fix it.

Currently KWin supports hardware from the three big vendors: NVIDIA, ATI/AMD and Intel. Each of the vendors has a set of different hardware generations. The biggest difference is between the fixed functionality hardware and the general purpose GPUs. The difference between the hardware generations is so big that assuming that just because it works on one of it, that it works on the others, too, is rather naive.

So we have different hardware vendors with hardware in different generations. Let’s make it more complex: drivers. We also have different drivers for the same hardware. Currently we see the following drivers:

  • NVIDIA proprietary driver
  • Catalyst (aka fglrx) proprietary driver
  • Intel (Mesa driver)
  • Radeon (Mesa driver for AMD/ATI)
  • Radeon/Gallium3D (Mesa driver for AMD/ATI)
  • Nouveau (Mesa driver for NVIDIA, Gallium3D, experimental)

Except for Intel we have different drivers for the same piece of hardware. In case of AMD/ATI even different free drivers (the list is missing radeonhd). The big hope is here in the Gallium3D stack which will hopefully simplify the situation by having one free stack shared by various drivers. As my example above illustrates different drivers for the same hardware show different behavior. So we need not only test with various hardware, but also with the various drivers. So at that point we are somewhere around 30 to 50 combinations of different vendors, hardware generations and the drivers for them.

That’s quite complex already, isn’t it? So let’s make it more complex: driver versions. Obviously each vendor/driver developer is working on their drivers and is fixing bugs and introducing new ones. Of course we need to support all versions properly, so we would need to test with all driver versions. E.g. we know that we see regressions in the latest NVIDIA blobs. Unfortunately I’m still on an older one due to using Debian Testing and not wanting to mess with my setup. So I cannot investigate these issues.

Ok we just made everything more complex with driver versions. Let’s add another complexity factor for the free drivers. The free drivers are more or less bundled together with libdrm, mesa, XServer and the kernel. Some drivers require a specific kernel version or won’t work with another one. This is kind of opening Pandora’s box concerning the possibilities of combinations.

So we have reached the level of combinations which are not being possible to test without recompiling the complete kernel/mesa/X stack. Let’s add another complexity factor: distributions. Distributions do not only ship the upstream drivers, no they "improve" them. We have seen that in the past several times that distro specific code broke KWin. The distribution tries to support a vast diversity of software and it can happen that an optimization for Compiz causes rendering issues in KWin. Also distributions sometimes bundle combinations of drivers and kernel which do not work together. This mostly happens with rolling release distributions. E.g. new Mesa version requiring a new Linux kernel which is not yet released. But rolling release distributions like Arch are also hitting problems before other distributions which can help us to workaround issues before it hits distributions with a larger userbase such as Kubuntu or openSUSE.

All together there are several hundreds of combinations of the stack below KWin. Changes in any of the component can cause regressions. Even trying to test all of them seems to be ridiculous. If I would try to test a reasonable amount of the combinations (vendors * hardware generation * drivers * latest version) I would need something like 30 machines. Of course I don’t have the time for it and that’s also the reason why I declined all hardware offerings so far. It’s just like a drop in a bucket.

But we have users who can test. We need users willing to run the latest KWin master together with the latest driver versions to ensure that we do not run into regressions. And we need users reporting high quality bug reports who are willing to also report them on freedesktop.org bugtracker.

And we need developers: preferable with OpenGL experience. Now is the perfect time to join the KWin development. We switched to a more easily readable coding style and introduced a new and modern OpenGL Shader based compositing backend. You could work on improving the User Experience for all users by writing more awesome effects or optimizing our backend. It’s really fun stuff.

=-=-=-=-=
Powered by Blogilo

35 Replies to “KWin and the unmanageable combinations of drivers”

  1. Just a great big thank you to the entire team for developing kwin. Reading this, I have to admire the work you do even more, and it looks like the improvements are set to continue 🙂

    Thank you! <3

  2. “Nouveau (Mesa driver for AMD/ATI, Gallium3D, experimental)” <– Understandable typo with two drivers for AMD/ATI above Nouveau, but this should be fixed.

  3. Given the limited amount of developer time, I wonder whether it’s really a good idea to try to support sophisticated effects on such a huge array of choices. Rather than working around the bugs, perhaps fall back to basic functionality for any combination that doesn’t work? That would prevent KWin from looking bad while simultaneously increasing pressure in other places in the stack to get things fixed.

    When an upgrade is performed, is there a way that one could do a “self-test” and figure out whether one is getting the expected results? That might allow a “blacklist” to be largely automated, rather than relying on user reports to determine which combinations work and which ones don’t. If the results of the self-test could also be sent to bugzilla (with user’s permission, of course), you’d get more information from users, not less.

    I suppose the hardest part is figuring out how to do the test. Perhaps there could be a set of scenes rendered, and then sections of screen captured and compared against a stored .png containing the expected results? That, plus capturing driver errors and measuring the execution time should, I would think, tell you most of what you need to know. I say all this naively, not knowing anything about OpenGL.

    1. Given the limited amount of developer time, I wonder whether it’s really a good idea to try to support sophisticated effects on such a huge array of choices. Rather than working around the bugs, perhaps fall back to basic functionality for any combination that doesn’t work?

      Hardly any effect uses GL directly. Most is through our abstraction layer to make development easy. A fix in the abstraction directly benefits all effects.

      When an upgrade is performed, is there a way that one could do a “self-test” and figure out whether one is getting the expected results

      No something like this is not yet implemented. But I want to have the compositor as a standalone application to make it possible to test it.

      1. > Hardly any effect uses GL directly. Most is through our abstraction layer to make development easy. A fix in the abstraction directly benefits all effects.

        Are you saying that KWin doesn’t introduce lots of combination-specific code? And that it’s mostly just a matter of testing the various combinations and reporting the bug to the appropriate place in the stack? If so, then I misunderstood the point of your post—I thought you were saying that the current situation greatly increases the complexity of KWin’s code, when now I think you’re saying is a quality-assurance problem that affects people’s perception of KWin but in fact is only rarely your problem.

        > I want to have the compositor as a standalone application to make it possible to test it.

        I agree that sounds nice, but is it strictly required? Could one have, after an upgrade & reboot, a full-screen application run prior to the log-in screen (“Please wait while the graphical capabilities of your card are tested…”) using the compositor as it is now? Or is that infeasible?

        1. when now I think you’re saying is a quality-assurance problem that affects people’s perception of KWin but in fact is only rarely your problem.

          Yes the post is about the difficulty of QA. The number of specific hardware related code pathes is rather low and I want to keep it that way 🙂

          Could one have, after an upgrade & reboot, a full-screen application run prior to the log-in screen (”Please wait while the graphical capabilities of your card are tested…”) using the compositor as it is now? Or is that infeasible?

          Sounds like a bad first user experience. And when to run it? The most dangerous point is not the KWin update, but the driver update or the kernel update or the X update – completely out of control of KDE.

          1. > The number of specific hardware related code pathes is rather low and I want to keep it that way 🙂

            Good thinking!

            > Sounds like a bad first user experience.

            I agree that this is a potential concern, but I doubt it is nearly as serious as having one’s display corrupted to the point of unusability upon first login.

            > And when to run it? The most dangerous point is not the KWin update, but the driver update or the kernel update or the X update – completely out of control of KDE.

            It sounds like the kind of thing that a distribution would have to enable; when upgrading the kernel package, a flag is set so that at next reboot the graphical system is re-tested before the log-in screen is reached.

            I know the distros must also be worried about the same QA issue that concerns you. If you like the idea of getting lots of test results on all kinds of different hardware platforms, then might this be something worth asking distributors whether this is something they would support? If not, I agree that it would be a waste of your time, but if they would be enthusiastic then it sounds like something that could work.

            1. Doing it in distros is too late. We need to find the issues before it hits the distros. It would only make sense in rolling release distributions. And for them it could be quite annoying – having the test run every other day.

              1. I see your point.

                But how about for live CDs/beta releases/etc? Personally, I almost never compile and test something as large as KDE/X/kernel, but when a new Kubuntu is coming out I do make an effort to download a beta ISO and then report the problems I find. If that could somehow be automated, you would presumably get far more information (specific hardware details, specific operations that fail, etc.) rather than the more amorphous “my screen is messed up” bug reports.

                The bottom line: if there were some application I could run that would put my graphical system through its paces and send the results to you (maybe it already exists?), I’d be happy to do it. Currently, I get the feeling that the graphics landscape is changing so rapidly that I am unsure whether my bug report is of any value to anyone; once you couple that with the challenge (for a poorly-informed user) of figuring out where to send the bug report (KDE? freedesktop.org? Kubuntu?), it becomes a significant barrier. I do it anyway, but I am never sure if I am really helping (especially since many of my bug reports languish anyway).

  4. But isn’t something being done on that field or did i just misunderstand some things? Isn’t LLVM, GLSL and Gallium3d introduced to make all the hardware appear to have the same “abilities” ? Either native or emulated on CPU ? Won’t that make your work easier at least on open drivers front?

    1. partly, Intel does not yet use Gallium3D and it seems like it won’t transit in the near future. Than there is still the NVIDIA driver which is the primary target for years to come. LLVM does not matter for us, except that we need to disable it.

      1. So the situation will be the same in the forthcoming years, that’s bad as it hampers all the good stuff you are doing… Thank You for all your work as i can see it everyday using KDE along with kwin (most powerful window manager in my opinion).

  5. Hi Martin. First of all, let me thank you for the amazing job you are doing.
    I am what you could call an advance user. I can grab, compile and test most of the software you through at me. However, for most users to do testing, it must be made simple. Is there a possibility to make a test bundle? Some script that would download, compile and test kwin and preferably send the test results back to you? I know that such a strict test is just the tip of the iceberg but would help to build statistics, al least. Anyway, a site with specific code to test bundled as a tar.gz would definitely be a +.

    1. Yes would be nice to have and as mentioned in another comment I would like to have the compositor standalone.

    2. as advanced user you could actually help make this possible in a quite easy way. The open build service instance on build.opensuse.org can pull directly from SVN and build packages for over 20 major linux distro’s so if you manage to get kwin building there (not sure how easy that is, being part of KDE and all) you could easily provide repositories with nightly builds for all major distro’s 😀

  6. Hi, Martin.

    I’ve carefully read everything you wrote and now I have the following very interesting question.

    There’s an 11 years old game called Quake 3 which works perfectly on all hardware running any Linux OpenGL drivers. There are even two quite advanced games called Doom3/Quake4 which also work quite nicely on all hardware I’ve ever thrown at it. All these games utilize great many OpenGL extensions and they all run fine in Linux.

    What’s so peculiar/extraordinary in KWin’s code that it’s destined to break on certain combinations of hardware/drivers? Isn’t it possible just to use “safe” OpenGL extensions that are known to work? Forgive me, if I’m asking the wrong questions because I’m not an OpenGL expert.

    1. quite simple: KWin is not a game, but a compositor. A compositor requires completely different functionality than a game. And the most important assumption of a game does not hold at all for a compositor. In a game it’s all about being the only running application, most likely fullscreen and it’s totally fine if your fan turns into a helicopter. A compositor should not be noticable. Performance matters.

      And how many different hardware do you have? One card, two cards or two hundred?

  7. Hi Martin,

    Thanks for the great work you have been doing on KWin.

    I would be very happy to help with testing. I’m running Arch on top of the latest -rcX kernel, so should hit problems before most people. I have one system with a radeon chip (open source driver) and one with an intel chip.

    I have not been running KDE from svn/git in some time (mostly due to not wanting to deal with all the, possibly unreleased, dependencies), so I have some questions before diving in:

    Where do I get the source of KWin? How do I go about compiling the minimum amount of code in order to help with testing (can I just compile the latest KWin, or do I need kdelibs, etc from git?).

    Except for “it doesn’t crash my machine”, do you have a list of useful tests or things to look out for? Benchmarks, known difficult scenarious, etc?

    Sorry if this information is available somewhere and I missed it.

    1. Where do I get the source of KWin? How do I go about compiling the minimum amount of code in order to help with testing (can I just compile the latest KWin, or do I need kdelibs, etc from git?).

      The sources are in kde-workspace git repo (see http://projects.kde.org ). You can build kwin standalone, just run cmake once in the workspace directory and then build kwin. KWin itself does in general not depend on latest kdelibs, but has some dependencies to other parts in workspace module (e.g. Oxygen window decoration depends on Oxygen), but this can mostly be isolated.

      In case of questions just jummp into #kwin channel on freenode.

      Except for “it doesn’t crash my machine”, do you have a list of useful tests or things to look out for? Benchmarks, known difficult scenarious, etc?

      Using it helps 🙂 In general problems like performance regressions, visual errors etc. are easy to spot.

      1. Thanks, it built.

        The only thing I had to do was to set the required KDE version to 4.6.0 in the root CMakeList file.

        It seems to work, I’ll take any problems/questions to IRC.

    1. I just looked through the thread. For us the current way is sufficient. Feel free to reuse or GLPlatform detection code. The crashers in context creation are for us no problem as we detect them and will not try again. A solution for you might be to do the querying out-of-process.

  8. Regarding OSS drivers for ATI hardware:

    RadeonHD is the 2d driver, which is dead and shouldn’t be used by anyone anymore. Radeon is the alternative in common use. This would affect XRender support, but shouldn’t do anything to OpenGL.

    The Mesa 3D drivers are: r300 and r600 classic drivers, and r300g and r600g gallium drivers. By 2012 the classic drivers should probably go away, apparently in a few months the 6900 support will come out in gallium drivers only.

  9. Martin,

    I am just thinking loud, but how about having a look on phoronix hardware test suite?

    http://www.phoronix-test-suite.com/

    Perhaps, you could adjust the test to your needs and have input from a wide user-base.

    Further, I notice that phoronix is about to launch OpenBenchmarking.org, an online database with results of its test suite.

    The test suite can be installed very easily to many distros and it seems many users are using it.

    A lot of useful outcomes could be derived from this collaboration, at least it looks like that. What do you think?

    1. in general I agree with the Plasma developers and would be very happy to drop anything except compositing.

  10. There are solutions for the problem.

    – Test based development.
    – Automated tests using Testbot.
    – Release early, release often.
    – Proper beta user feedback system.

    How would this look like. You have a live-CD for betas with feedback and debug enabled. The live-CD detects the graphic card and carries out the test, submits the automated feedback logs and user comments home. This would obviously include benchmarking. It works in a sort of crowdsourcing manner, similar to the kernel-oops website.

    Feedback would include a kind of timeswitch, say all runtime log stuff gets mailed to KDE47beta3-tests@kde.org but the Live-CD submission would be silenced after 3 weeks and the mailing list disabled. So you have a mailbox with noisy test results and logs from a broad variety of beta tester machines running the same live CD.

    1. In theory this all sounds great, in praxis it doesn’t. Live CDs don’t help. It is a different environment, e.g. no nvidia driver. If it were so easy, we would have done it already.

  11. I finally solved my performance problems on a Dell E6400 with Nvidia NVS 160m gfx by changing from the NVIDIA binary driver to nouveau (And getting rid of the external display in the process). I sometimes had to wait for several seconds until a window would finally appear from minimized, especially if I had Java apps (netbeans) or Calibre open, it’s been a nightmare!! also on 4.6 but with nouveau pretty much all seems peachy now even with blur enabled. I didn’t change anything else.

Comments are closed.