Profiling of KWin

In my last blog of KWin I described the improvements in the startup of KWin. Since then I did some further profiling and could improve a few effects to load resources only when needed. E.g. Present Windows effect created the texture for the window filter on startup. This is now delayed till the user starts to type which means not only about 20 msec faster startup but also less used resources in case the user does not use the filter at all. The effect (and other effects) had created this texture on creation since 4.3. Of course it’s not a long waiting time, but in multiple effects which need to be loaded at startup it sums up to a noticeable delay.

My investigations have now reached a point where I think there is hardly anything left to do for 4.9. We still have some I/O when starting up the OpenGL compositor (150 msec) as we need to load the shaders from file but this will be quite difficult to improve without a major rewrite on how the compositor startup works, which is of course not possible for 4.9 any more and I want to change the compositor startup anyway for 4.10.

Also remaining is currently the X communication in the startup. There are many round trips to the X Server and that can unfortunately be noticed. It’s something which will hopefully improve in 4.10 by making stronger use of xcb.

Given that I was doing profiling I was also interested in our rendering and had another look at this to see why we sometimes are not able to render at 60 frames per second. My results are very promising. Some of the assumptions which we already had could be validated which means I have a general idea where I need to optimize. It will be interesting work as for that I won’t have to touch KWin source code. And I find it quite satisfying that we have reached a point in KWin development where we need to optimize non KWin code to improve the performance further.

What remains is how I did the profiling. The obvious candidate would be callgrind, but it is not really useful to find delays in the startup. In case of KWin the compositor is dominating the results of callgrind so that the calls which are executed just once are hardly visible. Another problem with callgrind is that it does not show us when we are waiting for I/O or X.

I needed to know how long the actual tasks take to perform. Therefore I needed to add profiling information which ended up in this small class:

class Timing
{
public:
    Timing(const char *message)
        : m_message(message) {
            m_timer.start();
        }
    ~Timing() {
        std::cout << m_message.latin1() << ": " <<
m_timer.elapsed() << " msec" << std::endl;
    }
private:
    QElapsedTimer m_timer;
    QLatin1String m_message;
};

With that I spiced up my KWin checkout to contain many Timing timing(“Load Cube Effect”); statements. And it allowed me to get very close to where the time is spent.

Overall I take from this experience that the key to a faster startup is not a more parallel startup through things like system-d, but that actually you have to question everything you do. Of course it is useful to construct everything an object needs when the object is created, but it might not be what is actually required. This is quite the case for plugins and especially 3rd party plugins. I would be positively surprised if a third party author of a Plasmoid is aware that loading an image in the Plugin ctor will delay the startup of the complete KDE Plasma session.

So if someone wants to work on getting the startup faster: there are many Plasmoids and I am sure for quite a lot we can put loading of data into threads and delay calls into the next event loop. Also it could be nice to create some krazy checks to test for common “mistakes” like loading an image not in a thread.

22 Replies to “Profiling of KWin”

  1. Martin, yet again awseome 🙂

    However, you might be interested in the vallgrind ability to collect data from a given point in your code. For example:

    CALLGRIND_START_INSTRUMENTATION
    .. your startup code ..
    CALLGRIND_STOP_INSTRUMENTATION

    And then you should pass some argument to vallgrind. More about that stuff can be found here: http://valgrind.org/docs/manual/cl-manual.html

    And are you serious, is adding a image in the plasmoid ctor slowing down all of plasma-desktop.. hehehe, that sucks :p Why aren’t all plasmoids loaded from a separate thread? Even if you optimize the plasmoids till super efficiency, it’s still interesting to load plasmoids outside of the “main” plasma-desktop thread and just work with signals/slots to notify the desktop that a plasmoid is done loading and can be displayed. Note: i’m just doing wishful thinking here. I haven’t made plasmoids in C++ neither do i know the plasma-desktop code. Isn’t this a GSoC this year..?

    Cheers! And keep the improvements coming ^_^

    1. CALLGRIND_START_INSTRUMENTATION
      .. your startup code ..
      CALLGRIND_STOP_INSTRUMENTATION

      Would not help much in that case. E.g. I need to also see the first run of the compositor, till then too much has already happened to see proper results. I looked at the valgrind docs in fact today 🙂

      Why aren’t all plasmoids loaded from a separate thread?

      Actually it might be that plasmoids are loaded in a background thread. But I doubt from my experience with Kickoff where I tried to load some data in a background thread and it failed due to going out of the GUI thread.

      Oh and even if the Plasmoid is loaded in a background thread of course loading the image in the same thread as the rest of the Plasmoid will delay the loading of the Plasmoid.

        1. > True, but it’s better to have a slow plasmoid then a slow desktop

          but it’s even better still to fix the root of the problem (loading stuff which isn’t used) than to try and hack round it.

  2. Some other useful tools you might want to read up on are strace (which you’ve doubtless come across) – it’s very useful if you’ve used it a lot at seeing potentially wasteful patterns (plugin loading in Qt 4 will doubtless be one you’ll notice as being horrible if you’re looking at startup).

    ltrace (and latrace) are also useful for showing a library-level trace of calls that are going on, though there’s some caveats to both of their usage that I don’t completely recall this late at night.

    http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Debugging_tools_Using_latrace.html

    There’s some other extremely useful tools from Nokia like sp-rtrace (http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Debugging_tools_Using_sp-rtrace.html) and swaplogger (http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Performance_testing_tools_Using_swaplogger.html) which you might find useful.

  3. std::cout << m_message.latin1() << ": " <<
    m_timer.elapsed() << " msec" << std::endl;

    You want to modify this by :

    int elapsed = m_timer.elapsed();
    std::cout << m_message.latin1() << ": " <<
    elapsed << " msec" << std::endl;

    So you don't take in your elapsed time the time it take to run m_message.latin1(), creating ":'. It's nitpicking but it helps to have accurate results.

  4. I have 3 wishes for KDE/KWin:

    1- Performance is already great for me but more performance improvements are always welcome.

    2- Wayland. 😀

    3- Please please please implement remembering windows after saving, remember position, dimensions and stuff like that. When you do the Wayland port!

    Thanks a lot for all your work, you rock and I wish I had your skills. 🙂

    1. it’s the client’s task to remember position, that won’t change with Wayland.

      1. It really is? It’s hard for the client to do this properly, because a naive implementation (which just stores x,y width,height) might have missed that last time, the window was on an external screen which is no longer connected, or screen resolution changed, or whatever.
        However, for most windows this actually works fine here – Kopete, Firefox, Konsole, KDevelop and many more properly remember their position and size. I don’t know who is responsible for that, but I like to see it working 🙂

          1. I was actually curious about that becuse some X clients (VLC for example) are able to remember the position after I close them and open them again, in X.

            How do they achieve this? And why other applications don’t do the same?

            How can we get consistent for this?

            1. clients store their geometry when being closed (ideally not all the time 😉
              when they get started they tell the WM: “please position me here and i’d like to be of this size”

              KMainWindow does only store the size (per screensize) but does not ask for a specific position (because it’s usually not the best idea for the environment -other windows- has likely changed) so KWin positions them according to a configurable strategy.

              Plain Qt does not re/store it’s geometry at all – usually “legacy” applications do neither (no idea how this is handled for Gtk+)

              For most applications you can cause a specific geometry by [-]-geometry x++

  5. I was already aware of the fact that loading images at startup of a plasmoid (even a QML one) slows down the whole startup of the Plasma desktop. I became aware of this when I wrote my Luna QML plasmoid (http://kde-apps.org/content/show.php/Luna+QML?content=140204). Originally I loaded the complete luna.svgz file from which I selected the correct part to be displayed (like is done in the original C++ Luna plasmoid), but when I split the svgz file in different svg files containing only one moon phase and I loaded only the correct phase at startup, there was a noticeable speed improvement on the startup of the Plasma desktop (because a much smaller image had to be loaded).

    I should also have a look at the “Loader” QML element and use it to delay loading of the elements in my plasmoids which are not visible at startup. And I wonder whether the “WorkerScript” element can be useful in plasmoids.

    1. there is not async IO for everything. E.g. constructing a QImage does not provide any means to do it async.

      1. If construction of a QImage is a CPU bound task then thread is perfectly fine. Is it the case?

        1. I have no idea whether construction of a QImage is a CPU bound task, but I would assume no.

Comments are closed.