Profiling of KWin

In my last blog of KWin I described the improvements in the startup of KWin. Since then I did some further profiling and could improve a few effects to load resources only when needed. E.g. Present Windows effect created the texture for the window filter on startup. This is now delayed till the user starts to type which means not only about 20 msec faster startup but also less used resources in case the user does not use the filter at all. The effect (and other effects) had created this texture on creation since 4.3. Of course it’s not a long waiting time, but in multiple effects which need to be loaded at startup it sums up to a noticeable delay.

My investigations have now reached a point where I think there is hardly anything left to do for 4.9. We still have some I/O when starting up the OpenGL compositor (150 msec) as we need to load the shaders from file but this will be quite difficult to improve without a major rewrite on how the compositor startup works, which is of course not possible for 4.9 any more and I want to change the compositor startup anyway for 4.10.

Also remaining is currently the X communication in the startup. There are many round trips to the X Server and that can unfortunately be noticed. It’s something which will hopefully improve in 4.10 by making stronger use of xcb.

Given that I was doing profiling I was also interested in our rendering and had another look at this to see why we sometimes are not able to render at 60 frames per second. My results are very promising. Some of the assumptions which we already had could be validated which means I have a general idea where I need to optimize. It will be interesting work as for that I won’t have to touch KWin source code. And I find it quite satisfying that we have reached a point in KWin development where we need to optimize non KWin code to improve the performance further.

What remains is how I did the profiling. The obvious candidate would be callgrind, but it is not really useful to find delays in the startup. In case of KWin the compositor is dominating the results of callgrind so that the calls which are executed just once are hardly visible. Another problem with callgrind is that it does not show us when we are waiting for I/O or X.

I needed to know how long the actual tasks take to perform. Therefore I needed to add profiling information which ended up in this small class:

class Timing
{
public:
    Timing(const char *message)
        : m_message(message) {
            m_timer.start();
        }
    ~Timing() {
        std::cout << m_message.latin1() << ": " <<
m_timer.elapsed() << " msec" << std::endl;
    }
private:
    QElapsedTimer m_timer;
    QLatin1String m_message;
};

With that I spiced up my KWin checkout to contain many Timing timing(“Load Cube Effect”); statements. And it allowed me to get very close to where the time is spent.

Overall I take from this experience that the key to a faster startup is not a more parallel startup through things like system-d, but that actually you have to question everything you do. Of course it is useful to construct everything an object needs when the object is created, but it might not be what is actually required. This is quite the case for plugins and especially 3rd party plugins. I would be positively surprised if a third party author of a Plasmoid is aware that loading an image in the Plugin ctor will delay the startup of the complete KDE Plasma session.

So if someone wants to work on getting the startup faster: there are many Plasmoids and I am sure for quite a lot we can put loading of data into threads and delay calls into the next event loop. Also it could be nice to create some krazy checks to test for common “mistakes” like loading an image not in a thread.

22 Replies to “Profiling of KWin”

markg85 says:

3. May 2012 at 10:45 pm

Martin, yet again awseome 🙂

However, you might be interested in the vallgrind ability to collect data from a given point in your code. For example:

CALLGRIND_START_INSTRUMENTATION
.. your startup code ..
CALLGRIND_STOP_INSTRUMENTATION

And then you should pass some argument to vallgrind. More about that stuff can be found here: http://valgrind.org/docs/manual/cl-manual.html

And are you serious, is adding a image in the plasmoid ctor slowing down all of plasma-desktop.. hehehe, that sucks :p Why aren’t all plasmoids loaded from a separate thread? Even if you optimize the plasmoids till super efficiency, it’s still interesting to load plasmoids outside of the “main” plasma-desktop thread and just work with signals/slots to notify the desktop that a plasmoid is done loading and can be displayed. Note: i’m just doing wishful thinking here. I haven’t made plasmoids in C++ neither do i know the plasma-desktop code. Isn’t this a GSoC this year..?

Cheers! And keep the improvements coming ^_^
1. Martin Gräßlin says:
  
  3. May 2012 at 10:56 pm
  
  CALLGRIND_START_INSTRUMENTATION
  .. your startup code ..
  CALLGRIND_STOP_INSTRUMENTATION
  
  Would not help much in that case. E.g. I need to also see the first run of the compositor, till then too much has already happened to see proper results. I looked at the valgrind docs in fact today 🙂
  
  Why aren’t all plasmoids loaded from a separate thread?
  
  Actually it might be that plasmoids are loaded in a background thread. But I doubt from my experience with Kickoff where I tried to load some data in a background thread and it failed due to going out of the GUI thread.
  
  Oh and even if the Plasmoid is loaded in a background thread of course loading the image in the same thread as the rest of the Plasmoid will delay the loading of the Plasmoid.
  1. markg85 says:
    
    3. May 2012 at 11:06 pm
    
    — i hope this quote stuff works :p —
    
    Oh and even if the Plasmoid is loaded in a background thread of course loading the image in the same thread as the rest of the Plasmoid will delay the loading of the Plasmoid.
    
    True, but it’s better to have a slow plasmoid then a slow desktop. And it indeed is a GSoC: http://community.kde.org/GSoC/2012/Ideas#Project:_Lazy.2FAsynchronous_Plasmoid_Inititalization I don’t know if it’s accepted.
    1. David Edmundson says:
      
      4. May 2012 at 12:31 am
      
      > True, but it’s better to have a slow plasmoid then a slow desktop
      
      but it’s even better still to fix the root of the problem (loading stuff which isn’t used) than to try and hack round it.
Robin Burchell says:

4. May 2012 at 12:07 am

Some other useful tools you might want to read up on are strace (which you’ve doubtless come across) – it’s very useful if you’ve used it a lot at seeing potentially wasteful patterns (plugin loading in Qt 4 will doubtless be one you’ll notice as being horrible if you’re looking at startup).

ltrace (and latrace) are also useful for showing a library-level trace of calls that are going on, though there’s some caveats to both of their usage that I don’t completely recall this late at night.

http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Debugging_tools_Using_latrace.html

There’s some other extremely useful tools from Nokia like sp-rtrace (http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Debugging_tools_Using_sp-rtrace.html) and swaplogger (http://harmattan-dev.nokia.com/docs/library/html/guide/html/Developer_Library_Developing_for_Harmattan_Developer_tools_Performance_testing_tools_Using_swaplogger.html) which you might find useful.
Alexis Menard says:

4. May 2012 at 3:33 am

std::cout << m_message.latin1() << ": " <<
m_timer.elapsed() << " msec" << std::endl;

You want to modify this by :

int elapsed = m_timer.elapsed();
std::cout << m_message.latin1() << ": " <<
elapsed << " msec" << std::endl;

So you don't take in your elapsed time the time it take to run m_message.latin1(), creating ":'. It's nitpicking but it helps to have accurate results.
1. Martin Gräßlin says:
  
  4. May 2012 at 8:02 am
  
  thanks, adjusted
Diego says:

4. May 2012 at 6:35 am

I have 3 wishes for KDE/KWin:

1- Performance is already great for me but more performance improvements are always welcome.

2- Wayland. 😀

3- Please please please implement remembering windows after saving, remember position, dimensions and stuff like that. When you do the Wayland port!

Thanks a lot for all your work, you rock and I wish I had your skills. 🙂
1. Martin Gräßlin says:
  
  4. May 2012 at 8:03 am
  
  it’s the client’s task to remember position, that won’t change with Wayland.
  1. Ralf says:
    
    4. May 2012 at 10:27 am
    
    It really is? It’s hard for the client to do this properly, because a naive implementation (which just stores x,y width,height) might have missed that last time, the window was on an external screen which is no longer connected, or screen resolution changed, or whatever.
    However, for most windows this actually works fine here – Kopete, Firefox, Konsole, KDevelop and many more properly remember their position and size. I don’t know who is responsible for that, but I like to see it working 🙂
    1. Martin Gräßlin says:
      
      4. May 2012 at 10:30 am
      
      and the window manager is lacking even more information than the windows.
      1. Diego says:
        
        4. May 2012 at 7:49 pm
        
        I was actually curious about that becuse some X clients (VLC for example) are able to remember the position after I close them and open them again, in X.
        
        How do they achieve this? And why other applications don’t do the same?
        
        How can we get consistent for this?
        
        Diego says:
        
        4. May 2012 at 7:49 pm
        
        *because
        
        Thomas says:
        
        5. May 2012 at 11:54 pm
        
        clients store their geometry when being closed (ideally not all the time 😉
        when they get started they tell the WM: “please position me here and i’d like to be of this size”
        
        KMainWindow does only store the size (per screensize) but does not ask for a specific position (because it’s usually not the best idea for the environment -other windows- has likely changed) so KWin positions them according to a configurable strategy.
        
        Plain Qt does not re/store it’s geometry at all – usually “legacy” applications do neither (no idea how this is handled for Gtk+)
        
        For most applications you can cause a specific geometry by [-]-geometry x++
Diego says:

4. May 2012 at 6:37 am

Sorry, I wanted to say “remember windows after closing” (NOT after saving).
fasd says:

4. May 2012 at 11:07 am

Performance improvements!! Yay!! 😀
glad says:

4. May 2012 at 1:59 pm

I was already aware of the fact that loading images at startup of a plasmoid (even a QML one) slows down the whole startup of the Plasma desktop. I became aware of this when I wrote my Luna QML plasmoid (http://kde-apps.org/content/show.php/Luna+QML?content=140204). Originally I loaded the complete luna.svgz file from which I selected the correct part to be displayed (like is done in the original C++ Luna plasmoid), but when I split the svgz file in different svg files containing only one moon phase and I loaded only the correct phase at startup, there was a noticeable speed improvement on the startup of the Plasma desktop (because a much smaller image had to be loaded).

I should also have a look at the “Loader” QML element and use it to delay loading of the elements in my plasmoids which are not visible at startup. And I wonder whether the “WorkerScript” element can be useful in plasmoids.
Andy says:

4. May 2012 at 6:31 pm

Will this come for free when KDE uses Qt5 in a year or so? I’m thinking of the easy support for async loading in qtquick2

http://doc-snapshot.qt-project.org/5.0/qtquick2-performance.html#asynchronous-loading
Andrew says:

6. May 2012 at 8:26 am

Why use threads for IO? Why not async IO?
1. Martin Gräßlin says:
  
  6. May 2012 at 12:43 pm
  
  there is not async IO for everything. E.g. constructing a QImage does not provide any means to do it async.
  1. Andrew says:
    
    6. May 2012 at 4:48 pm
    
    If construction of a QImage is a CPU bound task then thread is perfectly fine. Is it the case?
    1. Martin Gräßlin says:
      
      6. May 2012 at 7:38 pm
      
      I have no idea whether construction of a QImage is a CPU bound task, but I would assume no.

Comments are closed.