Wednesday, August 15, 2012

When a QThread isn't a thread...

The thread that doesn't thread...

Lets say you have some background work you need to get done. The work is pretty time intensive, so you don't want to block the main UI thread while its executing. To do this work, you implement a worker thread like the one below to do the work once a second
class MyWorkerThread(QThread):
    def __init__(self):
        super(MyWorkerThread, self).__init__()
       
    def run(self):     
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.doWork)
        self.timer.start(1000)
        self.exec_()

    def doWork(self):
        print "Work!"
On the surface this code seems sane. When the thread starts executing, we setup a QTimer thats going to run in the current thread's event queue. We connect our "doWork" function to the timeout signal. Then we do work in the worker thread's context... right?

If you try this out with any serious work, though, you'll find the main thread is actually getting blocked by the work thats being done. The GUI running in the main thread will not be responsive. Its almost as if threading itself has stopped working.

So what's happening? Well there's two pieces of information we need to learn to figure this out. First is thread affinity. Thread affinity is described by QT as the "thread the QObject lives in". According to the QT docs, when a QObject is created, its "thread" pointer is set to the current executing thread. Now here's the important question -- what's the current executing thread when we create our worker thread?

Well it can't be the worker thread, as its not really created yet. The __init__ for the thread is run in the thread creating it. The QThread doesn't actually really become, well, a thread until you call start on it and it begins to run().

So a QThread's affinity is always the thread that creates it.

Ok the next piece of information is what happens when we connect the signal. By default, the connection is whats known as an AutoConnection. An AutoConnection begins by figuring out the thread affinity of the emitting and receiving QObjects. So what is the affinity of the QTimer firing the signal? Well its created while worker thread is running, so its the worker thread. Ok but the receiver, whats its affinity? Well the receiver is the QThread itself. We just established that the QThread's affinity is the thread that creates it. So we have a signal going from the QTimer living in the worker thread to the QThread living in the main thread.

AutoConnection posts the signal as an event to be queued in the receiving thread's event queue. The timeout signal is posted from the worker thread to the main thread. The main thread eventually gets to this signal, and figures out the slot to call, in this case the slot is doWork in our worker thread. The main thread then calls "doWork" in its own event queue in response to receiving the signal.

The work is intensive. It takes a lot of time. And the main thread's event queue can't run because its doing the work the worker thread should be doing. So none of the GUI events get processed, and the main thread effectively becomes starved.

The solution? 

We need to make sure that the QObject doing the work lives in the worker thread. So we need it to not be the QThread itself. We need something like
class Worker(QObject):
    """ I'll do the work and live in the worker thread!"""
    def __init__(self):
        super(Worker, self).__init__()
       
    def doWork(self):
        print "Work"

class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do
        nothing else..."""
    def __init__(self):
        super(QThread, self).__init__()
       
    def run(self:    
        self.timer= QTimer()
        self.worker = Worker()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()
So great, notice we created the Worker while the worker thread is executing. Our Worker and the QTimer have the same thread affinity (that is they both live in the worker thread). Therefore, when timeout is fired, doWork is fired from the worker thread's thread context.

Problem solved!

Some caveats....

You might want to control how the worker is created before you create the worker thread. Maybe you want to pass the worker into the worker thread for example. But then wait, if we create it outside of this thread, its not living in the worker, and we have the same problem again! Aaaah!

Luckily QObject has a method moveToThread which takes a QThread, changing the affinity to the passed in thread. This method is intended to allow you to push (important to note pulling is not threadsafe) into another thread. So we can do this:
class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do 
        nothing else..."""
    def __init__(self, worker):
        super(QThread, self).__init__()
        worker.moveToThread(self)
        self.worker = worker
        
    def run(self):      
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()
Of course, you probably don't want to retain the passed in worker in the creator of the thread. You might do call a method on it from outside the worker thread while its doing work in the worker thread's context. Most likely that operation will not be thread-safe and unintentional bugs will ensue. Ideally, you don't want to keep worker around in your worker thread's client code.

This is something that's easy to forget and hard to enforce, so another solution would be to pass in a factory function to create worker for you instead of passing the worker itself. This will remove the ability for the worker thread's owner/client to monkey with the worker in an unsafe way. So we'd then have something like this:
def createWorker():
    return Worker()

class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do 
        nothing else..."""
    def __init__(self, workerFactory):
        super(QThread, self).__init__()
        self.workerFactory = createWorker
        
    def run(self):      
        self.timer= QTimer()
        print self.timer.thread()
        self.worker = createWorker()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()

Final thoughts

Its interesting that when researching all this I discovered that for QT 5, "Subclassing QThread is no longer recommended way of using QThread". Well that's not surprising. When you think about it, since QThread lives in the thread that creates it, methods of QThread itself are a pretty unsafe place to do any work. We know run is executing in another context, so we can setup our workers and what not there, but we can't be sure whats going on in other methods of QThread.

This is a little weird if you've ever dealt with other frameworks like MFC. For example, for CWinThread, you'd be used to attaching thread message handlers to methods of CWinThread for direct handling in that CWinThread's context. You get trained to expect methods of you derrived CWinThread are being executed in the CWinThread's context, not in some external context. Sure you can always call into a CWinThread from another thread's context, but MFC developers know its better to post to the other thread and have it deal with the event in its own context.

Classes inheriting from QThread don't work that way. And QT signals/slots are different then Windows thread messages. So beware and keep your work out of QThread, delegating it to other objects that live in that QThread!

Tuesday, June 12, 2012

VisPerf -- Visualize The Output of Visual Studio's Stand Alone Profiler

At work we've recently been making pretty extensive use of the Visual Studio Stand Alone Profiler. We use the professional edition of Visual Studio, so we don't have any of the nice integrated profiling features with pretty GUIs built into more expensive editions of Visual Studio. Instead, with the free command line tools, we can get a set of csv performance reports, including most importantly --

  • A "Function Report" with each function listed with time listed including the functions it calls (inclusive time) and time excluding the functions it calls (exclusive time). 
  • A "Caller-Callee" report with, for each function F in the program (1) a root entry for F (2) A list of callers of F that function with inclusive/exclusive time when that caller calls F (3) A list of callees -- functions that F calls and inclusive/exclusive time when these functions are called by F. 
  • A "Call Tree" report. Similar to the caller-callee report, but represented as a tree starting from the top-level function, a list of all the functions called the next layer down, etc.
The report we've found most useful is the caller-callee report. With it, you can get a nice set of statistics on program performance. You can answer most of your basic questions. Which functions are taking the most time? Which called functions dominate those functions performance? When are those functions hurting performance -- when called by a single specific heavy-hitting caller? Or when called by a variety of different callers?

Unfortunately, as you can imagine, the csv files are very hard to use. Imagine, you want to know why function G is dominating function F's performance. You need to (1) save off/bookmark the root record for function F then (2) find the root record for function G and examine it. Once in function G, you need to repeat the process... say for function H. While keeping these breadcrumbs for the whole F->G->H investigation you've lost any ability to step back and look at the big picture. Maybe F->G->H isn't worth pursuing, but a lot can be gained by optimizing, say, the F->I call stack.

Wouldn't it be nice if you could easily navigate the performance output keeping the big picture in view? If we could only, in one glance, see F, all the functions F calls, and maybe even some levels deeper into the functions those functions call.

Enter VisPerf:


VisPerf is a visualizer for the caller-callee csv reports. The main view shows the performance of one "parent" function, showing rectangles for each function called by the parent.. If you've ever used windirstat, the following description will be second nature to you. Each function's rectangle is sized proportional to how much it contributes to parent's total execution time. So in the above view, we can see that PigLatinize dominates the performance of PigLatin simply by seeing how much area PigLatinize takes in PigLatin's square. Numbers prepended to the function name give more information, with the number in the <> angle brackets giving that function's percentage of the immediate parent and the () parenthesis giving that function's percentage of the overall program (not just when called here, but when called everywhere in the program).

The view doesn't just stop at one level, we can view in one glance at more than one level of depth. In the above view, we see how PigLatinize calls PigLatin which in turn calls PigLatinizeBeginsWithConsonant. We can see in one glance not just how PigLatinize dominates the performance of PigLatin, but that PigLatinizeBeginsWithConsonant is whats causing PigLatinize to be so slow and ultimately causing PigLatin to be slow. We can see right away obvious places for optimization and ignore the harder optimization problems. The depth can be controlled by the spinbox with the number in the upper right.

The current, top-level function can be changed by either selecting from the drop-down or by clicking the one of the rectangles. Each rectangle has a nice mouse-over tool-tip summary with some detailed info which you can copy to the clipboard using the context menu.. Also in the context menu, you can browse to other callers of the selected function to see if this function might matter in performance scenarios elsewhere in your program. The context menu allows you to browse there instantly.

To use VisPerf, download the latest binaries from here. Right click on a caller-callee csv file you've generated from the profiler and select "Open With". Browse to the VisPerf executable. From that point on, VisPerf will be in your context menu for csv files under "Open With" and you'll be able to open caller-callee reports with ease.

This tool has been a game changer for us. My colleagues and I have been using VisPerf to analyze the performance of our projects. We've been able to solve and prevent many performance problems in our everyday development. With an easy-to-use tool to see what's going on, bad performance decisions have be avoided early on, before it becomes an emergency. Its also helping us learn what constructs to use and what to avoid when performance matters (I'm looking at you iostreams and boost mutex). The whole team has become a lot more performance aware and confident because of our new ability to be strongly backed up by powerful, demonstrable data that can be unequivocally seen by everyone.

I'm releasing this because I feel there's a sliver of a chance that it might be helpful to someone else. You too can get the awesome that we've acquired! Please let me know if you have any feedback or find any bugs, I'd love to hear from you!

Odds-and-ends

The source code is available under the Boost license at github for anyone to do whatever with. The code is written in Python using PySide to bind to QT. There's also a python module in the source code "callercallee" that you can use to parse and navigate caller-callee reports programatically.

Its been a very fun project, giving me a lot of fun python code to write and challenging problems to solve. Its also been fun to figure out how to map a 1-dimensional value (percentage of time) to a 2-dimensional value (area) and have it work every time. This "windirstat" method seems to be a novel way to represent performance data and get an understanding of performance data at a quick glance. This continues to be an experiment along these lines. Most profiler GUIs, like Jprofiler for example, seem to use a tree-view to display this kind of data. So I'd love to hear any feedback you might have should you use this tool. Hopefully I'll have some time to blog about some of the technical details and interesting challenges that came up with this tool.

PS for a traditional tree-view of the caller-callee report, check out this CodeProject article.