Wednesday, September 5, 2012

QThread Best Practices -- When a QThread isn't a thread part 2

Review...

In When a QThread isn't a thread, we learned some of the finickiness in how QThread needs to be used. QT has a concept of "thread affinity" -- what thread the QObject "lives" in. At construction, a QObject becomes tied to the current running thread. Naturally, this applies to QThread as well. As of course our new thread cannot be running during its own creation, the new QThread instance is created while another "parent thread" is running. Because of the thread affinity rules, it now lives in the parent thread. This association with the parent thread combined with how QT signals-and-slots work creates a problematic situation. When emitting a signal, QT is smart enough figure out based on the destination/source QObject's thread affinity whether the signal's connected slots should be called directly as methods (these QObjects share thread affinity) or whether QT should post to the associated destination QThread's event queue for processing in the correct context (these QObjects that don't share a thread affinity).

Put all this together, and you'll see a problem. Since a QThread's affinity is the thread that created it, slots executing within it (that is methods of your QThread) will execute in the QThread's parent thread's context when responding to signals. In other words, if you "post" to a QThread method via a signal, you're not doing work in that QThread as you'd probably expect. A QThread isn't like, say, a CWinThread where methods in your thread tend to be handling thread messages and run in your thread's context.

To illustrate, in the code below, doWork does not execute in the MyWorkerThread's context. The worker thread is effectively meaningless as the meat of the work is being done in the worker thread's parent thread.

class MyWorkerThread(QThread):
    def __init__(self):
        super(MyWorkerThread, self).__init__()
       
    def run(self):     
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.doWork)
        self.timer.start(1000)
        self.exec_()

    def doWork(self):
        print "Work!"

Solution 0.1

The solution I proposed was to do all your required work in some secondary QObject (worker below). You can then change that thread's affinity to the worker thread by calling the very handy "moveToThread" method (see __init__ below). Now that  both QObjects are associated with MyWorkerThread, the work done on the timer signal is done in MyWorkerThread's context.

class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do 
        nothing else..."""
    def __init__(self, worker):
        super(QThread, self).__init__()
        worker.moveToThread(self)
        self.worker = worker
        
    def run(self):      
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_() 

There's a better way...

Since writing the original post, I've read several illuminating articles that discuss this topic. Specifically The Great QThread Mess and Using QThread Without Subclassing.In short, they describe the same problem I do and go on to describe the sad state of QThread documentation. See as it turns out you don't have to subclass QThread. 

I had been overriding run because I thought I needed to call self.exec_() directly to setup an event queue. It appeared It turns out that the default implementation of run gives you an event queue! So an even better solution is to do something that doesn't subclass QThread. Something like:

worker = Worker()
workerThread = QThread()
timer = QTimer()
timer.timeout.connect(worker.doWork)\
worker.moveToThread(workerThread)
timer.moveToThread(workerThread)
timer.start(1000) # fire in one second
workerThread.start()

We simply create two QObjects that will communicate via signals-and-slots, create a vanilla QThread, assign the thread affinity of the new objects to that QThread, and voila! All work will be done in the worker thread's context. Moreover, any signal emited from anywhere will be handled in the worker's thread context.

QThread with new hotness best practices

This turns out that this is a neat way to setup work. It creates a bit of a danger zone though. In my 0.1 solution, the QThread provided a nice container for holding objects that other thread's code shouldn't play with. Here though, everything is wide open in plain sight in client code.

Despite the awesomeness above, I tend to not want to have the object's I've just pushed into the new worker to hang around as now its code is potentially executing in two thread contexts and at the very least my "worker" is probably not thread-safe. For example, some work might  be getting done in the worker thread's context via a signal and I may accidentally call a method directly on worker or timer. That all being said, I still tend to want to control how the worker is created. I may need to construct it with specific parameters describing the work to be done. I might need to connect it to some signals I or it might emit to facilitate communication across threads.

As such, I've generally started to follow something like this pattern for my own safety and sanity
def startPeriodicWork(self, worker):
    """ Takes "ownership" of worker, returns
        QThread thats started """
    workerThread = QThread()   # Create the QThread
    timer = QTimer()           # Create a periodic timer signal
    timer.timeout.connect(worker.doWork)
    worker.moveToThread(workerThread)
    timer.moveToThread(workerThread)
    workerThread.timer = timer   # Stash away, take ownership
    workerThread.worker = worker  # Stash away,take ownership
    timer.start(1000) # fire in one second
    workerThread.start()
    return workerThread

Notice under the "moveToThread" calls we store timer and worker as attributes of our QThread. I find this a handy way to keep the objects alive so they're not garbage collected (maybe that's paranoia on my part, but they seem to represent a cycle with nothing named in the code pointing to them). So I like to stash them away in my QThread.

 In client code I would do something like:
# Create the worker
worker = Worker(foo, bar, "otherParam")
# Setup some signals/slots for communication
worker.workDone.connect(self.handleWorkDone)
self.newWork.connect(worker.handleNewWork)
# pass to function to get work going
self.workerThread = startPeriodicWork(worker)
# FORGET worker
# (notice worker isn't stored in "self" only the workerThread)
Notice we forget about worker once we've started it. If for some reason I need to change something about "worker" we can't talk to it directly. But that's ok. We've setup some useful signals/slots to communicate between us and the worker. Since the worker is associated with workerThread and our client QObject is presumably not, we can rely on QT's signals and slots to safely post the signal's between the two QObject's thread queues safely.

Conclusion

I could keep getting anal about more finicky details of this code and ways to improve, but this is just the basic sketch of what seems to be a sane pattern for working with QThreads. Once you get the hang of it and realize you don't and shouldn't subclass QThread, life becomes a lot more straightforward and easier. Simply put, never do work in QThread. You should almost never need to override run. For most use cases, setting up proper associations with a QObject to a QThread and using QT's signals/slots creates an extremely powerful way to do multithreaded programming. Just be careful not to let the QObjects you've pushed to your worker threads hang around as certain hilarity will almost definitely ensue.

Wednesday, August 15, 2012

When a QThread isn't a thread...

The thread that doesn't thread...

Lets say you have some background work you need to get done. The work is pretty time intensive, so you don't want to block the main UI thread while its executing. To do this work, you implement a worker thread like the one below to do the work once a second
class MyWorkerThread(QThread):
    def __init__(self):
        super(MyWorkerThread, self).__init__()
       
    def run(self):     
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.doWork)
        self.timer.start(1000)
        self.exec_()

    def doWork(self):
        print "Work!"
On the surface this code seems sane. When the thread starts executing, we setup a QTimer thats going to run in the current thread's event queue. We connect our "doWork" function to the timeout signal. Then we do work in the worker thread's context... right?

If you try this out with any serious work, though, you'll find the main thread is actually getting blocked by the work thats being done. The GUI running in the main thread will not be responsive. Its almost as if threading itself has stopped working.

So what's happening? Well there's two pieces of information we need to learn to figure this out. First is thread affinity. Thread affinity is described by QT as the "thread the QObject lives in". According to the QT docs, when a QObject is created, its "thread" pointer is set to the current executing thread. Now here's the important question -- what's the current executing thread when we create our worker thread?

Well it can't be the worker thread, as its not really created yet. The __init__ for the thread is run in the thread creating it. The QThread doesn't actually really become, well, a thread until you call start on it and it begins to run().

So a QThread's affinity is always the thread that creates it.

Ok the next piece of information is what happens when we connect the signal. By default, the connection is whats known as an AutoConnection. An AutoConnection begins by figuring out the thread affinity of the emitting and receiving QObjects. So what is the affinity of the QTimer firing the signal? Well its created while worker thread is running, so its the worker thread. Ok but the receiver, whats its affinity? Well the receiver is the QThread itself. We just established that the QThread's affinity is the thread that creates it. So we have a signal going from the QTimer living in the worker thread to the QThread living in the main thread.

AutoConnection posts the signal as an event to be queued in the receiving thread's event queue. The timeout signal is posted from the worker thread to the main thread. The main thread eventually gets to this signal, and figures out the slot to call, in this case the slot is doWork in our worker thread. The main thread then calls "doWork" in its own event queue in response to receiving the signal.

The work is intensive. It takes a lot of time. And the main thread's event queue can't run because its doing the work the worker thread should be doing. So none of the GUI events get processed, and the main thread effectively becomes starved.

The solution? 

We need to make sure that the QObject doing the work lives in the worker thread. So we need it to not be the QThread itself. We need something like
class Worker(QObject):
    """ I'll do the work and live in the worker thread!"""
    def __init__(self):
        super(Worker, self).__init__()
       
    def doWork(self):
        print "Work"

class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do
        nothing else..."""
    def __init__(self):
        super(QThread, self).__init__()
       
    def run(self:    
        self.timer= QTimer()
        self.worker = Worker()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()
So great, notice we created the Worker while the worker thread is executing. Our Worker and the QTimer have the same thread affinity (that is they both live in the worker thread). Therefore, when timeout is fired, doWork is fired from the worker thread's thread context.

Problem solved!

Some caveats....

You might want to control how the worker is created before you create the worker thread. Maybe you want to pass the worker into the worker thread for example. But then wait, if we create it outside of this thread, its not living in the worker, and we have the same problem again! Aaaah!

Luckily QObject has a method moveToThread which takes a QThread, changing the affinity to the passed in thread. This method is intended to allow you to push (important to note pulling is not threadsafe) into another thread. So we can do this:
class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do 
        nothing else..."""
    def __init__(self, worker):
        super(QThread, self).__init__()
        worker.moveToThread(self)
        self.worker = worker
        
    def run(self):      
        self.timer= QTimer()
        print self.timer.thread()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()
Of course, you probably don't want to retain the passed in worker in the creator of the thread. You might do call a method on it from outside the worker thread while its doing work in the worker thread's context. Most likely that operation will not be thread-safe and unintentional bugs will ensue. Ideally, you don't want to keep worker around in your worker thread's client code.

This is something that's easy to forget and hard to enforce, so another solution would be to pass in a factory function to create worker for you instead of passing the worker itself. This will remove the ability for the worker thread's owner/client to monkey with the worker in an unsafe way. So we'd then have something like this:
def createWorker():
    return Worker()

class MyWorkerThread(QThread):
    """ I'm just going to setup the event loop and do 
        nothing else..."""
    def __init__(self, workerFactory):
        super(QThread, self).__init__()
        self.workerFactory = createWorker
        
    def run(self):      
        self.timer= QTimer()
        print self.timer.thread()
        self.worker = createWorker()
        self.timer.setSingleShot(False)
        self.timer.timeout.connect(self.worker.doWork)
        self.timer.start(1000)
        self.exec_()

Final thoughts

Its interesting that when researching all this I discovered that for QT 5, "Subclassing QThread is no longer recommended way of using QThread". Well that's not surprising. When you think about it, since QThread lives in the thread that creates it, methods of QThread itself are a pretty unsafe place to do any work. We know run is executing in another context, so we can setup our workers and what not there, but we can't be sure whats going on in other methods of QThread.

This is a little weird if you've ever dealt with other frameworks like MFC. For example, for CWinThread, you'd be used to attaching thread message handlers to methods of CWinThread for direct handling in that CWinThread's context. You get trained to expect methods of you derrived CWinThread are being executed in the CWinThread's context, not in some external context. Sure you can always call into a CWinThread from another thread's context, but MFC developers know its better to post to the other thread and have it deal with the event in its own context.

Classes inheriting from QThread don't work that way. And QT signals/slots are different then Windows thread messages. So beware and keep your work out of QThread, delegating it to other objects that live in that QThread!

Tuesday, June 12, 2012

VisPerf -- Visualize The Output of Visual Studio's Stand Alone Profiler

At work we've recently been making pretty extensive use of the Visual Studio Stand Alone Profiler. We use the professional edition of Visual Studio, so we don't have any of the nice integrated profiling features with pretty GUIs built into more expensive editions of Visual Studio. Instead, with the free command line tools, we can get a set of csv performance reports, including most importantly --

  • A "Function Report" with each function listed with time listed including the functions it calls (inclusive time) and time excluding the functions it calls (exclusive time). 
  • A "Caller-Callee" report with, for each function F in the program (1) a root entry for F (2) A list of callers of F that function with inclusive/exclusive time when that caller calls F (3) A list of callees -- functions that F calls and inclusive/exclusive time when these functions are called by F. 
  • A "Call Tree" report. Similar to the caller-callee report, but represented as a tree starting from the top-level function, a list of all the functions called the next layer down, etc.
The report we've found most useful is the caller-callee report. With it, you can get a nice set of statistics on program performance. You can answer most of your basic questions. Which functions are taking the most time? Which called functions dominate those functions performance? When are those functions hurting performance -- when called by a single specific heavy-hitting caller? Or when called by a variety of different callers?

Unfortunately, as you can imagine, the csv files are very hard to use. Imagine, you want to know why function G is dominating function F's performance. You need to (1) save off/bookmark the root record for function F then (2) find the root record for function G and examine it. Once in function G, you need to repeat the process... say for function H. While keeping these breadcrumbs for the whole F->G->H investigation you've lost any ability to step back and look at the big picture. Maybe F->G->H isn't worth pursuing, but a lot can be gained by optimizing, say, the F->I call stack.

Wouldn't it be nice if you could easily navigate the performance output keeping the big picture in view? If we could only, in one glance, see F, all the functions F calls, and maybe even some levels deeper into the functions those functions call.

Enter VisPerf:


VisPerf is a visualizer for the caller-callee csv reports. The main view shows the performance of one "parent" function, showing rectangles for each function called by the parent.. If you've ever used windirstat, the following description will be second nature to you. Each function's rectangle is sized proportional to how much it contributes to parent's total execution time. So in the above view, we can see that PigLatinize dominates the performance of PigLatin simply by seeing how much area PigLatinize takes in PigLatin's square. Numbers prepended to the function name give more information, with the number in the <> angle brackets giving that function's percentage of the immediate parent and the () parenthesis giving that function's percentage of the overall program (not just when called here, but when called everywhere in the program).

The view doesn't just stop at one level, we can view in one glance at more than one level of depth. In the above view, we see how PigLatinize calls PigLatin which in turn calls PigLatinizeBeginsWithConsonant. We can see in one glance not just how PigLatinize dominates the performance of PigLatin, but that PigLatinizeBeginsWithConsonant is whats causing PigLatinize to be so slow and ultimately causing PigLatin to be slow. We can see right away obvious places for optimization and ignore the harder optimization problems. The depth can be controlled by the spinbox with the number in the upper right.

The current, top-level function can be changed by either selecting from the drop-down or by clicking the one of the rectangles. Each rectangle has a nice mouse-over tool-tip summary with some detailed info which you can copy to the clipboard using the context menu.. Also in the context menu, you can browse to other callers of the selected function to see if this function might matter in performance scenarios elsewhere in your program. The context menu allows you to browse there instantly.

To use VisPerf, download the latest binaries from here. Right click on a caller-callee csv file you've generated from the profiler and select "Open With". Browse to the VisPerf executable. From that point on, VisPerf will be in your context menu for csv files under "Open With" and you'll be able to open caller-callee reports with ease.

This tool has been a game changer for us. My colleagues and I have been using VisPerf to analyze the performance of our projects. We've been able to solve and prevent many performance problems in our everyday development. With an easy-to-use tool to see what's going on, bad performance decisions have be avoided early on, before it becomes an emergency. Its also helping us learn what constructs to use and what to avoid when performance matters (I'm looking at you iostreams and boost mutex). The whole team has become a lot more performance aware and confident because of our new ability to be strongly backed up by powerful, demonstrable data that can be unequivocally seen by everyone.

I'm releasing this because I feel there's a sliver of a chance that it might be helpful to someone else. You too can get the awesome that we've acquired! Please let me know if you have any feedback or find any bugs, I'd love to hear from you!

Odds-and-ends

The source code is available under the Boost license at github for anyone to do whatever with. The code is written in Python using PySide to bind to QT. There's also a python module in the source code "callercallee" that you can use to parse and navigate caller-callee reports programatically.

Its been a very fun project, giving me a lot of fun python code to write and challenging problems to solve. Its also been fun to figure out how to map a 1-dimensional value (percentage of time) to a 2-dimensional value (area) and have it work every time. This "windirstat" method seems to be a novel way to represent performance data and get an understanding of performance data at a quick glance. This continues to be an experiment along these lines. Most profiler GUIs, like Jprofiler for example, seem to use a tree-view to display this kind of data. So I'd love to hear any feedback you might have should you use this tool. Hopefully I'll have some time to blog about some of the technical details and interesting challenges that came up with this tool.

PS for a traditional tree-view of the caller-callee report, check out this CodeProject article.