Tuesday, April 5, 2011

Just the Facts Ma'am

When doing object-oriented programming, the hardest things to define are relationships. How does one class related to another? Does an instance of that class A have a bunch of class B? Do objects of class B need to know about the class A objects that hold it? Whats the best way to model this relationship?

On the one hand we want loose coupling. Just throwing in a pointer back to class B might be convenient, but it adds a dependency. On the other hand we don't want to avoid the natural relationships between things. We struggle with these concepts regurarly.

But there's an even more fundamental question underneath these struggles

Is what we're trying to accomplish descriptive or behavioral in nature? Is this piece of our code capturing a set of facts? Or is it a collection of actors that react/act on each other?

First lets think about what we mean by these, when we have a set of facts we mean a set of public data describing that fact. If our fact was a piece of information about baby names, we'd include lots of useful information about that baby name. For example we'd include the popularity, the culture of origin, the meaning of the name, etc. We also freely build associations between facts and other facts. Lets say we create a seperate fact describing all the stuff we care about cultures these names come from. Through whatever mechanism is available, we can relate the names back to the culture.

When we design actors, we care about encapsulation in the true tradition of object-orientation. We want to expose a small set of specific behaviors that can be executed. They may use facts internally or in their interface as value types. Nevertheless we treat an actor as a black box. Any internal facts being used are hidden to us. If our fictional baby system let you adopt a virtual baby, then the baby itself might be a good example of this. The baby can be fed, burped, and entertained. We have no idea whats going on inside the baby. We just no sometimes it poops and sometimes it cries. Sometimes feeding burping or entertaining helps, but we can't go and unset the poop bit in the baby no matter how much we'd like to.

It should be pretty clear that these are orthogonal concepts. Facts represent information in its most uncluttered form with no behavior. Actors represent behavior with information hiding. This orthoginality is the main theme of Chapter 6 of Robert C. Martin's Clean Code. Martin discusses the differences between working with facts (he calls them data structures) and actors (he calls them objects):

"Objects hide their data behind abstractions and expose functions that operate on that data. Data structures expose their data and have no meaningful functions. Go back and read that again. Notice the complimentary nature of the two definitions. They are virtual opposites. This difference may seem trivial, but it has far-reaching implications."

Martin emphasizes the strength of data structures to the object oriented crowd -- we can easily write a procedural function to do something new with the data without impacting the rest of the system. Somebody can take our baby name data and output it in their mobile baby app cause its just a collection of data. Awesome!

I think the depth of the distinction though goes deeper then Martin describes. The two aren't just opposite on the encapsulation spectrum. The two relate to their brethren in completely orthoginal ways -- facts relate to other facts in a fundamentally different way then how actors relate to other actors. Actors use other actors to do work. The baby object is going to use the mouth object to eat. Actors also implement interfaces or override base classes to provide modified behavior. Facts associate with other facts to indicate additional information -- this baby name is scottish -- the other scottish baby names are ... these people who like scottish baby names. Facts associate themselves with other facts (ie baby names associate themselves with cultures) to provide even more implicit facts. We like our facts this way, and its natural to operate with information along these lines.

To summarize:
  • From a "Facts" perspective, the universe is a database of useful, interrelated but inert information. Exciting connections can be made between different pieces of information and followed freely.

  • From an "Actors" perspective, the universe is a chain of command with main() as God. From there behavior is delegated to different objects. main() orders an object to do something, which orders someone else to do something. Behavior is commanded in an orderly fashion.

We're dealing with two different social systems. As you can imagine, the worse thing we can do is to try to make a fact like an actor and vice versa. Along these lines, in Clean Code Martin talks about avoiding "Hybrids" between these two. He describes data structures with methods added to them to perform some significant behavior.

I'd go a step further, now that we see how these two social systems function on different planets, its even more mind blowingly dangerous to make hybrids between the two. This is the road to everyone knowing about everything. This is where spaghetti code comes from. The "fact" side of the hybrid makes us want to reach out and know everything about everyone. We need to build associations between interrelated facts. However, the actor side of the hybrid wants to tell things what to do. Putting those two together, we've built something where any object can tell any other object what to do. So if somehow we had a Baby object that also had all the stuff about the name globbed into the baby. Say Baby had an accessor for getting the name, the name's originating culture and so on then from there we can find all the other Baby's in that culture through the hybrid culture object. We can now add methods in Baby that operate on all babys! Cruft like this builds and builds in classes like Baby like you wouldn't believe. With hybrids in play, You end up with big ball of mud. woohoo!

This situation arises because Baby starts out as an actor, jumps down into the fact parallel universe, jumps around the fact network of associations, and then because our facts are also hybrids, we can jump back up to the actor universe and tell those facts err I mean actors err I mean facts to do something. We've create a very bad confusing object oriented version of goto.

With that pitfall in mind, when we're back deciding how objects of a given class should be coupled, we need to ask ourselves why we're coupling them? Is it because they are related pieces of information? If so, then as long as information stays inert and behavior-less, allow facts to be facts. Let them relate to each other like they would want to in fact society. Maybe this evolves into a relational database with foreign keys linking rows together, providing additional implicit facts. Maybe its just a couple of nested structs generated and passed in as values, summarizing the important facts.

We should also be sure to take our facts out of our actors. Let information be standalone for all to consume and use it. Don't muddle the information with behavior. Be careful to avoid making behavioral entities pieces of information. Let them tell you facts, but be able to use those facts independent of the actor. For example its natural to use a baby name fact in the Baby actor, but avoid passing around Babys to get baby names. Use the fact instead. If you don't, you'll soon end up with an insane mass of spaghetti code. Once separated out, we can respect facts for what they're good at, giving us information.n. And now that we've separated the facts from our actors, *then* we can figure out how the actors should interelate to implement behavior.

When we let facts be just facts, we can feel confident navigating their web of knowledge for what we need to know. It becomes the "model" in Model-View Controller terminology. When we muddle the differences between the facts/model and the actors/controllers the inertia of software development will begin to create connections between otherwise unrelated pieces of code through the underlying information substrata of our software.

Wednesday, February 23, 2011

Single Threaded -- Not As Error Free As Your Mom Told You

There's good reason to avoid multithreaded programming. Nobody wants to deal with race conditions, shared memory, locking and deadlocks. This is especially true in a language like C++ where there are no guarantees on how locks behave, what kind of thread message passing scheme might be available, or what guarantees of atomicity exist on the target platform. It gets even more fun writing multi-threaded, platform independent code.


BUT

There comes a time when forcing everything into a single thread for the sake of single-threadedness becomes an unwieldy disaster.

Consider an application that must perform multiple, concurrent lines of processing. Lets say that this application must communicate with other pieces of software to perform multi-step tasks to accomplish these concurrent lines of processing. In the process of doing these multi-step tasks, we cannot prevent the application from executing. We can't prevent all of the lines of processing while waiting for one line of processing to complete. Doing so would starve every other line of processing in the system--they'd no longer be concurrent.

An example of such an application might be the backend to multiple automated teller machines (ATMs). The software must communicate with ATMs worldwide. Waiting for an ATM to respond in South Africa can't halt processing for every other ATM in the world. The backend software must be robust enough to keep multiple things going simultaneously.

By describing "lines of processing" and using terms like "concurrently" it may sound obvious that each "line" should get its own thread. Indeed this clearly is an inviting solution. Maybe here's what a "WithdrawFunds" routine might look like on the backend when run in its own thread:

// Run as a job in its own thread
void WithdrawFunds(unsigned int amount)
{
   // Confirm the amount is available in the account
   if (IsAmountAvailable(amount))
   {
      //Order the ATM to dispense funds, block
      //until this is completed
      DispenseFunds(amount); // BLOCKS

      DeductAmountFromAccount(amount);
   }
}

Here everything occurs step-by-step. If we need to wait for an ATM to respond we simply block while waiting. Obviously the other threads in the system continue to happily process their own withdrawals/deposits/etc. Somehow we'll need to deal with some information shared between threads--when deducting from an account for an example. To do this, we can get messy with locking code, use some kind of inter-thread communication, or just trust a database to take care of things for us.

There's another solution. "Lines of processing" need not indicate separating everything out into threads. We can implement our ATM backend system with a single never-blocking thread. The required ingredient is a thread message queue. Instead of blocking, a single thread can process each step asynchronously. Each ongoing line of processing, in our case a transaction with an ATM, gets tracked as a separate transaction. Messages coming into our thread initiate these transactions or push their state a step further.

Here's how the code would look:

// Based roughly on how MFC threading works
class TheAtmBackendThread
{
    // we need to track each ATM session now
    std::vector < atmtransaction> atmTransactions;

    // we received a withdraw request event
    void OnWithdrawRequest(/*some kind of msg*/)
    {
        if (IsAmountAvailable(...))
        {
            // Create a record for this communication with the ATM
            atmTransactions.push_back(new AtmSession);
            // Update the state of this transaction
            atmTransactions.back().UpdateState(WaitingOnFundsDispensed);
            PostMessageToDispenseFunds();
        }
    }

    void OnFundsDispensed(/*some kind of msg*/) 
    {
        // Now we have to figure out what ATM transactionthis is for 
        AtmTransaction tran = LookupTransaction(...);
        // Is the transaction in the expected state?
        if (sess.WaitingOnFundsToBeDispensed())
        {
           DeductAmountFromAccount(...)
        }
        // destroy the transaction
    }
};

In this solution, our thread's message queue looks at the next message and calls the corresponding function for that thread. To implement "WithdrawFunds", we need two events: One for withdraw requests, another confirming that funds were dispensed. Since we might be servicing requests for multiple ATMs simultaneously, we need to store every ongoing ATM transaction. When processing the WithdrawRequest event we create a transaction, store it away, and send a message back to the ATM to dispense funds. Later when we get a "funds dispensed" event we'll need to find the intended transaction for the event.

This method can become unreadable fast. Compare this code with the "WithdrawFunds" function above. "WithdrawFunds" by simply blocking can list everything that needs to be done step-by-step. Because we're not blocking here, we need to track the state of each transaction. Instead of being scoped to a single function as in "WithdrawFunds", now the transaction belongs to the entire thread. The blocking "WithdrawFunds" function, by avoiding all the additional state tracking, is far easier to understand and debug. When debugging the single threaded solution we never see the full story. What happened in the event handling functions when processing earlier events in this transaction? We'll never know because we've lost that function context. We don't know what happened, how we got here, or where we're going next. We're a little lost and its hard to find a map-- its hard to write a step-by-step procedure of how everything should happen just by looking at the code.

Moreover, by introducing extra state, we are just asking for bugs. What happens at each step when something goes wrong? How do we handle missing a step in the conversation? In the "WithrdawFunds" function this can simply be a caught exception or an error return code. In the single-threaded code, we need some kind of "watchdog" event to jump in and figure out what might have timed out. The increased scope of the transaction might expose us to bugs. More code will gain access to this data making it possible that unexpected code may poke around in our state leading to bugs and unexpected behavior. We get extra complexity and harder to debug code with its own potential for bugs.

Whats the answer? Its nice we don't have to think about all the threading issues in the single threaded solution. Its something that can certainly get ugly fast. However, a careful system designer should be able to mitigate these problems. Certainly a database would be an appropriate choice. Multiple threads only interacting with a database would avoid headaches associated with locking by delegating that job to the database. Aside from databases most platforms have ways of passing messages or communicating without the direct use of locks.

The question comes down to: do you want newbs confused about your crazy message passing/state tracking scheme? Or do you want them confused about how they should lock shared resources? Where do you want your newb generated bugs :)? If you can structure your code so that they can create threads and rarely have to think about locking (ie by using a database) then don't restrict yourself to the clumsy single threaded approach. If for some reason your system has poorly defined locking/thread communication semantics than there may be a reason to think more carefully about the trade-offs.

Tuesday, January 25, 2011

Producer, Consumer, Downcaster

Producer/Consumer relationships occur all over the places in code. In most contexts, separating the responsibility of producing something from the various ways that something can be consumed is an invaluable way to simplify a design. However, as we'll see, consumer code can encounter some unfortunate pitfalls when it tries to react to what gets produced.

Commonly the things getting produced inherit from the same interface. Often the class hierarchy looks something like this:

class IProducedThing
{
public:
    virtual void Foo() = 0;
  ...
};

class CConcreteProducedThing1 : public IProducedThing
{
public:
    virtual void Foo();
   ...
};

class CConcreteProducedThing2 : public IProducedThing
{
public:
    virtual void Foo();
   ...
};

IProducedThing* produce()
{
     // In some cases build a CConcreteThing1, in other cases a CConcreteThing2
     if (...)
    {
         return new CConcreteProducedThing1();
    }
    else
    {
         return new CConcreteProducedThing2();
     }
}

Unfortunately this can have the side effect of forcing the consumer to figure out exactly what was just produced. This probably involves a dynamic cast at the consumer to react appropriately to the consumed event:

void consumer()
{
      IProducedThing* thingie = produce();
      thingie->Foo();
      // Is this a CConcreteProducedThing1 or CConcreteProducedThing2?
      if (dynamic_cast<CConcreteProducedThing1>(thingie) != NULL)
      {
           // do something for a produced thing 1
      }
      else if (dynamic_cast<CConcreteProducedThing2>(thingie) != NULL)
      {
           // do something for a produced thing 2
      }
}

The most egregious reason to have this dynamic_cast would be if the "do something" operation is something best implemented as a method in each of the produced thingies. Something that should be pure virtual in the interface. 

Nevertheless, the consumer will need to do things that are best left separate from the produced class hierarchy. For example, the producer may be producing events while the consumer logs those events into a very specific format. Or the consumer itself may need to change its internal state based on what was produced. None of these things should or could be implemented as methods in our producer's class hierarchy.

Still even in these cases, dynamic_cast, casting in general, should give us the heebie-jeebies. We are circumventing what is otherwise a strongly typed language. By casting, we're forcing a conversion the compiler otherwise would see as an error. We tell the compiler to trust us. In strongly-typed languages, we want a persnickety compiler to complain to us as much as possible. We even hope that if we write our code thoughtfully, the compiler's errors can translate directly to prevented bugs. In short we want a compiler that doesn't trust us.

If ignoring compiler errors might mean ignoring bugs, there's got to be a better way. And there is. The producer just needs to be nice enough to implement the Visitor Pattern and we're all set. Here's how it works.

First we need to define a visitor base class:

class IProducedThingVisitor
{
  
};

In our visitor, we need to explicitly list every concrete implementation of IProducedThing, giving each its own virtual method called "visit":

class IProducedThingVisitor
{
public:
        void visit(CConcreteProducedThing1& thing1);
        void visit(CConcreteProducedThing2& thing2);
};

Then we'll need to require that each class that inherits from IProducedThing will accept a visitor

class IProducedThing
{
public:
       virtual void accept(IProducedThingVisitor& visitor) = 0;
  ...
};

class CConcreteProducedThing1 : public IProducedThing
{
       virtual void accept(IProducedThingVisitor& visitor)
       {
               visitor.visit(*this);
       }
   ...
};

class CConcreteProducedThing2 : public IProducedThing
{
       virtual void accept(IProducedThingVisitor& visitor)
       {
               visitor.visit(*this);
       }
   ...
};

What happens here? A class in consumer code inherits from IProducedThingVisitor. For each concrete IProducedThing, the consumers visitor creates a visit method to perform a specific task for that specific concrete type. For example, the consumers visitor may log each produced thing in a different log format. The consumer makes use of the visitor interface by creating its concrete visitor and passes an instance of this visitor into the produced item's accept as follows:

void consumer()
{
      IProducedThing* producedThingie = produce();
      CMyConcreteVisitor* visitor = new CMyConcreteVisitor();
      producedThingie->accept(visitor);
}

On the last line, the consumer ends up invoking either CConcreteProducedThing1::accept or CConcreteProducedThing2::accept. Once in those methods, the type of the this pointer corresponds to the type of the class we are in. This's type is CConcreteProducedThing1 in CConcreteProducedThing1::accept() and CConcreteProducedThing2 in CConcreteProducedThing2::accept() respectively. With this information he compiler can use the type of the this pointer to call the visitor's correct visit method. In this way we can have one set of operations for CConcreteProducedThing1 and another for CConcreteProducedThing2. Viola! no dynamic_cast required.

Problem solved? This solves the problem as presented, but there's still some open questions and issues that you'd have to consider before using the visitor pattern. 
  • What if the producer is in a separate, 3rd-party library and they didn't implement the visitor pattern?
  • What if you want to implement your own concrete implementations of the IProducedThing interface but the visitor base class is in 3rd party code out of your control, can you get producer code to call the visit method for your class?
  • What if you want data *returned* from the visit operations? What if some operations compute data while others don't?
Still, I'll think you'll agree this is an invaluable addition to any developers toolkit.

Saturday, December 18, 2010

Figuring Out Apache's Config File Structure

Figuring out the layout of Apache config files has always been confusing to me. I never quite know where I should put a given directive. Do I put it in the httpd.conf file? Or elsewhere? What about site or web-app specific config--stuff I can't or don't want to put in .htaccess? On my PC there appears to be about half-a-dozen little directories for putting different kinds of Apache configuration. Is this standard? What's the standard, best-practice for configuring Apache? Where do I put my stuff?

Debian & Ubuntu's Layout

Apparently there's not a whole lot that *is* standard. We can though learn a lot by looking a little at what Debian and Ubuntu do. This great article  lists how Debian and Ubuntu's flavors of Apache work. According to the article, Debian's default Apache config file is /etc/apache2/apache2.conf. From this file, files are included using the Include directive in  the following order:

# /etc/apache2/apache2.conf - pulls in additional
# configurations in this order:
Include /etc/apache2/mods-enabled/*.load
Include /etc/apache2/mods-enabled/*.conf
Include /etc/apache2/httpd.conf
Include /etc/apache2/ports.conf
Include /etc/apache2/conf.d/[^.#]*
Include /etc/apache2/sites-enabled/[^.#]*

Debian divides configs between those for Apache modules (mods-enabled directory), listen ports (ports.conf), VirtualHosts (sites-enabled) directory, and random other things (conf.d directory). The file httpd.conf is included for legacy reasons--it's the most common Apache root filename on all other distros.

Each file in "sites enabled" contains a VirtualHost directive. This directive allows configuration to be scoped to the host and/or port number used to access the site. Ports.conf holds both NameVirtualHost and Listen directives indicating what ports Apache should listen on. NameVirtualHost declares the VirtualHosts to be expected in the VirtualHost configurations later on. It says there will be a virtual host at localhost port 80. This is kind of like a "forward declaration" in C++ speak. Naming something without giving an explicit, full-fledged definition of the configuration you want to apply to that something.

"Mods-enabled" holds a .load file for each module containing a "LoadModule" directive-- telling Apache how to load the module. Each module may also have an optional .conf file for module-specific configuration. The dav_svn module on my box, for example, sets the SVNPath variable to the path of my SVN repository in its dav_svn.conf file.

Debian makes it easy to enable/disable modules and virtual hosts. Debian places all available modules and VirtualHost configs in the mods-available and sites-available locations respectively. These could be copied to the *-enabled directory directly, however Debian distros make it a little easier by providing non-standard a2enmod/a2dismod for enabling/disabling modules and a2ensite/a2dissite for enabling/disabling VirtualHosts. These commands place the specified configs will be placed in the corresponding *-enabled location. A restart of apache picks up the new module/site config.

When installing from a package, Debian puts config files specific to each piece of installed web app in the conf.d directory with the other miscellaneous config files. For example when I installed the phpmyadmin web app through synaptic, a phpmyadmin.conf was place in the conf.d directory. This config file uses the Alias directive to point the /phpmyadmin location at /usr/share/phpmyadmin. The file also disables/enables various php settings for phpmyadmin.

The Problem

I find Debian's layout to be a very tidy setup. Unfortunately, based on this article each OS install does something different. Each OS's install of Apache may use a different root configuration file. How this configuration file includes other paths/files also differs for each OS. Oh Joy. So your average Apache professional must learn the *right* place to put stuff for each distro? How does someone maintain a clean setup across multiple OS's without going nuts?

More specifically, consider the time when I want tight control of the server--I want to completely disable .htaccess files.  I want a single file with a very clearly defined, static Apache config for each installed web app (as is done with phpmyadmin). I want an obvious place to put that config on each OS when setting up the app. Furthermore, I may need some configuration that can't be put into an .htaccess file (like phpmyadmin's Alias directive). So where do I put my web app specific configuration? If I create some ostensibly platform-independent LAMP app, how could I write a script to ensure it gets installed on an OS-independent setup? Is this even feasible?

There is Some Hope

One thing that should be standard is that there is always one "root" server level file. Apache's own configuration lets us feel secure in this fact
The main configuration file is usually called httpd.conf. The location of this file is set at compile-time, but may be overridden with the -f command line flag. 
That root file may include others based on a given distribution's setup (as apache2.conf does for Debian). But a single, default, root Apache configuration file should always exists. We can hang our hat on that at least. So as I see it  the best way to place a web app specific config into Apache is to first try to find a directory for miscellaneous configurations (for Debian thats /etc/apache2/conf.d). Failing that, we'll have to do our best to directly include our configuration in the target OS's root configuration file.

It looks like the bottom line is you *can't* get around learning your distribution/OS's standard config file layout. You're gonna have to familiarize yourself with your OS's Apache install a little in order to play nice. Unfortunately the knowledge is not going to be very portable between OS's. But you can help yourself out by finding the root apache config file, seeing what it includes, and moving forward from there.

Doug being an idiot -- An example of checking the most obvious thing first...

Early this morning I set myself onto what should be the relatively simple task of installing a LAMP development environment in my Ubuntu virtualbox. I used aptitude to grab apache2, mysql, php5, etc. To test that I correctly had everything working (and because its useful in general) I grabbed phpmyadmin. Upon browsing to http::/localhost/phpmyadmin I noticed Apache was not running the index.php file, instead my browser downloaded the file. I must be forgetting to install something.

I initially suspected this problem to be related to not having the php5 Apache module installed (a separate package from the base php5 one). After installing/refreshing I still had the problem. Browsing to the IP address directly (http://<ipaddr>/phpmyadmin/) did not show the problem. phpmyadmin worked fine. Even browsing to http://127.0.0.1/localhost didn't show the problem. It all worked. Only http://localhost had the problem. WTF! I suspected a directive in the Apache config made the site available for web users but not through the localhost. I was sure there was something with the name "localhost" somewhere in the config file. After coming up with nothing, I tried changing "localhost" to "localhost2" in my hosts file. Chrome still showed the same bad behavior for http://localhost, even though it shouldn't even be able to contact the web server at all  now! WTF... Wait a second I thought. I bet Chrome is caching this crap and redownloading from the cache.

Resetting my hosts file, I loaded up Firefox, browsed to http://localhost and it worked. Clearing Chrome's cache also got this to work.

Damnit, I'm an idiot.