On Fri, Jul 02, 2010 at 02:32:19PM -0400, Richard Pieri wrote: > On Jul 2, 2010, at 1:44 PM, Derek Martin wrote: > > > > And you can't even conceive that this is caused by a problem with the > > implementation, rather than the file format? > > Large mbox files * heavy disk I/O = connection timeouts and lock problems. ... with your IMAP server. This is not inherently a problem of mbox. Let me make myself clear: mbox DOES NOT suck. Just every implementation of it that exists does. If it were implemented better, it would actually ALWAYS perform essentially as well (within the typical human's ability to notice, in the common cases) or better than maildir, for every operation. One thing to note is that there is no official mbox standard. There are two very common versions of mbox, but even they have variations. This means that you have the freedom to muck with the format of the mail store, provided you don't break compatibility with other tools that you need to use. This is important, as I'll illustrate momentarily. As a file format, mbox has 3 operations which are, as traditionaly implemented, very I/O intensive. The first is opening a mailbox, but in that case mbox already beats maildir handily assuming no caching or indexing, and with both the performance should be about the same for each format. Without caching, I and others have benched it with Mutt, and even Courier's own benchmarks show that mbox beats maildir for this case (though with the right filesystem and hardware you can get fairly close). So we don't need to consider it. The other two are, IIRC, the only two operations which do not beat or match maildir in terms of performance and amount of raw I/O. The second one is updating the status (i.e. read, new, etc.) of a message in the message store. Traditionally, mbox handles this by adding a Status: header to the message once it's been looked at, and messages without a Status header are treated as new. Also traditionally, the solution to updating the status of a message is to: 1. write out a copy of the entire mail folder up to the point where you want to add the Status header 2. write the Status header into the copy 3. write out the remainder of the mail folder into the copy 4. unlink the original mail folder 5. rename the copy of the mail folder to the original mail folder. That's a crazy amount of I/O for such a simple operation. If only you could find a way to update the status of a message without rewriting the entire mail store... Well it turns out you can. NO ONE does this, but if they did, it would make mbox negligibly slower than maildir for updating message status. The solution is to ALWAYS write out a Status header, and make its value a fixed width string -- even on new messages. If you do this, then you can: 1. index the file offset of the Status header 2. seek() to it when you need to update it 3. write out the new Status header *IN PLACE* And you're done. No copying entire mail folders, very simple, very easy. Of course you have to lock, but file locking is a decades well-solved problem (except over NFS, sadly, though that mostly works much better now than in the past). To complete this solution, you just need to get your MDA to deliver mail with the Status header set. No problem; you don't even need to patch anything, if you're using procmail: just pipe the message through something like formail -I "Status: N " and you're done. Wow, we just made mbox just as fast as maildir for updating message status, and reduced the I/O load by some ridiculous amount. This is a small change that does not break compatibility with other mbox implementations. Although, in order to truly reap the benefit of it, you'd need to always use tools that do this. And of course, your implementation needs to be prepared to fix mail that wasn't handled this way. But that's doable. The third case is deleting (expunging) messages from the message store. The problem is essentially the same as above: to delete a message, traditionally you write out a copy of the mail store, just as above. If only there were a way to avoid doing that... Well, guess what? Just as above, there is: expunge in place. People don't like to implement this, because if they get it wrong, the result is catastrophic mail loss. But if you get it right, the result is a one-file-per-mail-store format that beats or matches maildir FOR EVERY OPERATION, at least in the common case. Think about how you use mail, and specifically when you delete messages. Most often, people keep a bunch of old mail in their mail folders, and accumulate new messages at the end. Periodically, you might clean up your mail, but typically the old mail you have stays around for a while, and it's mostly new messages that you're deleting. This turns out to be perfect for mbox: most of the messages you want to delete are already at or near the end of the file. Can anyone see where this is going? Let's take the best case scenario first. Suppose you have 300 messages in a folder, and 200 of them are new, and you want to delete all of the last 100 messages. With maildir, you have to do 100 unlink() calls... kind of expensive, but much better than rewriting the entire mail store. BUT, you don't actually NEED to rewrite the entire mail store, in this case. Since all of the 100 messages are at the end of the file, and there are no messages intermixed that you want to keep, you can BLOW THE DOORS off maildir: delete them all at once, with a single call to truncate(). The amount of I/O you need to do just went from crazy to essentially NONE. Now, even in the common case, there are usually a few messages in there that you want to keep. We can still improve our lot tremendously. If you mmap() the file, and chunk by chunk write the messages you want to keep over the messages you want to delete, you can still do a truncate and save a ton of file I/O over traditional mbox implementations. In the common case, you won't notice much difference between expunging mbox and expunging maildir. Only when you're doing those infrequent cleanups will you notice a difference, and even then, it would be much faster than it would be with traditional implementations. The only other issue you mentioned was the back-up problem. I haven't spent any time thinking about this. But I'm willing to bet, that since you're already caching and indexing the messages and their metadata, performing efficient incremental back-ups of the mail store that are as good or better than what you can do with maildir are well within the realm of possibility. I doubt even that is a terribly hard problem. -- Derek D. Martin http://www.pizzashack.org/ GPG Key ID: 0xDFBEAD02 -=-=-=-=- This message is posted from an invalid address. Replying to it will result in undeliverable mail due to spam prevention. Sorry for the inconvenience.