Optimizations and CFLAGS in an ideal world...

I was messing around with my make.conf today, and decided to play with my CFLAGS again. Now, I realize that realistically, there are some seriously diminishing returns for turning on additional compiler flags on a system-wide basis. Even with a single package, if you don't know what you're doing you can make things worse, and if you do know what you're doing, you might just be able to spend a few hours to squeeze out that extra 1% reduction in execution time. Playing with your whole system is likely to make some things marginally better and others marginally worse. And, when all is said and done, you're probably never going to notice anyway, because amarok still takes several minutes to filter your playlist. (Amarok and I have a love-hate relationship, but that's another rant.)

But, (like many Gentoo users) I twiddle with them anyway. Partly because it's fun learning about weird compiler optimizations, but truth be told, I'm also lured by the imaginary speed gains, and the vanity of turning on weird things and knowing what they do.

So, I was reading the Linux Magazine benchmarks comparing different Gentoo Optimization levels, and there's basically no significant difference from -O2 and -O3 with a couple of weird exceptions.

One of the UT2004 demos has a 20 FPS gain for -O2 over -O3, but none of the rest of them do. So that's a weird fluke.

The Dbench filesystem tests show a huge difference between optimization levels, with -O3 losing quite dramatically, with the difference diminishing as the number of clients goes up. This one's kinda concerning, it's definitely not worth setting -O3 if it's going to halve my disk access speed. But (perhaps in my ignorance), I really have no idea how changing your optimization levels will affect disk I/O. That should all be handled in the kernel, which is optimized separately. So I'm gonna assume that's some kinda weird behavior in the test harness, because I dunno anything about Dbench.

The image processing tests show a predictable bonus for -O3, since that's the kind of CPU bound task that would benefit from vectorization and inlining.

And last but not least, is the GtkPerf results for "AddText," where -O3 takes half as long as -O2. In my ignorance, I would guess that the gain is from inlining functions, since from what little gtk programming I've done it seems like there's tons of small functions you call for everything. But I really have no idea. All the AddText test does is keep adding short text fragments into a scroll box, so maybe there's a copy loop that gets optimized somewhere, I dunno. But, I know how to run GtkPerf, so I decided to check this out.

Turns out, because of Gentoo Bug #133469, gtk filters your cflags, and doesn't even distinguish between -O3 and -O2. Since the bug was dealt with in 2006, (and they're still filtering today), and the benchmark was done sometime in 2009, I'm going to assume that they were using the standard Portage tree and were running the same gtk library in both tests. So any speed gains they saw must've been some underlying system library they were calling into. So, I'd put my money on some string copy function which runs about twice as fast under -O3.

Now, the bug report was describing some pretty catastrophic failures, (and I'm sure there were), but after hacking the ebuild to use -O3, and playing with my system, I haven't seen anything crash yet. So why is it still filtered? Well, since Gentoo actually lets users set their own CFLAGS, a bunch of crazy idiots like me decide to go through and turn on whatever they can turn on for kicks, speed and glory, and then complain and file bugs when stuff breaks. They honestly don't want to deal with that, and I can't blame them. So their general policy (as far as I understand it) is to ignore any issues where people have something other than "-march=native -O2" set. And I can't fault them there, especially when most of them are volunteers, and there are lots of idiots who want to muck with CFLAGS that break things. And granted, you can't test every gcc release to see if it stops some random, unpredictable crash. But I've seen things like this come up often enough that I wanna put in my two cents.

First off, a lot of these issues come up because there's really no way to set package specific CFLAGS. The people who are seriously about optimizing, want to set different weird flags for each program, or at very least set all their weird crazy flags, and then filter them for package they know break. (Like this guy...) Yes, I know a couple people have hacked up their own techniques to do this, but they're not integrated into Portage very well, and they're all shunned by the developers. Personally, I don't think this problem is going away. There's always going to be people (crazy or guru) who want to set custom flags for custom packages, to fix known bugs if nothing else. I was once unable to compile a working gmp with my arch setting because of some weird bug, and at that point all you can do is manually adjust the flags yourself to emerge gmp, and then put them back afterwards. Granted that works, but it kinda sucks. And the more packages you want to do that with the more it sucks. I think they should just stick a package.cflags into portage, the same way package.use works now, and let people trash their system if they're retarded. That also has the advantage of letting them put all the cflag filtering that they do into one file, the way they manage package.mask.

But what's a sensible cflags policy? Well, let me describe the imaginary, ideal world first. Ideally, there's really only 3 kinds of optimizations (for whatever you're looking for, speed, size, etc). Optimizations that always work, optimizations that sometimes work, and optimizations that make assumptions in order to work.

Things that always work, are reasonable for any user to turn on, and make their own cost-benefit determination of compile-time/running-time/memory. Setting gcc -O2 turns on more or less things that always work, and anything that breaks with these settings is a bug.

Optimizations that sometimes work, include things like loop-unrolling, and is reasonable for someone to turn on if they know what they're doing. These might actually hurt your performance, but they shouldn't actually break anything, or it's a bug. Enabling gcc -O3 turns on some things in this category like inline-functions. In my opinion, users shouldn't be protected from their own ignorance if they're slowing their system down by setting things beyond their competence level. But if these actually break things, you've probably hit a compiler bug, and to me it seems reasonable to filter them out with a warning message.

Finally optimizations that work with assumptions, are the ones you shouldn't set without knowing what you're doing, and should (probably) never set them system wide. These are things like disabling exception handling in C++, or enabling fast-math. It might help under certain circumstances, but it will break if you don't satisfy the assumptions (like you're using exceptions). Anyone using these system-wide is probably stupid, or at very least they're trying to do something weird on their own.

So ideally, you should filter out everything known to break because of a compiler bug, temporarily, until the compiler can be fixed, and then allow them again. But, we don't live in an ideal world. Compiler bugs are hard to find, hard to prove, and hard to fix. And the crazier the optimization, the harder it is to verify.

So what do you do in the real world?

First, if Portage had a package.cflags, anyone who wants to do turn on weird optimizations globally can have their own place to fix it, without filing a bunch of bugs demanding ebuild filtering for everyone. And, you now have a place to store package-specific performance information. If somebody on their own wants to publish a set of "known fastest CFLAGS for this arch" by package, they can.

Second, I think that a policy of "if you're not using -O2, we don't care" is a bit much. I think it's fair to designate -O2 as stable, and -O3 as unstable, but if there are known issues with -O3 where things actually break, it seems like you should go ahead and filter the flags. So in other words, you don't verify that anything works at -O3, but if its clear something doesn't, then go ahead and make it work for everyone else. It's not that hard, and things really shouldn't be breaking at -O3 (on a theoretical level anyway). But any weird flags making assumptions about math or runtime or whatever, it's fair to ignore, because if you were guru enough to have a reason to need that flag on, you can filter it out yourself in a package.cflags, and if you weren't that guru, you should probably turn it off. And I'm only talking genuine, verifiable crashes or errors at -O3. Slowing your system down is your problem, and if -O3 is designated "unstable," I think it's fair to let other people find and verify the problems as they come up.

Of course, I say -O3 "shouldn't" break things, but I really have no idea how risky it is, or how often bugs come up. I have random parts of my system compiled at -O3 and haven't noticed any problems, but I'm just one guy with my weird setup. If Gentoo decided to start ignoring -O3 crashes because there were too many for them to handle otherwise, I can understand. But if not, it seems like they're worth noting and fixing in the ebuilds when they come up. For the record, I think Gentoo does an excellent job on the whole, and it's really probably not worth a developer's time to chase down weird CFLAG bugs people report when they could be fixing other things. I also totally agreed with the rationale behind filtering gtk+, though if the problems are gone, it might be time to remove the filtering (but I can't tell that from my setup).

So when all is said and done, I went and ran gtkperf with just my gtk+ optimized differently. The final results?

No optimizations: Total time: 76.99
-O2: Total time: 74.25
-O3: Total time: 74.02

There's really no difference.