To be honest, with the amount of bandwidth on todays hardware, uploading changes per surface can still give reasonable performance, and the format differences only really kick in when you encounter a troublesome driver like I had. (Aside: developing on bad hardware can sometimes be great for highlighting issues like this that might pass you by otherwise.)
The bigger difficulties come from use of GL_RGB format and from multiple texture changes.
GL_RGB is bad because no such format actually exists in hardware. Sending data down in GL_RGB format means that your driver has to make a copy of it, expand it to 4-component and most likely then swizzle it to GL_BGRA. You may as well just use GL_BGRA in your code instead and bypass those steps.
There's great info about this on OpenGL.org:
http://www.opengl.org/wiki/Common_Mistakes
Multiple texture changes are not
that bad in themselves as texture changes are fast these days. Where trouble kicks in is that your driver is unable to optimize your vertexes into bigger batches for you, so you end up doing lots and lots of itty bitty draw calls instead of very few big ones. Interestingly, with OpenGL this seems to be the case irrespective of whether you use glBegin/glEnd or vertex arrays so the driver must be doing some behind-the-scenes optimization of it's own.
Sorting surfaces by texture, then by lightmap within that is the way to go. Building lightmaps in texture order also helps, and increasing the lightmap size to 512x512 (so you get more surfs per lightmap, and a better chance that all surfs with the same texture also have the same lightmap) improves things again.
Do a single bulk upload of all modified lightmaps after the sorting pass but before the drawing pass, and draw something else or do some other CPU work before you draw the lightmapped surfaces so that the glTexSubImage2D calls have time to finish updating the textures before you need to use them and you're in business.
What really caught me by surprise was GL_UNSIGNED_BYTE versus GL_UNSIGNED_INT_8_8_8_8_REV. I'm guessing that on my bad driver GL_UNSIGNED_INT_8_8_8_8_REV is giving it a hint that "this data is already in the format you like best so there's no need to pull it back to system memory (or whatever it is you're doing) and have your evil way with it there". It was faster by a factor of 30 on this driver, and marginally edges out GL_UNSIGNED_BYTE on other, better drivers.
On the other hand GL_BGRA was just twice as fast as GL_RGBA, smaller but still significant.
Of course, the formats that were fastest on my driver and my platform aren't necessarily going to be the fastest elsewhere, which is why it's important to check a few of them.
The end result is able to handle scenes where wpoly counts go into the thousands, and where almost every surface has an animating lightmap, without dropping below 72 FPS. Which is nice.