When you say that the results can change significantly --
I'm going to discount individual game patches - those are the responsibility of the developer to produce and out of the hands of the GPU manufacturer or the reviewer, and if you are using immature benchmarks to the point that early-release patches are significantly swinging performance, that's probably not the best review practice for hardware and should be limited to tech reviews for the game.
Driver updates - I'd say those are important, but I'd also say if a product comes out Day 1, and a later driver update changes the performance by more than single-digit percent, that's on the manufacturer of the card - they should do a better job on their drivers up front if they want to put their best foot forward.
I expect things to drift a bit - I do not expect to see a review state that Game XYZ got 98.2 FPS on a test rig, and then cry foul because XYZ on my rig only gets 97.4 FPS. There's some variability to be expected no matter what.
So the real question is: what's acceptable in terms of drift?
I lament that we do not have that database to look things up. I understand things drift, I do not expect reviews to be exact when presenting benchmarks, but I do expect them to be representative of the product at the time of the review, consistent between products, and relative to other products. And in that vein, I've always thought a reference data base to be able to quickly infer relative performance would be really really handy. Right now, I've got.. UserBenchmark, and that's about it. That is 100% admittedly not the best comparison to have, but it's just about the only tool I have left if I want to compare any two arbitrary pieces of hardware, particularly when I want to look across generations.