The blob uses *both* nops and (ss). It turns out that in some rare cases
the hardware does take more than 6 cycles, at least for movmsk, but
adding nops is unnecessary. I believe the extra nops are only there due
to the immaturity of the blob's implementation of subgroup ops, so we
don't have to copy them - just handle shared reg producers the same as
SFU instructions.
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/14246>