aco: Initial commit of independent AMD compiler

ACO (short for AMD Compiler) is a new compiler backend with the goal to replace LLVM for Radeon hardware for the RADV driver. ACO currently supports only VS, PS and CS on VI and Vega. There are some optimizations missing because of unmerged NIR changes which may decrease performance. Full commit history can be found at https://github.com/daniel-schuermann/mesa/commits/backend Co-authored-by: Daniel Schürmann <daniel@schuermann.dev> Co-authored-by: Rhys Perry <pendingchaos02@gmail.com> Co-authored-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl> Co-authored-by: Connor Abbott <cwabbott0@gmail.com> Co-authored-by: Michael Schellenberger Costa <mschellenbergercosta@googlemail.com> Co-authored-by: Timur Kristóf <timur.kristof@gmail.com> Acked-by: Samuel Pitoiset <samuel.pitoiset@gmail.com> Acked-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
2019-09-17 13:22:17 +02:00 · 2019-09-17 13:22:17 +02:00 · 93c8ebfa78
parent 99cbec0a5f
commit 93c8ebfa78
31 changed files with 25572 additions and 0 deletions
--- a/src/amd/compiler/README
+++ b/src/amd/compiler/README
@ -0,0 +1,87 @@
+# Unofficial GCN/RDNA ISA reference errata
+
+## v_sad_u32
+
+The Vega ISA reference writes it's behaviour as:
+```
+D.u = abs(S0.i - S1.i) + S2.u.
+```
+This is incorrect. The actual behaviour is what is written in the GCN3 reference
+guide:
+```
+ABS_DIFF (A,B) = (A>B) ? (A-B) : (B-A)
+D.u = ABS_DIFF (S0.u,S1.u) + S2.u
+```
+The instruction doesn't subtract the S0 and S1 and use the absolute value (the
+_signed_ distance), it uses the _unsigned_ distance between the operands. So
+`v_sad_u32(-5, 0, 0)` would return `4294967291` (`-5` interpreted as unsigned),
+not `5`.
+
+## s_bfe_*
+
+Both the Vega and GCN3 ISA references write that these instructions don't write
+SCC. They do.
+
+## v_bcnt_u32_b32
+
+The Vega ISA reference writes it's behaviour as:
+```
+D.u = 0;
+for i in 0 ... 31 do
+D.u += (S0.u[i] == 1 ? 1 : 0);
+endfor.
+```
+This is incorrect. The actual behaviour (and number of operands) is what
+is written in the GCN3 reference guide:
+```
+D.u = CountOneBits(S0.u) + S1.u.
+```
+
+## SMEM stores
+
+The Vega ISA references doesn't say this (or doesn't make it clear), but
+the offset for SMEM stores must be in m0 if IMM == 0.
+
+The RDNA ISA doesn't mention SMEM stores at all, but they seem to be supported
+by the chip and are present in LLVM. AMD devs however highly recommend avoiding
+these instructions.
+
+## SMEM atomics
+
+RDNA ISA: same as the SMEM stores, the ISA pretends they don't exist, but they
+are there in LLVM.
+
+## VMEM stores
+
+All reference guides say (under "Vector Memory Instruction Data Dependencies"):
+> When a VM instruction is issued, the address is immediately read out of VGPRs
+> and sent to the texture cache. Any texture or buffer resources and samplers
+> are also sent immediately. However, write-data is not immediately sent to the
+> texture cache.
+Reading that, one might think that waitcnts need to be added when writing to
+the registers used for a VMEM store's data. Experimentation has shown that this
+does not seem to be the case on GFX8 and GFX9 (GFX6 and GFX7 are untested). It
+also seems unlikely, since NOPs are apparently needed in a subset of these
+situations.
+
+## MIMG opcodes on GFX8/GCN3
+
+The `image_atomic_{swap,cmpswap,add,sub}` opcodes in the GCN3 ISA reference
+guide are incorrect. The Vega ISA reference guide has the correct ones.
+
+## Legacy instructions
+
+Some instructions have a `_LEGACY` variant which implements "DX9 rules", in which
+the zero "wins" in multiplications, ie. `0.0*x` is always `0.0`. The VEGA ISA
+mentions `V_MAC_LEGACY_F32` but this instruction is not really there on VEGA.
+
+# Hardware Bugs
+
+## SMEM corrupts VCCZ on SI/CI
+
+https://github.com/llvm/llvm-project/blob/acb089e12ae48b82c0b05c42326196a030df9b82/llvm/lib/Target/AMDGPU/SIInsertWaits.cpp#L580-L616
+After issuing a SMEM instructions, we need to wait for the SMEM instructions to
+finish and then write to vcc (for example, `s_mov_b64 vcc, vcc`) to correct vccz
+
+Currently, we don't do this.
+
--- a/src/amd/compiler/aco_assembler.cpp
+++ b/src/amd/compiler/aco_assembler.cpp
@ -0,0 +1,497 @@
+#include <map>
+
+#include "aco_ir.h"
+#include "common/sid.h"
+
+namespace aco {
+
+struct asm_context {
+   Program *program;
+   enum chip_class chip_class;
+   std::map<int, SOPP_instruction*> branches;
+   std::vector<unsigned> constaddrs;
+   const int16_t* opcode;
+   // TODO: keep track of branch instructions referring blocks
+   // and, when emitting the block, correct the offset in instr
+   asm_context(Program* program) : program(program), chip_class(program->chip_class) {
+      if (chip_class <= GFX9)
+         opcode = &instr_info.opcode_gfx9[0];
+   }
+};
+
+void emit_instruction(asm_context& ctx, std::vector<uint32_t>& out, Instruction* instr)
+{
+   uint32_t instr_offset = out.size() * 4u;
+
+   /* lower remaining pseudo-instructions */
+   if (instr->opcode == aco_opcode::p_constaddr) {
+      unsigned dest = instr->definitions[0].physReg();
+      unsigned offset = instr->operands[0].constantValue();
+
+      /* s_getpc_b64 dest[0:1] */
+      uint32_t encoding = (0b101111101 << 23);
+      uint32_t opcode = ctx.opcode[(int)aco_opcode::s_getpc_b64];
+      if (opcode >= 55 && ctx.chip_class <= GFX9) {
+         assert(ctx.chip_class == GFX9 && opcode < 60);
+         opcode = opcode - 4;
+      }
+      encoding |= dest << 16;
+      encoding |= opcode << 8;
+      out.push_back(encoding);
+
+      /* s_add_u32 dest[0], dest[0], ... */
+      encoding = (0b10 << 30);
+      encoding |= ctx.opcode[(int)aco_opcode::s_add_u32] << 23;
+      encoding |= dest << 16;
+      encoding |= dest;
+      encoding |= 255 << 8;
+      out.push_back(encoding);
+      ctx.constaddrs.push_back(out.size());
+      out.push_back(-(instr_offset + 4) + offset);
+
+      /* s_addc_u32 dest[1], dest[1], 0 */
+      encoding = (0b10 << 30);
+      encoding |= ctx.opcode[(int)aco_opcode::s_addc_u32] << 23;
+      encoding |= (dest + 1) << 16;
+      encoding |= dest + 1;
+      encoding |= 128 << 8;
+      out.push_back(encoding);
+      return;
+   }
+
+   uint32_t opcode = ctx.opcode[(int)instr->opcode];
+   if (opcode == (uint32_t)-1) {
+      fprintf(stderr, "Unsupported opcode: ");
+      aco_print_instr(instr, stderr);
+      abort();
+   }
+
+   switch (instr->format) {
+   case Format::SOP2: {
+      uint32_t encoding = (0b10 << 30);
+      encoding |= opcode << 23;
+      encoding |= !instr->definitions.empty() ? instr->definitions[0].physReg() << 16 : 0;
+      encoding |= instr->operands.size() >= 2 ? instr->operands[1].physReg() << 8 : 0;
+      encoding |= !instr->operands.empty() ? instr->operands[0].physReg() : 0;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::SOPK: {
+      uint32_t encoding = (0b1011 << 28);
+      encoding |= opcode << 23;
+      encoding |=
+         !instr->definitions.empty() && !(instr->definitions[0].physReg() == scc) ?
+         instr->definitions[0].physReg() << 16 :
+         !instr->operands.empty() && !(instr->operands[0].physReg() == scc) ?
+         instr->operands[0].physReg() << 16 : 0;
+      encoding |= static_cast<SOPK_instruction*>(instr)->imm;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::SOP1: {
+      uint32_t encoding = (0b101111101 << 23);
+      if (opcode >= 55 && ctx.chip_class <= GFX9) {
+         assert(ctx.chip_class == GFX9 && opcode < 60);
+         opcode = opcode - 4;
+      }
+      encoding |= !instr->definitions.empty() ? instr->definitions[0].physReg() << 16 : 0;
+      encoding |= opcode << 8;
+      encoding |= !instr->operands.empty() ? instr->operands[0].physReg() : 0;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::SOPC: {
+      uint32_t encoding = (0b101111110 << 23);
+      encoding |= opcode << 16;
+      encoding |= instr->operands.size() == 2 ? instr->operands[1].physReg() << 8 : 0;
+      encoding |= !instr->operands.empty() ? instr->operands[0].physReg() : 0;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::SOPP: {
+      SOPP_instruction* sopp = static_cast<SOPP_instruction*>(instr);
+      uint32_t encoding = (0b101111111 << 23);
+      encoding |= opcode << 16;
+      encoding |= (uint16_t) sopp->imm;
+      if (sopp->block != -1)
+         ctx.branches.insert({out.size(), sopp});
+      out.push_back(encoding);
+      break;
+   }
+   case Format::SMEM: {
+      SMEM_instruction* smem = static_cast<SMEM_instruction*>(instr);
+      uint32_t encoding = (0b110000 << 26);
+      encoding |= opcode << 18;
+      if (instr->operands.size() >= 2)
+         encoding |= instr->operands[1].isConstant() ? 1 << 17 : 0;
+      bool soe = instr->operands.size() >= (!instr->definitions.empty() ? 3 : 4);
+      assert(!soe || ctx.chip_class >= GFX9);
+      encoding |= soe ? 1 << 14 : 0;
+      encoding |= smem->glc ? 1 << 16 : 0;
+      if (!instr->definitions.empty() || instr->operands.size() >= 3)
+         encoding |= (!instr->definitions.empty() ? instr->definitions[0].physReg() : instr->operands[2].physReg().reg) << 6;
+      if (instr->operands.size() >= 1)
+         encoding |= instr->operands[0].physReg() >> 1;
+      out.push_back(encoding);
+      encoding = 0;
+      if (instr->operands.size() >= 2)
+         encoding |= instr->operands[1].isConstant() ? instr->operands[1].constantValue() : instr->operands[1].physReg().reg;
+      encoding |= soe ? instr->operands.back().physReg() << 25 : 0;
+      out.push_back(encoding);
+      return;
+   }
+   case Format::VOP2: {
+      uint32_t encoding = 0;
+      encoding |= opcode << 25;
+      encoding |= (0xFF & instr->definitions[0].physReg().reg) << 17;
+      encoding |= (0xFF & instr->operands[1].physReg().reg) << 9;
+      encoding |= instr->operands[0].physReg().reg;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::VOP1: {
+      uint32_t encoding = (0b0111111 << 25);
+      encoding |= (0xFF & instr->definitions[0].physReg().reg) << 17;
+      encoding |= opcode << 9;
+      encoding |= instr->operands[0].physReg().reg;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::VOPC: {
+      uint32_t encoding = (0b0111110 << 25);
+      encoding |= opcode << 17;
+      encoding |= (0xFF & instr->operands[1].physReg().reg) << 9;
+      encoding |= instr->operands[0].physReg().reg;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::VINTRP: {
+      Interp_instruction* interp = static_cast<Interp_instruction*>(instr);
+      uint32_t encoding = (0b110101 << 26);
+      encoding |= (0xFF & instr->definitions[0].physReg().reg) << 18;
+      encoding |= opcode << 16;
+      encoding |= interp->attribute << 10;
+      encoding |= interp->component << 8;
+      if (instr->opcode == aco_opcode::v_interp_mov_f32)
+         encoding |= (0x3 & instr->operands[0].constantValue());
+      else
+         encoding |= (0xFF & instr->operands[0].physReg().reg);
+      out.push_back(encoding);
+      break;
+   }
+   case Format::DS: {
+      DS_instruction* ds = static_cast<DS_instruction*>(instr);
+      uint32_t encoding = (0b110110 << 26);
+      encoding |= opcode << 17;
+      encoding |= (ds->gds ? 1 : 0) << 16;
+      encoding |= ((0xFF & ds->offset1) << 8);
+      encoding |= (0xFFFF & ds->offset0);
+      out.push_back(encoding);
+      encoding = 0;
+      unsigned reg = !instr->definitions.empty() ? instr->definitions[0].physReg() : 0;
+      encoding |= (0xFF & reg) << 24;
+      reg = instr->operands.size() >= 3 && !(instr->operands[2].physReg() == m0)  ? instr->operands[2].physReg() : 0;
+      encoding |= (0xFF & reg) << 16;
+      reg = instr->operands.size() >= 2 && !(instr->operands[1].physReg() == m0) ? instr->operands[1].physReg() : 0;
+      encoding |= (0xFF & reg) << 8;
+      encoding |= (0xFF & instr->operands[0].physReg().reg);
+      out.push_back(encoding);
+      break;
+   }
+   case Format::MUBUF: {
+      MUBUF_instruction* mubuf = static_cast<MUBUF_instruction*>(instr);
+      uint32_t encoding = (0b111000 << 26);
+      encoding |= opcode << 18;
+      encoding |= (mubuf->slc ? 1 : 0) << 17;
+      encoding |= (mubuf->lds ? 1 : 0) << 16;
+      encoding |= (mubuf->glc ? 1 : 0) << 14;
+      encoding |= (mubuf->idxen ? 1 : 0) << 13;
+      encoding |= (mubuf->offen ? 1 : 0) << 12;
+      encoding |= 0x0FFF & mubuf->offset;
+      out.push_back(encoding);
+      encoding = 0;
+      encoding |= instr->operands[2].physReg() << 24;
+      encoding |= (mubuf->tfe ? 1 : 0) << 23;
+      encoding |= (instr->operands[1].physReg() >> 2) << 16;
+      unsigned reg = instr->operands.size() > 3 ? instr->operands[3].physReg() : instr->definitions[0].physReg().reg;
+      encoding |= (0xFF & reg) << 8;
+      encoding |= (0xFF & instr->operands[0].physReg().reg);
+      out.push_back(encoding);
+      break;
+   }
+   case Format::MTBUF: {
+      MTBUF_instruction* mtbuf = static_cast<MTBUF_instruction*>(instr);
+      uint32_t encoding = (0b111010 << 26);
+      encoding |= opcode << 15;
+      encoding |= (mtbuf->glc ? 1 : 0) << 14;
+      encoding |= (mtbuf->idxen ? 1 : 0) << 13;
+      encoding |= (mtbuf->offen ? 1 : 0) << 12;
+      encoding |= 0x0FFF & mtbuf->offset;
+      encoding |= (0xF & mtbuf->dfmt) << 19;
+      encoding |= (0x7 & mtbuf->nfmt) << 23;
+      out.push_back(encoding);
+      encoding = 0;
+      encoding |= instr->operands[2].physReg().reg << 24;
+      encoding |= (mtbuf->tfe ? 1 : 0) << 23;
+      encoding |= (mtbuf->slc ? 1 : 0) << 22;
+      encoding |= (instr->operands[1].physReg().reg >> 2) << 16;
+      unsigned reg = instr->operands.size() > 3 ? instr->operands[3].physReg().reg : instr->definitions[0].physReg().reg;
+      encoding |= (0xFF & reg) << 8;
+      encoding |= (0xFF & instr->operands[0].physReg().reg);
+      out.push_back(encoding);
+      break;
+   }
+   case Format::MIMG: {
+      MIMG_instruction* mimg = static_cast<MIMG_instruction*>(instr);
+      uint32_t encoding = (0b111100 << 26);
+      encoding |= mimg->slc ? 1 << 25 : 0;
+      encoding |= opcode << 18;
+      encoding |= mimg->lwe ? 1 << 17 : 0;
+      encoding |= mimg->tfe ? 1 << 16 : 0;
+      encoding |= mimg->r128 ? 1 << 15 : 0;
+      encoding |= mimg->da ? 1 << 14 : 0;
+      encoding |= mimg->glc ? 1 << 13 : 0;
+      encoding |= mimg->unrm ? 1 << 12 : 0;
+      encoding |= (0xF & mimg->dmask) << 8;
+      out.push_back(encoding);
+      encoding = (0xFF & instr->operands[0].physReg().reg); /* VADDR */
+      if (!instr->definitions.empty()) {
+         encoding |= (0xFF & instr->definitions[0].physReg().reg) << 8; /* VDATA */
+      } else if (instr->operands.size() == 4) {
+         encoding |= (0xFF & instr->operands[3].physReg().reg) << 8; /* VDATA */
+      }
+      encoding |= (0x1F & (instr->operands[1].physReg() >> 2)) << 16; /* T# (resource) */
+      if (instr->operands.size() > 2)
+         encoding |= (0x1F & (instr->operands[2].physReg() >> 2)) << 21; /* sampler */
+      // TODO VEGA: D16
+      out.push_back(encoding);
+      break;
+   }
+   case Format::FLAT:
+   case Format::SCRATCH:
+   case Format::GLOBAL: {
+      FLAT_instruction *flat = static_cast<FLAT_instruction*>(instr);
+      uint32_t encoding = (0b110111 << 26);
+      encoding |= opcode << 18;
+      encoding |= flat->offset & 0x1fff;
+      if (instr->format == Format::SCRATCH)
+         encoding |= 1 << 14;
+      else if (instr->format == Format::GLOBAL)
+         encoding |= 2 << 14;
+      encoding |= flat->lds ? 1 << 13 : 0;
+      encoding |= flat->glc ? 1 << 13 : 0;
+      encoding |= flat->slc ? 1 << 13 : 0;
+      out.push_back(encoding);
+      encoding = (0xFF & instr->operands[0].physReg().reg);
+      if (!instr->definitions.empty())
+         encoding |= (0xFF & instr->definitions[0].physReg().reg) << 24;
+      else
+         encoding |= (0xFF & instr->operands[2].physReg().reg) << 8;
+      if (!instr->operands[1].isUndefined()) {
+         assert(instr->operands[1].physReg() != 0x7f);
+         assert(instr->format != Format::FLAT);
+         encoding |= instr->operands[1].physReg() << 16;
+      } else if (instr->format != Format::FLAT) {
+         encoding |= 0x7F << 16;
+      }
+      encoding |= flat->nv ? 1 << 23 : 0;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::EXP: {
+      Export_instruction* exp = static_cast<Export_instruction*>(instr);
+      uint32_t encoding = (0b110001 << 26);
+      encoding |= exp->valid_mask ? 0b1 << 12 : 0;
+      encoding |= exp->done ? 0b1 << 11 : 0;
+      encoding |= exp->compressed ? 0b1 << 10 : 0;
+      encoding |= exp->dest << 4;
+      encoding |= exp->enabled_mask;
+      out.push_back(encoding);
+      encoding = 0xFF & exp->operands[0].physReg().reg;
+      encoding |= (0xFF & exp->operands[1].physReg().reg) << 8;
+      encoding |= (0xFF & exp->operands[2].physReg().reg) << 16;
+      encoding |= (0xFF & exp->operands[3].physReg().reg) << 24;
+      out.push_back(encoding);
+      break;
+   }
+   case Format::PSEUDO:
+   case Format::PSEUDO_BARRIER:
+      unreachable("Pseudo instructions should be lowered before assembly.");
+   default:
+      if ((uint16_t) instr->format & (uint16_t) Format::VOP3A) {
+         VOP3A_instruction* vop3 = static_cast<VOP3A_instruction*>(instr);
+
+         if ((uint16_t) instr->format & (uint16_t) Format::VOP2)
+            opcode = opcode + 0x100;
+         else if ((uint16_t) instr->format & (uint16_t) Format::VOP1)
+            opcode = opcode + 0x140;
+         else if ((uint16_t) instr->format & (uint16_t) Format::VOPC)
+            opcode = opcode + 0x0;
+         else if ((uint16_t) instr->format & (uint16_t) Format::VINTRP)
+            opcode = opcode + 0x270;
+
+         // TODO: op_sel
+         uint32_t encoding = (0b110100 << 26);
+         encoding |= opcode << 16;
+         encoding |= (vop3->clamp ? 1 : 0) << 15;
+         for (unsigned i = 0; i < 3; i++)
+            encoding |= vop3->abs[i] << (8+i);
+         if (instr->definitions.size() == 2)
+            encoding |= instr->definitions[1].physReg() << 8;
+         encoding |= (0xFF & instr->definitions[0].physReg().reg);
+         out.push_back(encoding);
+         encoding = 0;
+         if (instr->opcode == aco_opcode::v_interp_mov_f32) {
+            encoding = 0x3 & instr->operands[0].constantValue();
+         } else {
+            for (unsigned i = 0; i < instr->operands.size(); i++)
+               encoding |= instr->operands[i].physReg() << (i * 9);
+         }
+         encoding |= vop3->omod << 27;
+         for (unsigned i = 0; i < 3; i++)
+            encoding |= vop3->neg[i] << (29+i);
+         out.push_back(encoding);
+         return;
+
+      } else if (instr->isDPP()){
+         /* first emit the instruction without the DPP operand */
+         Operand dpp_op = instr->operands[0];
+         instr->operands[0] = Operand(PhysReg{250}, v1);
+         instr->format = (Format) ((uint32_t) instr->format & ~(1 << 14));
+         emit_instruction(ctx, out, instr);
+         DPP_instruction* dpp = static_cast<DPP_instruction*>(instr);
+         uint32_t encoding = (0xF & dpp->row_mask) << 28;
+         encoding |= (0xF & dpp->bank_mask) << 24;
+         encoding |= dpp->abs[1] << 23;
+         encoding |= dpp->neg[1] << 22;
+         encoding |= dpp->abs[0] << 21;
+         encoding |= dpp->neg[0] << 20;
+         encoding |= dpp->bound_ctrl << 19;
+         encoding |= dpp->dpp_ctrl << 8;
+         encoding |= (0xFF) & dpp_op.physReg().reg;
+         out.push_back(encoding);
+         return;
+      } else {
+         unreachable("unimplemented instruction format");
+      }
+   }
+
+   /* append literal dword */
+   for (const Operand& op : instr->operands) {
+      if (op.isLiteral()) {
+         out.push_back(op.constantValue());
+         break;
+      }
+   }
+}
+
+void emit_block(asm_context& ctx, std::vector<uint32_t>& out, Block& block)
+{
+   for (aco_ptr<Instruction>& instr : block.instructions) {
+#if 0
+      int start_idx = out.size();
+      std::cerr << "Encoding:\t" << std::endl;
+      aco_print_instr(&*instr, stderr);
+      std::cerr << std::endl;
+#endif
+      emit_instruction(ctx, out, instr.get());
+#if 0
+      for (int i = start_idx; i < out.size(); i++)
+         std::cerr << "encoding: " << "0x" << std::setfill('0') << std::setw(8) << std::hex << out[i] << std::endl;
+#endif
+   }
+}
+
+void fix_exports(asm_context& ctx, std::vector<uint32_t>& out, Program* program)
+{
+   for (int idx = program->blocks.size() - 1; idx >= 0; idx--) {
+      Block& block = program->blocks[idx];
+      std::vector<aco_ptr<Instruction>>::reverse_iterator it = block.instructions.rbegin();
+      bool endBlock = false;
+      bool exported = false;
+      while ( it != block.instructions.rend())
+      {
+         if ((*it)->format == Format::EXP && endBlock) {
+            Export_instruction* exp = static_cast<Export_instruction*>((*it).get());
+            if (program->stage & hw_vs) {
+               if (exp->dest >= V_008DFC_SQ_EXP_POS && exp->dest <= (V_008DFC_SQ_EXP_POS + 3)) {
+                  exp->done = true;
+                  exported = true;
+                  break;
+               }
+            } else {
+               exp->done = true;
+               exp->valid_mask = true;
+               exported = true;
+               break;
+            }
+         } else if ((*it)->definitions.size() && (*it)->definitions[0].physReg() == exec)
+            break;
+         else if ((*it)->opcode == aco_opcode::s_endpgm) {
+            if (endBlock)
+               break;
+            endBlock = true;
+         }
+         ++it;
+      }
+      if (!endBlock || exported)
+         continue;
+      /* we didn't find an Export instruction and have to insert a null export */
+      aco_ptr<Export_instruction> exp{create_instruction<Export_instruction>(aco_opcode::exp, Format::EXP, 4, 0)};
+      for (unsigned i = 0; i < 4; i++)
+         exp->operands[i] = Operand(v1);
+      exp->enabled_mask = 0;
+      exp->compressed = false;
+      exp->done = true;
+      exp->valid_mask = program->stage & hw_fs;
+      if (program->stage & hw_fs)
+         exp->dest = 9; /* NULL */
+      else
+         exp->dest = V_008DFC_SQ_EXP_POS;
+      /* insert the null export 1 instruction before endpgm */
+      block.instructions.insert(block.instructions.end() - 1, std::move(exp));
+   }
+}
+
+void fix_branches(asm_context& ctx, std::vector<uint32_t>& out)
+{
+   for (std::pair<int, SOPP_instruction*> branch : ctx.branches)
+   {
+      int offset = (int)ctx.program->blocks[branch.second->block].offset - branch.first - 1;
+      out[branch.first] |= (uint16_t) offset;
+   }
+}
+
+void fix_constaddrs(asm_context& ctx, std::vector<uint32_t>& out)
+{
+   for (unsigned addr : ctx.constaddrs)
+      out[addr] += out.size() * 4u;
+}
+
+unsigned emit_program(Program* program,
+                      std::vector<uint32_t>& code)
+{
+   asm_context ctx(program);
+
+   if (program->stage & (hw_vs | hw_fs))
+      fix_exports(ctx, code, program);
+
+   for (Block& block : program->blocks) {
+      block.offset = code.size();
+      emit_block(ctx, code, block);
+   }
+
+   fix_branches(ctx, code);
+   fix_constaddrs(ctx, code);
+
+   unsigned constant_data_offset = code.size() * sizeof(uint32_t);
+   while (program->constant_data.size() % 4u)
+      program->constant_data.push_back(0);
+   /* Copy constant data */
+   code.insert(code.end(), (uint32_t*)program->constant_data.data(),
+               (uint32_t*)(program->constant_data.data() + program->constant_data.size()));
+
+   return constant_data_offset;
+}
+
+}
--- a/src/amd/compiler/aco_builder_h.py
+++ b/src/amd/compiler/aco_builder_h.py
@ -0,0 +1,400 @@
+
+template = """\
+/*
+ * Copyright (c) 2019 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * This file was generated by aco_builder_h.py
+ */
+
+#ifndef _ACO_BUILDER_
+#define _ACO_BUILDER_
+
+#include "aco_ir.h"
+#include "util/u_math.h"
+#include "util/bitscan.h"
+
+namespace aco {
+enum dpp_ctrl {
+    _dpp_quad_perm = 0x000,
+    _dpp_row_sl = 0x100,
+    _dpp_row_sr = 0x110,
+    _dpp_row_rr = 0x120,
+    dpp_wf_sl1 = 0x130,
+    dpp_wf_rl1 = 0x134,
+    dpp_wf_sr1 = 0x138,
+    dpp_wf_rr1 = 0x13C,
+    dpp_row_mirror = 0x140,
+    dpp_row_half_mirror = 0x141,
+    dpp_row_bcast15 = 0x142,
+    dpp_row_bcast31 = 0x143
+};
+
+inline dpp_ctrl
+dpp_quad_perm(unsigned lane0, unsigned lane1, unsigned lane2, unsigned lane3)
+{
+    assert(lane0 < 4 && lane1 < 4 && lane2 < 4 && lane3 < 4);
+    return (dpp_ctrl)(lane0 | (lane1 << 2) | (lane2 << 4) | (lane3 << 6));
+}
+
+inline dpp_ctrl
+dpp_row_sl(unsigned amount)
+{
+    assert(amount > 0 && amount < 16);
+    return (dpp_ctrl)(((unsigned) _dpp_row_sl) | amount);
+}
+
+inline dpp_ctrl
+dpp_row_sr(unsigned amount)
+{
+    assert(amount > 0 && amount < 16);
+    return (dpp_ctrl)(((unsigned) _dpp_row_sr) | amount);
+}
+
+inline unsigned
+ds_pattern_bitmode(unsigned and_mask, unsigned or_mask, unsigned xor_mask)
+{
+    assert(and_mask < 32 && or_mask < 32 && xor_mask < 32);
+    return and_mask | (or_mask << 5) | (xor_mask << 10);
+}
+
+aco_ptr<Instruction> create_s_mov(Definition dst, Operand src);
+
+class Builder {
+public:
+   struct Result {
+      Instruction *instr;
+
+      Result(Instruction *instr) : instr(instr) {}
+
+      operator Instruction *() const {
+         return instr;
+      }
+
+      operator Temp() const {
+         return instr->definitions[0].getTemp();
+      }
+
+      operator Operand() const {
+         return Operand((Temp)*this);
+      }
+
+      Definition& def(unsigned index) const {
+         return instr->definitions[index];
+      }
+
+      aco_ptr<Instruction> get_ptr() const {
+        return aco_ptr<Instruction>(instr);
+      }
+   };
+
+   struct Op {
+      Operand op;
+      Op(Temp tmp) : op(tmp) {}
+      Op(Operand op_) : op(op_) {}
+      Op(Result res) : op((Temp)res) {}
+   };
+
+   Program *program;
+   bool use_iterator;
+   union {
+   bool forwards; //when use_iterator == true
+   bool start; //when use_iterator == false
+   };
+   std::vector<aco_ptr<Instruction>> *instructions;
+   std::vector<aco_ptr<Instruction>>::iterator it;
+
+   Builder(Program *pgm) : program(pgm), use_iterator(false), start(false), instructions(NULL) {}
+   Builder(Program *pgm, Block *block) : program(pgm), use_iterator(false), start(false), instructions(&block->instructions) {}
+   Builder(Program *pgm, std::vector<aco_ptr<Instruction>> *instrs) : program(pgm), use_iterator(false), start(false), instructions(instrs) {}
+
+   void moveEnd(Block *block) {
+      instructions = &block->instructions;
+   }
+
+   void reset() {
+      use_iterator = false;
+      start = false;
+      instructions = NULL;
+   }
+
+   void reset(Block *block) {
+      use_iterator = false;
+      start = false;
+      instructions = &block->instructions;
+   }
+
+   void reset(std::vector<aco_ptr<Instruction>> *instrs) {
+      use_iterator = false;
+      start = false;
+      instructions = instrs;
+   }
+
+   Result insert(aco_ptr<Instruction> instr) {
+      Instruction *instr_ptr = instr.get();
+      if (instructions) {
+         if (use_iterator) {
+            it = instructions->emplace(it, std::move(instr));
+            if (forwards)
+               it = std::next(it);
+         } else if (!start) {
+            instructions->emplace_back(std::move(instr));
+         } else {
+            instructions->emplace(instructions->begin(), std::move(instr));
+         }
+      }
+      return Result(instr_ptr);
+   }
+
+   Result insert(Instruction* instr) {
+      if (instructions) {
+         if (use_iterator) {
+            it = instructions->emplace(it, aco_ptr<Instruction>(instr));
+            if (forwards)
+               it = std::next(it);
+         } else if (!start) {
+            instructions->emplace_back(aco_ptr<Instruction>(instr));
+         } else {
+            instructions->emplace(instructions->begin(), aco_ptr<Instruction>(instr));
+         }
+      }
+      return Result(instr);
+   }
+
+   Temp tmp(RegClass rc) {
+      return (Temp){program->allocateId(), rc};
+   }
+
+   Temp tmp(RegType type, unsigned size) {
+      return (Temp){program->allocateId(), RegClass(type, size)};
+   }
+
+   Definition def(RegClass rc) {
+      return Definition((Temp){program->allocateId(), rc});
+   }
+
+   Definition def(RegType type, unsigned size) {
+      return Definition((Temp){program->allocateId(), RegClass(type, size)});
+   }
+
+   Definition def(RegClass rc, PhysReg reg) {
+      return Definition(program->allocateId(), reg, rc);
+   }
+
+% for fixed in ['m0', 'vcc', 'exec', 'scc']:
+   Operand ${fixed}(Temp tmp) {
+       Operand op(tmp);
+       op.setFixed(aco::${fixed});
+       return op;
+   }
+
+   Definition ${fixed}(Definition def) {
+       def.setFixed(aco::${fixed});
+       return def;
+   }
+
+   Definition hint_${fixed}(Definition def) {
+       def.setHint(aco::${fixed});
+       return def;
+   }
+
+% endfor
+   /* hand-written helpers */
+   Temp as_uniform(Op op)
+   {
+      assert(op.op.isTemp());
+      if (op.op.getTemp().type() == RegType::vgpr)
+         return pseudo(aco_opcode::p_as_uniform, def(RegType::sgpr, op.op.size()), op);
+      else
+         return op.op.getTemp();
+   }
+
+   Result v_mul_imm(Definition dst, Temp tmp, uint32_t imm, bool bits24=false)
+   {
+      assert(tmp.type() == RegType::vgpr);
+      if (imm == 0) {
+         return vop1(aco_opcode::v_mov_b32, dst, Operand(0u));
+      } else if (imm == 1) {
+         return copy(dst, Operand(tmp));
+      } else if (util_is_power_of_two_or_zero(imm)) {
+         return vop2(aco_opcode::v_lshlrev_b32, dst, Operand((uint32_t)ffs(imm) - 1u), tmp);
+      } else if (bits24) {
+        return vop2(aco_opcode::v_mul_u32_u24, dst, Operand(imm), tmp);
+      } else {
+        Temp imm_tmp = copy(def(v1), Operand(imm));
+        return vop3(aco_opcode::v_mul_lo_u32, dst, imm_tmp, tmp);
+      }
+   }
+
+   Result v_mul24_imm(Definition dst, Temp tmp, uint32_t imm)
+   {
+      return v_mul_imm(dst, tmp, imm, true);
+   }
+
+   Result copy(Definition dst, Op op_) {
+      Operand op = op_.op;
+      if (dst.regClass() == s1 && op.size() == 1 && op.isLiteral()) {
+         uint32_t imm = op.constantValue();
+         if (imm >= 0xffff8000 || imm <= 0x7fff) {
+            return sopk(aco_opcode::s_movk_i32, dst, imm & 0xFFFFu);
+         } else if (util_bitreverse(imm) <= 64 || util_bitreverse(imm) >= 0xFFFFFFF0) {
+            uint32_t rev = util_bitreverse(imm);
+            return dst.regClass() == v1 ?
+                   vop1(aco_opcode::v_bfrev_b32, dst, Operand(rev)) :
+                   sop1(aco_opcode::s_brev_b32, dst, Operand(rev));
+         } else if (imm != 0) {
+            unsigned start = (ffs(imm) - 1) & 0x1f;
+            unsigned size = util_bitcount(imm) & 0x1f;
+            if ((((1u << size) - 1u) << start) == imm)
+                return sop2(aco_opcode::s_bfm_b32, dst, Operand(size), Operand(start));
+         }
+      }
+
+      if (dst.regClass() == s2) {
+        return sop1(aco_opcode::s_mov_b64, dst, op);
+      } else if (op.size() > 1) {
+         return pseudo(aco_opcode::p_create_vector, dst, op);
+      } else if (dst.regClass() == v1 || dst.regClass() == v1.as_linear()) {
+        return vop1(aco_opcode::v_mov_b32, dst, op);
+      } else {
+        assert(dst.regClass() == s1);
+        return sop1(aco_opcode::s_mov_b32, dst, op);
+      }
+   }
+
+   Result vadd32(Definition dst, Op a, Op b, bool carry_out=false, Op carry_in=Op(Operand(s2))) {
+      if (!b.op.isTemp() || b.op.regClass().type() != RegType::vgpr)
+         std::swap(a, b);
+      assert(b.op.isTemp() && b.op.regClass().type() == RegType::vgpr);
+
+      if (!carry_in.op.isUndefined())
+         return vop2(aco_opcode::v_addc_co_u32, Definition(dst), hint_vcc(def(s2)), a, b, carry_in);
+      else if (program->chip_class < GFX9 || carry_out)
+         return vop2(aco_opcode::v_add_co_u32, Definition(dst), hint_vcc(def(s2)), a, b);
+      else
+         return vop2(aco_opcode::v_add_u32, Definition(dst), a, b);
+   }
+
+   Result vsub32(Definition dst, Op a, Op b, bool carry_out=false, Op borrow=Op(Operand(s2)))
+   {
+      if (!borrow.op.isUndefined() || program->chip_class < GFX9)
+         carry_out = true;
+
+      bool reverse = !b.op.isTemp() || b.op.regClass().type() != RegType::vgpr;
+      if (reverse)
+         std::swap(a, b);
+      assert(b.op.isTemp() && b.op.regClass().type() == RegType::vgpr);
+
+      aco_opcode op;
+      Temp carry;
+      if (carry_out) {
+         carry = tmp(s2);
+         if (borrow.op.isUndefined())
+            op = reverse ? aco_opcode::v_subrev_co_u32 : aco_opcode::v_sub_co_u32;
+         else
+            op = reverse ? aco_opcode::v_subbrev_co_u32 : aco_opcode::v_subb_co_u32;
+      } else {
+         op = reverse ? aco_opcode::v_subrev_u32 : aco_opcode::v_sub_u32;
+      }
+
+      int num_ops = borrow.op.isUndefined() ? 2 : 3;
+      int num_defs = carry_out ? 2 : 1;
+      aco_ptr<Instruction> sub{create_instruction<VOP2_instruction>(op, Format::VOP2, num_ops, num_defs)};
+      sub->operands[0] = a.op;
+      sub->operands[1] = b.op;
+      if (!borrow.op.isUndefined())
+         sub->operands[2] = borrow.op;
+      sub->definitions[0] = dst;
+      if (carry_out) {
+         sub->definitions[1] = Definition(carry);
+         sub->definitions[1].setHint(aco::vcc);
+      }
+      return insert(std::move(sub));
+   }
+<%
+import itertools
+formats = [("pseudo", [Format.PSEUDO], 'Pseudo_instruction', list(itertools.product(range(5), range(5))) + [(8, 1), (1, 8)]),
+           ("sop1", [Format.SOP1], 'SOP1_instruction', [(1, 1), (2, 1), (3, 2)]),
+           ("sop2", [Format.SOP2], 'SOP2_instruction', itertools.product([1, 2], [2, 3])),
+           ("sopk", [Format.SOPK], 'SOPK_instruction', itertools.product([0, 1, 2], [0, 1])),
+           ("sopp", [Format.SOPP], 'SOPP_instruction', [(0, 0), (0, 1)]),
+           ("sopc", [Format.SOPC], 'SOPC_instruction', [(1, 2)]),
+           ("smem", [Format.SMEM], 'SMEM_instruction', [(0, 4), (0, 3), (1, 0), (1, 3), (1, 2), (0, 0)]),
+           ("ds", [Format.DS], 'DS_instruction', [(1, 1), (1, 2), (0, 3), (0, 4)]),
+           ("mubuf", [Format.MUBUF], 'MUBUF_instruction', [(0, 4), (1, 3)]),
+           ("mimg", [Format.MIMG], 'MIMG_instruction', [(0, 4), (1, 3), (0, 3), (1, 2)]), #TODO(pendingchaos): less shapes?
+           ("exp", [Format.EXP], 'Export_instruction', [(0, 4)]),
+           ("branch", [Format.PSEUDO_BRANCH], 'Pseudo_branch_instruction', itertools.product([0], [0, 1])),
+           ("barrier", [Format.PSEUDO_BARRIER], 'Pseudo_barrier_instruction', [(0, 0)]),
+           ("reduction", [Format.PSEUDO_REDUCTION], 'Pseudo_reduction_instruction', [(3, 2)]),
+           ("vop1", [Format.VOP1], 'VOP1_instruction', [(1, 1), (2, 2)]),
+           ("vop2", [Format.VOP2], 'VOP2_instruction', itertools.product([1, 2], [2, 3])),
+           ("vopc", [Format.VOPC], 'VOPC_instruction', itertools.product([1, 2], [2])),
+           ("vop3", [Format.VOP3A], 'VOP3A_instruction', [(1, 3), (1, 2), (1, 1), (2, 2)]),
+           ("vintrp", [Format.VINTRP], 'Interp_instruction', [(1, 2), (1, 3)]),
+           ("vop1_dpp", [Format.VOP1, Format.DPP], 'DPP_instruction', [(1, 1)]),
+           ("vop2_dpp", [Format.VOP2, Format.DPP], 'DPP_instruction', itertools.product([1, 2], [2, 3])),
+           ("vopc_dpp", [Format.VOPC, Format.DPP], 'DPP_instruction', itertools.product([1, 2], [2])),
+           ("vop1_e64", [Format.VOP1, Format.VOP3A], 'VOP3A_instruction', itertools.product([1], [1])),
+           ("vop2_e64", [Format.VOP2, Format.VOP3A], 'VOP3A_instruction', itertools.product([1, 2], [2, 3])),
+           ("vopc_e64", [Format.VOPC, Format.VOP3A], 'VOP3A_instruction', itertools.product([1, 2], [2])),
+           ("flat", [Format.FLAT], 'FLAT_instruction', [(0, 3), (1, 2)]),
+           ("global", [Format.GLOBAL], 'FLAT_instruction', [(0, 3), (1, 2)])]
+%>\\
+% for name, formats, struct, shapes in formats:
+    % for num_definitions, num_operands in shapes:
+        <%
+        args = ['aco_opcode opcode']
+        for i in range(num_definitions):
+            args.append('Definition def%d' % i)
+        for i in range(num_operands):
+            args.append('Op op%d' % i)
+        for f in formats:
+            args += f.get_builder_field_decls()
+        %>\\
+
+   Result ${name}(${', '.join(args)})
+   {
+      ${struct} *instr = create_instruction<${struct}>(opcode, (Format)(${'|'.join('(int)Format::%s' % f.name for f in formats)}), ${num_operands}, ${num_definitions});
+        % for i in range(num_definitions):
+            instr->definitions[${i}] = def${i};
+        % endfor
+        % for i in range(num_operands):
+            instr->operands[${i}] = op${i}.op;
+        % endfor
+        % for f in formats:
+            % for dest, field_name in zip(f.get_builder_field_dests(), f.get_builder_field_names()):
+      instr->${dest} = ${field_name};
+            % endfor
+        % endfor
+      return insert(instr);
+   }
+    % endfor
+% endfor
+};
+
+}
+#endif /* _ACO_BUILDER_ */"""
+
+from aco_opcodes import opcodes, Format
+from mako.template import Template
+
+print(Template(template).render(opcodes=opcodes, Format=Format))
--- a/src/amd/compiler/aco_dead_code_analysis.cpp
+++ b/src/amd/compiler/aco_dead_code_analysis.cpp
@ -0,0 +1,102 @@
+/*
+ * Copyright © 2019 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+
+#include <algorithm>
+
+/*
+ * Implements an analysis pass to determine the number of uses
+ * for each SSA-definition.
+ */
+
+namespace aco {
+namespace {
+
+struct dce_ctx {
+   int current_block;
+   std::vector<uint16_t> uses;
+   std::vector<std::vector<bool>> live;
+
+   dce_ctx(Program* program) : current_block(program->blocks.size() - 1), uses(program->peekAllocationId())
+   {
+      live.reserve(program->blocks.size());
+      for (Block& block : program->blocks)
+         live.emplace_back(block.instructions.size());
+   }
+};
+
+void process_block(dce_ctx& ctx, Block& block)
+{
+   std::vector<bool>& live = ctx.live[block.index];
+   assert(live.size() == block.instructions.size());
+   bool process_predecessors = false;
+   for (int idx = block.instructions.size() - 1; idx >= 0; idx--) {
+      if (live[idx])
+         continue;
+
+      aco_ptr<Instruction>& instr = block.instructions[idx];
+      const bool is_live = instr->definitions.empty() ||
+                           std::any_of(instr->definitions.begin(), instr->definitions.end(),
+                              [&ctx] (const Definition& def) { return !def.isTemp() || ctx.uses[def.tempId()];});
+
+      if (is_live) {
+         for (const Operand& op : instr->operands) {
+            if (op.isTemp()) {
+               if (ctx.uses[op.tempId()] == 0)
+                  process_predecessors = true;
+               ctx.uses[op.tempId()]++;
+            }
+         }
+         live[idx] = true;
+      }
+   }
+
+   if (process_predecessors) {
+      for (unsigned pred_idx : block.linear_preds)
+         ctx.current_block = std::max(ctx.current_block, (int) pred_idx);
+   }
+}
+
+} /* end namespace */
+
+std::vector<uint16_t> dead_code_analysis(Program *program) {
+
+   dce_ctx ctx(program);
+
+   while (ctx.current_block >= 0) {
+      unsigned next_block = ctx.current_block--;
+      process_block(ctx, program->blocks[next_block]);
+   }
+
+   /* add one use to exec to prevent startpgm from being removed */
+   aco_ptr<Instruction>& startpgm = program->blocks[0].instructions[0];
+   assert(startpgm->opcode == aco_opcode::p_startpgm);
+   ctx.uses[startpgm->definitions.back().tempId()]++;
+
+   return ctx.uses;
+}
+
+}
+
--- a/src/amd/compiler/aco_dominance.cpp
+++ b/src/amd/compiler/aco_dominance.cpp
@ -0,0 +1,93 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *    Daniel Schürmann (daniel.schuermann@campus.tu-berlin.de)
+ *
+ */
+
+#ifndef ACO_DOMINANCE_CPP
+#define ACO_DOMINANCE_CPP
+
+#include "aco_ir.h"
+
+/*
+ * Implements the algorithms for computing the dominator tree from
+ * "A Simple, Fast Dominance Algorithm" by Cooper, Harvey, and Kennedy.
+ *
+ * Different from the paper, our CFG allows to compute the dominator tree
+ * in a single pass as it is guaranteed that the dominating predecessors
+ * are processed before the current block.
+ */
+
+namespace aco {
+
+void dominator_tree(Program* program)
+{
+   program->blocks[0].logical_idom = 0;
+   program->blocks[0].linear_idom = 0;
+
+   for (unsigned i = 1; i < program->blocks.size(); i++) {
+      Block& block = program->blocks[i];
+      int new_logical_idom = -1;
+      int new_linear_idom = -1;
+      for (unsigned pred_idx : block.logical_preds) {
+         if ((int) program->blocks[pred_idx].logical_idom == -1)
+            continue;
+         
+         if (new_logical_idom == -1) {
+            new_logical_idom = pred_idx;
+            continue;
+         }
+         
+         while ((int) pred_idx != new_logical_idom) {
+            if ((int) pred_idx > new_logical_idom)
+               pred_idx = program->blocks[pred_idx].logical_idom;
+            if ((int) pred_idx < new_logical_idom)
+               new_logical_idom = program->blocks[new_logical_idom].logical_idom;
+         }
+      }
+
+      for (unsigned pred_idx : block.linear_preds) {
+         if ((int) program->blocks[pred_idx].linear_idom == -1)
+            continue;
+            
+         if (new_linear_idom == -1) {
+            new_linear_idom = pred_idx;
+            continue;
+         }
+         
+         while ((int) pred_idx != new_linear_idom) {
+            if ((int) pred_idx > new_linear_idom)
+               pred_idx = program->blocks[pred_idx].linear_idom;
+            if ((int) pred_idx < new_linear_idom)
+               new_linear_idom = program->blocks[new_linear_idom].linear_idom;
+         }
+      }
+
+      block.logical_idom = new_logical_idom;
+      block.linear_idom = new_linear_idom;
+   }
+}
+
+}
+#endif
--- a/src/amd/compiler/aco_insert_NOPs.cpp
+++ b/src/amd/compiler/aco_insert_NOPs.cpp
@ -0,0 +1,282 @@
+/*
+ * Copyright © 2019 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+
+namespace aco {
+namespace {
+
+struct NOP_ctx {
+   /* just initialize these with something less than max NOPs */
+   int VALU_wrexec = -10;
+   int VALU_wrvcc = -10;
+   int VALU_wrsgpr = -10;
+   enum chip_class chip_class;
+   unsigned vcc_physical;
+   NOP_ctx(Program* program) : chip_class(program->chip_class) {
+      vcc_physical = program->config->num_sgprs - 2;
+   }
+};
+
+bool VALU_writes_sgpr(aco_ptr<Instruction>& instr)
+{
+   if ((uint32_t) instr->format & (uint32_t) Format::VOPC)
+      return true;
+   if (instr->isVOP3() && instr->definitions.size() == 2)
+      return true;
+   if (instr->opcode == aco_opcode::v_readfirstlane_b32 || instr->opcode == aco_opcode::v_readlane_b32)
+      return true;
+   return false;
+}
+
+bool regs_intersect(PhysReg a_reg, unsigned a_size, PhysReg b_reg, unsigned b_size)
+{
+   return a_reg > b_reg ?
+          (a_reg - b_reg < b_size) :
+          (b_reg - a_reg < a_size);
+}
+
+int handle_instruction(NOP_ctx& ctx, aco_ptr<Instruction>& instr,
+                       std::vector<aco_ptr<Instruction>>& old_instructions,
+                       std::vector<aco_ptr<Instruction>>& new_instructions)
+{
+   int new_idx = new_instructions.size();
+
+   // TODO: setreg / getreg / m0 writes
+   // TODO: try to schedule the NOP-causing instruction up to reduce the number of stall cycles
+
+   /* break off from prevous SMEM clause if needed */
+   if (instr->format == Format::SMEM && ctx.chip_class >= GFX8) {
+      const bool is_store = instr->definitions.empty();
+      for (int pred_idx = new_idx - 1; pred_idx >= 0; pred_idx--) {
+         aco_ptr<Instruction>& pred = new_instructions[pred_idx];
+         if (pred->format != Format::SMEM)
+            break;
+
+         /* Don't allow clauses with store instructions since the clause's
+          * instructions may use the same address. */
+         if (is_store || pred->definitions.empty())
+            return 1;
+
+         Definition& instr_def = instr->definitions[0];
+         Definition& pred_def = pred->definitions[0];
+
+         /* ISA reference doesn't say anything about this, but best to be safe */
+         if (regs_intersect(instr_def.physReg(), instr_def.size(), pred_def.physReg(), pred_def.size()))
+            return 1;
+
+         for (const Operand& op : pred->operands) {
+            if (op.isConstant() || !op.isFixed())
+               continue;
+            if (regs_intersect(instr_def.physReg(), instr_def.size(), op.physReg(), op.size()))
+               return 1;
+         }
+         for (const Operand& op : instr->operands) {
+            if (op.isConstant() || !op.isFixed())
+               continue;
+            if (regs_intersect(pred_def.physReg(), pred_def.size(), op.physReg(), op.size()))
+               return 1;
+         }
+      }
+   } else if (instr->isVALU() || instr->format == Format::VINTRP) {
+      int NOPs = 0;
+
+      if (instr->isDPP()) {
+         /* VALU does not forward EXEC to DPP. */
+         if (ctx.VALU_wrexec + 5 >= new_idx)
+            NOPs = 5 + ctx.VALU_wrexec - new_idx + 1;
+
+         /* VALU DPP reads VGPR written by VALU */
+         for (int pred_idx = new_idx - 1; pred_idx >= 0 && pred_idx >= new_idx - 2; pred_idx--) {
+            aco_ptr<Instruction>& pred = new_instructions[pred_idx];
+            if ((pred->isVALU() || pred->format == Format::VINTRP) &&
+                !pred->definitions.empty() &&
+                pred->definitions[0].physReg() == instr->operands[0].physReg()) {
+               NOPs = std::max(NOPs, 2 + pred_idx - new_idx + 1);
+               break;
+            }
+         }
+      }
+
+      /* SALU writes M0 */
+      if (instr->format == Format::VINTRP && new_idx > 0 && ctx.chip_class >= GFX9) {
+         aco_ptr<Instruction>& pred = new_instructions.back();
+         if (pred->isSALU() &&
+             !pred->definitions.empty() &&
+             pred->definitions[0].physReg() == m0)
+            NOPs = std::max(NOPs, 1);
+      }
+
+      for (const Operand& op : instr->operands) {
+         /* VALU which uses VCCZ */
+         if (op.physReg() == PhysReg{251} &&
+             ctx.VALU_wrvcc + 5 >= new_idx)
+            NOPs = std::max(NOPs, 5 + ctx.VALU_wrvcc - new_idx + 1);
+
+         /* VALU which uses EXECZ */
+         if (op.physReg() == PhysReg{252} &&
+             ctx.VALU_wrexec + 5 >= new_idx)
+            NOPs = std::max(NOPs, 5 + ctx.VALU_wrexec - new_idx + 1);
+
+         /* VALU which reads VCC as a constant */
+         if (ctx.VALU_wrvcc + 1 >= new_idx) {
+            for (unsigned k = 0; k < op.size(); k++) {
+               unsigned reg = op.physReg() + k;
+               if (reg == ctx.vcc_physical || reg == ctx.vcc_physical + 1)
+                  NOPs = std::max(NOPs, 1);
+            }
+         }
+      }
+
+      switch (instr->opcode) {
+         case aco_opcode::v_readlane_b32:
+         case aco_opcode::v_writelane_b32: {
+            if (ctx.VALU_wrsgpr + 4 < new_idx)
+               break;
+            PhysReg reg = instr->operands[1].physReg();
+            for (int pred_idx = new_idx - 1; pred_idx >= 0 && pred_idx >= new_idx - 4; pred_idx--) {
+               aco_ptr<Instruction>& pred = new_instructions[pred_idx];
+               if (!pred->isVALU() || !VALU_writes_sgpr(pred))
+                  continue;
+               for (const Definition& def : pred->definitions) {
+                  if (def.physReg() == reg)
+                     NOPs = std::max(NOPs, 4 + pred_idx - new_idx + 1);
+               }
+            }
+            break;
+         }
+         case aco_opcode::v_div_fmas_f32:
+         case aco_opcode::v_div_fmas_f64: {
+            if (ctx.VALU_wrvcc + 4 >= new_idx)
+               NOPs = std::max(NOPs, 4 + ctx.VALU_wrvcc - new_idx + 1);
+            break;
+         }
+         default:
+            break;
+      }
+
+      /* Write VGPRs holding writedata > 64 bit from MIMG/MUBUF instructions */
+      // FIXME: handle case if the last instruction of a block without branch is such store
+      // TODO: confirm that DS instructions cannot cause WAR hazards here
+      if (new_idx > 0) {
+         aco_ptr<Instruction>& pred = new_instructions.back();
+         if (pred->isVMEM() &&
+             pred->operands.size() == 4 &&
+             pred->operands[3].size() > 2 &&
+             pred->operands[1].size() != 8 &&
+             (pred->format != Format::MUBUF || pred->operands[2].physReg() >= 102)) {
+            /* Ops that use a 256-bit T# do not need a wait state.
+             * BUFFER_STORE_* operations that use an SGPR for "offset"
+             * do not require any wait states. */
+            PhysReg wrdata = pred->operands[3].physReg();
+            unsigned size = pred->operands[3].size();
+            assert(wrdata >= 256);
+            for (const Definition& def : instr->definitions) {
+               if (regs_intersect(def.physReg(), def.size(), wrdata, size))
+                  NOPs = std::max(NOPs, 1);
+            }
+         }
+      }
+
+      if (VALU_writes_sgpr(instr)) {
+         for (const Definition& def : instr->definitions) {
+            if (def.physReg() == vcc)
+               ctx.VALU_wrvcc = NOPs ? new_idx : new_idx + 1;
+            else if (def.physReg() == exec)
+               ctx.VALU_wrexec = NOPs ? new_idx : new_idx + 1;
+            else if (def.physReg() <= 102)
+               ctx.VALU_wrsgpr = NOPs ? new_idx : new_idx + 1;
+         }
+      }
+      return NOPs;
+   } else if (instr->isVMEM() && ctx.VALU_wrsgpr + 5 >= new_idx) {
+      /* If the VALU writes the SGPR that is used by a VMEM, the user must add five wait states. */
+      for (int pred_idx = new_idx - 1; pred_idx >= 0 && pred_idx >= new_idx - 5; pred_idx--) {
+         aco_ptr<Instruction>& pred = new_instructions[pred_idx];
+         if (!(pred->isVALU() && VALU_writes_sgpr(pred)))
+            continue;
+
+         for (const Definition& def : pred->definitions) {
+            if (def.physReg() > 102)
+               continue;
+
+            if (instr->operands.size() > 1 &&
+                regs_intersect(instr->operands[1].physReg(), instr->operands[1].size(),
+                               def.physReg(), def.size())) {
+                  return 5 + pred_idx - new_idx + 1;
+            }
+
+            if (instr->operands.size() > 2 &&
+                regs_intersect(instr->operands[2].physReg(), instr->operands[2].size(),
+                               def.physReg(), def.size())) {
+                  return 5 + pred_idx - new_idx + 1;
+            }
+         }
+      }
+   }
+
+   return 0;
+}
+
+
+void handle_block(NOP_ctx& ctx, Block& block)
+{
+   std::vector<aco_ptr<Instruction>> instructions;
+   instructions.reserve(block.instructions.size());
+   for (unsigned i = 0; i < block.instructions.size(); i++) {
+      aco_ptr<Instruction>& instr = block.instructions[i];
+      unsigned NOPs = handle_instruction(ctx, instr, block.instructions, instructions);
+      if (NOPs) {
+         // TODO: try to move the instruction down
+         /* create NOP */
+         aco_ptr<SOPP_instruction> nop{create_instruction<SOPP_instruction>(aco_opcode::s_nop, Format::SOPP, 0, 0)};
+         nop->imm = NOPs - 1;
+         nop->block = -1;
+         instructions.emplace_back(std::move(nop));
+      }
+
+      instructions.emplace_back(std::move(instr));
+   }
+
+   ctx.VALU_wrvcc -= instructions.size();
+   ctx.VALU_wrexec -= instructions.size();
+   ctx.VALU_wrsgpr -= instructions.size();
+   block.instructions = std::move(instructions);
+}
+
+} /* end namespace */
+
+
+void insert_NOPs(Program* program)
+{
+   NOP_ctx ctx(program);
+   for (Block& block : program->blocks) {
+      if (block.instructions.empty())
+         continue;
+
+      handle_block(ctx, block);
+   }
+}
+
+}
--- a/src/amd/compiler/aco_insert_exec_mask.cpp
+++ b/src/amd/compiler/aco_insert_exec_mask.cpp
--- a/src/amd/compiler/aco_insert_waitcnt.cpp
+++ b/src/amd/compiler/aco_insert_waitcnt.cpp
@ -0,0 +1,697 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <algorithm>
+#include <map>
+
+#include "aco_ir.h"
+#include "vulkan/radv_shader.h"
+
+namespace aco {
+
+namespace {
+
+/**
+ * The general idea of this pass is:
+ * The CFG is traversed in reverse postorder (forward).
+ * Per BB one wait_ctx is maintained.
+ * The in-context is the joined out-contexts of the predecessors.
+ * The context contains a map: gpr -> wait_entry
+ * consisting of the information about the cnt values to be waited for.
+ * Note: After merge-nodes, it might occur that for the same register
+ *       multiple cnt values are to be waited for.
+ *
+ * The values are updated according to the encountered instructions:
+ * - additional events increment the counter of waits of the same type
+ * - or erase gprs with counters higher than to be waited for.
+ */
+
+// TODO: do a more clever insertion of wait_cnt (lgkm_cnt) when there is a load followed by a use of a previous load
+
+/* Instructions of the same event will finish in-order except for smem
+ * and maybe flat. Instructions of different events may not finish in-order. */
+enum wait_event : uint16_t {
+   event_smem = 1 << 0,
+   event_lds = 1 << 1,
+   event_gds = 1 << 2,
+   event_vmem = 1 << 3,
+   event_vmem_store = 1 << 4, /* GFX10+ */
+   event_flat = 1 << 5,
+   event_exp_pos = 1 << 6,
+   event_exp_param = 1 << 7,
+   event_exp_mrt_null = 1 << 8,
+   event_gds_gpr_lock = 1 << 9,
+   event_vmem_gpr_lock = 1 << 10,
+};
+
+enum counter_type : uint8_t {
+   counter_exp = 1 << 0,
+   counter_lgkm = 1 << 1,
+   counter_vm = 1 << 2,
+   counter_vs = 1 << 3,
+};
+
+static const uint16_t exp_events = event_exp_pos | event_exp_param | event_exp_mrt_null | event_gds_gpr_lock | event_vmem_gpr_lock;
+static const uint16_t lgkm_events = event_smem | event_lds | event_gds | event_flat;
+static const uint16_t vm_events = event_vmem | event_flat;
+static const uint16_t vs_events = event_vmem_store;
+
+uint8_t get_counters_for_event(wait_event ev)
+{
+   switch (ev) {
+   case event_smem:
+   case event_lds:
+   case event_gds:
+      return counter_lgkm;
+   case event_vmem:
+      return counter_vm;
+   case event_vmem_store:
+      return counter_vs;
+   case event_flat:
+      return counter_vm | counter_lgkm;
+   case event_exp_pos:
+   case event_exp_param:
+   case event_exp_mrt_null:
+   case event_gds_gpr_lock:
+   case event_vmem_gpr_lock:
+      return counter_exp;
+   default:
+      return 0;
+   }
+}
+
+struct wait_imm {
+   static const uint8_t unset_counter = 0xff;
+
+   uint8_t vm;
+   uint8_t exp;
+   uint8_t lgkm;
+   uint8_t vs;
+
+   wait_imm() :
+      vm(unset_counter), exp(unset_counter), lgkm(unset_counter), vs(unset_counter) {}
+   wait_imm(uint16_t vm_, uint16_t exp_, uint16_t lgkm_, uint16_t vs_) :
+      vm(vm_), exp(exp_), lgkm(lgkm_), vs(vs_) {}
+
+   uint16_t pack(enum chip_class chip) const
+   {
+      uint16_t imm = 0;
+      assert(exp == unset_counter || exp <= 0x7);
+      switch (chip) {
+      case GFX10:
+         assert(lgkm == unset_counter || lgkm <= 0x3f);
+         assert(vm == unset_counter || vm <= 0x3f);
+         imm = ((vm & 0x30) << 10) | ((lgkm & 0x3f) << 8) | ((exp & 0x7) << 4) | (vm & 0xf);
+         break;
+      case GFX9:
+         assert(lgkm == unset_counter || lgkm <= 0xf);
+         assert(vm == unset_counter || vm <= 0x3f);
+         imm = ((vm & 0x30) << 10) | ((lgkm & 0xf) << 8) | ((exp & 0x7) << 4) | (vm & 0xf);
+         break;
+      default:
+         assert(lgkm == unset_counter || lgkm <= 0xf);
+         assert(vm == unset_counter || vm <= 0xf);
+         imm = ((lgkm & 0xf) << 8) | ((exp & 0x7) << 4) | (vm & 0xf);
+         break;
+      }
+      if (chip < GFX9 && vm == wait_imm::unset_counter)
+         imm |= 0xc000; /* should have no effect on pre-GFX9 and now we won't have to worry about the architecture when interpreting the immediate */
+      if (chip < GFX10 && lgkm == wait_imm::unset_counter)
+         imm |= 0x3000; /* should have no effect on pre-GFX10 and now we won't have to worry about the architecture when interpreting the immediate */
+      return imm;
+   }
+
+   void combine(const wait_imm& other)
+   {
+      vm = std::min(vm, other.vm);
+      exp = std::min(exp, other.exp);
+      lgkm = std::min(lgkm, other.lgkm);
+      vs = std::min(vs, other.vs);
+   }
+
+   bool empty() const
+   {
+      return vm == unset_counter && exp == unset_counter &&
+             lgkm == unset_counter && vs == unset_counter;
+   }
+};
+
+struct wait_entry {
+   wait_imm imm;
+   uint16_t events; /* use wait_event notion */
+   uint8_t counters; /* use counter_type notion */
+   bool wait_on_read:1;
+   bool logical:1;
+
+   wait_entry(wait_event event, wait_imm imm, bool logical, bool wait_on_read)
+           : imm(imm), events(event), counters(get_counters_for_event(event)),
+             wait_on_read(wait_on_read), logical(logical) {}
+
+   void join(const wait_entry& other)
+   {
+      events |= other.events;
+      counters |= other.counters;
+      imm.combine(other.imm);
+      wait_on_read = wait_on_read || other.wait_on_read;
+      assert(logical == other.logical);
+   }
+
+   void remove_counter(counter_type counter)
+   {
+      counters &= ~counter;
+
+      if (counter == counter_lgkm) {
+         imm.lgkm = wait_imm::unset_counter;
+         events &= ~(event_smem | event_lds | event_gds);
+      }
+
+      if (counter == counter_vm) {
+         imm.vm = wait_imm::unset_counter;
+         events &= ~event_vmem;
+      }
+
+      if (counter == counter_exp) {
+         imm.exp = wait_imm::unset_counter;
+         events &= ~(event_exp_pos | event_exp_param | event_exp_mrt_null | event_gds_gpr_lock | event_vmem_gpr_lock);
+      }
+
+      if (counter == counter_vs) {
+         imm.vs = wait_imm::unset_counter;
+         events &= ~event_vmem_store;
+      }
+
+      if (!(counters & counter_lgkm) && !(counters & counter_vm))
+         events &= ~event_flat;
+   }
+};
+
+struct wait_ctx {
+   Program *program;
+   enum chip_class chip_class;
+   uint16_t max_vm_cnt;
+   uint16_t max_exp_cnt;
+   uint16_t max_lgkm_cnt;
+   uint16_t max_vs_cnt;
+   uint16_t unordered_events = event_smem | event_flat;
+
+   uint8_t vm_cnt = 0;
+   uint8_t exp_cnt = 0;
+   uint8_t lgkm_cnt = 0;
+   uint8_t vs_cnt = 0;
+   bool pending_flat_lgkm = false;
+   bool pending_flat_vm = false;
+
+   wait_imm barrier_imm[barrier_count];
+
+   std::map<PhysReg,wait_entry> gpr_map;
+
+   wait_ctx() {}
+   wait_ctx(Program *program_)
+           : program(program_),
+             chip_class(program_->chip_class),
+             max_vm_cnt(program_->chip_class >= GFX9 ? 62 : 14),
+             max_exp_cnt(6),
+             max_lgkm_cnt(program_->chip_class >= GFX10 ? 62 : 14),
+             max_vs_cnt(program_->chip_class >= GFX10 ? 62 : 0),
+             unordered_events(event_smem | (program_->chip_class < GFX10 ? event_flat : 0)) {}
+
+   void join(const wait_ctx* other, bool logical)
+   {
+      exp_cnt = std::max(exp_cnt, other->exp_cnt);
+      vm_cnt = std::max(vm_cnt, other->vm_cnt);
+      lgkm_cnt = std::max(lgkm_cnt, other->lgkm_cnt);
+      vs_cnt = std::max(vs_cnt, other->vs_cnt);
+      pending_flat_lgkm |= other->pending_flat_lgkm;
+      pending_flat_vm |= other->pending_flat_vm;
+
+      for (std::pair<PhysReg,wait_entry> entry : other->gpr_map)
+      {
+         std::map<PhysReg,wait_entry>::iterator it = gpr_map.find(entry.first);
+         if (entry.second.logical != logical)
+            continue;
+
+         if (it != gpr_map.end())
+            it->second.join(entry.second);
+         else
+            gpr_map.insert(entry);
+      }
+
+      for (unsigned i = 0; i < barrier_count; i++)
+         barrier_imm[i].combine(other->barrier_imm[i]);
+   }
+};
+
+wait_imm check_instr(Instruction* instr, wait_ctx& ctx)
+{
+   wait_imm wait;
+
+   for (const Operand op : instr->operands) {
+      if (op.isConstant() || op.isUndefined())
+         continue;
+
+      /* check consecutively read gprs */
+      for (unsigned j = 0; j < op.size(); j++) {
+         PhysReg reg{op.physReg() + j};
+         std::map<PhysReg,wait_entry>::iterator it = ctx.gpr_map.find(reg);
+         if (it == ctx.gpr_map.end() || !it->second.wait_on_read)
+            continue;
+
+         wait.combine(it->second.imm);
+      }
+   }
+
+   for (const Definition& def : instr->definitions) {
+      /* check consecutively written gprs */
+      for (unsigned j = 0; j < def.getTemp().size(); j++)
+      {
+         PhysReg reg{def.physReg() + j};
+
+         std::map<PhysReg,wait_entry>::iterator it = ctx.gpr_map.find(reg);
+         if (it == ctx.gpr_map.end())
+            continue;
+
+         /* Vector Memory reads and writes return in the order they were issued */
+         if (instr->isVMEM() && ((it->second.events & vm_events) == event_vmem)) {
+            it->second.remove_counter(counter_vm);
+            if (!it->second.counters)
+               it = ctx.gpr_map.erase(it);
+            continue;
+         }
+
+         /* LDS reads and writes return in the order they were issued. same for GDS */
+         if (instr->format == Format::DS) {
+            bool gds = static_cast<DS_instruction*>(instr)->gds;
+            if ((it->second.events & lgkm_events) == (gds ? event_gds : event_lds)) {
+               it->second.remove_counter(counter_lgkm);
+               if (!it->second.counters)
+                  it = ctx.gpr_map.erase(it);
+               continue;
+            }
+         }
+
+         wait.combine(it->second.imm);
+      }
+   }
+
+   return wait;
+}
+
+wait_imm kill(Instruction* instr, wait_ctx& ctx)
+{
+   wait_imm imm;
+   if (ctx.exp_cnt || ctx.vm_cnt || ctx.lgkm_cnt)
+      imm.combine(check_instr(instr, ctx));
+
+   if (instr->format == Format::PSEUDO_BARRIER) {
+      unsigned* bsize = ctx.program->info->cs.block_size;
+      unsigned workgroup_size = bsize[0] * bsize[1] * bsize[2];
+      switch (instr->opcode) {
+      case aco_opcode::p_memory_barrier_all:
+         for (unsigned i = 0; i < barrier_count; i++) {
+            if ((1 << i) == barrier_shared && workgroup_size <= 64)
+               continue;
+            imm.combine(ctx.barrier_imm[i]);
+         }
+         break;
+      case aco_opcode::p_memory_barrier_atomic:
+         imm.combine(ctx.barrier_imm[ffs(barrier_atomic) - 1]);
+         break;
+      /* see comment in aco_scheduler.cpp's can_move_instr() on why these barriers are merged */
+      case aco_opcode::p_memory_barrier_buffer:
+      case aco_opcode::p_memory_barrier_image:
+         imm.combine(ctx.barrier_imm[ffs(barrier_buffer) - 1]);
+         imm.combine(ctx.barrier_imm[ffs(barrier_image) - 1]);
+         break;
+      case aco_opcode::p_memory_barrier_shared:
+         if (workgroup_size > 64)
+            imm.combine(ctx.barrier_imm[ffs(barrier_shared) - 1]);
+         break;
+      default:
+         assert(false);
+         break;
+      }
+   }
+
+   if (!imm.empty()) {
+      if (ctx.pending_flat_vm && imm.vm != wait_imm::unset_counter)
+         imm.vm = 0;
+      if (ctx.pending_flat_lgkm && imm.lgkm != wait_imm::unset_counter)
+         imm.lgkm = 0;
+
+      /* reset counters */
+      ctx.exp_cnt = std::min(ctx.exp_cnt, imm.exp);
+      ctx.vm_cnt = std::min(ctx.vm_cnt, imm.vm);
+      ctx.lgkm_cnt = std::min(ctx.lgkm_cnt, imm.lgkm);
+      ctx.vs_cnt = std::min(ctx.vs_cnt, imm.vs);
+
+      /* update barrier wait imms */
+      for (unsigned i = 0; i < barrier_count; i++) {
+         wait_imm& bar = ctx.barrier_imm[i];
+         if (bar.exp != wait_imm::unset_counter && imm.exp <= bar.exp)
+            bar.exp = wait_imm::unset_counter;
+         if (bar.vm != wait_imm::unset_counter && imm.vm <= bar.vm)
+            bar.vm = wait_imm::unset_counter;
+         if (bar.lgkm != wait_imm::unset_counter && imm.lgkm <= bar.lgkm)
+            bar.lgkm = wait_imm::unset_counter;
+         if (bar.vs != wait_imm::unset_counter && imm.vs <= bar.vs)
+            bar.vs = wait_imm::unset_counter;
+      }
+
+      /* remove all vgprs with higher counter from map */
+      std::map<PhysReg,wait_entry>::iterator it = ctx.gpr_map.begin();
+      while (it != ctx.gpr_map.end())
+      {
+         if (imm.exp != wait_imm::unset_counter && imm.exp <= it->second.imm.exp)
+            it->second.remove_counter(counter_exp);
+         if (imm.vm != wait_imm::unset_counter && imm.vm <= it->second.imm.vm)
+            it->second.remove_counter(counter_vm);
+         if (imm.lgkm != wait_imm::unset_counter && imm.lgkm <= it->second.imm.lgkm)
+            it->second.remove_counter(counter_lgkm);
+         if (imm.lgkm != wait_imm::unset_counter && imm.vs <= it->second.imm.vs)
+            it->second.remove_counter(counter_vs);
+         if (!it->second.counters)
+            it = ctx.gpr_map.erase(it);
+         else
+            it++;
+      }
+   }
+
+   if (imm.vm == 0)
+      ctx.pending_flat_vm = false;
+   if (imm.lgkm == 0)
+      ctx.pending_flat_lgkm = false;
+
+   return imm;
+}
+
+void update_barrier_imm(wait_ctx& ctx, uint8_t counters, barrier_interaction barrier)
+{
+   unsigned barrier_index = ffs(barrier) - 1;
+   for (unsigned i = 0; i < barrier_count; i++) {
+      wait_imm& bar = ctx.barrier_imm[i];
+      if (i == barrier_index) {
+         if (counters & counter_lgkm)
+            bar.lgkm = 0;
+         if (counters & counter_vm)
+            bar.vm = 0;
+         if (counters & counter_exp)
+            bar.exp = 0;
+         if (counters & counter_vs)
+            bar.vs = 0;
+      } else {
+         if (counters & counter_lgkm && bar.lgkm != wait_imm::unset_counter && bar.lgkm < ctx.max_lgkm_cnt)
+            bar.lgkm++;
+         if (counters & counter_vm && bar.vm != wait_imm::unset_counter && bar.vm < ctx.max_vm_cnt)
+            bar.vm++;
+         if (counters & counter_exp && bar.exp != wait_imm::unset_counter && bar.exp < ctx.max_exp_cnt)
+            bar.exp++;
+         if (counters & counter_vs && bar.vs != wait_imm::unset_counter && bar.vs < ctx.max_vs_cnt)
+            bar.vs++;
+      }
+   }
+}
+
+void update_counters(wait_ctx& ctx, wait_event event, barrier_interaction barrier=barrier_none)
+{
+   uint8_t counters = get_counters_for_event(event);
+
+   if (counters & counter_lgkm && ctx.lgkm_cnt <= ctx.max_lgkm_cnt)
+      ctx.lgkm_cnt++;
+   if (counters & counter_vm && ctx.vm_cnt <= ctx.max_vm_cnt)
+      ctx.vm_cnt++;
+   if (counters & counter_exp && ctx.exp_cnt <= ctx.max_exp_cnt)
+      ctx.exp_cnt++;
+   if (counters & counter_vs && ctx.vs_cnt <= ctx.max_vs_cnt)
+      ctx.vs_cnt++;
+
+   update_barrier_imm(ctx, counters, barrier);
+
+   if (ctx.unordered_events & event)
+      return;
+
+   if (ctx.pending_flat_lgkm)
+      counters &= ~counter_lgkm;
+   if (ctx.pending_flat_vm)
+      counters &= ~counter_vm;
+
+   for (std::pair<const PhysReg,wait_entry>& e : ctx.gpr_map) {
+      wait_entry& entry = e.second;
+
+      if (entry.events & ctx.unordered_events)
+         continue;
+
+      assert(entry.events);
+
+      if ((counters & counter_exp) && (entry.events & exp_events) == event && entry.imm.exp < ctx.max_exp_cnt)
+         entry.imm.exp++;
+      if ((counters & counter_lgkm) && (entry.events & lgkm_events) == event && entry.imm.lgkm < ctx.max_lgkm_cnt)
+         entry.imm.lgkm++;
+      if ((counters & counter_vm) && (entry.events & vm_events) == event && entry.imm.vm < ctx.max_vm_cnt)
+         entry.imm.vm++;
+      if ((counters & counter_vs) && (entry.events & vs_events) == event && entry.imm.vs < ctx.max_vs_cnt)
+         entry.imm.vs++;
+   }
+}
+
+void update_counters_for_flat_load(wait_ctx& ctx, barrier_interaction barrier=barrier_none)
+{
+   assert(ctx.chip_class < GFX10);
+
+   if (ctx.lgkm_cnt <= ctx.max_lgkm_cnt)
+      ctx.lgkm_cnt++;
+   if (ctx.lgkm_cnt <= ctx.max_vm_cnt)
+   ctx.vm_cnt++;
+
+   update_barrier_imm(ctx, counter_vm | counter_lgkm, barrier);
+
+   for (std::pair<PhysReg,wait_entry> e : ctx.gpr_map)
+   {
+      if (e.second.counters & counter_vm)
+         e.second.imm.vm = 0;
+      if (e.second.counters & counter_lgkm)
+         e.second.imm.lgkm = 0;
+   }
+   ctx.pending_flat_lgkm = true;
+   ctx.pending_flat_vm = true;
+}
+
+void insert_wait_entry(wait_ctx& ctx, PhysReg reg, RegClass rc, wait_event event, bool wait_on_read)
+{
+   uint16_t counters = get_counters_for_event(event);
+   wait_imm imm;
+   if (counters & counter_lgkm)
+      imm.lgkm = 0;
+   if (counters & counter_vm)
+      imm.vm = 0;
+   if (counters & counter_exp)
+      imm.exp = 0;
+   if (counters & counter_vs)
+      imm.vs = 0;
+
+   wait_entry new_entry(event, imm, !rc.is_linear(), wait_on_read);
+
+   for (unsigned i = 0; i < rc.size(); i++) {
+      auto it = ctx.gpr_map.emplace(PhysReg{reg.reg+i}, new_entry);
+      if (!it.second)
+         it.first->second.join(new_entry);
+   }
+}
+
+void insert_wait_entry(wait_ctx& ctx, Operand op, wait_event event)
+{
+   if (!op.isConstant() && !op.isUndefined())
+      insert_wait_entry(ctx, op.physReg(), op.regClass(), event, false);
+}
+
+void insert_wait_entry(wait_ctx& ctx, Definition def, wait_event event)
+{
+   insert_wait_entry(ctx, def.physReg(), def.regClass(), event, true);
+}
+
+void gen(Instruction* instr, wait_ctx& ctx)
+{
+   switch (instr->format) {
+   case Format::EXP: {
+      Export_instruction* exp_instr = static_cast<Export_instruction*>(instr);
+
+      wait_event ev;
+      if (exp_instr->dest <= 9)
+         ev = event_exp_mrt_null;
+      else if (exp_instr->dest <= 15)
+         ev = event_exp_pos;
+      else
+         ev = event_exp_param;
+      update_counters(ctx, ev);
+
+      /* insert new entries for exported vgprs */
+      for (unsigned i = 0; i < 4; i++)
+      {
+         if (exp_instr->enabled_mask & (1 << i)) {
+            unsigned idx = exp_instr->compressed ? i >> 1 : i;
+            assert(idx < exp_instr->operands.size());
+            insert_wait_entry(ctx, exp_instr->operands[idx], ev);
+
+         }
+      }
+      insert_wait_entry(ctx, exec, s2, ev, false);
+      break;
+   }
+   case Format::FLAT: {
+      if (ctx.chip_class < GFX10 && !instr->definitions.empty())
+         update_counters_for_flat_load(ctx, barrier_buffer);
+      else
+         update_counters(ctx, event_flat, barrier_buffer);
+
+      if (!instr->definitions.empty())
+         insert_wait_entry(ctx, instr->definitions[0], event_flat);
+      break;
+   }
+   case Format::SMEM: {
+      update_counters(ctx, event_smem, static_cast<SMEM_instruction*>(instr)->barrier);
+
+      if (!instr->definitions.empty())
+         insert_wait_entry(ctx, instr->definitions[0], event_smem);
+      break;
+   }
+   case Format::DS: {
+      bool gds = static_cast<DS_instruction*>(instr)->gds;
+      update_counters(ctx, gds ? event_gds : event_lds, gds ? barrier_none : barrier_shared);
+      if (gds)
+         update_counters(ctx, event_gds_gpr_lock);
+
+      if (!instr->definitions.empty())
+         insert_wait_entry(ctx, instr->definitions[0], gds ? event_gds : event_lds);
+
+      if (gds) {
+         for (const Operand& op : instr->operands)
+            insert_wait_entry(ctx, op, event_gds_gpr_lock);
+         insert_wait_entry(ctx, exec, s2, event_gds_gpr_lock, false);
+      }
+      break;
+   }
+   case Format::MUBUF:
+   case Format::MTBUF:
+   case Format::MIMG:
+   case Format::GLOBAL: {
+      wait_event ev = !instr->definitions.empty() || ctx.chip_class < GFX10 ? event_vmem : event_vmem_store;
+      update_counters(ctx, ev, get_barrier_interaction(instr));
+
+      if (!instr->definitions.empty())
+         insert_wait_entry(ctx, instr->definitions[0], ev);
+
+      if (instr->operands.size() == 4 && ctx.chip_class == GFX6) {
+         ctx.exp_cnt++;
+         update_counters(ctx, event_vmem_gpr_lock);
+         insert_wait_entry(ctx, instr->operands[3], event_vmem_gpr_lock);
+      }
+      break;
+   }
+   default:
+      break;
+   }
+}
+
+void emit_waitcnt(wait_ctx& ctx, std::vector<aco_ptr<Instruction>>& instructions, wait_imm imm)
+{
+   if (imm.vs != wait_imm::unset_counter) {
+      assert(ctx.chip_class >= GFX10);
+      SOPK_instruction* waitcnt_vs = create_instruction<SOPK_instruction>(aco_opcode::s_waitcnt_vscnt, Format::SOPK, 0, 0);
+      waitcnt_vs->imm = imm.vs;
+      instructions.emplace_back(waitcnt_vs);
+      imm.vs = wait_imm::unset_counter;
+   }
+   if (!imm.empty()) {
+      SOPP_instruction* waitcnt = create_instruction<SOPP_instruction>(aco_opcode::s_waitcnt, Format::SOPP, 0, 0);
+      waitcnt->imm = imm.pack(ctx.chip_class);
+      waitcnt->block = -1;
+      instructions.emplace_back(waitcnt);
+   }
+}
+
+void handle_block(Program *program, Block& block, wait_ctx& ctx)
+{
+   std::vector<aco_ptr<Instruction>> new_instructions;
+
+   for (aco_ptr<Instruction>& instr : block.instructions) {
+      wait_imm imm = kill(instr.get(), ctx);
+
+      if (!imm.empty())
+         emit_waitcnt(ctx, new_instructions, imm);
+
+      gen(instr.get(), ctx);
+
+      if (instr->format != Format::PSEUDO_BARRIER)
+         new_instructions.emplace_back(std::move(instr));
+   }
+
+   /* check if this block is at the end of a loop */
+   for (unsigned succ_idx : block.linear_succs) {
+      /* eliminate any remaining counters */
+      if (succ_idx <= block.index && (ctx.vm_cnt || ctx.exp_cnt || ctx.lgkm_cnt || ctx.vs_cnt)) {
+         // TODO: we could do better if we only wait if the regs between the block and other predecessors differ
+
+         aco_ptr<Instruction> branch = std::move(new_instructions.back());
+         new_instructions.pop_back();
+
+         wait_imm imm(ctx.vm_cnt ? 0 : wait_imm::unset_counter,
+                      ctx.exp_cnt ? 0 : wait_imm::unset_counter,
+                      ctx.lgkm_cnt ? 0 : wait_imm::unset_counter,
+                      ctx.vs_cnt ? 0 : wait_imm::unset_counter);
+         emit_waitcnt(ctx, new_instructions, imm);
+
+         new_instructions.push_back(std::move(branch));
+
+         ctx = wait_ctx(program);
+         break;
+      }
+   }
+   block.instructions.swap(new_instructions);
+}
+
+} /* end namespace */
+
+void insert_wait_states(Program* program)
+{
+   wait_ctx out_ctx[program->blocks.size()]; /* per BB ctx */
+   for (unsigned i = 0; i < program->blocks.size(); i++)
+      out_ctx[i] = wait_ctx(program);
+
+   for (unsigned i = 0; i < program->blocks.size(); i++) {
+      Block& current = program->blocks[i];
+      wait_ctx& in = out_ctx[current.index];
+
+      for (unsigned b : current.linear_preds)
+         in.join(&out_ctx[b], false);
+      for (unsigned b : current.logical_preds)
+         in.join(&out_ctx[b], true);
+
+      if (current.instructions.empty())
+         continue;
+
+      handle_block(program, current, in);
+   }
+}
+
+}
+
--- a/src/amd/compiler/aco_instruction_selection.cpp
+++ b/src/amd/compiler/aco_instruction_selection.cpp
--- a/src/amd/compiler/aco_instruction_selection_setup.cpp
+++ b/src/amd/compiler/aco_instruction_selection_setup.cpp
--- a/src/amd/compiler/aco_interface.cpp
+++ b/src/amd/compiler/aco_interface.cpp
@ -0,0 +1,166 @@
+/*
+ * Copyright © 2018 Google
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#include "aco_interface.h"
+#include "aco_ir.h"
+#include "vulkan/radv_shader.h"
+#include "c11/threads.h"
+#include "util/debug.h"
+
+#include <iostream>
+#include <sstream>
+
+namespace aco {
+uint64_t debug_flags = 0;
+
+static const struct debug_control aco_debug_options[] = {
+   {"validateir", DEBUG_VALIDATE},
+   {"validatera", DEBUG_VALIDATE_RA},
+   {"perfwarn", DEBUG_PERFWARN},
+   {NULL, 0}
+};
+
+static once_flag init_once_flag = ONCE_FLAG_INIT;
+
+static void init()
+{
+   debug_flags = parse_debug_string(getenv("ACO_DEBUG"), aco_debug_options);
+
+   #ifndef NDEBUG
+   /* enable some flags by default on debug builds */
+   debug_flags |= aco::DEBUG_VALIDATE;
+   #endif
+}
+}
+
+void aco_compile_shader(unsigned shader_count,
+                        struct nir_shader *const *shaders,
+                        struct radv_shader_binary **binary,
+                        struct radv_shader_info *info,
+                        struct radv_nir_compiler_options *options)
+{
+   call_once(&aco::init_once_flag, aco::init);
+
+   ac_shader_config config = {0};
+   std::unique_ptr<aco::Program> program{new aco::Program};
+
+   /* Instruction Selection */
+   aco::select_program(program.get(), shader_count, shaders, &config, info, options);
+   if (options->dump_preoptir) {
+      std::cerr << "After Instruction Selection:\n";
+      aco_print_program(program.get(), stderr);
+   }
+   aco::validate(program.get(), stderr);
+
+   /* Boolean phi lowering */
+   aco::lower_bool_phis(program.get());
+   //std::cerr << "After Boolean Phi Lowering:\n";
+   //aco_print_program(program.get(), stderr);
+
+   aco::dominator_tree(program.get());
+
+   /* Optimization */
+   aco::value_numbering(program.get());
+   aco::optimize(program.get());
+   aco::validate(program.get(), stderr);
+
+   aco::setup_reduce_temp(program.get());
+   aco::insert_exec_mask(program.get());
+   aco::validate(program.get(), stderr);
+
+   aco::live live_vars = aco::live_var_analysis(program.get(), options);
+   aco::spill(program.get(), live_vars, options);
+
+   //std::cerr << "Before Schedule:\n";
+   //aco_print_program(program.get(), stderr);
+   aco::schedule_program(program.get(), live_vars);
+
+   /* Register Allocation */
+   aco::register_allocation(program.get(), live_vars.live_out);
+   if (options->dump_shader) {
+      std::cerr << "After RA:\n";
+      aco_print_program(program.get(), stderr);
+   }
+
+   if (aco::validate_ra(program.get(), options, stderr)) {
+      std::cerr << "Program after RA validation failure:\n";
+      aco_print_program(program.get(), stderr);
+      abort();
+   }
+
+   aco::ssa_elimination(program.get());
+   /* Lower to HW Instructions */
+   aco::lower_to_hw_instr(program.get());
+   //std::cerr << "After Eliminate Pseudo Instr:\n";
+   //aco_print_program(program.get(), stderr);
+
+   /* Insert Waitcnt */
+   aco::insert_wait_states(program.get());
+   aco::insert_NOPs(program.get());
+
+   //std::cerr << "After Insert-Waitcnt:\n";
+   //aco_print_program(program.get(), stderr);
+
+   /* Assembly */
+   std::vector<uint32_t> code;
+   unsigned exec_size = aco::emit_program(program.get(), code);
+
+   bool get_disasm = options->dump_shader;
+#ifndef NDEBUG
+   get_disasm |= options->record_llvm_ir;
+#endif
+
+   size_t size = 0;
+
+   std::string disasm;
+   if (get_disasm) {
+      std::ostringstream stream;
+      aco::print_asm(program.get(), code, exec_size / 4u, options->family, stream);
+      stream << '\0';
+      disasm = stream.str();
+      size += disasm.size();
+   }
+
+   size += code.size() * sizeof(uint32_t) + sizeof(radv_shader_binary_legacy);
+   radv_shader_binary_legacy* legacy_binary = (radv_shader_binary_legacy*) malloc(size);
+
+   legacy_binary->base.type = RADV_BINARY_TYPE_LEGACY;
+   legacy_binary->base.stage = shaders[shader_count-1]->info.stage;
+   legacy_binary->base.is_gs_copy_shader = false;
+   legacy_binary->base.total_size = size;
+
+   memcpy(legacy_binary->data, code.data(), code.size() * sizeof(uint32_t));
+   legacy_binary->exec_size = exec_size;
+   legacy_binary->code_size = code.size() * sizeof(uint32_t);
+
+   legacy_binary->config = config;
+   legacy_binary->disasm_size = 0;
+   legacy_binary->llvm_ir_size = 0;
+
+   if (get_disasm) {
+      disasm.copy((char*) legacy_binary->data + legacy_binary->code_size, disasm.size());
+      legacy_binary->disasm_size = disasm.size() - 1;
+   }
+
+   *binary = (radv_shader_binary*) legacy_binary;
+}
--- a/src/amd/compiler/aco_interface.h
+++ b/src/amd/compiler/aco_interface.h
@ -0,0 +1,45 @@
+/*
+ * Copyright © 2018 Google
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef ACO_INTERFACE_H
+#define ACO_INTERFACE_H
+
+#include "nir.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+struct ac_shader_config;
+
+void aco_compile_shader(unsigned shader_count,
+                        struct nir_shader *const *shaders,
+                        struct radv_shader_binary** binary,
+                        struct radv_shader_info *info,
+                        struct radv_nir_compiler_options *options);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
--- a/src/amd/compiler/aco_ir.h
+++ b/src/amd/compiler/aco_ir.h
--- a/src/amd/compiler/aco_live_var_analysis.cpp
+++ b/src/amd/compiler/aco_live_var_analysis.cpp
@ -0,0 +1,243 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ * Copyright © 2018 Google
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *    Daniel Schürmann (daniel.schuermann@campus.tu-berlin.de)
+ *    Bas Nieuwenhuizen (bas@basnieuwenhuizen.nl)
+ *
+ */
+
+#include "aco_ir.h"
+
+#include <set>
+#include <vector>
+
+#include "vulkan/radv_shader.h"
+
+namespace aco {
+namespace {
+
+void process_live_temps_per_block(Program *program, live& lives, Block* block,
+                                  std::set<unsigned>& worklist, std::vector<uint16_t>& phi_sgpr_ops)
+{
+   std::vector<RegisterDemand>& register_demand = lives.register_demand[block->index];
+   RegisterDemand new_demand;
+
+   register_demand.resize(block->instructions.size());
+   block->register_demand = RegisterDemand();
+
+   std::set<Temp> live_sgprs;
+   std::set<Temp> live_vgprs;
+
+   /* add the live_out_exec to live */
+   bool exec_live = false;
+   if (block->live_out_exec != Temp()) {
+      live_sgprs.insert(block->live_out_exec);
+      new_demand.sgpr += 2;
+      exec_live = true;
+   }
+
+   /* split the live-outs from this block into the temporary sets */
+   std::vector<std::set<Temp>>& live_temps = lives.live_out;
+   for (const Temp temp : live_temps[block->index]) {
+      const bool inserted = temp.is_linear()
+                          ? live_sgprs.insert(temp).second
+                          : live_vgprs.insert(temp).second;
+      if (inserted) {
+         new_demand += temp;
+      }
+   }
+   new_demand.sgpr -= phi_sgpr_ops[block->index];
+
+   /* traverse the instructions backwards */
+   for (int idx = block->instructions.size() -1; idx >= 0; idx--)
+   {
+      /* substract the 2 sgprs from exec */
+      if (exec_live)
+         assert(new_demand.sgpr >= 2);
+      register_demand[idx] = RegisterDemand(new_demand.vgpr, new_demand.sgpr - (exec_live ? 2 : 0));
+
+      Instruction *insn = block->instructions[idx].get();
+      /* KILL */
+      for (Definition& definition : insn->definitions) {
+         if (!definition.isTemp()) {
+            continue;
+         }
+
+         const Temp temp = definition.getTemp();
+         size_t n = 0;
+         if (temp.is_linear())
+            n = live_sgprs.erase(temp);
+         else
+            n = live_vgprs.erase(temp);
+
+         if (n) {
+            new_demand -= temp;
+            definition.setKill(false);
+         } else {
+            register_demand[idx] += temp;
+            definition.setKill(true);
+         }
+
+         if (definition.isFixed() && definition.physReg() == exec)
+            exec_live = false;
+      }
+
+      /* GEN */
+      if (insn->opcode == aco_opcode::p_phi ||
+          insn->opcode == aco_opcode::p_linear_phi) {
+         /* directly insert into the predecessors live-out set */
+         std::vector<unsigned>& preds = insn->opcode == aco_opcode::p_phi
+                                      ? block->logical_preds
+                                      : block->linear_preds;
+         for (unsigned i = 0; i < preds.size(); ++i)
+         {
+            Operand &operand = insn->operands[i];
+            if (!operand.isTemp()) {
+               continue;
+            }
+            /* check if we changed an already processed block */
+            const bool inserted = live_temps[preds[i]].insert(operand.getTemp()).second;
+            if (inserted) {
+               operand.setFirstKill(true);
+               worklist.insert(preds[i]);
+               if (insn->opcode == aco_opcode::p_phi && operand.getTemp().type() == RegType::sgpr)
+                  phi_sgpr_ops[preds[i]] += operand.size();
+            }
+         }
+      } else if (insn->opcode == aco_opcode::p_logical_end) {
+         new_demand.sgpr += phi_sgpr_ops[block->index];
+      } else {
+         for (unsigned i = 0; i < insn->operands.size(); ++i)
+         {
+            Operand& operand = insn->operands[i];
+            if (!operand.isTemp()) {
+               continue;
+            }
+            const Temp temp = operand.getTemp();
+            const bool inserted = temp.is_linear()
+                                ? live_sgprs.insert(temp).second
+                                : live_vgprs.insert(temp).second;
+            if (inserted) {
+               operand.setFirstKill(true);
+               for (unsigned j = i + 1; j < insn->operands.size(); ++j) {
+                  if (insn->operands[j].isTemp() && insn->operands[j].tempId() == operand.tempId()) {
+                     insn->operands[j].setFirstKill(false);
+                     insn->operands[j].setKill(true);
+                  }
+               }
+               new_demand += temp;
+            } else {
+               operand.setKill(false);
+            }
+
+            if (operand.isFixed() && operand.physReg() == exec)
+               exec_live = true;
+         }
+      }
+
+      block->register_demand.update(register_demand[idx]);
+   }
+
+   /* now, we have the live-in sets and need to merge them into the live-out sets */
+   for (unsigned pred_idx : block->logical_preds) {
+      for (Temp vgpr : live_vgprs) {
+         auto it = live_temps[pred_idx].insert(vgpr);
+         if (it.second)
+            worklist.insert(pred_idx);
+      }
+   }
+
+   for (unsigned pred_idx : block->linear_preds) {
+      for (Temp sgpr : live_sgprs) {
+         auto it = live_temps[pred_idx].insert(sgpr);
+         if (it.second)
+            worklist.insert(pred_idx);
+      }
+   }
+
+   if (!(block->index != 0 || (live_vgprs.empty() && live_sgprs.empty()))) {
+      aco_print_program(program, stderr);
+      fprintf(stderr, "These temporaries are never defined or are defined after use:\n");
+      for (Temp vgpr : live_vgprs)
+         fprintf(stderr, "%%%d\n", vgpr.id());
+      for (Temp sgpr : live_sgprs)
+         fprintf(stderr, "%%%d\n", sgpr.id());
+      abort();
+   }
+
+   assert(block->index != 0 || new_demand == RegisterDemand());
+}
+} /* end namespace */
+
+void update_vgpr_sgpr_demand(Program* program, const RegisterDemand new_demand)
+{
+   // TODO: also take shared mem into account
+   const int16_t total_sgpr_regs = program->chip_class >= GFX8 ? 800 : 512;
+   const int16_t max_addressible_sgpr = program->sgpr_limit;
+   /* VGPRs are allocated in chunks of 4 */
+   const int16_t rounded_vgpr_demand = std::max<int16_t>(4, (new_demand.vgpr + 3) & ~3);
+   /* SGPRs are allocated in chunks of 16 between 8 and 104. VCC occupies the last 2 registers */
+   const int16_t rounded_sgpr_demand = std::min(std::max<int16_t>(8, (new_demand.sgpr + 2 + 7) & ~7), max_addressible_sgpr);
+   /* this won't compile, register pressure reduction necessary */
+   if (new_demand.vgpr > 256 || new_demand.sgpr > max_addressible_sgpr) {
+      program->num_waves = 0;
+      program->max_reg_demand = new_demand;
+   } else {
+      program->num_waves = std::min<uint16_t>(10,
+                                              std::min<uint16_t>(256 / rounded_vgpr_demand,
+                                                                 total_sgpr_regs / rounded_sgpr_demand));
+
+      program->max_reg_demand = {  int16_t((256 / program->num_waves) & ~3), std::min<int16_t>(((total_sgpr_regs / program->num_waves) & ~7) - 2, max_addressible_sgpr)};
+   }
+}
+
+live live_var_analysis(Program* program,
+                       const struct radv_nir_compiler_options *options)
+{
+   live result;
+   result.live_out.resize(program->blocks.size());
+   result.register_demand.resize(program->blocks.size());
+   std::set<unsigned> worklist;
+   std::vector<uint16_t> phi_sgpr_ops(program->blocks.size());
+   RegisterDemand new_demand;
+
+   /* this implementation assumes that the block idx corresponds to the block's position in program->blocks vector */
+   for (Block& block : program->blocks)
+      worklist.insert(block.index);
+   while (!worklist.empty()) {
+      std::set<unsigned>::reverse_iterator b_it = worklist.rbegin();
+      unsigned block_idx = *b_it;
+      worklist.erase(block_idx);
+      process_live_temps_per_block(program, result, &program->blocks[block_idx], worklist, phi_sgpr_ops);
+      new_demand.update(program->blocks[block_idx].register_demand);
+   }
+
+   /* calculate the program's register demand and number of waves */
+   update_vgpr_sgpr_demand(program, new_demand);
+
+   return result;
+}
+
+}
+
--- a/src/amd/compiler/aco_lower_bool_phis.cpp
+++ b/src/amd/compiler/aco_lower_bool_phis.cpp
@ -0,0 +1,241 @@
+/*
+ * Copyright © 2019 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *    Rhys Perry (pendingchaos02@gmail.com)
+ *
+ */
+
+#include <map>
+
+#include "aco_ir.h"
+#include "aco_builder.h"
+
+
+namespace aco {
+
+struct phi_use {
+   Block *block;
+   unsigned phi_def;
+
+   bool operator<(const phi_use& other) const {
+      return std::make_tuple(block, phi_def) <
+             std::make_tuple(other.block, other.phi_def);
+   }
+};
+
+struct ssa_state {
+   std::map<unsigned, unsigned> latest;
+   std::map<unsigned, std::map<phi_use, uint64_t>> phis;
+};
+
+Operand get_ssa(Program *program, unsigned block_idx, ssa_state *state)
+{
+   while (true) {
+      auto pos = state->latest.find(block_idx);
+      if (pos != state->latest.end())
+         return Operand({pos->second, s2});
+
+      Block& block = program->blocks[block_idx];
+      size_t pred = block.linear_preds.size();
+      if (pred == 0) {
+         return Operand(s2);
+      } else if (pred == 1) {
+         block_idx = block.linear_preds[0];
+         continue;
+      } else {
+         unsigned res = program->allocateId();
+         state->latest[block_idx] = res;
+
+         aco_ptr<Pseudo_instruction> phi{create_instruction<Pseudo_instruction>(aco_opcode::p_linear_phi, Format::PSEUDO, pred, 1)};
+         for (unsigned i = 0; i < pred; i++) {
+            phi->operands[i] = get_ssa(program, block.linear_preds[i], state);
+            if (phi->operands[i].isTemp()) {
+               assert(i < 64);
+               state->phis[phi->operands[i].tempId()][(phi_use){&block, res}] |= (uint64_t)1 << i;
+            }
+         }
+         phi->definitions[0] = Definition(Temp{res, s2});
+         block.instructions.emplace(block.instructions.begin(), std::move(phi));
+
+         return Operand({res, s2});
+      }
+   }
+}
+
+void update_phi(Program *program, ssa_state *state, Block *block, unsigned phi_def, uint64_t operand_mask) {
+   for (auto& phi : block->instructions) {
+      if (phi->opcode != aco_opcode::p_phi && phi->opcode != aco_opcode::p_linear_phi)
+         break;
+      if (phi->opcode != aco_opcode::p_linear_phi)
+         continue;
+      if (phi->definitions[0].tempId() != phi_def)
+         continue;
+      assert(ffsll(operand_mask) <= phi->operands.size());
+
+      uint64_t operands = operand_mask;
+      while (operands) {
+         unsigned operand = u_bit_scan64(&operands);
+         Operand new_operand = get_ssa(program, block->linear_preds[operand], state);
+         phi->operands[operand] = new_operand;
+         if (!new_operand.isUndefined())
+            state->phis[new_operand.tempId()][(phi_use){block, phi_def}] |= (uint64_t)1 << operand;
+      }
+      return;
+   }
+   assert(false);
+}
+
+Temp write_ssa(Program *program, Block *block, ssa_state *state, unsigned previous) {
+   unsigned id = program->allocateId();
+   state->latest[block->index] = id;
+
+   /* update phis */
+   if (previous) {
+      std::map<phi_use, uint64_t> phis;
+      phis.swap(state->phis[previous]);
+      for (auto& phi : phis)
+         update_phi(program, state, phi.first.block, phi.first.phi_def, phi.second);
+   }
+
+   return {id, s2};
+}
+
+void insert_before_branch(Block *block, aco_ptr<Instruction> instr)
+{
+   int end = block->instructions.size() - 1;
+   if (block->instructions[end]->format == Format::PSEUDO_BRANCH)
+      block->instructions.emplace(std::prev(block->instructions.end()), std::move(instr));
+   else
+      block->instructions.emplace_back(std::move(instr));
+}
+
+void insert_before_logical_end(Block *block, aco_ptr<Instruction> instr)
+{
+   for (int i = block->instructions.size() - 1; i >= 0; --i) {
+      if (block->instructions[i]->opcode == aco_opcode::p_logical_end) {
+         block->instructions.emplace(std::next(block->instructions.begin(), i), std::move(instr));
+         return;
+      }
+   }
+   insert_before_branch(block, std::move(instr));
+}
+
+aco_ptr<Instruction> lower_divergent_bool_phi(Program *program, Block *block, aco_ptr<Instruction>& phi)
+{
+   Builder bld(program);
+
+   ssa_state state;
+   for (unsigned i = 0; i < phi->operands.size(); i++) {
+      Block *pred = &program->blocks[block->logical_preds[i]];
+
+      if (phi->operands[i].isUndefined())
+         continue;
+
+      assert(phi->operands[i].isTemp());
+      Temp phi_src = phi->operands[i].getTemp();
+      if (phi_src.regClass() == s1) {
+         Temp new_phi_src = bld.tmp(s2);
+         insert_before_logical_end(pred,
+            bld.sop2(aco_opcode::s_cselect_b64, Definition(new_phi_src),
+                     Operand((uint32_t)-1), Operand(0u), bld.scc(phi_src)).get_ptr());
+         phi_src = new_phi_src;
+      }
+      assert(phi_src.regClass() == s2);
+
+      Operand cur = get_ssa(program, pred->index, &state);
+      Temp new_cur = write_ssa(program, pred, &state, cur.isTemp() ? cur.tempId() : 0);
+
+      if (cur.isUndefined()) {
+         insert_before_logical_end(pred, bld.sop1(aco_opcode::s_mov_b64, Definition(new_cur), phi_src).get_ptr());
+      } else {
+         Temp tmp1 = bld.tmp(s2), tmp2 = bld.tmp(s2);
+         insert_before_logical_end(pred,
+            bld.sop2(aco_opcode::s_andn2_b64, Definition(tmp1), bld.def(s1, scc),
+                     cur, Operand(exec, s2)).get_ptr());
+         insert_before_logical_end(pred,
+            bld.sop2(aco_opcode::s_and_b64, Definition(tmp2), bld.def(s1, scc),
+                     phi_src, Operand(exec, s2)).get_ptr());
+         insert_before_logical_end(pred,
+            bld.sop2(aco_opcode::s_or_b64, Definition(new_cur), bld.def(s1, scc),
+                     tmp1, tmp2).get_ptr());
+      }
+   }
+
+   return bld.sop1(aco_opcode::s_mov_b64, phi->definitions[0], get_ssa(program, block->index, &state)).get_ptr();
+}
+
+void lower_linear_bool_phi(Program *program, Block *block, aco_ptr<Instruction>& phi)
+{
+   Builder bld(program);
+
+   for (unsigned i = 0; i < phi->operands.size(); i++) {
+      if (!phi->operands[i].isTemp())
+         continue;
+
+      Temp phi_src = phi->operands[i].getTemp();
+      if (phi_src.regClass() == s2) {
+         Temp new_phi_src = bld.tmp(s1);
+         insert_before_logical_end(&program->blocks[block->linear_preds[i]],
+            bld.sopc(aco_opcode::s_cmp_lg_u64, bld.scc(Definition(new_phi_src)),
+                     Operand(0u), phi_src).get_ptr());
+         phi->operands[i].setTemp(new_phi_src);
+      }
+   }
+}
+
+void lower_bool_phis(Program* program)
+{
+   for (Block& block : program->blocks) {
+      std::vector<aco_ptr<Instruction>> instructions;
+      std::vector<aco_ptr<Instruction>> non_phi;
+      instructions.swap(block.instructions);
+      block.instructions.reserve(instructions.size());
+      unsigned i = 0;
+      for (; i < instructions.size(); i++)
+      {
+         aco_ptr<Instruction>& phi = instructions[i];
+         if (phi->opcode != aco_opcode::p_phi && phi->opcode != aco_opcode::p_linear_phi)
+            break;
+         if (phi->opcode == aco_opcode::p_phi && phi->definitions[0].regClass() == s2) {
+            non_phi.emplace_back(std::move(lower_divergent_bool_phi(program, &block, phi)));
+         } else if (phi->opcode == aco_opcode::p_linear_phi && phi->definitions[0].regClass() == s1) {
+            /* if it's a valid non-boolean phi, this should be a no-op */
+            lower_linear_bool_phi(program, &block, phi);
+            block.instructions.emplace_back(std::move(phi));
+         } else {
+            block.instructions.emplace_back(std::move(phi));
+         }
+      }
+      for (auto&& instr : non_phi) {
+         assert(instr->opcode != aco_opcode::p_phi && instr->opcode != aco_opcode::p_linear_phi);
+         block.instructions.emplace_back(std::move(instr));
+      }
+      for (; i < instructions.size(); i++) {
+         aco_ptr<Instruction> instr = std::move(instructions[i]);
+         assert(instr->opcode != aco_opcode::p_phi && instr->opcode != aco_opcode::p_linear_phi);
+         block.instructions.emplace_back(std::move(instr));
+      }
+   }
+}
+
+}
--- a/src/amd/compiler/aco_lower_to_hw_instr.cpp
+++ b/src/amd/compiler/aco_lower_to_hw_instr.cpp
@ -0,0 +1,765 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *    Daniel Schürmann (daniel.schuermann@campus.tu-berlin.de)
+ *
+ */
+
+#include <map>
+
+#include "aco_ir.h"
+#include "aco_builder.h"
+#include "util/u_math.h"
+#include "sid.h"
+
+
+namespace aco {
+
+struct lower_context {
+   Program *program;
+   std::vector<aco_ptr<Instruction>> instructions;
+};
+
+void emit_dpp_op(lower_context *ctx, PhysReg dst, PhysReg src0, PhysReg src1, PhysReg vtmp, PhysReg wrtmp,
+                 aco_opcode op, Format format, bool clobber_vcc, unsigned dpp_ctrl,
+                 unsigned row_mask, unsigned bank_mask, bool bound_ctrl_zero, unsigned size,
+                 Operand *identity=NULL) /* for VOP3 with sparse writes */
+{
+   RegClass rc = RegClass(RegType::vgpr, size);
+   if (format == Format::VOP3) {
+      Builder bld(ctx->program, &ctx->instructions);
+
+      if (identity)
+         bld.vop1(aco_opcode::v_mov_b32, Definition(vtmp, v1), identity[0]);
+      if (identity && size >= 2)
+         bld.vop1(aco_opcode::v_mov_b32, Definition(PhysReg{vtmp+1}, v1), identity[1]);
+
+      for (unsigned i = 0; i < size; i++)
+         bld.vop1_dpp(aco_opcode::v_mov_b32, Definition(PhysReg{vtmp+i}, v1), Operand(PhysReg{src0+i}, v1),
+                      dpp_ctrl, row_mask, bank_mask, bound_ctrl_zero);
+
+      if (clobber_vcc)
+         bld.vop3(op, Definition(dst, rc), Definition(vcc, s2), Operand(vtmp, rc), Operand(src1, rc));
+      else
+         bld.vop3(op, Definition(dst, rc), Operand(vtmp, rc), Operand(src1, rc));
+   } else {
+      assert(format == Format::VOP2 || format == Format::VOP1);
+      assert(size == 1 || (op == aco_opcode::v_mov_b32));
+
+      for (unsigned i = 0; i < size; i++) {
+         aco_ptr<DPP_instruction> dpp{create_instruction<DPP_instruction>(
+            op, (Format) ((uint32_t) format | (uint32_t) Format::DPP),
+            format == Format::VOP2 ? 2 : 1, clobber_vcc ? 2 : 1)};
+         dpp->operands[0] = Operand(PhysReg{src0+i}, rc);
+         if (format == Format::VOP2)
+            dpp->operands[1] = Operand(PhysReg{src1+i}, rc);
+         dpp->definitions[0] = Definition(PhysReg{dst+i}, rc);
+         if (clobber_vcc)
+            dpp->definitions[1] = Definition(vcc, s2);
+         dpp->dpp_ctrl = dpp_ctrl;
+         dpp->row_mask = row_mask;
+         dpp->bank_mask = bank_mask;
+         dpp->bound_ctrl = bound_ctrl_zero;
+         ctx->instructions.emplace_back(std::move(dpp));
+      }
+   }
+}
+
+uint32_t get_reduction_identity(ReduceOp op, unsigned idx)
+{
+   switch (op) {
+   case iadd32:
+   case iadd64:
+   case fadd32:
+   case fadd64:
+   case ior32:
+   case ior64:
+   case ixor32:
+   case ixor64:
+   case umax32:
+   case umax64:
+      return 0;
+   case imul32:
+   case imul64:
+      return idx ? 0 : 1;
+   case fmul32:
+      return 0x3f800000u; /* 1.0 */
+   case fmul64:
+      return idx ? 0x3ff00000u : 0u; /* 1.0 */
+   case imin32:
+      return INT32_MAX;
+   case imin64:
+      return idx ? 0x7fffffffu : 0xffffffffu;
+   case imax32:
+      return INT32_MIN;
+   case imax64:
+      return idx ? 0x80000000u : 0;
+   case umin32:
+   case umin64:
+   case iand32:
+   case iand64:
+      return 0xffffffffu;
+   case fmin32:
+      return 0x7f800000u; /* infinity */
+   case fmin64:
+      return idx ? 0x7ff00000u : 0u; /* infinity */
+   case fmax32:
+      return 0xff800000u; /* negative infinity */
+   case fmax64:
+      return idx ? 0xfff00000u : 0u; /* negative infinity */
+   }
+   unreachable("Invalid reduction operation");
+}
+
+aco_opcode get_reduction_opcode(lower_context *ctx, ReduceOp op, bool *clobber_vcc, Format *format)
+{
+   *clobber_vcc = false;
+   *format = Format::VOP2;
+   switch (op) {
+   case iadd32:
+      *clobber_vcc = ctx->program->chip_class < GFX9;
+      return ctx->program->chip_class < GFX9 ? aco_opcode::v_add_co_u32 : aco_opcode::v_add_u32;
+   case imul32:
+      *format = Format::VOP3;
+      return aco_opcode::v_mul_lo_u32;
+   case fadd32:
+      return aco_opcode::v_add_f32;
+   case fmul32:
+      return aco_opcode::v_mul_f32;
+   case imax32:
+      return aco_opcode::v_max_i32;
+   case imin32:
+      return aco_opcode::v_min_i32;
+   case umin32:
+      return aco_opcode::v_min_u32;
+   case umax32:
+      return aco_opcode::v_max_u32;
+   case fmin32:
+      return aco_opcode::v_min_f32;
+   case fmax32:
+      return aco_opcode::v_max_f32;
+   case iand32:
+      return aco_opcode::v_and_b32;
+   case ixor32:
+      return aco_opcode::v_xor_b32;
+   case ior32:
+      return aco_opcode::v_or_b32;
+   case iadd64:
+   case imul64:
+      assert(false);
+      break;
+   case fadd64:
+      *format = Format::VOP3;
+      return aco_opcode::v_add_f64;
+   case fmul64:
+      *format = Format::VOP3;
+      return aco_opcode::v_mul_f64;
+   case imin64:
+   case imax64:
+   case umin64:
+   case umax64:
+      assert(false);
+      break;
+   case fmin64:
+      *format = Format::VOP3;
+      return aco_opcode::v_min_f64;
+   case fmax64:
+      *format = Format::VOP3;
+      return aco_opcode::v_max_f64;
+   case iand64:
+   case ior64:
+   case ixor64:
+      assert(false);
+      break;
+   }
+   unreachable("Invalid reduction operation");
+   return aco_opcode::v_min_u32;
+}
+
+void emit_vopn(lower_context *ctx, PhysReg dst, PhysReg src0, PhysReg src1,
+               RegClass rc, aco_opcode op, Format format, bool clobber_vcc)
+{
+   aco_ptr<Instruction> instr;
+   switch (format) {
+   case Format::VOP2:
+      instr.reset(create_instruction<VOP2_instruction>(op, format, 2, clobber_vcc ? 2 : 1));
+      break;
+   case Format::VOP3:
+      instr.reset(create_instruction<VOP3A_instruction>(op, format, 2, clobber_vcc ? 2 : 1));
+      break;
+   default:
+      assert(false);
+   }
+   instr->operands[0] = Operand(src0, rc);
+   instr->operands[1] = Operand(src1, rc);
+   instr->definitions[0] = Definition(dst, rc);
+   if (clobber_vcc)
+      instr->definitions[1] = Definition(vcc, s2);
+   ctx->instructions.emplace_back(std::move(instr));
+}
+
+void emit_reduction(lower_context *ctx, aco_opcode op, ReduceOp reduce_op, unsigned cluster_size, PhysReg tmp,
+                    PhysReg stmp, PhysReg vtmp, PhysReg sitmp, Operand src, Definition dst)
+{
+   assert(cluster_size == 64 || op == aco_opcode::p_reduce);
+
+   Builder bld(ctx->program, &ctx->instructions);
+
+   PhysReg wrtmp{0}; /* should never be needed */
+
+   Format format;
+   bool should_clobber_vcc;
+   aco_opcode reduce_opcode = get_reduction_opcode(ctx, reduce_op, &should_clobber_vcc, &format);
+   Operand identity[2];
+   identity[0] = Operand(get_reduction_identity(reduce_op, 0));
+   identity[1] = Operand(get_reduction_identity(reduce_op, 1));
+   Operand vcndmask_identity[2] = {identity[0], identity[1]};
+
+   /* First, copy the source to tmp and set inactive lanes to the identity */
+   // note: this clobbers SCC!
+   bld.sop1(aco_opcode::s_or_saveexec_b64, Definition(stmp, s2), Definition(scc, s1), Definition(exec, s2), Operand(UINT64_MAX), Operand(exec, s2));
+
+   for (unsigned i = 0; i < src.size(); i++) {
+      /* p_exclusive_scan needs it to be a sgpr or inline constant for the v_writelane_b32 */
+      if (identity[i].isLiteral() && op == aco_opcode::p_exclusive_scan) {
+         bld.sop1(aco_opcode::s_mov_b32, Definition(PhysReg{sitmp+i}, s1), identity[i]);
+         identity[i] = Operand(PhysReg{sitmp+i}, s1);
+
+         bld.vop1(aco_opcode::v_mov_b32, Definition(PhysReg{tmp+i}, v1), identity[i]);
+         vcndmask_identity[i] = Operand(PhysReg{tmp+i}, v1);
+      } else if (identity[i].isLiteral()) {
+         bld.vop1(aco_opcode::v_mov_b32, Definition(PhysReg{tmp+i}, v1), identity[i]);
+         vcndmask_identity[i] = Operand(PhysReg{tmp+i}, v1);
+      }
+   }
+
+   for (unsigned i = 0; i < src.size(); i++) {
+      bld.vop2_e64(aco_opcode::v_cndmask_b32, Definition(PhysReg{tmp + i}, v1),
+                   vcndmask_identity[i], Operand(PhysReg{src.physReg() + i}, v1),
+                   Operand(stmp, s2));
+   }
+
+   bool exec_restored = false;
+   bool dst_written = false;
+   switch (op) {
+   case aco_opcode::p_reduce:
+      if (cluster_size == 1) break;
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_quad_perm(1, 0, 3, 2), 0xf, 0xf, false, src.size());
+      if (cluster_size == 2) break;
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_quad_perm(2, 3, 0, 1), 0xf, 0xf, false, src.size());
+      if (cluster_size == 4) break;
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_half_mirror, 0xf, 0xf, false, src.size());
+      if (cluster_size == 8) break;
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_mirror, 0xf, 0xf, false, src.size());
+      if (cluster_size == 16) break;
+      if (cluster_size == 32) {
+         for (unsigned i = 0; i < src.size(); i++)
+            bld.ds(aco_opcode::ds_swizzle_b32, Definition(PhysReg{vtmp+i}, v1), Operand(PhysReg{tmp+i}, s1), ds_pattern_bitmode(0x1f, 0, 0x10));
+         bld.sop1(aco_opcode::s_mov_b64, Definition(exec, s2), Operand(stmp, s2));
+         exec_restored = true;
+         emit_vopn(ctx, dst.physReg(), vtmp, tmp, src.regClass(), reduce_opcode, format, should_clobber_vcc);
+         dst_written = true;
+      } else {
+         assert(cluster_size == 64);
+         emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                     dpp_row_bcast15, 0xa, 0xf, false, src.size());
+         emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                     dpp_row_bcast31, 0xc, 0xf, false, src.size());
+      }
+      break;
+   case aco_opcode::p_exclusive_scan:
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, aco_opcode::v_mov_b32, Format::VOP1, false,
+                  dpp_wf_sr1, 0xf, 0xf, true, src.size());
+      for (unsigned i = 0; i < src.size(); i++) {
+         if (!identity[i].isConstant() || identity[i].constantValue()) { /* bound_ctrl should take case of this overwise */
+            assert((identity[i].isConstant() && !identity[i].isLiteral()) || identity[i].physReg() == PhysReg{sitmp+i});
+            bld.vop3(aco_opcode::v_writelane_b32, Definition(PhysReg{tmp+i}, v1),
+                     identity[i], Operand(0u));
+         }
+      }
+      /* fall through */
+   case aco_opcode::p_inclusive_scan:
+      assert(cluster_size == 64);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_sr(1), 0xf, 0xf, false, src.size(), identity);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_sr(2), 0xf, 0xf, false, src.size(), identity);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_sr(4), 0xf, 0xf, false, src.size(), identity);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_sr(8), 0xf, 0xf, false, src.size(), identity);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_bcast15, 0xa, 0xf, false, src.size(), identity);
+      emit_dpp_op(ctx, tmp, tmp, tmp, vtmp, wrtmp, reduce_opcode, format, should_clobber_vcc,
+                  dpp_row_bcast31, 0xc, 0xf, false, src.size(), identity);
+      break;
+   default:
+      unreachable("Invalid reduction mode");
+   }
+
+   if (!exec_restored)
+      bld.sop1(aco_opcode::s_mov_b64, Definition(exec, s2), Operand(stmp, s2));
+
+   if (op == aco_opcode::p_reduce && cluster_size == 64) {
+      for (unsigned k = 0; k < src.size(); k++) {
+         bld.vop3(aco_opcode::v_readlane_b32, Definition(PhysReg{dst.physReg() + k}, s1),
+                  Operand(PhysReg{tmp + k}, v1), Operand(63u));
+      }
+   } else if (!(dst.physReg() == tmp) && !dst_written) {
+      for (unsigned k = 0; k < src.size(); k++) {
+         bld.vop1(aco_opcode::v_mov_b32, Definition(PhysReg{dst.physReg() + k}, s1),
+                  Operand(PhysReg{tmp + k}, v1));
+      }
+   }
+}
+
+struct copy_operation {
+   Operand op;
+   Definition def;
+   unsigned uses;
+   unsigned size;
+};
+
+void handle_operands(std::map<PhysReg, copy_operation>& copy_map, lower_context* ctx, chip_class chip_class, Pseudo_instruction *pi)
+{
+   Builder bld(ctx->program, &ctx->instructions);
+   aco_ptr<Instruction> mov;
+   std::map<PhysReg, copy_operation>::iterator it = copy_map.begin();
+   std::map<PhysReg, copy_operation>::iterator target;
+   bool writes_scc = false;
+
+   /* count the number of uses for each dst reg */
+   while (it != copy_map.end()) {
+      if (it->second.op.isConstant()) {
+         ++it;
+         continue;
+      }
+
+      if (it->second.def.physReg() == scc)
+         writes_scc = true;
+
+      assert(!pi->tmp_in_scc || !(it->second.def.physReg() == pi->scratch_sgpr));
+
+      /* if src and dst reg are the same, remove operation */
+      if (it->first == it->second.op.physReg()) {
+         it = copy_map.erase(it);
+         continue;
+      }
+      /* check if the operand reg may be overwritten by another copy operation */
+      target = copy_map.find(it->second.op.physReg());
+      if (target != copy_map.end()) {
+         target->second.uses++;
+      }
+
+      ++it;
+   }
+
+   /* first, handle paths in the location transfer graph */
+   bool preserve_scc = pi->tmp_in_scc && !writes_scc;
+   it = copy_map.begin();
+   while (it != copy_map.end()) {
+
+      /* the target reg is not used as operand for any other copy */
+      if (it->second.uses == 0) {
+
+         /* try to coalesce 32-bit sgpr copies to 64-bit copies */
+         if (it->second.def.getTemp().type() == RegType::sgpr && it->second.size == 1 &&
+             !it->second.op.isConstant() && it->first % 2 == it->second.op.physReg() % 2) {
+
+            PhysReg other_def_reg = PhysReg{it->first % 2 ? it->first - 1 : it->first + 1};
+            PhysReg other_op_reg = PhysReg{it->first % 2 ? it->second.op.physReg() - 1 : it->second.op.physReg() + 1};
+            std::map<PhysReg, copy_operation>::iterator other = copy_map.find(other_def_reg);
+
+            if (other != copy_map.end() && !other->second.uses && other->second.size == 1 &&
+                other->second.op.physReg() == other_op_reg && !other->second.op.isConstant()) {
+               std::map<PhysReg, copy_operation>::iterator to_erase = it->first % 2 ? it : other;
+               it = it->first % 2 ? other : it;
+               copy_map.erase(to_erase);
+               it->second.size = 2;
+            }
+         }
+
+         if (it->second.def.physReg() == scc) {
+            bld.sopc(aco_opcode::s_cmp_lg_i32, it->second.def, it->second.op, Operand(0u));
+            preserve_scc = true;
+         } else if (it->second.size == 2 && it->second.def.getTemp().type() == RegType::sgpr) {
+            bld.sop1(aco_opcode::s_mov_b64, it->second.def, Operand(it->second.op.physReg(), s2));
+         } else {
+            bld.copy(it->second.def, it->second.op);
+         }
+
+         /* reduce the number of uses of the operand reg by one */
+         if (!it->second.op.isConstant()) {
+            for (unsigned i = 0; i < it->second.size; i++) {
+               target = copy_map.find(PhysReg{it->second.op.physReg() + i});
+               if (target != copy_map.end())
+                  target->second.uses--;
+            }
+         }
+
+         copy_map.erase(it);
+         it = copy_map.begin();
+         continue;
+      } else {
+         /* the target reg is used as operand, check the next entry */
+         ++it;
+      }
+   }
+
+   if (copy_map.empty())
+      return;
+
+   /* all target regs are needed as operand somewhere which means, all entries are part of a cycle */
+   bool constants = false;
+   for (it = copy_map.begin(); it != copy_map.end(); ++it) {
+      assert(it->second.op.isFixed());
+      if (it->first == it->second.op.physReg())
+         continue;
+      /* do constants later */
+      if (it->second.op.isConstant()) {
+         constants = true;
+         continue;
+      }
+
+      if (preserve_scc && it->second.def.getTemp().type() == RegType::sgpr)
+         assert(!(it->second.def.physReg() == pi->scratch_sgpr));
+
+      /* to resolve the cycle, we have to swap the src reg with the dst reg */
+      copy_operation swap = it->second;
+      assert(swap.op.regClass() == swap.def.regClass());
+      Operand def_as_op = Operand(swap.def.physReg(), swap.def.regClass());
+      Definition op_as_def = Definition(swap.op.physReg(), swap.op.regClass());
+      if (chip_class >= GFX9 && swap.def.getTemp().type() == RegType::vgpr) {
+         bld.vop1(aco_opcode::v_swap_b32, swap.def, op_as_def, swap.op, def_as_op);
+      } else if (swap.op.physReg() == scc || swap.def.physReg() == scc) {
+         /* we need to swap scc and another sgpr */
+         assert(!preserve_scc);
+
+         PhysReg other = swap.op.physReg() == scc ? swap.def.physReg() : swap.op.physReg();
+
+         bld.sop1(aco_opcode::s_mov_b32, Definition(pi->scratch_sgpr, s1), Operand(scc, s1));
+         bld.sopc(aco_opcode::s_cmp_lg_i32, Definition(scc, s1), Operand(other, s1), Operand(0u));
+         bld.sop1(aco_opcode::s_mov_b32, Definition(other, s1), Operand(pi->scratch_sgpr, s1));
+      } else if (swap.def.getTemp().type() == RegType::sgpr) {
+         if (preserve_scc) {
+            bld.sop1(aco_opcode::s_mov_b32, Definition(pi->scratch_sgpr, s1), swap.op);
+            bld.sop1(aco_opcode::s_mov_b32, op_as_def, def_as_op);
+            bld.sop1(aco_opcode::s_mov_b32, swap.def, Operand(pi->scratch_sgpr, s1));
+         } else {
+            bld.sop2(aco_opcode::s_xor_b32, op_as_def, Definition(scc, s1), swap.op, def_as_op);
+            bld.sop2(aco_opcode::s_xor_b32, swap.def, Definition(scc, s1), swap.op, def_as_op);
+            bld.sop2(aco_opcode::s_xor_b32, op_as_def, Definition(scc, s1), swap.op, def_as_op);
+         }
+      } else {
+         bld.vop2(aco_opcode::v_xor_b32, op_as_def, swap.op, def_as_op);
+         bld.vop2(aco_opcode::v_xor_b32, swap.def, swap.op, def_as_op);
+         bld.vop2(aco_opcode::v_xor_b32, op_as_def, swap.op, def_as_op);
+      }
+
+      /* change the operand reg of the target's use */
+      assert(swap.uses == 1);
+      target = it;
+      for (++target; target != copy_map.end(); ++target) {
+         if (target->second.op.physReg() == it->first) {
+            target->second.op.setFixed(swap.op.physReg());
+            break;
+         }
+      }
+   }
+
+   /* copy constants into a registers which were operands */
+   if (constants) {
+      for (it = copy_map.begin(); it != copy_map.end(); ++it) {
+         if (!it->second.op.isConstant())
+            continue;
+         if (it->second.def.physReg() == scc) {
+            bld.sopc(aco_opcode::s_cmp_lg_i32, Definition(scc, s1), Operand(0u), Operand(it->second.op.constantValue() ? 1u : 0u));
+         } else {
+            bld.copy(it->second.def, it->second.op);
+         }
+      }
+   }
+}
+
+void lower_to_hw_instr(Program* program)
+{
+   Block *discard_block = NULL;
+
+   for (size_t i = 0; i < program->blocks.size(); i++)
+   {
+      Block *block = &program->blocks[i];
+      lower_context ctx;
+      ctx.program = program;
+      Builder bld(program, &ctx.instructions);
+
+      for (size_t j = 0; j < block->instructions.size(); j++) {
+         aco_ptr<Instruction>& instr = block->instructions[j];
+         aco_ptr<Instruction> mov;
+         if (instr->format == Format::PSEUDO) {
+            Pseudo_instruction *pi = (Pseudo_instruction*)instr.get();
+
+            switch (instr->opcode)
+            {
+            case aco_opcode::p_extract_vector:
+            {
+               unsigned reg = instr->operands[0].physReg() + instr->operands[1].constantValue() * instr->definitions[0].size();
+               RegClass rc = RegClass(instr->operands[0].getTemp().type(), 1);
+               RegClass rc_def = RegClass(instr->definitions[0].getTemp().type(), 1);
+               if (reg == instr->definitions[0].physReg())
+                  break;
+
+               std::map<PhysReg, copy_operation> copy_operations;
+               for (unsigned i = 0; i < instr->definitions[0].size(); i++) {
+                  Definition def = Definition(PhysReg{instr->definitions[0].physReg() + i}, rc_def);
+                  copy_operations[def.physReg()] = {Operand(PhysReg{reg + i}, rc), def, 0, 1};
+               }
+               handle_operands(copy_operations, &ctx, program->chip_class, pi);
+               break;
+            }
+            case aco_opcode::p_create_vector:
+            {
+               std::map<PhysReg, copy_operation> copy_operations;
+               RegClass rc_def = RegClass(instr->definitions[0].getTemp().type(), 1);
+               unsigned reg_idx = 0;
+               for (const Operand& op : instr->operands) {
+                  if (op.isConstant()) {
+                     const PhysReg reg = PhysReg{instr->definitions[0].physReg() + reg_idx};
+                     const Definition def = Definition(reg, rc_def);
+                     copy_operations[reg] = {op, def, 0, 1};
+                     reg_idx++;
+                     continue;
+                  }
+
+                  RegClass rc_op = RegClass(op.getTemp().type(), 1);
+                  for (unsigned j = 0; j < op.size(); j++)
+                  {
+                     const Operand copy_op = Operand(PhysReg{op.physReg() + j}, rc_op);
+                     const Definition def = Definition(PhysReg{instr->definitions[0].physReg() + reg_idx}, rc_def);
+                     copy_operations[def.physReg()] = {copy_op, def, 0, 1};
+                     reg_idx++;
+                  }
+               }
+               handle_operands(copy_operations, &ctx, program->chip_class, pi);
+               break;
+            }
+            case aco_opcode::p_split_vector:
+            {
+               std::map<PhysReg, copy_operation> copy_operations;
+               RegClass rc_op = instr->operands[0].isConstant() ? s1 : RegClass(instr->operands[0].regClass().type(), 1);
+               for (unsigned i = 0; i < instr->definitions.size(); i++) {
+                  unsigned k = instr->definitions[i].size();
+                  RegClass rc_def = RegClass(instr->definitions[i].getTemp().type(), 1);
+                  for (unsigned j = 0; j < k; j++) {
+                     Operand op = Operand(PhysReg{instr->operands[0].physReg() + (i*k+j)}, rc_op);
+                     Definition def = Definition(PhysReg{instr->definitions[i].physReg() + j}, rc_def);
+                     copy_operations[def.physReg()] = {op, def, 0, 1};
+                  }
+               }
+               handle_operands(copy_operations, &ctx, program->chip_class, pi);
+               break;
+            }
+            case aco_opcode::p_parallelcopy:
+            case aco_opcode::p_wqm:
+            {
+               std::map<PhysReg, copy_operation> copy_operations;
+               for (unsigned i = 0; i < instr->operands.size(); i++)
+               {
+                  Operand operand = instr->operands[i];
+                  if (operand.isConstant() || operand.size() == 1) {
+                     assert(instr->definitions[i].size() == 1);
+                     copy_operations[instr->definitions[i].physReg()] = {operand, instr->definitions[i], 0, 1};
+                  } else {
+                     RegClass def_rc = RegClass(instr->definitions[i].regClass().type(), 1);
+                     RegClass op_rc = RegClass(operand.getTemp().type(), 1);
+                     for (unsigned j = 0; j < operand.size(); j++)
+                     {
+                        Operand op = Operand(PhysReg{instr->operands[i].physReg() + j}, op_rc);
+                        Definition def = Definition(PhysReg{instr->definitions[i].physReg() + j}, def_rc);
+                        copy_operations[def.physReg()] = {op, def, 0, 1};
+                     }
+                  }
+               }
+               handle_operands(copy_operations, &ctx, program->chip_class, pi);
+               break;
+            }
+            case aco_opcode::p_discard_if:
+            {
+               bool early_exit = false;
+               if (block->instructions[j + 1]->opcode != aco_opcode::p_logical_end ||
+                   block->instructions[j + 2]->opcode != aco_opcode::s_endpgm) {
+                  early_exit = true;
+               }
+
+               if (early_exit && !discard_block) {
+                  discard_block = program->create_and_insert_block();
+                  block = &program->blocks[i];
+
+                  bld.reset(discard_block);
+                  bld.exp(aco_opcode::exp, Operand(v1), Operand(v1), Operand(v1), Operand(v1),
+                          0, V_008DFC_SQ_EXP_NULL, false, true, true);
+                  if (program->wb_smem_l1_on_end)
+                     bld.smem(aco_opcode::s_dcache_wb);
+                  bld.sopp(aco_opcode::s_endpgm);
+
+                  bld.reset(&ctx.instructions);
+               }
+
+               // TODO: optimize uniform conditions
+               Definition branch_cond = instr->definitions.back();
+               Operand discard_cond = instr->operands.back();
+               aco_ptr<Instruction> sop2;
+               /* backwards, to finally branch on the global exec mask */
+               for (int i = instr->operands.size() - 2; i >= 0; i--) {
+                  bld.sop2(aco_opcode::s_andn2_b64,
+                           instr->definitions[i], /* new mask */
+                           branch_cond, /* scc */
+                           instr->operands[i], /* old mask */
+                           discard_cond);
+               }
+
+               if (early_exit) {
+                  bld.sopp(aco_opcode::s_cbranch_scc0, bld.scc(branch_cond.getTemp()), discard_block->index);
+
+                  discard_block->linear_preds.push_back(block->index);
+                  block->linear_succs.push_back(discard_block->index);
+               }
+
+               break;
+            }
+            case aco_opcode::p_spill:
+            {
+               assert(instr->operands[0].regClass() == v1.as_linear());
+               for (unsigned i = 0; i < instr->operands[2].size(); i++) {
+                  bld.vop3(aco_opcode::v_writelane_b32, bld.def(v1, instr->operands[0].physReg()),
+                           Operand(PhysReg{instr->operands[2].physReg() + i}, s1),
+                           Operand(instr->operands[1].constantValue() + i));
+               }
+               break;
+            }
+            case aco_opcode::p_reload:
+            {
+               assert(instr->operands[0].regClass() == v1.as_linear());
+               for (unsigned i = 0; i < instr->definitions[0].size(); i++) {
+                  bld.vop3(aco_opcode::v_readlane_b32,
+                           bld.def(s1, PhysReg{instr->definitions[0].physReg() + i}),
+                           instr->operands[0], Operand(instr->operands[1].constantValue() + i));
+               }
+               break;
+            }
+            case aco_opcode::p_as_uniform:
+            {
+               if (instr->operands[0].isConstant() || instr->operands[0].regClass().type() == RegType::sgpr) {
+                  std::map<PhysReg, copy_operation> copy_operations;
+                  Operand operand = instr->operands[0];
+                  if (operand.isConstant() || operand.size() == 1) {
+                     assert(instr->definitions[0].size() == 1);
+                     copy_operations[instr->definitions[0].physReg()] = {operand, instr->definitions[0], 0, 1};
+                  } else {
+                     for (unsigned i = 0; i < operand.size(); i++)
+                     {
+                        Operand op = Operand(PhysReg{operand.physReg() + i}, s1);
+                        Definition def = Definition(PhysReg{instr->definitions[0].physReg() + i}, s1);
+                        copy_operations[def.physReg()] = {op, def, 0, 1};
+                     }
+                  }
+
+                  handle_operands(copy_operations, &ctx, program->chip_class, pi);
+               } else {
+                  assert(instr->operands[0].regClass().type() == RegType::vgpr);
+                  assert(instr->definitions[0].regClass().type() == RegType::sgpr);
+                  assert(instr->operands[0].size() == instr->definitions[0].size());
+                  for (unsigned i = 0; i < instr->definitions[0].size(); i++) {
+                     bld.vop1(aco_opcode::v_readfirstlane_b32,
+                              bld.def(s1, PhysReg{instr->definitions[0].physReg() + i}),
+                              Operand(PhysReg{instr->operands[0].physReg() + i}, v1));
+                  }
+               }
+               break;
+            }
+            default:
+               break;
+            }
+         } else if (instr->format == Format::PSEUDO_BRANCH) {
+            Pseudo_branch_instruction* branch = static_cast<Pseudo_branch_instruction*>(instr.get());
+            /* check if all blocks from current to target are empty */
+            bool can_remove = block->index < branch->target[0];
+            for (unsigned i = block->index + 1; can_remove && i < branch->target[0]; i++) {
+               if (program->blocks[i].instructions.size())
+                  can_remove = false;
+            }
+            if (can_remove)
+               continue;
+
+            switch (instr->opcode) {
+               case aco_opcode::p_branch:
+                  assert(block->linear_succs[0] == branch->target[0]);
+                  bld.sopp(aco_opcode::s_branch, branch->target[0]);
+                  break;
+               case aco_opcode::p_cbranch_nz:
+                  assert(block->linear_succs[1] == branch->target[0]);
+                  if (branch->operands[0].physReg() == exec)
+                     bld.sopp(aco_opcode::s_cbranch_execnz, branch->target[0]);
+                  else if (branch->operands[0].physReg() == vcc)
+                     bld.sopp(aco_opcode::s_cbranch_vccnz, branch->target[0]);
+                  else {
+                     assert(branch->operands[0].physReg() == scc);
+                     bld.sopp(aco_opcode::s_cbranch_scc1, branch->target[0]);
+                  }
+                  break;
+               case aco_opcode::p_cbranch_z:
+                  assert(block->linear_succs[1] == branch->target[0]);
+                  if (branch->operands[0].physReg() == exec)
+                     bld.sopp(aco_opcode::s_cbranch_execz, branch->target[0]);
+                  else if (branch->operands[0].physReg() == vcc)
+                     bld.sopp(aco_opcode::s_cbranch_vccz, branch->target[0]);
+                  else {
+                     assert(branch->operands[0].physReg() == scc);
+                     bld.sopp(aco_opcode::s_cbranch_scc0, branch->target[0]);
+                  }
+                  break;
+               default:
+                  unreachable("Unknown Pseudo branch instruction!");
+            }
+
+         } else if (instr->format == Format::PSEUDO_REDUCTION) {
+            Pseudo_reduction_instruction* reduce = static_cast<Pseudo_reduction_instruction*>(instr.get());
+            emit_reduction(&ctx, reduce->opcode, reduce->reduce_op, reduce->cluster_size,
+                           reduce->operands[1].physReg(), // tmp
+                           reduce->definitions[1].physReg(), // stmp
+                           reduce->operands[2].physReg(), // vtmp
+                           reduce->definitions[2].physReg(), // sitmp
+                           reduce->operands[0], reduce->definitions[0]);
+         } else {
+            ctx.instructions.emplace_back(std::move(instr));
+         }
+
+      }
+      block->instructions.swap(ctx.instructions);
+   }
+}
+
+}
--- a/src/amd/compiler/aco_opcodes.py
+++ b/src/amd/compiler/aco_opcodes.py
--- a/src/amd/compiler/aco_opcodes_cpp.py
+++ b/src/amd/compiler/aco_opcodes_cpp.py
@ -0,0 +1,74 @@
+
+template = """\
+/* 
+ * Copyright (c) 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+
+namespace aco {
+
+const unsigned VOPC_to_GFX6[256] = {
+% for code in VOPC_GFX6:
+    ${code},
+% endfor
+};
+
+<%
+opcode_names = sorted(opcodes.keys())
+can_use_input_modifiers = "".join([opcodes[name].input_mod for name in reversed(opcode_names)])
+can_use_output_modifiers = "".join([opcodes[name].output_mod for name in reversed(opcode_names)])
+%>
+
+extern const aco::Info instr_info = {
+   .opcode_gfx9 = {
+      % for name in opcode_names:
+      ${opcodes[name].opcode_gfx9},
+      % endfor
+   },
+   .opcode_gfx10 = {
+      % for name in opcode_names:
+      ${opcodes[name].opcode_gfx10},
+      % endfor
+   },
+   .can_use_input_modifiers = std::bitset<${len(opcode_names)}>("${can_use_input_modifiers}"),
+   .can_use_output_modifiers = std::bitset<${len(opcode_names)}>("${can_use_output_modifiers}"),
+   .name = {
+      % for name in opcode_names:
+      "${name}",
+      % endfor
+   },
+   .format = {
+      % for name in opcode_names:
+      aco::Format::${str(opcodes[name].format.name)},
+      % endfor
+   },
+};
+
+}
+"""
+
+from aco_opcodes import opcodes, VOPC_GFX6
+from mako.template import Template
+
+print(Template(template).render(opcodes=opcodes, VOPC_GFX6=VOPC_GFX6))
--- a/src/amd/compiler/aco_opcodes_h.py
+++ b/src/amd/compiler/aco_opcodes_h.py
@ -0,0 +1,47 @@
+
+template = """\
+/* 
+ * Copyright (c) 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ * Authors:
+ *    Daniel Schuermann (daniel.schuermann@campus.tu-berlin.de)
+ */
+
+#ifndef _ACO_OPCODES_
+#define _ACO_OPCODES_
+
+<% opcode_names = sorted(opcodes.keys()) %>
+
+enum class aco_opcode : std::uint16_t {
+% for name in opcode_names:
+   ${name},
+% endfor
+   last_opcode = ${opcode_names[-1]},
+   num_opcodes = last_opcode + 1
+};
+
+#endif /* _ACO_OPCODES_ */"""
+
+from aco_opcodes import opcodes
+from mako.template import Template
+
+print(Template(template).render(opcodes=opcodes))
--- a/src/amd/compiler/aco_opt_value_numbering.cpp
+++ b/src/amd/compiler/aco_opt_value_numbering.cpp
@ -0,0 +1,327 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include <map>
+#include <unordered_set>
+
+#include "aco_ir.h"
+
+/*
+ * Implements the algorithm for dominator-tree value numbering
+ * from "Value Numbering" by Briggs, Cooper, and Simpson.
+ */
+
+namespace aco {
+namespace {
+
+struct InstrHash {
+   std::size_t operator()(Instruction* instr) const
+   {
+      uint64_t hash = (uint64_t) instr->opcode + (uint64_t) instr->format;
+      for (unsigned i = 0; i < instr->operands.size(); i++) {
+         Operand op = instr->operands[i];
+         uint64_t val = op.isTemp() ? op.tempId() : op.isFixed() ? op.physReg() : op.constantValue();
+         hash |= val << (i+1) * 8;
+      }
+      if (instr->isVOP3()) {
+         VOP3A_instruction* vop3 = static_cast<VOP3A_instruction*>(instr);
+         for (unsigned i = 0; i < 3; i++) {
+            hash ^= vop3->abs[i] << (i*3 + 0);
+            hash ^= vop3->opsel[i] << (i*3 + 1);
+            hash ^= vop3->neg[i] << (i*3 + 2);
+         }
+         hash ^= (vop3->clamp << 28) * 13;
+         hash += vop3->omod << 19;
+      }
+      switch (instr->format) {
+      case Format::SMEM:
+         break;
+      case Format::VINTRP: {
+         Interp_instruction* interp = static_cast<Interp_instruction*>(instr);
+         hash ^= interp->attribute << 13;
+         hash ^= interp->component << 27;
+         break;
+      }
+      case Format::DS:
+         break;
+      default:
+         break;
+      }
+
+      return hash;
+   }
+};
+
+struct InstrPred {
+   bool operator()(Instruction* a, Instruction* b) const
+   {
+      if (a->format != b->format)
+         return false;
+      if (a->opcode != b->opcode)
+         return false;
+      if (a->operands.size() != b->operands.size() || a->definitions.size() != b->definitions.size())
+         return false; /* possible with pseudo-instructions */
+      for (unsigned i = 0; i < a->operands.size(); i++) {
+         if (a->operands[i].isConstant()) {
+            if (!b->operands[i].isConstant())
+               return false;
+            if (a->operands[i].constantValue() != b->operands[i].constantValue())
+               return false;
+         }
+         else if (a->operands[i].isTemp()) {
+            if (!b->operands[i].isTemp())
+               return false;
+            if (a->operands[i].tempId() != b->operands[i].tempId())
+               return false;
+         }
+         else if (a->operands[i].isUndefined() ^ b->operands[i].isUndefined())
+            return false;
+         if (a->operands[i].isFixed()) {
+            if (a->operands[i].physReg() == exec)
+               return false;
+            if (!b->operands[i].isFixed())
+               return false;
+            if (!(a->operands[i].physReg() == b->operands[i].physReg()))
+               return false;
+         }
+      }
+      for (unsigned i = 0; i < a->definitions.size(); i++) {
+         if (a->definitions[i].isTemp()) {
+            if (!b->definitions[i].isTemp())
+               return false;
+            if (a->definitions[i].regClass() != b->definitions[i].regClass())
+               return false;
+         }
+         if (a->definitions[i].isFixed()) {
+            if (!b->definitions[i].isFixed())
+               return false;
+            if (!(a->definitions[i].physReg() == b->definitions[i].physReg()))
+               return false;
+         }
+      }
+      if (a->format == Format::PSEUDO_BRANCH)
+         return false;
+      if (a->isVOP3()) {
+         VOP3A_instruction* a3 = static_cast<VOP3A_instruction*>(a);
+         VOP3A_instruction* b3 = static_cast<VOP3A_instruction*>(b);
+         for (unsigned i = 0; i < 3; i++) {
+            if (a3->abs[i] != b3->abs[i] ||
+                a3->opsel[i] != b3->opsel[i] ||
+                a3->neg[i] != b3->neg[i])
+               return false;
+         }
+         return a3->clamp == b3->clamp &&
+                a3->omod == b3->omod;
+      }
+      if (a->isDPP()) {
+         DPP_instruction* aDPP = static_cast<DPP_instruction*>(a);
+         DPP_instruction* bDPP = static_cast<DPP_instruction*>(b);
+         return aDPP->dpp_ctrl == bDPP->dpp_ctrl &&
+                aDPP->bank_mask == bDPP->bank_mask &&
+                aDPP->row_mask == bDPP->row_mask &&
+                aDPP->bound_ctrl == bDPP->bound_ctrl &&
+                aDPP->abs[0] == bDPP->abs[0] &&
+                aDPP->abs[1] == bDPP->abs[1] &&
+                aDPP->neg[0] == bDPP->neg[0] &&
+                aDPP->neg[1] == bDPP->neg[1];
+      }
+      switch (a->format) {
+         case Format::VOPC: {
+            /* Since the results depend on the exec mask, these shouldn't
+             * be value numbered (this is especially useful for subgroupBallot()). */
+            return false;
+         }
+         case Format::SOPK: {
+            SOPK_instruction* aK = static_cast<SOPK_instruction*>(a);
+            SOPK_instruction* bK = static_cast<SOPK_instruction*>(b);
+            return aK->imm == bK->imm;
+         }
+         case Format::SMEM: {
+            SMEM_instruction* aS = static_cast<SMEM_instruction*>(a);
+            SMEM_instruction* bS = static_cast<SMEM_instruction*>(b);
+            return aS->can_reorder && bS->can_reorder &&
+                   aS->glc == bS->glc && aS->nv == bS->nv;
+         }
+         case Format::VINTRP: {
+            Interp_instruction* aI = static_cast<Interp_instruction*>(a);
+            Interp_instruction* bI = static_cast<Interp_instruction*>(b);
+            if (aI->attribute != bI->attribute)
+               return false;
+            if (aI->component != bI->component)
+               return false;
+            return true;
+         }
+         case Format::PSEUDO_REDUCTION:
+            return false;
+         case Format::MTBUF: {
+            /* this is fine since they are only used for vertex input fetches */
+            MTBUF_instruction* aM = static_cast<MTBUF_instruction *>(a);
+            MTBUF_instruction* bM = static_cast<MTBUF_instruction *>(b);
+            return aM->dfmt == bM->dfmt &&
+                   aM->nfmt == bM->nfmt &&
+                   aM->offset == bM->offset &&
+                   aM->offen == bM->offen &&
+                   aM->idxen == bM->idxen &&
+                   aM->glc == bM->glc &&
+                   aM->slc == bM->slc &&
+                   aM->tfe == bM->tfe &&
+                   aM->disable_wqm == bM->disable_wqm;
+         }
+         /* we want to optimize these in NIR and don't hassle with load-store dependencies */
+         case Format::MUBUF:
+         case Format::FLAT:
+         case Format::GLOBAL:
+         case Format::SCRATCH:
+         case Format::DS:
+            return false;
+         case Format::MIMG: {
+            MIMG_instruction* aM = static_cast<MIMG_instruction*>(a);
+            MIMG_instruction* bM = static_cast<MIMG_instruction*>(b);
+            return aM->can_reorder && bM->can_reorder &&
+                   aM->dmask == bM->dmask &&
+                   aM->unrm == bM->unrm &&
+                   aM->glc == bM->glc &&
+                   aM->slc == bM->slc &&
+                   aM->tfe == bM->tfe &&
+                   aM->da == bM->da &&
+                   aM->lwe == bM->lwe &&
+                   aM->r128 == bM->r128 &&
+                   aM->a16 == bM->a16 &&
+                   aM->d16 == bM->d16 &&
+                   aM->disable_wqm == bM->disable_wqm;
+         }
+         default:
+            return true;
+      }
+   }
+};
+
+
+typedef std::unordered_set<Instruction*, InstrHash, InstrPred> expr_set;
+
+void process_block(Block& block,
+                   expr_set& expr_values,
+                   std::map<uint32_t, Temp>& renames)
+{
+   bool run = false;
+   std::vector<aco_ptr<Instruction>>::iterator it = block.instructions.begin();
+   std::vector<aco_ptr<Instruction>> new_instructions;
+   new_instructions.reserve(block.instructions.size());
+   expr_set phi_values;
+
+   while (it != block.instructions.end()) {
+      aco_ptr<Instruction>& instr = *it;
+      /* first, rename operands */
+      for (Operand& op : instr->operands) {
+         if (!op.isTemp())
+            continue;
+         auto it = renames.find(op.tempId());
+         if (it != renames.end())
+            op.setTemp(it->second);
+      }
+
+      if (instr->definitions.empty() || !run) {
+         if (instr->opcode == aco_opcode::p_logical_start)
+            run = true;
+         else if (instr->opcode == aco_opcode::p_logical_end)
+            run = false;
+         else if (instr->opcode == aco_opcode::p_phi || instr->opcode == aco_opcode::p_linear_phi) {
+            std::pair<expr_set::iterator, bool> res = phi_values.emplace(instr.get());
+            if (!res.second) {
+               Instruction* orig_phi = *(res.first);
+               renames.emplace(instr->definitions[0].tempId(), orig_phi->definitions[0].getTemp()).second;
+               ++it;
+               continue;
+            }
+         }
+         new_instructions.emplace_back(std::move(instr));
+         ++it;
+         continue;
+      }
+
+      /* simple copy-propagation through renaming */
+      if ((instr->opcode == aco_opcode::s_mov_b32 || instr->opcode == aco_opcode::s_mov_b64 || instr->opcode == aco_opcode::v_mov_b32) &&
+          !instr->definitions[0].isFixed() && instr->operands[0].isTemp() && instr->operands[0].regClass() == instr->definitions[0].regClass() &&
+          !instr->isDPP() && !((int)instr->format & (int)Format::SDWA)) {
+         renames[instr->definitions[0].tempId()] = instr->operands[0].getTemp();
+      }
+
+      std::pair<expr_set::iterator, bool> res = expr_values.emplace(instr.get());
+
+      /* if there was already an expression with the same value number */
+      if (!res.second) {
+         Instruction* orig_instr = *(res.first);
+         assert(instr->definitions.size() == orig_instr->definitions.size());
+         for (unsigned i = 0; i < instr->definitions.size(); i++) {
+            assert(instr->definitions[i].regClass() == orig_instr->definitions[i].regClass());
+            renames.emplace(instr->definitions[i].tempId(), orig_instr->definitions[i].getTemp()).second;
+         }
+      } else {
+         new_instructions.emplace_back(std::move(instr));
+      }
+      ++it;
+   }
+
+   block.instructions.swap(new_instructions);
+}
+
+void rename_phi_operands(Block& block, std::map<uint32_t, Temp>& renames)
+{
+   for (aco_ptr<Instruction>& phi : block.instructions) {
+      if (phi->opcode != aco_opcode::p_phi && phi->opcode != aco_opcode::p_linear_phi)
+         break;
+
+      for (Operand& op : phi->operands) {
+         if (!op.isTemp())
+            continue;
+         auto it = renames.find(op.tempId());
+         if (it != renames.end())
+            op.setTemp(it->second);
+      }
+   }
+}
+} /* end namespace */
+
+
+void value_numbering(Program* program)
+{
+   std::vector<expr_set> expr_values(program->blocks.size());
+   std::map<uint32_t, Temp> renames;
+
+   for (Block& block : program->blocks) {
+      if (block.logical_idom != -1) {
+         /* initialize expr_values from idom */
+         expr_values[block.index] = expr_values[block.logical_idom];
+         process_block(block, expr_values[block.index], renames);
+      } else {
+         expr_set empty;
+         process_block(block, empty, renames);
+      }
+   }
+
+   for (Block& block : program->blocks)
+      rename_phi_operands(block, renames);
+}
+
+}
--- a/src/amd/compiler/aco_optimizer.cpp
+++ b/src/amd/compiler/aco_optimizer.cpp
--- a/src/amd/compiler/aco_print_asm.cpp
+++ b/src/amd/compiler/aco_print_asm.cpp
@ -0,0 +1,104 @@
+
+#include <iomanip>
+#include "aco_ir.h"
+#include "llvm-c/Disassembler.h"
+#include "ac_llvm_util.h"
+
+#include <llvm/ADT/StringRef.h>
+
+namespace aco {
+
+void print_asm(Program *program, std::vector<uint32_t>& binary,
+               unsigned exec_size, enum radeon_family family, std::ostream& out)
+{
+   std::vector<bool> referenced_blocks(program->blocks.size());
+   referenced_blocks[0] = true;
+   for (Block& block : program->blocks) {
+      for (unsigned succ : block.linear_succs)
+         referenced_blocks[succ] = true;
+   }
+
+   std::vector<std::tuple<uint64_t, llvm::StringRef, uint8_t>> symbols;
+   std::vector<std::array<char,16>> block_names;
+   block_names.reserve(program->blocks.size());
+   for (Block& block : program->blocks) {
+      if (!referenced_blocks[block.index])
+         continue;
+      std::array<char, 16> name;
+      sprintf(name.data(), "BB%u", block.index);
+      block_names.push_back(name);
+      symbols.emplace_back(block.offset * 4, llvm::StringRef(block_names[block_names.size() - 1].data()), 0);
+   }
+
+   LLVMDisasmContextRef disasm = LLVMCreateDisasmCPU("amdgcn-mesa-mesa3d",
+                                                     ac_get_llvm_processor_name(family),
+                                                     &symbols, 0, NULL, NULL);
+
+   char outline[1024];
+   size_t pos = 0;
+   bool invalid = false;
+   unsigned next_block = 0;
+   while (pos < exec_size) {
+      while (next_block < program->blocks.size() && pos == program->blocks[next_block].offset) {
+         if (referenced_blocks[next_block])
+            out << "BB" << std::dec << next_block << ":" << std::endl;
+         next_block++;
+      }
+
+      size_t l = LLVMDisasmInstruction(disasm, (uint8_t *) &binary[pos],
+                                       (exec_size - pos) * sizeof(uint32_t), pos * 4,
+                                       outline, sizeof(outline));
+
+      size_t new_pos;
+      const int align_width = 60;
+      if (program->chip_class == GFX9 && !l && ((binary[pos] & 0xffff8000) == 0xd1348000)) { /* not actually an invalid instruction */
+         out << std::left << std::setw(align_width) << std::setfill(' ') << "\tv_add_u32_e64 + clamp";
+         new_pos = pos + 2;
+      } else if (!l) {
+         out << std::left << std::setw(align_width) << std::setfill(' ') << "(invalid instruction)";
+         new_pos = pos + 1;
+         invalid = true;
+      } else {
+         out << std::left << std::setw(align_width) << std::setfill(' ') << outline;
+         assert(l % 4 == 0);
+         new_pos = pos + l / 4;
+      }
+      out << std::right;
+
+      out << " ;";
+      for (; pos < new_pos; pos++)
+         out << " " << std::setfill('0') << std::setw(8) << std::hex << binary[pos];
+      out << std::endl;
+   }
+   out << std::setfill(' ') << std::setw(0) << std::dec;
+   assert(next_block == program->blocks.size());
+
+   LLVMDisasmDispose(disasm);
+
+   if (program->constant_data.size()) {
+      out << std::endl << "/* constant data */" << std::endl;
+      for (unsigned i = 0; i < program->constant_data.size(); i += 32) {
+         out << '[' << std::setw(6) << std::setfill('0') << std::dec << i << ']';
+         unsigned line_size = std::min<size_t>(program->constant_data.size() - i, 32);
+         for (unsigned j = 0; j < line_size; j += 4) {
+            unsigned size = std::min<size_t>(program->constant_data.size() - (i + j), 4);
+            uint32_t v = 0;
+            memcpy(&v, &program->constant_data[i + j], size);
+            out << " " << std::setw(8) << std::setfill('0') << std::hex << v;
+         }
+         out << std::endl;
+      }
+   }
+
+   out << std::setfill(' ') << std::setw(0) << std::dec;
+
+   if (invalid) {
+      /* Invalid instructions usually lead to GPU hangs, which can make
+       * getting the actual invalid instruction hard. Abort here so that we
+       * can find the problem.
+       */
+      abort();
+   }
+}
+
+}
--- a/src/amd/compiler/aco_print_ir.cpp
+++ b/src/amd/compiler/aco_print_ir.cpp
@ -0,0 +1,575 @@
+#include "aco_ir.h"
+#include "aco_builder.h"
+
+#include "sid.h"
+
+namespace aco {
+
+static const char *reduce_ops[] = {
+   [iadd32] = "iadd32",
+   [iadd64] = "iadd64",
+   [imul32] = "imul32",
+   [imul64] = "imul64",
+   [fadd32] = "fadd32",
+   [fadd64] = "fadd64",
+   [fmul32] = "fmul32",
+   [fmul64] = "fmul64",
+   [imin32] = "imin32",
+   [imin64] = "imin64",
+   [imax32] = "imax32",
+   [imax64] = "imax64",
+   [umin32] = "umin32",
+   [umin64] = "umin64",
+   [umax32] = "umax32",
+   [umax64] = "umax64",
+   [fmin32] = "fmin32",
+   [fmin64] = "fmin64",
+   [fmax32] = "fmax32",
+   [fmax64] = "fmax64",
+   [iand32] = "iand32",
+   [iand64] = "iand64",
+   [ior32] = "ior32",
+   [ior64] = "ior64",
+   [ixor32] = "ixor32",
+   [ixor64] = "ixor64",
+};
+
+static void print_reg_class(const RegClass rc, FILE *output)
+{
+   switch (rc) {
+      case RegClass::s1: fprintf(output, " s1: "); return;
+      case RegClass::s2: fprintf(output, " s2: "); return;
+      case RegClass::s3: fprintf(output, " s3: "); return;
+      case RegClass::s4: fprintf(output, " s4: "); return;
+      case RegClass::s6: fprintf(output, " s6: "); return;
+      case RegClass::s8: fprintf(output, " s8: "); return;
+      case RegClass::s16: fprintf(output, "s16: "); return;
+      case RegClass::v1: fprintf(output, " v1: "); return;
+      case RegClass::v2: fprintf(output, " v2: "); return;
+      case RegClass::v3: fprintf(output, " v3: "); return;
+      case RegClass::v4: fprintf(output, " v4: "); return;
+      case RegClass::v5: fprintf(output, " v5: "); return;
+      case RegClass::v6: fprintf(output, " v6: "); return;
+      case RegClass::v7: fprintf(output, " v7: "); return;
+      case RegClass::v8: fprintf(output, " v8: "); return;
+      case RegClass::v1_linear: fprintf(output, " v1: "); return;
+      case RegClass::v2_linear: fprintf(output, " v2: "); return;
+   }
+}
+
+void print_physReg(unsigned reg, unsigned size, FILE *output)
+{
+   if (reg == 124) {
+      fprintf(output, ":m0");
+   } else if (reg == 106) {
+      fprintf(output, ":vcc");
+   } else if (reg == 253) {
+      fprintf(output, ":scc");
+   } else if (reg == 126) {
+      fprintf(output, ":exec");
+   } else {
+      bool is_vgpr = reg / 256;
+      reg = reg % 256;
+      fprintf(output, ":%c[%d", is_vgpr ? 'v' : 's', reg);
+      if (size > 1)
+         fprintf(output, "-%d]", reg + size -1);
+      else
+         fprintf(output, "]");
+   }
+}
+
+static void print_constant(uint8_t reg, FILE *output)
+{
+   if (reg >= 128 && reg <= 192) {
+      fprintf(output, "%d", reg - 128);
+      return;
+   } else if (reg >= 192 && reg <= 208) {
+      fprintf(output, "%d", 192 - reg);
+      return;
+   }
+
+   switch (reg) {
+   case 240:
+      fprintf(output, "0.5");
+      break;
+   case 241:
+      fprintf(output, "-0.5");
+      break;
+   case 242:
+      fprintf(output, "1.0");
+      break;
+   case 243:
+      fprintf(output, "-1.0");
+      break;
+   case 244:
+      fprintf(output, "2.0");
+      break;
+   case 245:
+      fprintf(output, "-2.0");
+      break;
+   case 246:
+      fprintf(output, "4.0");
+      break;
+   case 247:
+      fprintf(output, "-4.0");
+      break;
+   case 248:
+      fprintf(output, "1/(2*PI)");
+      break;
+   }
+}
+
+static void print_operand(const Operand *operand, FILE *output)
+{
+   if (operand->isLiteral()) {
+      fprintf(output, "0x%x", operand->constantValue());
+   } else if (operand->isConstant()) {
+      print_constant(operand->physReg().reg, output);
+   } else if (operand->isUndefined()) {
+      print_reg_class(operand->regClass(), output);
+      fprintf(output, "undef");
+   } else {
+      fprintf(output, "%%%d", operand->tempId());
+
+      if (operand->isFixed())
+         print_physReg(operand->physReg(), operand->size(), output);
+   }
+}
+
+static void print_definition(const Definition *definition, FILE *output)
+{
+   print_reg_class(definition->regClass(), output);
+   fprintf(output, "%%%d", definition->tempId());
+
+   if (definition->isFixed())
+      print_physReg(definition->physReg(), definition->size(), output);
+}
+
+static void print_barrier_reorder(bool can_reorder, barrier_interaction barrier, FILE *output)
+{
+   if (can_reorder)
+      fprintf(output, " reorder");
+
+   if (barrier & barrier_buffer)
+      fprintf(output, " buffer");
+   if (barrier & barrier_image)
+      fprintf(output, " image");
+   if (barrier & barrier_atomic)
+      fprintf(output, " atomic");
+   if (barrier & barrier_shared)
+      fprintf(output, " shared");
+}
+
+static void print_instr_format_specific(struct Instruction *instr, FILE *output)
+{
+   switch (instr->format) {
+   case Format::SOPK: {
+      SOPK_instruction* sopk = static_cast<SOPK_instruction*>(instr);
+      fprintf(output, " imm:%d", sopk->imm & 0x8000 ? (sopk->imm - 65536) : sopk->imm);
+      break;
+   }
+   case Format::SOPP: {
+      SOPP_instruction* sopp = static_cast<SOPP_instruction*>(instr);
+      uint16_t imm = sopp->imm;
+      switch (instr->opcode) {
+      case aco_opcode::s_waitcnt: {
+         /* we usually should check the chip class for vmcnt/lgkm, but
+          * insert_waitcnt() should fill it in regardless. */
+         unsigned vmcnt = (imm & 0xF) | ((imm & (0x3 << 14)) >> 10);
+         if (vmcnt != 63) fprintf(output, " vmcnt(%d)", vmcnt);
+         if (((imm >> 4) & 0x7) < 0x7) fprintf(output, " expcnt(%d)", (imm >> 4) & 0x7);
+         if (((imm >> 8) & 0x3F) < 0x3F) fprintf(output, " lgkmcnt(%d)", (imm >> 8) & 0x3F);
+         break;
+      }
+      case aco_opcode::s_endpgm:
+      case aco_opcode::s_endpgm_saved:
+      case aco_opcode::s_endpgm_ordered_ps_done:
+      case aco_opcode::s_wakeup:
+      case aco_opcode::s_barrier:
+      case aco_opcode::s_icache_inv:
+      case aco_opcode::s_ttracedata:
+      case aco_opcode::s_set_gpr_idx_off: {
+         break;
+      }
+      default: {
+         if (imm)
+            fprintf(output, " imm:%u", imm);
+         break;
+      }
+      }
+      if (sopp->block != -1)
+         fprintf(output, " block:BB%d", sopp->block);
+      break;
+   }
+   case Format::SMEM: {
+      SMEM_instruction* smem = static_cast<SMEM_instruction*>(instr);
+      if (smem->glc)
+         fprintf(output, " glc");
+      if (smem->nv)
+         fprintf(output, " nv");
+      print_barrier_reorder(smem->can_reorder, smem->barrier, output);
+      break;
+   }
+   case Format::VINTRP: {
+      Interp_instruction* vintrp = static_cast<Interp_instruction*>(instr);
+      fprintf(output, " attr%d.%c", vintrp->attribute, "xyzw"[vintrp->component]);
+      break;
+   }
+   case Format::DS: {
+      DS_instruction* ds = static_cast<DS_instruction*>(instr);
+      if (ds->offset0)
+         fprintf(output, " offset0:%u", ds->offset0);
+      if (ds->offset1)
+         fprintf(output, " offset1:%u", ds->offset1);
+      if (ds->gds)
+         fprintf(output, " gds");
+      break;
+   }
+   case Format::MUBUF: {
+      MUBUF_instruction* mubuf = static_cast<MUBUF_instruction*>(instr);
+      if (mubuf->offset)
+         fprintf(output, " offset:%u", mubuf->offset);
+      if (mubuf->offen)
+         fprintf(output, " offen");
+      if (mubuf->idxen)
+         fprintf(output, " idxen");
+      if (mubuf->glc)
+         fprintf(output, " glc");
+      if (mubuf->slc)
+         fprintf(output, " slc");
+      if (mubuf->tfe)
+         fprintf(output, " tfe");
+      if (mubuf->lds)
+         fprintf(output, " lds");
+      if (mubuf->disable_wqm)
+         fprintf(output, " disable_wqm");
+      print_barrier_reorder(mubuf->can_reorder, mubuf->barrier, output);
+      break;
+   }
+   case Format::MIMG: {
+      MIMG_instruction* mimg = static_cast<MIMG_instruction*>(instr);
+      unsigned identity_dmask = !instr->definitions.empty() ?
+                                (1 << instr->definitions[0].size()) - 1 :
+                                0xf;
+      if ((mimg->dmask & identity_dmask) != identity_dmask)
+         fprintf(output, " dmask:%s%s%s%s",
+                 mimg->dmask & 0x1 ? "x" : "",
+                 mimg->dmask & 0x2 ? "y" : "",
+                 mimg->dmask & 0x4 ? "z" : "",
+                 mimg->dmask & 0x8 ? "w" : "");
+      if (mimg->unrm)
+         fprintf(output, " unrm");
+      if (mimg->glc)
+         fprintf(output, " glc");
+      if (mimg->slc)
+         fprintf(output, " slc");
+      if (mimg->tfe)
+         fprintf(output, " tfe");
+      if (mimg->da)
+         fprintf(output, " da");
+      if (mimg->lwe)
+         fprintf(output, " lwe");
+      if (mimg->r128 || mimg->a16)
+         fprintf(output, " r128/a16");
+      if (mimg->d16)
+         fprintf(output, " d16");
+      if (mimg->disable_wqm)
+         fprintf(output, " disable_wqm");
+      print_barrier_reorder(mimg->can_reorder, mimg->barrier, output);
+      break;
+   }
+   case Format::EXP: {
+      Export_instruction* exp = static_cast<Export_instruction*>(instr);
+      unsigned identity_mask = exp->compressed ? 0x5 : 0xf;
+      if ((exp->enabled_mask & identity_mask) != identity_mask)
+         fprintf(output, " en:%c%c%c%c",
+                 exp->enabled_mask & 0x1 ? 'r' : '*',
+                 exp->enabled_mask & 0x2 ? 'g' : '*',
+                 exp->enabled_mask & 0x4 ? 'b' : '*',
+                 exp->enabled_mask & 0x8 ? 'a' : '*');
+      if (exp->compressed)
+         fprintf(output, " compr");
+      if (exp->done)
+         fprintf(output, " done");
+      if (exp->valid_mask)
+         fprintf(output, " vm");
+
+      if (exp->dest <= V_008DFC_SQ_EXP_MRT + 7)
+         fprintf(output, " mrt%d", exp->dest - V_008DFC_SQ_EXP_MRT);
+      else if (exp->dest == V_008DFC_SQ_EXP_MRTZ)
+         fprintf(output, " mrtz");
+      else if (exp->dest == V_008DFC_SQ_EXP_NULL)
+         fprintf(output, " null");
+      else if (exp->dest >= V_008DFC_SQ_EXP_POS && exp->dest <= V_008DFC_SQ_EXP_POS + 3)
+         fprintf(output, " pos%d", exp->dest - V_008DFC_SQ_EXP_POS);
+      else if (exp->dest >= V_008DFC_SQ_EXP_PARAM && exp->dest <= V_008DFC_SQ_EXP_PARAM + 31)
+         fprintf(output, " param%d", exp->dest - V_008DFC_SQ_EXP_PARAM);
+      break;
+   }
+   case Format::PSEUDO_BRANCH: {
+      Pseudo_branch_instruction* branch = static_cast<Pseudo_branch_instruction*>(instr);
+      /* Note: BB0 cannot be a branch target */
+      if (branch->target[0] != 0)
+         fprintf(output, " BB%d", branch->target[0]);
+      if (branch->target[1] != 0)
+         fprintf(output, ", BB%d", branch->target[1]);
+      break;
+   }
+   case Format::PSEUDO_REDUCTION: {
+      Pseudo_reduction_instruction* reduce = static_cast<Pseudo_reduction_instruction*>(instr);
+      fprintf(output, " op:%s", reduce_ops[reduce->reduce_op]);
+      if (reduce->cluster_size)
+         fprintf(output, " cluster_size:%u", reduce->cluster_size);
+      break;
+   }
+   case Format::FLAT:
+   case Format::GLOBAL:
+   case Format::SCRATCH: {
+      FLAT_instruction* flat = static_cast<FLAT_instruction*>(instr);
+      if (flat->offset)
+         fprintf(output, " offset:%u", flat->offset);
+      if (flat->glc)
+         fprintf(output, " glc");
+      if (flat->slc)
+         fprintf(output, " slc");
+      if (flat->lds)
+         fprintf(output, " lds");
+      if (flat->nv)
+         fprintf(output, " nv");
+      break;
+   }
+   case Format::MTBUF: {
+      MTBUF_instruction* mtbuf = static_cast<MTBUF_instruction*>(instr);
+      fprintf(output, " dfmt:");
+      switch (mtbuf->dfmt) {
+      case V_008F0C_BUF_DATA_FORMAT_8: fprintf(output, "8"); break;
+      case V_008F0C_BUF_DATA_FORMAT_16: fprintf(output, "16"); break;
+      case V_008F0C_BUF_DATA_FORMAT_8_8: fprintf(output, "8_8"); break;
+      case V_008F0C_BUF_DATA_FORMAT_32: fprintf(output, "32"); break;
+      case V_008F0C_BUF_DATA_FORMAT_16_16: fprintf(output, "16_16"); break;
+      case V_008F0C_BUF_DATA_FORMAT_10_11_11: fprintf(output, "10_11_11"); break;
+      case V_008F0C_BUF_DATA_FORMAT_11_11_10: fprintf(output, "11_11_10"); break;
+      case V_008F0C_BUF_DATA_FORMAT_10_10_10_2: fprintf(output, "10_10_10_2"); break;
+      case V_008F0C_BUF_DATA_FORMAT_2_10_10_10: fprintf(output, "2_10_10_10"); break;
+      case V_008F0C_BUF_DATA_FORMAT_8_8_8_8: fprintf(output, "8_8_8_8"); break;
+      case V_008F0C_BUF_DATA_FORMAT_32_32: fprintf(output, "32_32"); break;
+      case V_008F0C_BUF_DATA_FORMAT_16_16_16_16: fprintf(output, "16_16_16_16"); break;
+      case V_008F0C_BUF_DATA_FORMAT_32_32_32: fprintf(output, "32_32_32"); break;
+      case V_008F0C_BUF_DATA_FORMAT_32_32_32_32: fprintf(output, "32_32_32_32"); break;
+      case V_008F0C_BUF_DATA_FORMAT_RESERVED_15: fprintf(output, "reserved15"); break;
+      }
+      fprintf(output, " nfmt:");
+      switch (mtbuf->nfmt) {
+      case V_008F0C_BUF_NUM_FORMAT_UNORM: fprintf(output, "unorm"); break;
+      case V_008F0C_BUF_NUM_FORMAT_SNORM: fprintf(output, "snorm"); break;
+      case V_008F0C_BUF_NUM_FORMAT_USCALED: fprintf(output, "uscaled"); break;
+      case V_008F0C_BUF_NUM_FORMAT_SSCALED: fprintf(output, "sscaled"); break;
+      case V_008F0C_BUF_NUM_FORMAT_UINT: fprintf(output, "uint"); break;
+      case V_008F0C_BUF_NUM_FORMAT_SINT: fprintf(output, "sint"); break;
+      case V_008F0C_BUF_NUM_FORMAT_SNORM_OGL: fprintf(output, "snorm"); break;
+      case V_008F0C_BUF_NUM_FORMAT_FLOAT: fprintf(output, "float"); break;
+      }
+      if (mtbuf->offset)
+         fprintf(output, " offset:%u", mtbuf->offset);
+      if (mtbuf->offen)
+         fprintf(output, " offen");
+      if (mtbuf->idxen)
+         fprintf(output, " idxen");
+      if (mtbuf->glc)
+         fprintf(output, " glc");
+      if (mtbuf->slc)
+         fprintf(output, " slc");
+      if (mtbuf->tfe)
+         fprintf(output, " tfe");
+      if (mtbuf->disable_wqm)
+         fprintf(output, " disable_wqm");
+      print_barrier_reorder(mtbuf->can_reorder, mtbuf->barrier, output);
+      break;
+   }
+   default: {
+      break;
+   }
+   }
+   if (instr->isVOP3()) {
+      VOP3A_instruction* vop3 = static_cast<VOP3A_instruction*>(instr);
+      switch (vop3->omod) {
+      case 1:
+         fprintf(output, " *2");
+         break;
+      case 2:
+         fprintf(output, " *4");
+         break;
+      case 3:
+         fprintf(output, " *0.5");
+         break;
+      }
+      if (vop3->clamp)
+         fprintf(output, " clamp");
+   } else if (instr->isDPP()) {
+      DPP_instruction* dpp = static_cast<DPP_instruction*>(instr);
+      if (dpp->dpp_ctrl <= 0xff) {
+         fprintf(output, " quad_perm:[%d,%d,%d,%d]",
+                 dpp->dpp_ctrl & 0x3, (dpp->dpp_ctrl >> 2) & 0x3,
+                 (dpp->dpp_ctrl >> 4) & 0x3, (dpp->dpp_ctrl >> 6) & 0x3);
+      } else if (dpp->dpp_ctrl >= 0x101 && dpp->dpp_ctrl <= 0x10f) {
+         fprintf(output, " row_shl:%d", dpp->dpp_ctrl & 0xf);
+      } else if (dpp->dpp_ctrl >= 0x111 && dpp->dpp_ctrl <= 0x11f) {
+         fprintf(output, " row_shr:%d", dpp->dpp_ctrl & 0xf);
+      } else if (dpp->dpp_ctrl >= 0x121 && dpp->dpp_ctrl <= 0x12f) {
+         fprintf(output, " row_ror:%d", dpp->dpp_ctrl & 0xf);
+      } else if (dpp->dpp_ctrl == dpp_wf_sl1) {
+         fprintf(output, " wave_shl:1");
+      } else if (dpp->dpp_ctrl == dpp_wf_rl1) {
+         fprintf(output, " wave_rol:1");
+      } else if (dpp->dpp_ctrl == dpp_wf_sr1) {
+         fprintf(output, " wave_shr:1");
+      } else if (dpp->dpp_ctrl == dpp_wf_rr1) {
+         fprintf(output, " wave_ror:1");
+      } else if (dpp->dpp_ctrl == dpp_row_mirror) {
+         fprintf(output, " row_mirror");
+      } else if (dpp->dpp_ctrl == dpp_row_half_mirror) {
+         fprintf(output, " row_half_mirror");
+      } else if (dpp->dpp_ctrl == dpp_row_bcast15) {
+         fprintf(output, " row_bcast:15");
+      } else if (dpp->dpp_ctrl == dpp_row_bcast31) {
+         fprintf(output, " row_bcast:31");
+      } else {
+         fprintf(output, " dpp_ctrl:0x%.3x", dpp->dpp_ctrl);
+      }
+      if (dpp->row_mask != 0xf)
+         fprintf(output, " row_mask:0x%.1x", dpp->row_mask);
+      if (dpp->bank_mask != 0xf)
+         fprintf(output, " bank_mask:0x%.1x", dpp->bank_mask);
+      if (dpp->bound_ctrl)
+         fprintf(output, " bound_ctrl:1");
+   } else if ((int)instr->format & (int)Format::SDWA) {
+      fprintf(output, " (printing unimplemented)");
+   }
+}
+
+void aco_print_instr(struct Instruction *instr, FILE *output)
+{
+   if (!instr->definitions.empty()) {
+      for (unsigned i = 0; i < instr->definitions.size(); ++i) {
+         print_definition(&instr->definitions[i], output);
+         if (i + 1 != instr->definitions.size())
+            fprintf(output, ", ");
+      }
+      fprintf(output, " = ");
+   }
+   fprintf(output, "%s", instr_info.name[(int)instr->opcode]);
+   if (instr->operands.size()) {
+      bool abs[instr->operands.size()];
+      bool neg[instr->operands.size()];
+      if ((int)instr->format & (int)Format::VOP3A) {
+         VOP3A_instruction* vop3 = static_cast<VOP3A_instruction*>(instr);
+         for (unsigned i = 0; i < instr->operands.size(); ++i) {
+            abs[i] = vop3->abs[i];
+            neg[i] = vop3->neg[i];
+         }
+      } else if (instr->isDPP()) {
+         DPP_instruction* dpp = static_cast<DPP_instruction*>(instr);
+         assert(instr->operands.size() <= 2);
+         for (unsigned i = 0; i < instr->operands.size(); ++i) {
+            abs[i] = dpp->abs[i];
+            neg[i] = dpp->neg[i];
+         }
+      } else {
+         for (unsigned i = 0; i < instr->operands.size(); ++i) {
+            abs[i] = false;
+            neg[i] = false;
+         }
+      }
+      for (unsigned i = 0; i < instr->operands.size(); ++i) {
+         if (i)
+            fprintf(output, ", ");
+         else
+            fprintf(output, " ");
+
+         if (neg[i])
+            fprintf(output, "-");
+         if (abs[i])
+            fprintf(output, "|");
+         print_operand(&instr->operands[i], output);
+         if (abs[i])
+            fprintf(output, "|");
+       }
+   }
+   print_instr_format_specific(instr, output);
+}
+
+static void print_block_kind(uint16_t kind, FILE *output)
+{
+   if (kind & block_kind_uniform)
+      fprintf(output, "uniform, ");
+   if (kind & block_kind_top_level)
+      fprintf(output, "top-level, ");
+   if (kind & block_kind_loop_preheader)
+      fprintf(output, "loop-preheader, ");
+   if (kind & block_kind_loop_header)
+      fprintf(output, "loop-header, ");
+   if (kind & block_kind_loop_exit)
+      fprintf(output, "loop-exit, ");
+   if (kind & block_kind_continue)
+      fprintf(output, "continue, ");
+   if (kind & block_kind_break)
+      fprintf(output, "break, ");
+   if (kind & block_kind_continue_or_break)
+      fprintf(output, "continue_or_break, ");
+   if (kind & block_kind_discard)
+      fprintf(output, "discard, ");
+   if (kind & block_kind_branch)
+      fprintf(output, "branch, ");
+   if (kind & block_kind_merge)
+      fprintf(output, "merge, ");
+   if (kind & block_kind_invert)
+      fprintf(output, "invert, ");
+   if (kind & block_kind_uses_discard_if)
+      fprintf(output, "discard_if, ");
+   if (kind & block_kind_needs_lowering)
+      fprintf(output, "needs_lowering, ");
+}
+
+void aco_print_block(const struct Block* block, FILE *output)
+{
+   fprintf(output, "BB%d\n", block->index);
+   fprintf(output, "/* logical preds: ");
+   for (unsigned pred : block->logical_preds)
+      fprintf(output, "BB%d, ", pred);
+   fprintf(output, "/ linear preds: ");
+   for (unsigned pred : block->linear_preds)
+      fprintf(output, "BB%d, ", pred);
+   fprintf(output, "/ kind: ");
+   print_block_kind(block->kind, output);
+   fprintf(output, "*/\n");
+   for (auto const& instr : block->instructions) {
+      fprintf(output, "\t");
+      aco_print_instr(instr.get(), output);
+      fprintf(output, "\n");
+   }
+}
+
+void aco_print_program(Program *program, FILE *output)
+{
+   for (Block const& block : program->blocks)
+      aco_print_block(&block, output);
+
+   if (program->constant_data.size()) {
+      fprintf(output, "\n/* constant data */\n");
+      for (unsigned i = 0; i < program->constant_data.size(); i += 32) {
+         fprintf(output, "[%06d] ", i);
+         unsigned line_size = std::min<size_t>(program->constant_data.size() - i, 32);
+         for (unsigned j = 0; j < line_size; j += 4) {
+            unsigned size = std::min<size_t>(program->constant_data.size() - (i + j), 4);
+            uint32_t v = 0;
+            memcpy(&v, &program->constant_data[i + j], size);
+            fprintf(output, " %08x", v);
+         }
+         fprintf(output, "\n");
+      }
+   }
+
+   fprintf(output, "\n");
+}
+
+}
--- a/src/amd/compiler/aco_reduce_assign.cpp
+++ b/src/amd/compiler/aco_reduce_assign.cpp
@ -0,0 +1,164 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ * Copyright © 2018 Google
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+#include "aco_builder.h"
+
+/*
+ * Insert p_linear_start instructions right before RA to correctly allocate
+ * temporaries for reductions that have to disrespect EXEC by executing in
+ * WWM.
+ */
+
+namespace aco {
+
+void setup_reduce_temp(Program* program)
+{
+   unsigned last_top_level_block_idx = 0;
+   unsigned maxSize = 0;
+
+   std::vector<bool> hasReductions(program->blocks.size());
+   for (Block& block : program->blocks) {
+      for (aco_ptr<Instruction>& instr : block.instructions) {
+         if (instr->format != Format::PSEUDO_REDUCTION)
+            continue;
+
+         maxSize = MAX2(maxSize, instr->operands[0].size());
+         hasReductions[block.index] = true;
+      }
+   }
+
+   if (maxSize == 0)
+      return;
+
+   assert(maxSize == 1 || maxSize == 2);
+   Temp reduceTmp(0, RegClass(RegType::vgpr, maxSize).as_linear());
+   Temp vtmp(0, RegClass(RegType::vgpr, maxSize).as_linear());
+   int inserted_at = -1;
+   int vtmp_inserted_at = -1;
+   bool reduceTmp_in_loop = false;
+   bool vtmp_in_loop = false;
+
+   for (Block& block : program->blocks) {
+
+      /* insert p_end_linear_vgpr after the outermost loop */
+      if (reduceTmp_in_loop && block.loop_nest_depth == 0) {
+         assert(inserted_at == (int)last_top_level_block_idx);
+
+         aco_ptr<Instruction> end{create_instruction<Instruction>(aco_opcode::p_end_linear_vgpr, Format::PSEUDO, vtmp_in_loop ? 2 : 1, 0)};
+         end->operands[0] = Operand(reduceTmp);
+         if (vtmp_in_loop)
+            end->operands[1] = Operand(vtmp);
+         /* insert after the phis of the loop exit block */
+         std::vector<aco_ptr<Instruction>>::iterator it = block.instructions.begin();
+         while ((*it)->opcode == aco_opcode::p_linear_phi || (*it)->opcode == aco_opcode::p_phi)
+            ++it;
+         block.instructions.insert(it, std::move(end));
+         reduceTmp_in_loop = false;
+      }
+
+      if (block.kind & block_kind_top_level)
+         last_top_level_block_idx = block.index;
+
+      if (!hasReductions[block.index])
+         continue;
+
+      std::vector<aco_ptr<Instruction>>::iterator it;
+      for (it = block.instructions.begin(); it != block.instructions.end(); ++it) {
+         Instruction *instr = (*it).get();
+         if (instr->format != Format::PSEUDO_REDUCTION)
+            continue;
+
+         ReduceOp op = static_cast<Pseudo_reduction_instruction *>(instr)->reduce_op;
+         reduceTmp_in_loop |= block.loop_nest_depth > 0;
+
+         if ((int)last_top_level_block_idx != inserted_at) {
+            reduceTmp = {program->allocateId(), reduceTmp.regClass()};
+            aco_ptr<Pseudo_instruction> create{create_instruction<Pseudo_instruction>(aco_opcode::p_start_linear_vgpr, Format::PSEUDO, 0, 1)};
+            create->definitions[0] = Definition(reduceTmp);
+            /* find the right place to insert this definition */
+            if (last_top_level_block_idx == block.index) {
+               /* insert right before the current instruction */
+               it = block.instructions.insert(it, std::move(create));
+               it++;
+               /* inserted_at is intentionally not updated here, so later blocks
+                * would insert at the end instead of using this one. */
+            } else {
+               assert(last_top_level_block_idx < block.index);
+               /* insert before the branch at last top level block */
+               std::vector<aco_ptr<Instruction>>& instructions = program->blocks[last_top_level_block_idx].instructions;
+               instructions.insert(std::next(instructions.begin(), instructions.size() - 1), std::move(create));
+               inserted_at = last_top_level_block_idx;
+            }
+         }
+
+         /* same as before, except for the vector temporary instead of the reduce temporary */
+         bool need_vtmp = op == imul32 || op == fadd64 || op == fmul64 ||
+                          op == fmin64 || op == fmax64;
+
+         need_vtmp |= static_cast<Pseudo_reduction_instruction *>(instr)->cluster_size == 32;
+         vtmp_in_loop |= need_vtmp && block.loop_nest_depth > 0;
+         if (need_vtmp && (int)last_top_level_block_idx != vtmp_inserted_at) {
+            vtmp = {program->allocateId(), vtmp.regClass()};
+            aco_ptr<Pseudo_instruction> create{create_instruction<Pseudo_instruction>(aco_opcode::p_start_linear_vgpr, Format::PSEUDO, 0, 1)};
+            create->definitions[0] = Definition(vtmp);
+            if (last_top_level_block_idx == block.index) {
+               it = block.instructions.insert(it, std::move(create));
+               it++;
+            } else {
+               assert(last_top_level_block_idx < block.index);
+               std::vector<aco_ptr<Instruction>>& instructions = program->blocks[last_top_level_block_idx].instructions;
+               instructions.insert(std::next(instructions.begin(), instructions.size() - 1), std::move(create));
+               vtmp_inserted_at = last_top_level_block_idx;
+            }
+         }
+
+         instr->operands[1] = Operand(reduceTmp);
+         if (need_vtmp)
+            instr->operands[2] = Operand(vtmp);
+
+         /* scalar temporary */
+         Builder bld(program);
+         instr->definitions[1] = bld.def(s2);
+
+         /* scalar identity temporary */
+         if (instr->opcode == aco_opcode::p_exclusive_scan &&
+             (op == imin32 || op == imin64 ||
+              op == imax32 || op == imax64 ||
+              op == fmin32 || op == fmin64 ||
+              op == fmax32 || op == fmax64 ||
+              op == fmul64)) {
+            instr->definitions[2] = bld.def(RegClass(RegType::sgpr, instr->operands[0].size()));
+         }
+
+         /* vcc clobber */
+         if (op == iadd32 && program->chip_class < GFX9)
+            instr->definitions[4] = Definition(vcc, s2);
+      }
+   }
+}
+
+};
+
--- a/src/amd/compiler/aco_register_allocation.cpp
+++ b/src/amd/compiler/aco_register_allocation.cpp
--- a/src/amd/compiler/aco_scheduler.cpp
+++ b/src/amd/compiler/aco_scheduler.cpp
@ -0,0 +1,835 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+#include <unordered_set>
+#include <algorithm>
+
+#include "vulkan/radv_shader.h" // for radv_nir_compiler_options
+#include "amdgfxregs.h"
+
+#define SMEM_WINDOW_SIZE (350 - ctx.num_waves * 35)
+#define VMEM_WINDOW_SIZE (1024 - ctx.num_waves * 64)
+#define POS_EXP_WINDOW_SIZE 512
+#define SMEM_MAX_MOVES (80 - ctx.num_waves * 8)
+#define VMEM_MAX_MOVES (128 - ctx.num_waves * 4)
+#define POS_EXP_MAX_MOVES 512
+
+namespace aco {
+
+struct sched_ctx {
+   std::vector<bool> depends_on;
+   std::vector<bool> RAR_dependencies;
+   RegisterDemand max_registers;
+   int16_t num_waves;
+   int16_t last_SMEM_stall;
+   int last_SMEM_dep_idx;
+};
+
+/* This scheduler is a simple bottom-up pass based on ideas from
+ * "A Novel Lightweight Instruction Scheduling Algorithm for Just-In-Time Compiler"
+ * from Xiaohua Shi and Peng Guo.
+ * The basic approach is to iterate over all instructions. When a memory instruction
+ * is encountered it tries to move independent instructions from above and below
+ * between the memory instruction and it's first user.
+ * The novelty is that this scheduler cares for the current register pressure:
+ * Instructions will only be moved if the register pressure won't exceed a certain bound.
+ */
+
+template <typename T>
+void move_element(T& list, size_t idx, size_t before) {
+    if (idx < before) {
+        auto begin = std::next(list.begin(), idx);
+        auto end = std::next(list.begin(), before);
+        std::rotate(begin, begin + 1, end);
+    } else if (idx > before) {
+        auto begin = std::next(list.begin(), before);
+        auto end = std::next(list.begin(), idx + 1);
+        std::rotate(begin, end - 1, end);
+    }
+}
+
+static RegisterDemand getLiveChanges(aco_ptr<Instruction>& instr)
+{
+   RegisterDemand changes;
+   for (const Definition& def : instr->definitions) {
+      if (!def.isTemp() || def.isKill())
+         continue;
+      changes += def.getTemp();
+   }
+
+   for (const Operand& op : instr->operands) {
+      if (!op.isTemp() || !op.isFirstKill())
+         continue;
+      changes -= op.getTemp();
+   }
+
+   return changes;
+}
+
+static RegisterDemand getTempRegisters(aco_ptr<Instruction>& instr)
+{
+   RegisterDemand temp_registers;
+   for (const Definition& def : instr->definitions) {
+      if (!def.isTemp() || !def.isKill())
+         continue;
+      temp_registers += def.getTemp();
+   }
+   return temp_registers;
+}
+
+static bool is_spill_reload(aco_ptr<Instruction>& instr)
+{
+   return instr->opcode == aco_opcode::p_spill || instr->opcode == aco_opcode::p_reload;
+}
+
+bool can_move_instr(aco_ptr<Instruction>& instr, Instruction* current, int moving_interaction)
+{
+   /* don't move exports so that they stay closer together */
+   if (instr->format == Format::EXP)
+      return false;
+
+   /* handle barriers */
+
+   /* TODO: instead of stopping, maybe try to move the barriers and any
+    * instructions interacting with them instead? */
+   if (instr->format != Format::PSEUDO_BARRIER) {
+      if (instr->opcode == aco_opcode::s_barrier) {
+         bool can_reorder = false;
+         switch (current->format) {
+         case Format::SMEM:
+            can_reorder = static_cast<SMEM_instruction*>(current)->can_reorder;
+            break;
+         case Format::MUBUF:
+            can_reorder = static_cast<MUBUF_instruction*>(current)->can_reorder;
+            break;
+         case Format::MIMG:
+            can_reorder = static_cast<MIMG_instruction*>(current)->can_reorder;
+            break;
+         default:
+            break;
+         }
+         return can_reorder && moving_interaction == barrier_none;
+      } else {
+         return true;
+      }
+   }
+
+   int interaction = get_barrier_interaction(current);
+   interaction |= moving_interaction;
+
+   switch (instr->opcode) {
+   case aco_opcode::p_memory_barrier_atomic:
+      return !(interaction & barrier_atomic);
+   /* For now, buffer and image barriers are treated the same. this is because of
+    * dEQP-VK.memory_model.message_passing.core11.u32.coherent.fence_fence.atomicwrite.device.payload_nonlocal.buffer.guard_nonlocal.image.comp
+    * which seems to use an image load to determine if the result of a buffer load is valid. So the ordering of the two loads is important.
+    * I /think/ we should probably eventually expand the meaning of a buffer barrier so that all buffer operations before it, must stay before it
+    * and that both image and buffer operations after it, must stay after it. We should also do the same for image barriers.
+    * Or perhaps the problem is that we don't have a combined barrier instruction for both buffers and images, but the CTS test expects us to?
+    * Either way, this solution should work. */
+   case aco_opcode::p_memory_barrier_buffer:
+   case aco_opcode::p_memory_barrier_image:
+      return !(interaction & (barrier_image | barrier_buffer));
+   case aco_opcode::p_memory_barrier_shared:
+      return !(interaction & barrier_shared);
+   case aco_opcode::p_memory_barrier_all:
+      return interaction == barrier_none;
+   default:
+      return false;
+   }
+}
+
+bool can_reorder(Instruction* candidate, bool allow_smem)
+{
+   switch (candidate->format) {
+   case Format::SMEM:
+      return allow_smem || static_cast<SMEM_instruction*>(candidate)->can_reorder;
+   case Format::MUBUF:
+      return static_cast<MUBUF_instruction*>(candidate)->can_reorder;
+   case Format::MIMG:
+      return static_cast<MIMG_instruction*>(candidate)->can_reorder;
+   case Format::MTBUF:
+      return static_cast<MTBUF_instruction*>(candidate)->can_reorder;
+   case Format::FLAT:
+   case Format::GLOBAL:
+   case Format::SCRATCH:
+      return false;
+   default:
+      return true;
+   }
+}
+
+void schedule_SMEM(sched_ctx& ctx, Block* block,
+                   std::vector<RegisterDemand>& register_demand,
+                   Instruction* current, int idx)
+{
+   assert(idx != 0);
+   int window_size = SMEM_WINDOW_SIZE;
+   int max_moves = SMEM_MAX_MOVES;
+   int16_t k = 0;
+   bool can_reorder_cur = can_reorder(current, false);
+
+   /* create the initial set of values which current depends on */
+   std::fill(ctx.depends_on.begin(), ctx.depends_on.end(), false);
+   for (const Operand& op : current->operands) {
+      if (op.isTemp())
+         ctx.depends_on[op.tempId()] = true;
+   }
+
+   /* maintain how many registers remain free when moving instructions */
+   RegisterDemand register_pressure = register_demand[idx];
+
+   /* first, check if we have instructions before current to move down */
+   int insert_idx = idx + 1;
+   int moving_interaction = barrier_none;
+   bool moving_spill = false;
+
+   for (int candidate_idx = idx - 1; k < max_moves && candidate_idx > (int) idx - window_size; candidate_idx--) {
+      assert(candidate_idx >= 0);
+      aco_ptr<Instruction>& candidate = block->instructions[candidate_idx];
+
+      /* break if we'd make the previous SMEM instruction stall */
+      bool can_stall_prev_smem = idx <= ctx.last_SMEM_dep_idx && candidate_idx < ctx.last_SMEM_dep_idx;
+      if (can_stall_prev_smem && ctx.last_SMEM_stall >= 0)
+         break;
+
+      /* break when encountering another MEM instruction, logical_start or barriers */
+      if (!can_reorder(candidate.get(), false) && !can_reorder_cur)
+         break;
+      if (candidate->opcode == aco_opcode::p_logical_start)
+         break;
+      if (!can_move_instr(candidate, current, moving_interaction))
+         break;
+      register_pressure.update(register_demand[candidate_idx]);
+
+      /* if current depends on candidate, add additional dependencies and continue */
+      bool can_move_down = true;
+      bool writes_exec = false;
+      for (const Definition& def : candidate->definitions) {
+         if (def.isTemp() && ctx.depends_on[def.tempId()])
+            can_move_down = false;
+         if (def.isFixed() && def.physReg() == exec)
+            writes_exec = true;
+      }
+      if (writes_exec)
+         break;
+
+      if (moving_spill && is_spill_reload(candidate))
+         can_move_down = false;
+      if ((moving_interaction & barrier_shared) && candidate->format == Format::DS)
+         can_move_down = false;
+      moving_interaction |= get_barrier_interaction(candidate.get());
+      moving_spill |= is_spill_reload(candidate);
+      if (!can_move_down) {
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.depends_on[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      bool register_pressure_unknown = false;
+      /* check if one of candidate's operands is killed by depending instruction */
+      for (const Operand& op : candidate->operands) {
+         if (op.isTemp() && ctx.depends_on[op.tempId()]) {
+            // FIXME: account for difference in register pressure
+            register_pressure_unknown = true;
+         }
+      }
+      if (register_pressure_unknown) {
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.depends_on[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      /* check if register pressure is low enough: the diff is negative if register pressure is decreased */
+      const RegisterDemand candidate_diff = getLiveChanges(candidate);
+      const RegisterDemand tempDemand = getTempRegisters(candidate);
+      if (RegisterDemand(register_pressure - candidate_diff).exceeds(ctx.max_registers))
+         break;
+      const RegisterDemand tempDemand2 = getTempRegisters(block->instructions[insert_idx - 1]);
+      const RegisterDemand new_demand  = register_demand[insert_idx - 1] - tempDemand2 + tempDemand;
+      if (new_demand.exceeds(ctx.max_registers))
+         break;
+      // TODO: we might want to look further to find a sequence of instructions to move down which doesn't exceed reg pressure
+
+      /* move the candidate below the memory load */
+      move_element(block->instructions, candidate_idx, insert_idx);
+
+      /* update register pressure */
+      move_element(register_demand, candidate_idx, insert_idx);
+      for (int i = candidate_idx; i < insert_idx - 1; i++) {
+         register_demand[i] -= candidate_diff;
+      }
+      register_demand[insert_idx - 1] = new_demand;
+      register_pressure -= candidate_diff;
+
+      if (candidate_idx < ctx.last_SMEM_dep_idx)
+         ctx.last_SMEM_stall++;
+      insert_idx--;
+      k++;
+   }
+
+   /* create the initial set of values which depend on current */
+   std::fill(ctx.depends_on.begin(), ctx.depends_on.end(), false);
+   std::fill(ctx.RAR_dependencies.begin(), ctx.RAR_dependencies.end(), false);
+   for (const Definition& def : current->definitions) {
+      if (def.isTemp())
+         ctx.depends_on[def.tempId()] = true;
+   }
+
+   /* find the first instruction depending on current or find another MEM */
+   insert_idx = idx + 1;
+   moving_interaction = barrier_none;
+   moving_spill = false;
+
+   bool found_dependency = false;
+   /* second, check if we have instructions after current to move up */
+   for (int candidate_idx = idx + 1; k < max_moves && candidate_idx < (int) idx + window_size; candidate_idx++) {
+      assert(candidate_idx < (int) block->instructions.size());
+      aco_ptr<Instruction>& candidate = block->instructions[candidate_idx];
+
+      if (candidate->opcode == aco_opcode::p_logical_end)
+         break;
+      if (!can_move_instr(candidate, current, moving_interaction))
+         break;
+
+      const bool writes_exec = std::any_of(candidate->definitions.begin(), candidate->definitions.end(),
+                                           [](const Definition& def) { return def.isFixed() && def.physReg() == exec;});
+      if (writes_exec)
+         break;
+
+      /* check if candidate depends on current */
+      bool is_dependency = std::any_of(candidate->operands.begin(), candidate->operands.end(),
+                                       [&ctx](const Operand& op) { return op.isTemp() && ctx.depends_on[op.tempId()];});
+      if (moving_spill && is_spill_reload(candidate))
+         is_dependency = true;
+      if ((moving_interaction & barrier_shared) && candidate->format == Format::DS)
+         is_dependency = true;
+      moving_interaction |= get_barrier_interaction(candidate.get());
+      moving_spill |= is_spill_reload(candidate);
+      if (is_dependency) {
+         for (const Definition& def : candidate->definitions) {
+            if (def.isTemp())
+               ctx.depends_on[def.tempId()] = true;
+         }
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.RAR_dependencies[op.tempId()] = true;
+         }
+         if (!found_dependency) {
+            insert_idx = candidate_idx;
+            found_dependency = true;
+            /* init register pressure */
+            register_pressure = register_demand[insert_idx - 1];
+         }
+      }
+
+      if (!can_reorder(candidate.get(), false) && !can_reorder_cur)
+         break;
+
+      if (!found_dependency) {
+         k++;
+         continue;
+      }
+
+      /* update register pressure */
+      register_pressure.update(register_demand[candidate_idx - 1]);
+
+      if (is_dependency)
+         continue;
+      assert(insert_idx != idx);
+
+      // TODO: correctly calculate register pressure for this case
+      bool register_pressure_unknown = false;
+      /* check if candidate uses/kills an operand which is used by a dependency */
+      for (const Operand& op : candidate->operands) {
+         if (op.isTemp() && ctx.RAR_dependencies[op.tempId()])
+            register_pressure_unknown = true;
+      }
+      if (register_pressure_unknown) {
+         for (const Definition& def : candidate->definitions) {
+            if (def.isTemp())
+               ctx.RAR_dependencies[def.tempId()] = true;
+         }
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.RAR_dependencies[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      /* check if register pressure is low enough: the diff is negative if register pressure is decreased */
+      const RegisterDemand candidate_diff = getLiveChanges(candidate);
+      const RegisterDemand temp = getTempRegisters(candidate);
+      if (RegisterDemand(register_pressure + candidate_diff).exceeds(ctx.max_registers))
+         break;
+      const RegisterDemand temp2 = getTempRegisters(block->instructions[insert_idx - 1]);
+      const RegisterDemand new_demand = register_demand[insert_idx - 1] - temp2 + candidate_diff + temp;
+      if (new_demand.exceeds(ctx.max_registers))
+         break;
+
+      /* move the candidate above the insert_idx */
+      move_element(block->instructions, candidate_idx, insert_idx);
+
+      /* update register pressure */
+      move_element(register_demand, candidate_idx, insert_idx);
+      for (int i = insert_idx + 1; i <= candidate_idx; i++) {
+         register_demand[i] += candidate_diff;
+      }
+      register_demand[insert_idx] = new_demand;
+      register_pressure += candidate_diff;
+      insert_idx++;
+      k++;
+   }
+
+   ctx.last_SMEM_dep_idx = found_dependency ? insert_idx : 0;
+   ctx.last_SMEM_stall = 10 - ctx.num_waves - k;
+}
+
+void schedule_VMEM(sched_ctx& ctx, Block* block,
+                   std::vector<RegisterDemand>& register_demand,
+                   Instruction* current, int idx)
+{
+   assert(idx != 0);
+   int window_size = VMEM_WINDOW_SIZE;
+   int max_moves = VMEM_MAX_MOVES;
+   int16_t k = 0;
+   bool can_reorder_cur = can_reorder(current, false);
+
+   /* create the initial set of values which current depends on */
+   std::fill(ctx.depends_on.begin(), ctx.depends_on.end(), false);
+   for (const Operand& op : current->operands) {
+      if (op.isTemp())
+         ctx.depends_on[op.tempId()] = true;
+   }
+
+   /* maintain how many registers remain free when moving instructions */
+   RegisterDemand register_pressure = register_demand[idx];
+
+   /* first, check if we have instructions before current to move down */
+   int insert_idx = idx + 1;
+   int moving_interaction = barrier_none;
+   bool moving_spill = false;
+
+   for (int candidate_idx = idx - 1; k < max_moves && candidate_idx > (int) idx - window_size; candidate_idx--) {
+      assert(candidate_idx >= 0);
+      aco_ptr<Instruction>& candidate = block->instructions[candidate_idx];
+
+      /* break when encountering another VMEM instruction, logical_start or barriers */
+      if (!can_reorder(candidate.get(), true) && !can_reorder_cur)
+         break;
+      if (candidate->opcode == aco_opcode::p_logical_start)
+         break;
+      if (!can_move_instr(candidate, current, moving_interaction))
+         break;
+
+      /* break if we'd make the previous SMEM instruction stall */
+      bool can_stall_prev_smem = idx <= ctx.last_SMEM_dep_idx && candidate_idx < ctx.last_SMEM_dep_idx;
+      if (can_stall_prev_smem && ctx.last_SMEM_stall >= 0)
+         break;
+      register_pressure.update(register_demand[candidate_idx]);
+
+      /* if current depends on candidate, add additional dependencies and continue */
+      bool can_move_down = true;
+      bool writes_exec = false;
+      for (const Definition& def : candidate->definitions) {
+         if (def.isTemp() && ctx.depends_on[def.tempId()])
+            can_move_down = false;
+         if (def.isFixed() && def.physReg() == exec)
+            writes_exec = true;
+      }
+      if (writes_exec)
+         break;
+
+      if (moving_spill && is_spill_reload(candidate))
+         can_move_down = false;
+      if ((moving_interaction & barrier_shared) && candidate->format == Format::DS)
+         can_move_down = false;
+      moving_interaction |= get_barrier_interaction(candidate.get());
+      moving_spill |= is_spill_reload(candidate);
+      if (!can_move_down) {
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.depends_on[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      bool register_pressure_unknown = false;
+      /* check if one of candidate's operands is killed by depending instruction */
+      for (const Operand& op : candidate->operands) {
+         if (op.isTemp() && ctx.depends_on[op.tempId()]) {
+            // FIXME: account for difference in register pressure
+            register_pressure_unknown = true;
+         }
+      }
+      if (register_pressure_unknown) {
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.depends_on[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      /* check if register pressure is low enough: the diff is negative if register pressure is decreased */
+      const RegisterDemand candidate_diff = getLiveChanges(candidate);
+      const RegisterDemand temp = getTempRegisters(candidate);;
+      if (RegisterDemand(register_pressure - candidate_diff).exceeds(ctx.max_registers))
+         break;
+      const RegisterDemand temp2 = getTempRegisters(block->instructions[insert_idx - 1]);
+      const RegisterDemand new_demand = register_demand[insert_idx - 1] - temp2 + temp;
+      if (new_demand.exceeds(ctx.max_registers))
+         break;
+      // TODO: we might want to look further to find a sequence of instructions to move down which doesn't exceed reg pressure
+
+      /* move the candidate below the memory load */
+      move_element(block->instructions, candidate_idx, insert_idx);
+
+      /* update register pressure */
+      move_element(register_demand, candidate_idx, insert_idx);
+      for (int i = candidate_idx; i < insert_idx - 1; i++) {
+         register_demand[i] -= candidate_diff;
+      }
+      register_demand[insert_idx - 1] = new_demand;
+      register_pressure -=  candidate_diff;
+      insert_idx--;
+      k++;
+      if (candidate_idx < ctx.last_SMEM_dep_idx)
+         ctx.last_SMEM_stall++;
+   }
+
+   /* create the initial set of values which depend on current */
+   std::fill(ctx.depends_on.begin(), ctx.depends_on.end(), false);
+   std::fill(ctx.RAR_dependencies.begin(), ctx.RAR_dependencies.end(), false);
+   for (const Definition& def : current->definitions) {
+      if (def.isTemp())
+         ctx.depends_on[def.tempId()] = true;
+   }
+
+   /* find the first instruction depending on current or find another VMEM */
+   insert_idx = idx;
+   moving_interaction = barrier_none;
+   moving_spill = false;
+
+   bool found_dependency = false;
+   /* second, check if we have instructions after current to move up */
+   for (int candidate_idx = idx + 1; k < max_moves && candidate_idx < (int) idx + window_size; candidate_idx++) {
+      assert(candidate_idx < (int) block->instructions.size());
+      aco_ptr<Instruction>& candidate = block->instructions[candidate_idx];
+
+      if (candidate->opcode == aco_opcode::p_logical_end)
+         break;
+      if (!can_move_instr(candidate, current, moving_interaction))
+         break;
+
+      const bool writes_exec = std::any_of(candidate->definitions.begin(), candidate->definitions.end(),
+                                           [](const Definition& def) {return def.isFixed() && def.physReg() == exec; });
+      if (writes_exec)
+         break;
+
+      /* check if candidate depends on current */
+      bool is_dependency = !can_reorder(candidate.get(), true) && !can_reorder_cur;
+      for (const Operand& op : candidate->operands) {
+         if (op.isTemp() && ctx.depends_on[op.tempId()]) {
+            is_dependency = true;
+            break;
+         }
+      }
+      if (moving_spill && is_spill_reload(candidate))
+         is_dependency = true;
+      if ((moving_interaction & barrier_shared) && candidate->format == Format::DS)
+         is_dependency = true;
+      moving_interaction |= get_barrier_interaction(candidate.get());
+      moving_spill |= is_spill_reload(candidate);
+      if (is_dependency) {
+         for (const Definition& def : candidate->definitions) {
+            if (def.isTemp())
+               ctx.depends_on[def.tempId()] = true;
+         }
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.RAR_dependencies[op.tempId()] = true;
+         }
+         if (!found_dependency) {
+            insert_idx = candidate_idx;
+            found_dependency = true;
+            /* init register pressure */
+            register_pressure = register_demand[insert_idx - 1];
+            continue;
+         }
+      }
+
+      /* update register pressure */
+      register_pressure.update(register_demand[candidate_idx - 1]);
+
+      if (is_dependency || !found_dependency)
+         continue;
+      assert(insert_idx != idx);
+
+      bool register_pressure_unknown = false;
+      /* check if candidate uses/kills an operand which is used by a dependency */
+      for (const Operand& op : candidate->operands) {
+         if (op.isTemp() && ctx.RAR_dependencies[op.tempId()])
+            register_pressure_unknown = true;
+      }
+      if (register_pressure_unknown) {
+         for (const Definition& def : candidate->definitions) {
+            if (def.isTemp())
+               ctx.RAR_dependencies[def.tempId()] = true;
+         }
+         for (const Operand& op : candidate->operands) {
+            if (op.isTemp())
+               ctx.RAR_dependencies[op.tempId()] = true;
+         }
+         continue;
+      }
+
+      /* check if register pressure is low enough: the diff is negative if register pressure is decreased */
+      const RegisterDemand candidate_diff = getLiveChanges(candidate);
+      const RegisterDemand temp = getTempRegisters(candidate);
+      if (RegisterDemand(register_pressure + candidate_diff).exceeds(ctx.max_registers))
+         break;
+      const RegisterDemand temp2 = getTempRegisters(block->instructions[insert_idx - 1]);
+      const RegisterDemand new_demand = register_demand[insert_idx - 1] - temp2 + candidate_diff + temp;
+      if (new_demand.exceeds(ctx.max_registers))
+         break;
+
+      /* move the candidate above the insert_idx */
+      move_element(block->instructions, candidate_idx, insert_idx);
+
+      /* update register pressure */
+      move_element(register_demand, candidate_idx, insert_idx);
+      for (int i = insert_idx + 1; i <= candidate_idx; i++) {
+         register_demand[i] += candidate_diff;
+      }
+      register_demand[insert_idx] = new_demand;
+      register_pressure += candidate_diff;
+      insert_idx++;
+      k++;
+   }
+}
+
+void schedule_position_export(sched_ctx& ctx, Block* block,
+                              std::vector<RegisterDemand>& register_demand,
+                              Instruction* current, int idx)
+{
+   assert(idx != 0);
+   int window_size = POS_EXP_WINDOW_SIZE;
+   int max_moves = POS_EXP_MAX_MOVES;
+   int16_t k = 0;
+
+   /* create the initial set of values which current depends on */
+   std::fill(ctx.depends_on.begin(), ctx.depends_on.end(), false);
+   for (unsigned i = 0; i < current->operands.size(); i++) {
+      if (current->operands[i].isTemp())
+         ctx.depends_on[current->operands[i].tempId()] = true;
+   }
+
+   /* maintain how many registers remain free when moving instructions */
+   RegisterDemand register_pressure = register_demand[idx];
+
+   /* first, check if we have instructions before current to move down */
+   int insert_idx = idx + 1;
+   int moving_interaction = barrier_none;
+   bool moving_spill = false;
+
+   for (int candidate_idx = idx - 1; k < max_moves && candidate_idx > (int) idx - window_size; candidate_idx--) {
+      assert(candidate_idx >= 0);
+      aco_ptr<Instruction>& candidate = block->instructions[candidate_idx];
+
+      /* break when encountering logical_start or barriers */
+      if (candidate->opcode == aco_opcode::p_logical_start)
+         break;
+      if (candidate->isVMEM() || candidate->format == Format::SMEM)
+         break;
+      if (!can_move_instr(candidate, current, moving_interaction))
+         break;
+
+      register_pressure.update(register_demand[candidate_idx]);
+
+      /* if current depends on candidate, add additional dependencies and continue */
+      bool can_move_down = true;
+      bool writes_exec = false;
+      for (unsigned i = 0; i < candidate->definitions.size(); i++) {
+         if (candidate->definitions[i].isTemp() && ctx.depends_on[candidate->definitions[i].tempId()])
+            can_move_down = false;
+         if (candidate->definitions[i].isFixed() && candidate->definitions[i].physReg() == exec)
+            writes_exec = true;
+      }
+      if (writes_exec)
+         break;
+
+      if (moving_spill && is_spill_reload(candidate))
+         can_move_down = false;
+      if ((moving_interaction & barrier_shared) && candidate->format == Format::DS)
+         can_move_down = false;
+      moving_interaction |= get_barrier_interaction(candidate.get());
+      moving_spill |= is_spill_reload(candidate);
+      if (!can_move_down) {
+         for (unsigned i = 0; i < candidate->operands.size(); i++) {
+            if (candidate->operands[i].isTemp())
+               ctx.depends_on[candidate->operands[i].tempId()] = true;
+         }
+         continue;
+      }
+
+      bool register_pressure_unknown = false;
+      /* check if one of candidate's operands is killed by depending instruction */
+      for (unsigned i = 0; i < candidate->operands.size(); i++) {
+         if (candidate->operands[i].isTemp() && ctx.depends_on[candidate->operands[i].tempId()]) {
+            // FIXME: account for difference in register pressure
+            register_pressure_unknown = true;
+         }
+      }
+      if (register_pressure_unknown) {
+         for (unsigned i = 0; i < candidate->operands.size(); i++) {
+            if (candidate->operands[i].isTemp())
+               ctx.depends_on[candidate->operands[i].tempId()] = true;
+         }
+         continue;
+      }
+
+      /* check if register pressure is low enough: the diff is negative if register pressure is decreased */
+      const RegisterDemand candidate_diff = getLiveChanges(candidate);
+      const RegisterDemand temp = getTempRegisters(candidate);;
+      if (RegisterDemand(register_pressure - candidate_diff).exceeds(ctx.max_registers))
+         break;
+      const RegisterDemand temp2 = getTempRegisters(block->instructions[insert_idx - 1]);
+      const RegisterDemand new_demand = register_demand[insert_idx - 1] - temp2 + temp;
+      if (new_demand.exceeds(ctx.max_registers))
+         break;
+      // TODO: we might want to look further to find a sequence of instructions to move down which doesn't exceed reg pressure
+
+      /* move the candidate below the export */
+      move_element(block->instructions, candidate_idx, insert_idx);
+
+      /* update register pressure */
+      move_element(register_demand, candidate_idx, insert_idx);
+      for (int i = candidate_idx; i < insert_idx - 1; i++) {
+         register_demand[i] -= candidate_diff;
+      }
+      register_demand[insert_idx - 1] = new_demand;
+      register_pressure -=  candidate_diff;
+      insert_idx--;
+      k++;
+   }
+}
+
+void schedule_block(sched_ctx& ctx, Program *program, Block* block, live& live_vars)
+{
+   ctx.last_SMEM_dep_idx = 0;
+   ctx.last_SMEM_stall = INT16_MIN;
+
+   /* go through all instructions and find memory loads */
+   for (unsigned idx = 0; idx < block->instructions.size(); idx++) {
+      Instruction* current = block->instructions[idx].get();
+
+      if (current->definitions.empty())
+         continue;
+
+      if (current->isVMEM())
+         schedule_VMEM(ctx, block, live_vars.register_demand[block->index], current, idx);
+      if (current->format == Format::SMEM)
+         schedule_SMEM(ctx, block, live_vars.register_demand[block->index], current, idx);
+   }
+
+   if ((program->stage & hw_vs) && block->index == program->blocks.size() - 1) {
+      /* Try to move position exports as far up as possible, to reduce register
+       * usage and because ISA reference guides say so. */
+      for (unsigned idx = 0; idx < block->instructions.size(); idx++) {
+         Instruction* current = block->instructions[idx].get();
+
+         if (current->format == Format::EXP) {
+            unsigned target = static_cast<Export_instruction*>(current)->dest;
+            if (target >= V_008DFC_SQ_EXP_POS && target < V_008DFC_SQ_EXP_PARAM)
+               schedule_position_export(ctx, block, live_vars.register_demand[block->index], current, idx);
+         }
+      }
+   }
+
+   /* resummarize the block's register demand */
+   block->register_demand = RegisterDemand();
+   for (unsigned idx = 0; idx < block->instructions.size(); idx++) {
+      block->register_demand.update(live_vars.register_demand[block->index][idx]);
+   }
+}
+
+
+void schedule_program(Program *program, live& live_vars)
+{
+   sched_ctx ctx;
+   ctx.depends_on.resize(program->peekAllocationId());
+   ctx.RAR_dependencies.resize(program->peekAllocationId());
+   /* Allowing the scheduler to reduce the number of waves to as low as 5
+    * improves performance of Thrones of Britannia significantly and doesn't
+    * seem to hurt anything else. */
+   //TODO: maybe use some sort of heuristic instead
+   //TODO: this also increases window-size/max-moves? did I realize that at the time?
+   ctx.num_waves = std::min<uint16_t>(program->num_waves, 5);
+   assert(ctx.num_waves);
+   uint16_t total_sgpr_regs = program->chip_class >= GFX8 ? 800 : 512;
+   uint16_t max_addressible_sgpr = program->sgpr_limit;
+   ctx.max_registers = { int16_t(((256 / ctx.num_waves) & ~3) - 2), std::min<int16_t>(((total_sgpr_regs / ctx.num_waves) & ~7) - 2, max_addressible_sgpr)};
+
+   for (Block& block : program->blocks)
+      schedule_block(ctx, program, &block, live_vars);
+
+   /* update max_reg_demand and num_waves */
+   RegisterDemand new_demand;
+   for (Block& block : program->blocks) {
+      new_demand.update(block.register_demand);
+   }
+   update_vgpr_sgpr_demand(program, new_demand);
+
+   /* if enabled, this code asserts that register_demand is updated correctly */
+   #if 0
+   int prev_num_waves = program->num_waves;
+   const RegisterDemand prev_max_demand = program->max_reg_demand;
+
+   std::vector<RegisterDemand> demands(program->blocks.size());
+   for (unsigned j = 0; j < program->blocks.size(); j++) {
+      demands[j] = program->blocks[j].register_demand;
+   }
+
+   struct radv_nir_compiler_options options;
+   options.chip_class = program->chip_class;
+   live live_vars2 = aco::live_var_analysis(program, &options);
+
+   for (unsigned j = 0; j < program->blocks.size(); j++) {
+      Block &b = program->blocks[j];
+      for (unsigned i = 0; i < b.instructions.size(); i++)
+         assert(live_vars.register_demand[b.index][i] == live_vars2.register_demand[b.index][i]);
+      assert(b.register_demand == demands[j]);
+   }
+
+   assert(program->max_reg_demand == prev_max_demand);
+   assert(program->num_waves == prev_num_waves);
+   #endif
+}
+
+}
--- a/src/amd/compiler/aco_spill.cpp
+++ b/src/amd/compiler/aco_spill.cpp
--- a/src/amd/compiler/aco_ssa_elimination.cpp
+++ b/src/amd/compiler/aco_ssa_elimination.cpp
@ -0,0 +1,291 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+
+#include "aco_ir.h"
+
+#include <map>
+
+namespace aco {
+namespace {
+
+/* map: block-id -> pair (dest, src) to store phi information */
+typedef std::map<uint32_t, std::vector<std::pair<Definition, Operand>>> phi_info;
+
+struct ssa_elimination_ctx {
+   phi_info logical_phi_info;
+   phi_info linear_phi_info;
+   std::vector<bool> empty_blocks;
+   Program* program;
+
+   ssa_elimination_ctx(Program* program) : empty_blocks(program->blocks.size(), true), program(program) {}
+};
+
+void collect_phi_info(ssa_elimination_ctx& ctx)
+{
+   for (Block& block : ctx.program->blocks) {
+      for (aco_ptr<Instruction>& phi : block.instructions) {
+         if (phi->opcode != aco_opcode::p_phi && phi->opcode != aco_opcode::p_linear_phi)
+            break;
+
+         for (unsigned i = 0; i < phi->operands.size(); i++) {
+            if (phi->operands[i].isUndefined())
+               continue;
+            if (phi->operands[i].isTemp() && phi->operands[i].physReg() == phi->definitions[0].physReg())
+               continue;
+
+            std::vector<unsigned>& preds = phi->opcode == aco_opcode::p_phi ? block.logical_preds : block.linear_preds;
+            phi_info& info = phi->opcode == aco_opcode::p_phi ? ctx.logical_phi_info : ctx.linear_phi_info;
+            const auto result = info.emplace(preds[i], std::vector<std::pair<Definition, Operand>>());
+            result.first->second.emplace_back(phi->definitions[0], phi->operands[i]);
+            ctx.empty_blocks[preds[i]] = false;
+         }
+      }
+   }
+}
+
+void insert_parallelcopies(ssa_elimination_ctx& ctx)
+{
+   /* insert the parallelcopies from logical phis before p_logical_end */
+   for (auto&& entry : ctx.logical_phi_info) {
+      Block& block = ctx.program->blocks[entry.first];
+      unsigned idx = block.instructions.size() - 1;
+      while (block.instructions[idx]->opcode != aco_opcode::p_logical_end) {
+         assert(idx > 0);
+         idx--;
+      }
+
+      std::vector<aco_ptr<Instruction>>::iterator it = std::next(block.instructions.begin(), idx);
+      aco_ptr<Pseudo_instruction> pc{create_instruction<Pseudo_instruction>(aco_opcode::p_parallelcopy, Format::PSEUDO, entry.second.size(), entry.second.size())};
+      unsigned i = 0;
+      for (std::pair<Definition, Operand>& pair : entry.second)
+      {
+         pc->definitions[i] = pair.first;
+         pc->operands[i] = pair.second;
+         i++;
+      }
+      /* this shouldn't be needed since we're only copying vgprs */
+      pc->tmp_in_scc = false;
+      block.instructions.insert(it, std::move(pc));
+   }
+
+   /* insert parallelcopies for the linear phis at the end of blocks just before the branch */
+   for (auto&& entry : ctx.linear_phi_info) {
+      Block& block = ctx.program->blocks[entry.first];
+      std::vector<aco_ptr<Instruction>>::iterator it = block.instructions.end();
+      --it;
+      assert((*it)->format == Format::PSEUDO_BRANCH);
+      aco_ptr<Pseudo_instruction> pc{create_instruction<Pseudo_instruction>(aco_opcode::p_parallelcopy, Format::PSEUDO, entry.second.size(), entry.second.size())};
+      unsigned i = 0;
+      for (std::pair<Definition, Operand>& pair : entry.second)
+      {
+         pc->definitions[i] = pair.first;
+         pc->operands[i] = pair.second;
+         i++;
+      }
+      pc->tmp_in_scc = block.scc_live_out;
+      pc->scratch_sgpr = block.scratch_sgpr;
+      block.instructions.insert(it, std::move(pc));
+   }
+}
+
+
+void try_remove_merge_block(ssa_elimination_ctx& ctx, Block* block)
+{
+   /* check if the successor is another merge block which restores exec */
+   // TODO: divergent loops also restore exec
+   if (block->linear_succs.size() != 1 ||
+       !(ctx.program->blocks[block->linear_succs[0]].kind & block_kind_merge))
+      return;
+
+   /* check if this block is empty and the exec mask is not needed */
+   for (aco_ptr<Instruction>& instr : block->instructions) {
+      if (instr->opcode == aco_opcode::p_parallelcopy) {
+         if (instr->definitions[0].physReg() == exec)
+            continue;
+         else
+            return;
+      }
+
+      if (instr->opcode != aco_opcode::p_linear_phi &&
+          instr->opcode != aco_opcode::p_phi &&
+          instr->opcode != aco_opcode::p_logical_start &&
+          instr->opcode != aco_opcode::p_logical_end &&
+          instr->opcode != aco_opcode::p_branch)
+         return;
+   }
+
+   /* keep the branch instruction and remove the rest */
+   aco_ptr<Instruction> branch = std::move(block->instructions.back());
+   block->instructions.clear();
+   block->instructions.emplace_back(std::move(branch));
+}
+
+void try_remove_invert_block(ssa_elimination_ctx& ctx, Block* block)
+{
+   assert(block->linear_succs.size() == 2);
+   if (block->linear_succs[0] != block->linear_succs[1])
+      return;
+
+   /* check if we can remove this block */
+   for (aco_ptr<Instruction>& instr : block->instructions) {
+      if (instr->opcode != aco_opcode::p_linear_phi &&
+          instr->opcode != aco_opcode::p_phi &&
+          instr->opcode != aco_opcode::s_andn2_b64 &&
+          instr->opcode != aco_opcode::p_branch)
+         return;
+   }
+
+   unsigned succ_idx = block->linear_succs[0];
+   assert(block->linear_preds.size() == 2);
+   for (unsigned i = 0; i < 2; i++) {
+      Block *pred = &ctx.program->blocks[block->linear_preds[i]];
+      pred->linear_succs[0] = succ_idx;
+      ctx.program->blocks[succ_idx].linear_preds[i] = pred->index;
+
+      Pseudo_branch_instruction *branch = static_cast<Pseudo_branch_instruction*>(pred->instructions.back().get());
+      assert(branch->format == Format::PSEUDO_BRANCH);
+      branch->target[0] = succ_idx;
+      branch->target[1] = succ_idx;
+   }
+
+   block->instructions.clear();
+   block->linear_preds.clear();
+   block->linear_succs.clear();
+}
+
+void try_remove_simple_block(ssa_elimination_ctx& ctx, Block* block)
+{
+   for (aco_ptr<Instruction>& instr : block->instructions) {
+      if (instr->opcode != aco_opcode::p_logical_start &&
+          instr->opcode != aco_opcode::p_logical_end &&
+          instr->opcode != aco_opcode::p_branch)
+         return;
+   }
+
+   Block& pred = ctx.program->blocks[block->linear_preds[0]];
+   Block& succ = ctx.program->blocks[block->linear_succs[0]];
+   Pseudo_branch_instruction* branch = static_cast<Pseudo_branch_instruction*>(pred.instructions.back().get());
+   if (branch->opcode == aco_opcode::p_branch) {
+      branch->target[0] = succ.index;
+      branch->target[1] = succ.index;
+   } else if (branch->target[0] == block->index) {
+      branch->target[0] = succ.index;
+   } else if (branch->target[0] == succ.index) {
+      assert(branch->target[1] == block->index);
+      branch->target[1] = succ.index;
+      branch->opcode = aco_opcode::p_branch;
+   } else if (branch->target[1] == block->index) {
+      /* check if there is a fall-through path from block to succ */
+      bool falls_through = true;
+      for (unsigned j = block->index + 1; falls_through && j < succ.index; j++) {
+         assert(ctx.program->blocks[j].index == j);
+         if (!ctx.program->blocks[j].instructions.empty())
+            falls_through = false;
+      }
+      if (falls_through) {
+         branch->target[1] = succ.index;
+      } else {
+         /* check if there is a fall-through path for the alternative target */
+         for (unsigned j = block->index + 1; j < branch->target[0]; j++) {
+            if (!ctx.program->blocks[j].instructions.empty())
+               return;
+         }
+
+         /* This is a (uniform) break or continue block. The branch condition has to be inverted. */
+         if (branch->opcode == aco_opcode::p_cbranch_z)
+            branch->opcode = aco_opcode::p_cbranch_nz;
+         else if (branch->opcode == aco_opcode::p_cbranch_nz)
+            branch->opcode = aco_opcode::p_cbranch_z;
+         else
+            assert(false);
+         /* also invert the linear successors */
+         pred.linear_succs[0] = pred.linear_succs[1];
+         pred.linear_succs[1] = succ.index;
+         branch->target[1] = branch->target[0];
+         branch->target[0] = succ.index;
+      }
+   } else {
+      assert(false);
+   }
+
+   if (branch->target[0] == branch->target[1])
+      branch->opcode = aco_opcode::p_branch;
+
+   for (unsigned i = 0; i < pred.linear_succs.size(); i++)
+      if (pred.linear_succs[i] == block->index)
+         pred.linear_succs[i] = succ.index;
+
+   for (unsigned i = 0; i < succ.linear_preds.size(); i++)
+      if (succ.linear_preds[i] == block->index)
+         succ.linear_preds[i] = pred.index;
+
+   block->instructions.clear();
+   block->linear_preds.clear();
+   block->linear_succs.clear();
+}
+
+void jump_threading(ssa_elimination_ctx& ctx)
+{
+   for (int i = ctx.program->blocks.size() - 1; i >= 0; i--) {
+      Block* block = &ctx.program->blocks[i];
+
+      if (!ctx.empty_blocks[i])
+         continue;
+
+      if (block->kind & block_kind_invert) {
+         try_remove_invert_block(ctx, block);
+         continue;
+      }
+
+      if (block->linear_succs.size() > 1)
+         continue;
+
+      if (block->kind & block_kind_merge ||
+          block->kind & block_kind_loop_exit)
+         try_remove_merge_block(ctx, block);
+
+      if (block->linear_preds.size() == 1)
+         try_remove_simple_block(ctx, block);
+   }
+}
+
+} /* end namespace */
+
+
+void ssa_elimination(Program* program)
+{
+   ssa_elimination_ctx ctx(program);
+
+   /* Collect information about every phi-instruction */
+   collect_phi_info(ctx);
+
+   /* eliminate empty blocks */
+   jump_threading(ctx);
+
+   /* insert parallelcopies from SSA elimination */
+   insert_parallelcopies(ctx);
+
+}
+}
--- a/src/amd/compiler/aco_util.h
+++ b/src/amd/compiler/aco_util.h
@ -0,0 +1,233 @@
+/*
+ * Copyright Michael Schellenberger Costa
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#ifndef ACO_UTIL_H
+#define ACO_UTIL_H
+
+#include <cassert>
+#include <iterator>
+
+namespace aco {
+
+/*! \brief      Definition of a span object
+*
+*   \details    A "span" is an "array view" type for holding a view of contiguous
+*               data. The "span" object does not own the data itself.
+*/
+template <typename T>
+class span {
+public:
+   using value_type             = T;
+   using pointer                = value_type*;
+   using const_pointer          = const value_type*;
+   using reference              = value_type&;
+   using const_reference        = const value_type&;
+   using iterator               = pointer;
+   using const_iterator         = const_pointer;
+   using reverse_iterator       = std::reverse_iterator<iterator>;
+   using const_reverse_iterator = std::reverse_iterator<const_iterator>;
+   using size_type              = std::size_t;
+   using difference_type        = std::ptrdiff_t;
+
+   /*! \brief                  Compiler generated default constructor
+   */
+   constexpr span() = default;
+
+   /*! \brief                 Constructor taking a pointer and the length of the span
+   *   \param[in]   data      Pointer to the underlying data array
+   *   \param[in]   length    The size of the span
+   */
+   constexpr span(pointer data, const size_type length)
+       : data{ data } , length{ length } {}
+
+   /*! \brief                 Returns an iterator to the begin of the span
+   *   \return                data
+   */
+   constexpr iterator begin() noexcept {
+      return data;
+   }
+
+   /*! \brief                 Returns a const_iterator to the begin of the span
+   *   \return                data
+   */
+   constexpr const_iterator begin() const noexcept {
+      return data;
+   }
+
+   /*! \brief                 Returns an iterator to the end of the span
+   *   \return                data + length
+   */
+   constexpr iterator end() noexcept {
+      return std::next(data, length);
+   }
+
+   /*! \brief                 Returns a const_iterator to the end of the span
+   *   \return                data + length
+   */
+   constexpr const_iterator end() const noexcept {
+      return std::next(data, length);
+   }
+
+   /*! \brief                 Returns a const_iterator to the begin of the span
+   *   \return                data
+   */
+   constexpr const_iterator cbegin() const noexcept {
+      return data;
+   }
+
+   /*! \brief                 Returns a const_iterator to the end of the span
+   *   \return                data + length
+   */
+   constexpr const_iterator cend() const noexcept {
+      return std::next(data, length);
+   }
+
+   /*! \brief                 Returns a reverse_iterator to the end of the span
+   *   \return                reverse_iterator(end())
+   */
+   constexpr reverse_iterator rbegin() noexcept {
+      return reverse_iterator(end());
+   }
+
+   /*! \brief                 Returns a const_reverse_iterator to the end of the span
+   *   \return                reverse_iterator(end())
+   */
+   constexpr const_reverse_iterator rbegin() const noexcept {
+      return const_reverse_iterator(end());
+   }
+
+   /*! \brief                 Returns a reverse_iterator to the begin of the span
+   *   \return                reverse_iterator(begin())
+   */
+   constexpr reverse_iterator rend() noexcept {
+      return reverse_iterator(begin());
+   }
+
+   /*! \brief                 Returns a const_reverse_iterator to the begin of the span
+   *   \return                reverse_iterator(begin())
+   */
+   constexpr const_reverse_iterator rend() const noexcept {
+      return const_reverse_iterator(begin());
+   }
+
+   /*! \brief                 Returns a const_reverse_iterator to the end of the span
+   *   \return                rbegin()
+   */
+   constexpr const_reverse_iterator crbegin() const noexcept {
+      return const_reverse_iterator(cend());
+   }
+
+   /*! \brief                 Returns a const_reverse_iterator to the begin of the span
+   *   \return                rend()
+   */
+   constexpr const_reverse_iterator crend() const noexcept {
+      return const_reverse_iterator(cbegin());
+   }
+
+   /*! \brief                 Unchecked access operator
+   *   \param[in] index       Index of the element we want to access
+   *   \return                *(std::next(data, index))
+   */
+   constexpr reference operator[](const size_type index) noexcept {
+      assert(length > index);
+      return *(std::next(data, index));
+   }
+
+   /*! \brief                 Unchecked const access operator
+   *   \param[in] index       Index of the element we want to access
+   *   \return                *(std::next(data, index))
+   */
+   constexpr const_reference operator[](const size_type index) const noexcept {
+      assert(length > index);
+      return *(std::next(data, index));
+   }
+
+   /*! \brief                 Returns a reference to the last element of the span
+   *   \return                *(std::next(data, length - 1))
+   */
+   constexpr reference back() noexcept {
+      assert(length > 0);
+      return *(std::next(data, length - 1));
+   }
+
+   /*! \brief                 Returns a const_reference to the last element of the span
+   *   \return                *(std::next(data, length - 1))
+   */
+   constexpr const_reference back() const noexcept {
+      assert(length > 0);
+      return *(std::next(data, length - 1));
+   }
+
+   /*! \brief                 Returns a reference to the first element of the span
+   *   \return                *begin()
+   */
+   constexpr reference front() noexcept {
+      assert(length > 0);
+      return *begin();
+   }
+
+   /*! \brief                 Returns a const_reference to the first element of the span
+   *   \return                *cbegin()
+   */
+   constexpr const_reference front() const noexcept {
+      assert(length > 0);
+      return *cbegin();
+   }
+
+   /*! \brief                 Returns true if the span is empty
+   *   \return                length == 0
+   */
+   constexpr bool empty() const noexcept {
+      return length == 0;
+   }
+
+   /*! \brief                 Returns the size of the span
+   *   \return                length == 0
+   */
+   constexpr size_type size() const noexcept {
+      return length;
+   }
+
+   /*! \brief                 Decreases the size of the span by 1
+   */
+   constexpr void pop_back() noexcept {
+      assert(length > 0);
+      --length;
+   }
+
+   /*! \brief                 Clears the span
+   */
+   constexpr void clear() noexcept {
+      data = nullptr;
+      length = 0;
+   }
+
+private:
+   pointer data{ nullptr };   //!> Pointer to the underlying data array
+   size_type length{ 0 };     //!> Size of the span
+};
+
+} // namespace aco
+
+#endif // ACO_UTIL_H
--- a/src/amd/compiler/aco_validate.cpp
+++ b/src/amd/compiler/aco_validate.cpp
@ -0,0 +1,460 @@
+/*
+ * Copyright © 2018 Valve Corporation
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sublicense,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the next
+ * paragraph) shall be included in all copies or substantial portions of the
+ * Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ *
+ */
+
+#include "aco_ir.h"
+
+#include <map>
+
+namespace aco {
+
+#ifndef NDEBUG
+void perfwarn(bool cond, const char *msg, Instruction *instr)
+{
+   if (cond) {
+      fprintf(stderr, "ACO performance warning: %s\n", msg);
+      if (instr) {
+         fprintf(stderr, "instruction: ");
+         aco_print_instr(instr, stderr);
+         fprintf(stderr, "\n");
+      }
+
+      if (debug_flags & DEBUG_PERFWARN)
+         exit(1);
+   }
+}
+#endif
+
+void validate(Program* program, FILE * output)
+{
+   if (!(debug_flags & DEBUG_VALIDATE))
+      return;
+
+   bool is_valid = true;
+   auto check = [&output, &is_valid](bool check, const char * msg, aco::Instruction * instr) -> void {
+      if (!check) {
+         fprintf(output, "%s: ", msg);
+         aco_print_instr(instr, output);
+         fprintf(output, "\n");
+         is_valid = false;
+      }
+   };
+
+   for (Block& block : program->blocks) {
+      for (aco_ptr<Instruction>& instr : block.instructions) {
+
+         /* check base format */
+         Format base_format = instr->format;
+         base_format = (Format)((uint32_t)base_format & ~(uint32_t)Format::SDWA);
+         base_format = (Format)((uint32_t)base_format & ~(uint32_t)Format::DPP);
+         if ((uint32_t)base_format & (uint32_t)Format::VOP1)
+            base_format = Format::VOP1;
+         else if ((uint32_t)base_format & (uint32_t)Format::VOP2)
+            base_format = Format::VOP2;
+         else if ((uint32_t)base_format & (uint32_t)Format::VOPC)
+            base_format = Format::VOPC;
+         else if ((uint32_t)base_format & (uint32_t)Format::VINTRP)
+            base_format = Format::VINTRP;
+         check(base_format == instr_info.format[(int)instr->opcode], "Wrong base format for instruction", instr.get());
+
+         /* check VOP3 modifiers */
+         if (((uint32_t)instr->format & (uint32_t)Format::VOP3) && instr->format != Format::VOP3) {
+            check(base_format == Format::VOP2 ||
+                  base_format == Format::VOP1 ||
+                  base_format == Format::VOPC ||
+                  base_format == Format::VINTRP,
+                  "Format cannot have VOP3A/VOP3B applied", instr.get());
+         }
+
+         /* check for undefs */
+         for (unsigned i = 0; i < instr->operands.size(); i++) {
+            if (instr->operands[i].isUndefined()) {
+               bool flat = instr->format == Format::FLAT || instr->format == Format::SCRATCH || instr->format == Format::GLOBAL;
+               bool can_be_undef = is_phi(instr) || instr->format == Format::EXP ||
+                                   instr->format == Format::PSEUDO_REDUCTION ||
+                                   (flat && i == 1) || (instr->format == Format::MIMG && i == 2) ||
+                                   ((instr->format == Format::MUBUF || instr->format == Format::MTBUF) && i == 0);
+               check(can_be_undef, "Undefs can only be used in certain operands", instr.get());
+            }
+         }
+
+         /* check num literals */
+         if (instr->isSALU() || instr->isVALU()) {
+            unsigned num_literals = 0;
+            for (unsigned i = 0; i < instr->operands.size(); i++)
+            {
+               if (instr->operands[i].isLiteral()) {
+                  check(instr->format == Format::SOP1 ||
+                        instr->format == Format::SOP2 ||
+                        instr->format == Format::SOPC ||
+                        instr->format == Format::VOP1 ||
+                        instr->format == Format::VOP2 ||
+                        instr->format == Format::VOPC,
+                        "Literal applied on wrong instruction format", instr.get());
+
+                  num_literals++;
+                  check(!instr->isVALU() || i == 0 || i == 2, "Wrong source position for Literal argument", instr.get());
+               }
+            }
+            check(num_literals <= 1, "Only 1 Literal allowed", instr.get());
+
+            /* check num sgprs for VALU */
+            if (instr->isVALU()) {
+               check(instr->definitions[0].getTemp().type() == RegType::vgpr ||
+                     (int) instr->format & (int) Format::VOPC ||
+                     instr->opcode == aco_opcode::v_readfirstlane_b32 ||
+                     instr->opcode == aco_opcode::v_readlane_b32,
+                     "Wrong Definition type for VALU instruction", instr.get());
+               unsigned num_sgpr = 0;
+               unsigned sgpr_idx = instr->operands.size();
+               for (unsigned i = 0; i < instr->operands.size(); i++)
+               {
+                  if (instr->operands[i].isTemp() && instr->operands[i].regClass().type() == RegType::sgpr) {
+                     check(i != 1 || (int) instr->format & (int) Format::VOP3A, "Wrong source position for SGPR argument", instr.get());
+
+                     if (sgpr_idx == instr->operands.size() || instr->operands[sgpr_idx].tempId() != instr->operands[i].tempId())
+                        num_sgpr++;
+                     sgpr_idx = i;
+                  }
+
+                  if (instr->operands[i].isConstant() && !instr->operands[i].isLiteral())
+                     check(i == 0 || (int) instr->format & (int) Format::VOP3A, "Wrong source position for constant argument", instr.get());
+               }
+               check(num_sgpr + num_literals <= 1, "Only 1 Literal OR 1 SGPR allowed", instr.get());
+            }
+
+            if (instr->format == Format::SOP1 || instr->format == Format::SOP2) {
+               check(instr->definitions[0].getTemp().type() == RegType::sgpr, "Wrong Definition type for SALU instruction", instr.get());
+               for (const Operand& op : instr->operands) {
+                 check(op.isConstant() || op.regClass().type() <= RegType::sgpr,
+                       "Wrong Operand type for SALU instruction", instr.get());
+            }
+         }
+         }
+
+         switch (instr->format) {
+         case Format::PSEUDO: {
+            if (instr->opcode == aco_opcode::p_create_vector) {
+               unsigned size = 0;
+               for (const Operand& op : instr->operands) {
+                  size += op.size();
+               }
+               check(size == instr->definitions[0].size(), "Definition size does not match operand sizes", instr.get());
+               if (instr->definitions[0].getTemp().type() == RegType::sgpr) {
+                  for (const Operand& op : instr->operands) {
+                     check(op.isConstant() || op.regClass().type() == RegType::sgpr,
+                           "Wrong Operand type for scalar vector", instr.get());
+                  }
+               }
+            } else if (instr->opcode == aco_opcode::p_extract_vector) {
+               check((instr->operands[0].isTemp()) && instr->operands[1].isConstant(), "Wrong Operand types", instr.get());
+               check(instr->operands[1].constantValue() < instr->operands[0].size(), "Index out of range", instr.get());
+               check(instr->definitions[0].getTemp().type() == RegType::vgpr || instr->operands[0].regClass().type() == RegType::sgpr,
+                     "Cannot extract SGPR value from VGPR vector", instr.get());
+            } else if (instr->opcode == aco_opcode::p_parallelcopy) {
+               check(instr->definitions.size() == instr->operands.size(), "Number of Operands does not match number of Definitions", instr.get());
+               for (unsigned i = 0; i < instr->operands.size(); i++) {
+                  if (instr->operands[i].isTemp())
+                     check((instr->definitions[i].getTemp().type() == instr->operands[i].regClass().type()) ||
+                           (instr->definitions[i].getTemp().type() == RegType::vgpr && instr->operands[i].regClass().type() == RegType::sgpr),
+                           "Operand and Definition types do not match", instr.get());
+               }
+            } else if (instr->opcode == aco_opcode::p_phi) {
+               check(instr->operands.size() == block.logical_preds.size(), "Number of Operands does not match number of predecessors", instr.get());
+               check(instr->definitions[0].getTemp().type() == RegType::vgpr || instr->definitions[0].getTemp().regClass() == s2, "Logical Phi Definition must be vgpr or divergent boolean", instr.get());
+            } else if (instr->opcode == aco_opcode::p_linear_phi) {
+               for (const Operand& op : instr->operands)
+                  check(!op.isTemp() || op.getTemp().is_linear(), "Wrong Operand type", instr.get());
+               check(instr->operands.size() == block.linear_preds.size(), "Number of Operands does not match number of predecessors", instr.get());
+            }
+            break;
+         }
+         case Format::SMEM: {
+            if (instr->operands.size() >= 1)
+               check(instr->operands[0].isTemp() && instr->operands[0].regClass().type() == RegType::sgpr, "SMEM operands must be sgpr", instr.get());
+            if (instr->operands.size() >= 2)
+               check(instr->operands[1].isConstant() || (instr->operands[1].isTemp() && instr->operands[1].regClass().type() == RegType::sgpr),
+                     "SMEM offset must be constant or sgpr", instr.get());
+            if (!instr->definitions.empty())
+               check(instr->definitions[0].getTemp().type() == RegType::sgpr, "SMEM result must be sgpr", instr.get());
+            break;
+         }
+         case Format::MTBUF:
+         case Format::MUBUF:
+         case Format::MIMG: {
+            check(instr->operands.size() > 1, "VMEM instructions must have at least one operand", instr.get());
+            check(instr->operands[0].hasRegClass() && instr->operands[0].regClass().type() == RegType::vgpr,
+                  "VADDR must be in vgpr for VMEM instructions", instr.get());
+            check(instr->operands[1].isTemp() && instr->operands[1].regClass().type() == RegType::sgpr, "VMEM resource constant must be sgpr", instr.get());
+            check(instr->operands.size() < 4 || (instr->operands[3].isTemp() && instr->operands[3].regClass().type() == RegType::vgpr), "VMEM write data must be vgpr", instr.get());
+            break;
+         }
+         case Format::DS: {
+            for (const Operand& op : instr->operands) {
+               check((op.isTemp() && op.regClass().type() == RegType::vgpr) || op.physReg() == m0,
+                     "Only VGPRs are valid DS instruction operands", instr.get());
+            }
+            if (!instr->definitions.empty())
+               check(instr->definitions[0].getTemp().type() == RegType::vgpr, "DS instruction must return VGPR", instr.get());
+            break;
+         }
+         case Format::EXP: {
+            for (unsigned i = 0; i < 4; i++)
+               check(instr->operands[i].hasRegClass() && instr->operands[i].regClass().type() == RegType::vgpr,
+                     "Only VGPRs are valid Export arguments", instr.get());
+            break;
+         }
+         case Format::FLAT:
+            check(instr->operands[1].isUndefined(), "Flat instructions don't support SADDR", instr.get());
+            /* fallthrough */
+         case Format::GLOBAL:
+         case Format::SCRATCH: {
+            check(instr->operands[0].isTemp() && instr->operands[0].regClass().type() == RegType::vgpr, "FLAT/GLOBAL/SCRATCH address must be vgpr", instr.get());
+            check(instr->operands[1].hasRegClass() && instr->operands[1].regClass().type() == RegType::sgpr,
+                  "FLAT/GLOBAL/SCRATCH sgpr address must be undefined or sgpr", instr.get());
+            if (!instr->definitions.empty())
+               check(instr->definitions[0].getTemp().type() == RegType::vgpr, "FLAT/GLOBAL/SCRATCH result must be vgpr", instr.get());
+            else
+               check(instr->operands[2].regClass().type() == RegType::vgpr, "FLAT/GLOBAL/SCRATCH data must be vgpr", instr.get());
+            break;
+         }
+         default:
+            break;
+         }
+      }
+   }
+   assert(is_valid);
+}
+
+/* RA validation */
+namespace {
+
+struct Location {
+   Location() : block(NULL), instr(NULL) {}
+
+   Block *block;
+   Instruction *instr; //NULL if it's the block's live-in
+};
+
+struct Assignment {
+   Location defloc;
+   Location firstloc;
+   PhysReg reg;
+};
+
+bool ra_fail(FILE *output, Location loc, Location loc2, const char *fmt, ...) {
+   va_list args;
+   va_start(args, fmt);
+   char msg[1024];
+   vsprintf(msg, fmt, args);
+   va_end(args);
+
+   fprintf(stderr, "RA error found at instruction in BB%d:\n", loc.block->index);
+   if (loc.instr) {
+      aco_print_instr(loc.instr, stderr);
+      fprintf(stderr, "\n%s", msg);
+   } else {
+      fprintf(stderr, "%s", msg);
+   }
+   if (loc2.block) {
+      fprintf(stderr, " in BB%d:\n", loc2.block->index);
+      aco_print_instr(loc2.instr, stderr);
+   }
+   fprintf(stderr, "\n\n");
+
+   return true;
+}
+
+} /* end namespace */
+
+bool validate_ra(Program *program, const struct radv_nir_compiler_options *options, FILE *output) {
+   if (!(debug_flags & DEBUG_VALIDATE_RA))
+      return false;
+
+   bool err = false;
+   aco::live live_vars = aco::live_var_analysis(program, options);
+   std::vector<std::vector<Temp>> phi_sgpr_ops(program->blocks.size());
+
+   std::map<unsigned, Assignment> assignments;
+   for (Block& block : program->blocks) {
+      Location loc;
+      loc.block = &block;
+      for (aco_ptr<Instruction>& instr : block.instructions) {
+         if (instr->opcode == aco_opcode::p_phi) {
+            for (unsigned i = 0; i < instr->operands.size(); i++) {
+               if (instr->operands[i].isTemp() &&
+                   instr->operands[i].getTemp().type() == RegType::sgpr &&
+                   instr->operands[i].isFirstKill())
+                  phi_sgpr_ops[block.logical_preds[i]].emplace_back(instr->operands[i].getTemp());
+            }
+         }
+
+         loc.instr = instr.get();
+         for (unsigned i = 0; i < instr->operands.size(); i++) {
+            Operand& op = instr->operands[i];
+            if (!op.isTemp())
+               continue;
+            if (!op.isFixed())
+               err |= ra_fail(output, loc, Location(), "Operand %d is not assigned a register", i);
+            if (assignments.count(op.tempId()) && assignments[op.tempId()].reg != op.physReg())
+               err |= ra_fail(output, loc, assignments.at(op.tempId()).firstloc, "Operand %d has an inconsistent register assignment with instruction", i);
+            if ((op.getTemp().type() == RegType::vgpr && op.physReg() + op.size() > 256 + program->config->num_vgprs) ||
+                (op.getTemp().type() == RegType::sgpr && op.physReg() + op.size() > program->config->num_sgprs && op.physReg() < program->sgpr_limit))
+               err |= ra_fail(output, loc, assignments.at(op.tempId()).firstloc, "Operand %d has an out-of-bounds register assignment", i);
+            if (!assignments[op.tempId()].firstloc.block)
+               assignments[op.tempId()].firstloc = loc;
+            if (!assignments[op.tempId()].defloc.block)
+               assignments[op.tempId()].reg = op.physReg();
+         }
+
+         for (unsigned i = 0; i < instr->definitions.size(); i++) {
+            Definition& def = instr->definitions[i];
+            if (!def.isTemp())
+               continue;
+            if (!def.isFixed())
+               err |= ra_fail(output, loc, Location(), "Definition %d is not assigned a register", i);
+            if (assignments[def.tempId()].defloc.block)
+               err |= ra_fail(output, loc, assignments.at(def.tempId()).defloc, "Temporary %%%d also defined by instruction", def.tempId());
+            if ((def.getTemp().type() == RegType::vgpr && def.physReg() + def.size() > 256 + program->config->num_vgprs) ||
+                (def.getTemp().type() == RegType::sgpr && def.physReg() + def.size() > program->config->num_sgprs && def.physReg() < program->sgpr_limit))
+               err |= ra_fail(output, loc, assignments.at(def.tempId()).firstloc, "Definition %d has an out-of-bounds register assignment", i);
+            if (!assignments[def.tempId()].firstloc.block)
+               assignments[def.tempId()].firstloc = loc;
+            assignments[def.tempId()].defloc = loc;
+            assignments[def.tempId()].reg = def.physReg();
+         }
+      }
+   }
+
+   for (Block& block : program->blocks) {
+      Location loc;
+      loc.block = &block;
+
+      std::array<unsigned, 512> regs;
+      regs.fill(0);
+
+      std::set<Temp> live;
+      live.insert(live_vars.live_out[block.index].begin(), live_vars.live_out[block.index].end());
+      /* remove killed p_phi sgpr operands */
+      for (Temp tmp : phi_sgpr_ops[block.index])
+         live.erase(tmp);
+
+      /* check live out */
+      for (Temp tmp : live) {
+         PhysReg reg = assignments.at(tmp.id()).reg;
+         for (unsigned i = 0; i < tmp.size(); i++) {
+            if (regs[reg + i]) {
+               err |= ra_fail(output, loc, Location(), "Assignment of element %d of %%%d already taken by %%%d in live-out", i, tmp.id(), regs[reg + i]);
+            }
+            regs[reg + i] = tmp.id();
+         }
+      }
+      regs.fill(0);
+
+      for (auto it = block.instructions.rbegin(); it != block.instructions.rend(); ++it) {
+         aco_ptr<Instruction>& instr = *it;
+
+         /* check killed p_phi sgpr operands */
+         if (instr->opcode == aco_opcode::p_logical_end) {
+            for (Temp tmp : phi_sgpr_ops[block.index]) {
+               PhysReg reg = assignments.at(tmp.id()).reg;
+               for (unsigned i = 0; i < tmp.size(); i++) {
+                  if (regs[reg + i])
+                     err |= ra_fail(output, loc, Location(), "Assignment of element %d of %%%d already taken by %%%d in live-out", i, tmp.id(), regs[reg + i]);
+               }
+               live.emplace(tmp);
+            }
+         }
+
+         for (const Definition& def : instr->definitions) {
+            if (!def.isTemp())
+               continue;
+            live.erase(def.getTemp());
+         }
+
+         /* don't count phi operands as live-in, since they are actually
+          * killed when they are copied at the predecessor */
+         if (instr->opcode != aco_opcode::p_phi && instr->opcode != aco_opcode::p_linear_phi) {
+            for (const Operand& op : instr->operands) {
+               if (!op.isTemp())
+                  continue;
+               live.insert(op.getTemp());
+            }
+         }
+      }
+
+      for (Temp tmp : live) {
+         PhysReg reg = assignments.at(tmp.id()).reg;
+         for (unsigned i = 0; i < tmp.size(); i++)
+            regs[reg + i] = tmp.id();
+      }
+
+      for (aco_ptr<Instruction>& instr : block.instructions) {
+         loc.instr = instr.get();
+
+         /* remove killed p_phi operands from regs */
+         if (instr->opcode == aco_opcode::p_logical_end) {
+            for (Temp tmp : phi_sgpr_ops[block.index]) {
+               PhysReg reg = assignments.at(tmp.id()).reg;
+               regs[reg] = 0;
+            }
+         }
+
+         if (instr->opcode != aco_opcode::p_phi && instr->opcode != aco_opcode::p_linear_phi) {
+            for (const Operand& op : instr->operands) {
+               if (!op.isTemp())
+                  continue;
+               if (op.isFirstKill()) {
+                  for (unsigned j = 0; j < op.getTemp().size(); j++)
+                     regs[op.physReg() + j] = 0;
+               }
+            }
+         }
+
+         for (unsigned i = 0; i < instr->definitions.size(); i++) {
+            Definition& def = instr->definitions[i];
+            if (!def.isTemp())
+               continue;
+            Temp tmp = def.getTemp();
+            PhysReg reg = assignments.at(tmp.id()).reg;
+            for (unsigned j = 0; j < tmp.size(); j++) {
+               if (regs[reg + j])
+                  err |= ra_fail(output, loc, assignments.at(regs[reg + i]).defloc, "Assignment of element %d of %%%d already taken by %%%d from instruction", i, tmp.id(), regs[reg + j]);
+               regs[reg + j] = tmp.id();
+            }
+         }
+
+         for (const Definition& def : instr->definitions) {
+            if (!def.isTemp())
+               continue;
+            if (def.isKill()) {
+               for (unsigned j = 0; j < def.getTemp().size(); j++)
+                  regs[def.physReg() + j] = 0;
+            }
+         }
+      }
+   }
+
+   return err;
+}
+}
--- a/src/amd/compiler/meson.build
+++ b/src/amd/compiler/meson.build
@ -0,0 +1,103 @@
+# Copyright © 2018 Valve Corporation
+
+# Permission is hereby granted, free of charge, to any person obtaining a copy
+# of this software and associated documentation files (the "Software"), to deal
+# in the Software without restriction, including without limitation the rights
+# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+# copies of the Software, and to permit persons to whom the Software is
+# furnished to do so, subject to the following conditions:
+
+# The above copyright notice and this permission notice shall be included in
+# all copies or substantial portions of the Software.
+
+# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+# SOFTWARE.
+
+aco_depends = files('aco_opcodes.py')
+
+aco_opcodes_h = custom_target(
+  'aco_opcodes.h',
+  input : 'aco_opcodes_h.py',
+  output : 'aco_opcodes.h',
+  command : [prog_python, '@INPUT@'],
+  capture : true,
+  depend_files : aco_depends,
+)
+
+aco_opcodes_c = custom_target(
+  'aco_opcodes.cpp',
+  input : 'aco_opcodes_cpp.py',
+  output : 'aco_opcodes.cpp',
+  command : [prog_python, '@INPUT@'],
+  capture : true,
+  depend_files : aco_depends,
+)
+
+aco_builder_h = custom_target(
+  'aco_builder.h',
+  input : 'aco_builder_h.py',
+  output : 'aco_builder.h',
+  command : [prog_python, '@INPUT@'],
+  capture : true,
+  depend_files : aco_depends,
+)
+
+# Headers-only dependency
+idep_aco_headers = declare_dependency(
+  sources : [aco_opcodes_h],
+  include_directories : include_directories('.'),
+)
+
+libaco_files = files(
+  'aco_dead_code_analysis.cpp',
+  'aco_dominance.cpp',
+  'aco_instruction_selection.cpp',
+  'aco_instruction_selection_setup.cpp',
+  'aco_interface.cpp',
+  'aco_interface.h',
+  'aco_ir.h',
+  'aco_assembler.cpp',
+  'aco_insert_exec_mask.cpp',
+  'aco_insert_NOPs.cpp',
+  'aco_insert_waitcnt.cpp',
+  'aco_reduce_assign.cpp',
+  'aco_register_allocation.cpp',
+  'aco_live_var_analysis.cpp',
+  'aco_lower_bool_phis.cpp',
+  'aco_lower_to_hw_instr.cpp',
+  'aco_optimizer.cpp',
+  'aco_opt_value_numbering.cpp',
+  'aco_print_asm.cpp',
+  'aco_print_ir.cpp',
+  'aco_scheduler.cpp',
+  'aco_ssa_elimination.cpp',
+  'aco_spill.cpp',
+  'aco_util.h',
+  'aco_validate.cpp',
+)
+
+_libaco = static_library(
+  'aco',
+  [libaco_files, aco_opcodes_c, aco_opcodes_h, aco_builder_h],
+  include_directories : [
+   inc_common, inc_compiler, inc_mesa, inc_mapi, inc_amd, inc_amd_common,
+  ],
+  dependencies : [
+    dep_llvm, dep_thread, dep_elf, dep_libdrm_amdgpu, dep_valgrind,
+    idep_nir_headers, idep_amdgfxregs_h,
+  ],
+  c_args : [c_vis_args],
+  cpp_args : [cpp_vis_args],
+  build_by_default : true,
+)
+
+# Also link with aco
+idep_aco = declare_dependency(
+  dependencies : idep_aco_headers,
+  link_with : _libaco,
+)