Abstract-Custom instruction set extensions can substantially boost performance of reconfigurable softcore CPUs. While this approach is commonly tailored to one specific FPGA system, we are presenting a fine-grained FPGA-like overlay architecture which can be implemented in the user logic of various FPGA families from different vendors. This allows the execution of a portable application consisting of a program binary and an overlay configuration in a completely heterogeneous environment. Furthermore, we are presenting different optimizations for dramatically reducing the implementation cost of the proposed overlay architecture. In particular, this includes the mapping of the overlay interconnection network directly into the switch fabric of the hosting FPGA. Our case study demonstrates an overhead reduction of an order of magnitude as compared to related approaches.