Abstract: A processor method and apparatus that allows for the overlapped execution of multiple iterations of a loop while allowing the compiler to include only a single copy of the loop body in the code while automatically managing which iterations are active. Since the prologue and epilogue are implicitly created and maintained within the hardware in the invention, a significant reduction in code size can be achieved compared to software-only modulo scheduling. Furthermore, loops with iteration counts less than the number of concurrent iterations present in the kernel are also automatically handled. This hardware enhanced scheme achieves the same performance as the fully-specified standard method. Furthermore, the hardware reduces the power requirement as the entire fetch unit can be deactivated for a portion of the loop's execution.