Parallel, radix-4 turbo decoding is used to enhance the throughput and at the same time reduce the overall memory cost. The bottleneck is the higher complexity associated with radix-4 parallel interleaver implementation. This paper addresses the implementation issues of radix-4, parallel interleaver and also proposes necessary modifications in the interleaver algorithms for parallel address generation. It presents a re-configurable architecture which enables the use of same turbo decoding core to be used for multiple standards. The proposed interleaver architecture is capable of handling the memory conflicts on-the-fly. It consumes 12.5K gates and can run at a frequency of 285MHz, thus supporting a throughput of 173.3Mpbs, which can cover most of the emerging communication standards.