This paper describes an on-die lightweight nanoAES hardware accelerator, fabricated in 22 nm tri-gate high-k/metal-gate CMOS, targeted for ultra-low power symmetric-key encryption and decryption on mobile SOCs. Compared to conventional 128 bit AES implementations, this design uses a single 8 bit Sbox circuit along with ShiftRows byte-order data processing to compute all AES rounds in native composite-field. This approach along with a serial-accumulating MixColumns circuit, area-optimized encrypt and decrypt Galois-field polynomials and integrated on-the-fly key generation circuit results in a compact encrypt/decrypt layout occupying 2200/2736 m and lowest-reported gate count of 1947/2090 respectively, while achieving: (i) maximum operating frequency of 1.133 GHz and total power consumption of 13 mW with leakage component of 500 W, measured at 0.9 V, 25 C, (ii) nominal AES-128 encrypt/decrypt throughput of 432/671 Mbps respectively, with peak energy-efficiency of 289 Gbps/W measured at near-threshold operation of 430 mV (11 higher than previously reported implementations), (iii) encrypt/decrypt latencies of 336/216 cycles and total energy consumption of 3.9/2.5 nJ respectively, (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 45 Mbps, 170 W, measured at 340 mV, 25 C and (v) first-reported Galois-field polynomial-based micro-architectural co-optimization, resulting in distinct area-optimized encrypt and decrypt polynomials with up to 9% area reduction at iso-performance.Index Terms-Advanced encryption standard, composite-field polynomial arithmetic, encryption hardware accelerator, lightweight crypto, on-the-fly key-generation, security, ultra-low power AES.