Hardware Transactional Memory (HTM) relies heavily on the on-chip network for intertransaction communication. However, the network bandwidth utilization of transactions has been largely neglected in HTM designs. In this work, we propose a cost model to analyze network bandwidth in transaction execution. The cost model identifies a set of key factors that can be optimized through system design to reduce the communication cost of HTM. Based on the model and network traffic characterization of a representative HTM design, we identify a huge source of superfluous traffic due to failed requests in transaction conflicts. As observed in a spectrum of workloads, 39% of the transactional requests fail due to conflicts, which renders 58% of the transactional network traffic futile. To combat this pathology, a novel in-network filtering mechanism is proposed. The on-chip router is augmented to predict conflicts among transactions and proactively filter out those requests that have a high probability to fail. Experimental results show the proposed mechanism reduces total network traffic by 24% on average for a set of high-contention TM applications, thereby reducing energy consumption by an average of 24%. Meanwhile, the contention in the coherence directory is reduced by 68%, on average. These improvements are achieved with only 5% area added to a conventional on-chip router design. 51:2 L. Zhao et al.performance, while fine-grain locks increase complexity. Transactional Memory (TM) promises to increase the productivity in parallel programming by providing language constructs to delimit code blocks (i.e., transactions) that appear to execute atomically and in isolation with other threads. TM systems can be implemented in the software stack, in hardware or a hybrid approach. The pure software approach is inherently slow, thereby preventing its large-scale usage. This work focuses on the hardware execution of transactions (applicable to Hardware Transaction Memory (HTM) and hybrid TM), as its tight coupling with evolving parallel processor architectures continues to enable extensive in-hardware optimization opportunities.HTM research has generally focused on performance [Negi et al. 2012;Lupon et al. 2010;Chafi et al. 2007;], implementation issues [Sanchez et al. 2007;Blundell et al. 2007], transaction scheduling [Blake et al. 2009[Blake et al. , 2011Scherer III and Scott 2005], and hardware-software interplay [Shriraman et al. 2008;Rajwar et al. 2005;Rossbach et al. 2007]. These efforts have paved the way for HTM to be present in commodity systems [Tremblay and Chaudhry 2008;Haring et al. 2012;Yoo et al. 2013]. However, the majority of the research proposals on HTM either assume an ideal on-chip network with zero latency or a simple communication fabric. While packet-switched on-chip networks are viewed as the de facto solution for future many-core processors to supply low latency, high-bandwidth and energy-efficient on-chip communication seldom has the interaction between HTM and on-chip networks been studied. It is of vita...