RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Liu, Ke; Gupta, Udit; Cho, Benjamin Youngjae; Brooks, David; Chandra, Vikas; Diril, Utku; Firoozshahian, Amin; Hazelwood, Kim; Jia, Bill; Lee, Hsien-Hsin S.; Li, Meng; Maher, Bert; Mudigere, Dheevatsa; Naumov, Maxim; Schätz, Martin; Smelyanskiy, Mikhail; Wang, Xiaodong; Reagen, Brandon; Wu, Carole-Jean; Hempstead, Mark; Zhang, Xuan

doi:10.1109/isca45697.2020.00070

Cited by 140 publications

(87 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such properties render GnR a prime candidate for acceleration using near-data processing (NDP) at the processor-memory interface. Indeed, TensorDIMM [10] and RecNMP [9] are two recent studies that explored the efficacy of NDP in accelerating GnR. However, we observe that the rank-level parallelism exploited by TensorDIMM and RecNMP does not fully reap the maximum potential of NDP acceleration, leaving significant performance capabilities on the table.…”

Section: ! 1 Deep-learning-based Recommendation Systemmentioning

confidence: 78%

“…When N lookup is 40 or 80, the relative EDP of TRiM-B is slightly better than that of TRiM-G. However, considering that TRiM-B incurs 4× more area overhead than TRiM-G as it populates a PE per bank, not a bank group, TRiM-G is a better option compared to TRiM-B in the range of N lookup (between 20 and 80) covered by DLRM [9]. Hereafter, we detail the microarchitecture for TRiM-G. Mitigating load imbalances through replication: At a given N lookup , a memory node with a PE receives fewer embedding vectors to reduce when TRiM exploits finer-grained parallelism, potentially experiencing load imbalance problems.…”

Section: Trim Architecturementioning

confidence: 99%

“…To estimate the performance of GnR at a production-scale workload, we generate an embedding table access trace with a Zipf distribution according to the studies [4] because the real trace is not open to the public. The detailed parameters are set based on the information in the prior work [6], [9].…”

Section: Trim Architecturementioning

confidence: 99%

“…Both IDR and NDR consist of a 32-bit floatingpoint multiply-add units (MACs) for vector reductions and includes registers that temporarily store the partial sums of the reduced vectors. The buffer chip includes a lightweight command decoder that processes compressed instructions and an in-memory cache for NDR to process embedding entries with high locality effectively [9]. TRiM leverages the method of compressing the instructions sent from a host MC into ranks from RecNMP [9], allowing each rank to send DRAM commands to the bank group independently.…”

Section: Trim Architecturementioning

confidence: 99%

“…The buffer chip includes a lightweight command decoder that processes compressed instructions and an in-memory cache for NDR to process embedding entries with high locality effectively [9]. TRiM leverages the method of compressing the instructions sent from a host MC into ranks from RecNMP [9], allowing each rank to send DRAM commands to the bank group independently. Commands for GnR (e.g., reading a vector from a DRAM and reducing it with the vector in the IDR registers) are defined using the reservedfor-future-use (RFU) command of DDR4.…”

Section: Trim Architecturementioning

confidence: 99%

See 4 more Smart Citations