Challenges In Building An Energy Efficient AI Recommendation System

Key players insights
Challenges In Building An Energy Efficient AI Recommendation System
 Every day, humans are bombarded with recommendations of shows to watch or products to purchase. Whether you’re browsing the web, streaming media, or scrolling on social media, there are algorithms working in the background contributing to your decisions. In this article we will touch on what AI recommendation systems mean for businesses, what optimal hardware architecture for a recommendation system looks like, and what the hardware limitations and challenges are for implementation. 

An AI recommendation system is a set of machine learning algorithms used by developers to predict a user's choice and behavior and offer relevant suggestions. They offer personalized products and automate the search process for customers. 

Consumers can buy just about anything online which has created significant AI recommendation system adoption in many online services to enhance the personal shopping experience virtually. AI recommendation systems are used by many large data center hyperscalers and are driving significant financial and compute decisions every day.

Accuracy is one of the most important factors to success when implementing an AI recommendation system. Since recommendation systems heavily rely on data and data is constantly changing, algorithms need to keep up so that results stay accurate. In order to achieve high accuracy, you need to have optimal hardware. Large-scale AI recommendation systems are typically split into two major phases: retrieval and ranking. Retrieval is a key phase of every recommendation system and is used to quickly select relevant items from a large pool, while ranking sorts the results more precisely to choose a handful out of the sorted items. Recent approaches in this domain have mainly focused on embedding-based retrieval (EBR) systems which are widely used in recommendation systems for online services in data centers. What makes EBR special is that it involves a scoring method to simplify merging and filtering retrieval items from multiple channels. EBR represents user queries and candidate items with semantic embedding vectors (embedding for short) using representation learning, and it converts the retrieval problem into a similarity search problem in the embedding spaces. A good EBR system needs to achieve both high throughput and low latency, as high throughput usually means cost saving and low latency improves user experience. 

The recent research published by the Hong Kong University of Science and Technology and Bytedance    (1), found that an FPGA-accelerated EBR achieves the optimal performance of the practically ideal EBR system compared to the performances of an EBR system based on an Nvidia T4 GPU which, while comparably priced, is far from optimal due to its inherent architectural limitations. The research team show that the FPGA-accelerated Embedding-based Retrieval System using the AMD Virtex UltraScale+™ device on the Alveo™ U55N/C has the same memory bandwidth of the GPU but still achieves 1.21×-12.27× lower latency and up to 4.29× higher throughput under a latency target of 10 ms than GPU-based EBR.

FPGA platforms feature large HBM memory capacity (8 to 32GB), a HBM stack of 2 parallel DRAM channels and high bandwidth (460 GB/s) which are ideal for corpus store and fast corpus scanning. It also has FPGA-based parallel architecture for data parallelism that enables similarity calculations and K-selection data/pipeline parallelism. The FPGA-accelerated EBR can be fully pipelined and non-congested for the entire EBR data flow which is the ideal EBR architecture

The Alveo U55N/C used in the FPGA-accelerated Embedding-based Retrieval System research is an AMD high-performance compute card that provides optimized acceleration for workloads in AI and high-performance computing (HPC), big data analytics, and search. Featuring an AMD Virtex UltraScale+™ FPGA, the Alveo U55N/C card packs in high-bandwidth memory (HBM2) into a single slot, small form factor card, and is designed for deployment in any server. With a Large memory capacity, high memory bandwidth, data parallelism, pipeline parallelism and batch queries with low latency, FPGAs provide exceptional support for EBR architecture in a recommendation system.

Next Steps

Read the full research paper “Faery: An FPGA-accelerated Embedding-based Retrieval System” by Hong Kong University of Science and Technology and ByteDance
Click here

1. Faery: An FPGA-accelerated Embedding-based Retrieval System. Chaoliang Zeng, Hong Kong University of Science and Technology, et al. Carlsbad, CA : Usenix, 2022. (p 849-850)