UPI Fraud Inference at Scale: Why Latency Breaks Under PCIe, Memory, and Queue Bottlenecks

Jason Karlin

Last Updated: Jun 30, 2026

11 Minute Read

29 Views

UPI Fraud Inference at Scale: Why Latency Breaks Under PCIe, Memory, and Queue Bottlenecks

Unified Payments Interface (UPI), a real-time digital payment system developed by the National Payments Corporation of India (NPCI), processes close to 755 million transactions every day. Not only the transaction volume, there has also been a surge in UPI frauds as well. More than 13.4 lakh UPI fraud cases were reported in FY 2023-24, resulting in losses exceeding ₹1,087 crore.

At such a transaction velocity, fraud detection is no longer just a modelling problem. It is about tackling the latency and catering to the challenges associated with the infrastructure beneath the model. At production scale, that infrastructure stress surfaces in three places: the PCIe bus that shuttles data between CPU and GPU, the memory layer, and the message queues that feed transactions into the inference pipeline.

How Inference Latency Impacts Fraud Prevention and User Experience

UPI is designed as a real-time payment system where customers expect close to near-instant responses. Owing to this, UPI fraud detection systems also operate under extremely tight latency. NPCI, banks, and payment apps such as PhonePe and GPay deploy an automated fraud-prevention mechanism that instantaneously evaluates transaction data such as user behavior, device trust, location, and payee risk for assigning a fraud score between 1~100.

A fraud inference or risk-decisioning pipeline can sit inside the PSP, bank, acquiring/issuing-side risk system, or an NPCI/bank-integrated fraud-monitoring flow depending on the participant architecture. It is not a post-transaction review system that flags fraud after the money has already moved, debited or credited.

Figure 1: Sample UPI Payment Flow [Image Source]

To combat evolving fraud threats, including phishing, fraudulent transactions, and identity theft, financial institutions and payment providers now rely on Machine Learning (ML)-based fraud detection systems. ML-based fraud systems can support real-time risk scoring, rule/ML hybrid decisions, alerts, step-up authentication, transaction decline, velocity controls and post-event investigation workflows.

The entire transaction from the payer PSP, through the NPCI switch, to the remitter and beneficiary banks, must be completed in real-time. Since NPCI does not leave room for delay in this window, the fraud score must arrive well before the completion of the transaction.

Modern fraud detection architectures utilize a range of algorithms, from individual Decision Trees and Support Vector Machines (SVM) to sophisticated ensemble techniques like Random Forest and XGBoost. During inference, these models evaluate incoming transactions using historical data and behavioral signals to detect suspicious activities and generate real-time fraud scores.

As operations scale, deep learning architectures are increasingly integrated; however, their significant computational overhead further intensifies the critical challenge of maintaining sub-millisecond latency. As UPI transactions are time-bound, fraud inference must also complete within its latency budget, i.e., time.

In case that happens, the fraud inference pipeline has worked as expected and the respective transaction is evaluated as risk-free without any delays. When inference exceeds the latency budget, either fraud slips through, or legitimate payments get blocked.

From the perspective of customer experience, even small delays or multiple transaction attempts can undermine the perception of UPI’s near real-time payments. It might also cause trust-deficit on the platform. Overall, it is imperative to balance fraud prevention effectiveness with a seamless user experience.

The Fraud Inference Pipeline

As shown below, the fraud inference engine or pipeline is triggered as soon as the user initiates the transaction and the Payer PSP receives the request, before it is forwarded to NPCI.

Figure 1: UPI Payment Flow and role of Fraud Inference Pipeline [Image Source]

The fraud inference pipeline collects relevant signals, computes a risk score against historical and behavioral patterns, and arrives at a final allow-or-block decision. Here is what happens at every stage on the initiation of a UPI transaction:

1. Collects signals

The moment a transaction is initiated, the model in the fraud inference pipeline aggregates all available features about the payee and the transaction:

Category	Example Features
Transaction Details	Transaction amount, transaction type, timestamp, transaction frequency
Payee Information	Payee VPA, payee risk score, historical fraud association
Device and Network Signals	Device ID or fingerprint, operating system and app version, IP address, Geolocation
User Behavioral Signals	Historical transaction patterns, typical transaction amounts, frequently used payees, recent behavioral changes
Session and Context Signals	Failed authentication attempts, recent device changes, whether the transaction originated from a new device

2. Compute transaction scores

The model computes a risk score taking various features or data-points derived in the previous step.

For scoring the initiated UPI transaction, the features are run through a trained model, i.e., Random Forest, XGBoost, etc., where the transaction is compared against historical patterns of both legitimate and fraudulent behavior.

Figure 2: Steps in the Fraud Inference Pipeline [Image created using AI]

3. Derive a Decision (Go/No-Go)

Based on the computed risk score, the model arrives at a final decision. It could be either allowing the transaction to proceed to NPCI or blocking it at the Payer PSP.

PCIe Bottlenecks: The Hidden Cost of CPU-to-GPU Data Movement

Fraud inference models built on Decision Trees, Random Forests, XGBoost, and SVMs typically rely only on CPUs for lightweight inference workloads. Feature vectors remain entirely in host memory, and the model computes fraud scores entirely on the CPU.

However, the CPU-only approach becomes increasingly difficult to scale at UPI transaction volumes, where fraud engines must score massive transaction volumes within tight latency budgets. For large-scale workloads, algorithms such as XGBoost and certain SVM implementations can leverage cloud GPUs to accelerate training and inference.

The shift from CPU-only models to GPU-accelerated fraud inference, whether deployed on GPUs such as NVIDIA L4, L40S, H100, or B200, introduces a new source of latency, the Peripheral Component Interconnect Express (PCIe) bus.

In simple terms, a PCIe bus is a high-speed interconnect that is responsible for data movement between the CPU, system memory, and peripherals such as GPUs.

Figure 3: The PCIe bus [Image Source]

The feature vectors, i.e., transaction details, payee information, behavioral signals, etc., in the fraud inference pipeline are assembled on the CPU. Because inference is performed on GPU-accelerated infrastructure, the feature vectors must cross the PCIe bus from system memory to GPU memory. The model for fraud detection is then executed on the GPU(s), which eventually generates a fraud score and its associated metadata.

Now that the model inference is complete, the computed score and metadata are returned back to the application layer via the PCIe bus for the final allow-or-block decision.

At high concurrency, PCIe bandwidth, copy engines, CPU memory bandwidth, NUMA placement and GPU stream scheduling can become shared resources. Measure them with p95/p99 latency and throughput under production-like load.

Multiple fraud inference requests compete for the same bus and transfer queues begin to form, especially during peak business hours, leading to latency that model-level optimizations cannot eliminate.

Since PCIe bus is integral to both CPU-to-GPU and GPU-to-CPU data movement, even small delays can cause fraud inference latency, thereby slowing legitimate payments or allowing fraudulent transactions to slip through when the inference window is exceeded.

Memory Bottlenecks: Why Memory Matters as Much as Compute

Before a fraud model scores a single transaction, it needs data. In the stage-1, i.e., collect signals, of the inference pipeline, the system assembles a feature vector of the necessary features related to the payee and transaction. In most production systems, Redis is used as in-memory key-value store for serving ML models with ultra-low latency.

A single transaction does not map to a single Redis key, since every feature group consisting of transaction-level, behavioral, device, payee, etc. is stored and retrieved separately.

Figure 4: Memory aspects w.r.t CPU and GPU in inference pipeline [Image created using AI]

As seen above, individual features are fetched from Redis and assembled in system memory (DRAM). Assembling a complete feature vector can require multiple feature-store reads, but these should be optimized through mget/pipelining, precomputed features, local cache, feature co-location and timeout/default-value policies. At UPI’s transaction volume, those round trips start to compound and contribute to the E2E inference latency.

Apart from the feature store, the system DRAM layer also comes under pressure. This is where the feature vectors received from in-memory feature store get assembled. With higher concurrent requests, CPU memory bandwidth, cache locality and allocation overhead can degrade per-request latency, especially if feature assembly is not vectorized or batched.

L1, L2, L3 CPU cache evictions caused by competing inference requests can result in repeated reads from the DRAM. As expected, every cache miss adds latency before the fraud inference model can even begin scoring.

Although the GPU also uses the High-Bandwidth Memory (HBM), the HBM is rarely a memory bottleneck in fraud inference workloads. Fraud inference models involve a large number of matrix multiplications and tensor operations, making the GPU instrumental in accelerating its computationally intensive parts.

However, GPU can start the inference only when features are retrieved from the feature store, the feature vector is assembled in DRAM, and the data crosses PCIe bus into its memory. To summarize, memory bottlenecks in large-scale fraud inference systems arise on the CPU side, rather than the GPU HBM itself.

Queue Bottlenecks: When Requests Wait Longer Than They Compute

In many UPI fraud detection systems, the incoming UPI transactions are buffered in a streaming/event bus such as Kafka for asynchronous events, feature updates or analytics; use a synchronous low-latency path if the score is required before authorization.

UPI transactions do not arrive at a uniform or constant rate, there are normally spikes during working hours, festival seasons, etc. On the other hand, the inflow is less during off-peak hours, especially night times. If the fraud inference engine cannot keep pace during these bursts, transactions begin to queue up and latency propagates upstream.

Apache Kafka decouples the producer, i.e., PSP applications, from the consumer, i.e., inference engine. The producer writes the transaction to the Kafka queue and moves on. The fraud inference consumers retrieve and process transactions at their own pace.

This architecture can itself become a bottleneck under sustained high load. The queue can build up under load if the inference consumer can’t keep up with the rate at which the producer is writing. Queued transactions accumulate, owing to which transactions may spend more time waiting in the queue than being scored by the fraud model itself.

The consequence is a bottleneck that sits entirely outside the inference engine. Even if feature retrieval, PCIe transfers, and model execution are optimized, E2E fraud inference latency will still suffer if transactions spend most of their time waiting in queues.

Optimizing the fraud inference engine will not solve the queue bottleneck since the latency lies outside the engine. Optimizing feature retrieval, PCIe transfers, and model execution does little to improve end-to-end latency if incoming UPI transactions spend most of their time waiting in queues.

Why Modern Fraud Inference Demands Accelerated Infrastructure

While traditional processors handle basic workloads, deep learning-based fraud engines at UPI’s massive transaction velocity quickly encounter computational limits within CPU-only environments. Modern cloud GPU instances, such as those powered by NVIDIA A100 and H100 GPUs, leverage PCIe Gen4 and Gen5 interconnects that deliver significantly higher CPU-to-GPU bandwidth than legacy PCIe Gen3-based infrastructure.

GPU-accelerated inference enables batching, where multiple transactions are scored in a single GPU pass. This helps reduce consumer, i.e., inference engine, lag and increase throughput, as each inference consumer can process more transactions per unit time. All this ultimately improves the UPI payment experience.

As far as on-premise GPU infrastructure is concerned, it can be provisioned to handle peak UPI loads, but much of that capacity remains underutilized during off-peak periods. A GPU cloud infrastructure like AceCloud can address the said problem as the infrastructure can be scaled up during bursts and scaled down when traffic normalises. This makes high-throughput fraud inference economically viable.

Addressing fraud inference at UPI’s scale is not just a modelling problem, it is a hardware and infrastructure problem. The PCIe bus, the memory layer, and the message queue each introduce latency that no fraud inference model optimization can eliminate on its own.

By combining scalable infrastructure with low-latency feature stores, bounded queues, optimized CPU/GPU serving, observability, deterministic fallbacks and compliance controls, payment providers can support larger fraud workloads while maintaining strict SLOs.

Jason Karlin

author

Industry veteran with over 10 years of experience architecting and managing GPU-powered cloud solutions. Specializes in enabling scalable AI/ML and HPC workloads for enterprise and research applications. Former lead solutions architect for top-tier cloud providers and startups in the AI infrastructure space.