Why Gradient Boosting Still Outperforms Deep Learning for Payment Fraud - and When It Doesn't

machine learning gradient boosting decision tree visualization

Neural networks dominate the ML discourse. Transformer architectures outperform everything else on language, vision, and audio tasks. The obvious inference — that deep learning should dominate payment fraud detection too — turns out to be wrong in most production scenarios, and the reasons why are worth understanding clearly. Gradient boosting isn't winning because practitioners are being conservative; it's winning because the data structure and operational constraints of payment fraud are genuinely better suited to tree-based methods.

The Structural Advantages of Gradient Boosting on Tabular Data

Payment transaction data is tabular: a row of features for each transaction, including transaction amount, merchant category, card type, device fingerprint, velocity counts, and a few hundred other numeric and categorical values. This is the data type where gradient boosting methods (XGBoost, LightGBM, CatBoost) have consistently outperformed neural networks in rigorous benchmarks over the past decade.

The performance advantage on tabular data is not marginal. In 2022 and 2023 benchmark studies comparing models on real financial tabular datasets, LightGBM and XGBoost outperformed feed-forward neural networks, TabNet, and even more sophisticated architectures like FT-Transformer (the best-performing neural architecture for tabular data) on most financial prediction tasks with datasets under 1 million rows. The gap narrows with dataset size and closes around 100M+ rows, but most payment processors don't have 100M labeled fraud transactions for training.

The theoretical reason for this: neural networks excel at learning representations from raw signals — pixels, token sequences — where the raw input doesn't directly encode the useful features. The network learns to extract features. For tabular financial data, the features have already been carefully engineered (velocity counts, BIN lookups, device reputation scores) and the raw input is already meaningful. Gradient boosting trees are particularly efficient at learning threshold-based decision boundaries on pre-engineered features, which is exactly the structure of payment fraud signal.

High-Cardinality Categoricals: Where CatBoost Specifically Excels

Payment fraud models have several high-cardinality categorical features: merchant ID (potentially millions of distinct values), BIN prefix (tens of thousands of values), device fingerprint hash, IP address. These features are useful but problematic for neural networks, which require numerical encoding and either collapse the cardinality (losing information) or create embeddings that require substantial data to learn well.

CatBoost's ordered target encoding handles high-cardinality categoricals without explicit embedding layers, using a time-ordered statistics approach that prevents target leakage. For merchant ID specifically — where the model needs to learn per-merchant fraud baselines — CatBoost's handling is significantly better out-of-the-box than neural network approaches that require explicit embedding training. In our internal benchmarks on merchant-level fraud features, CatBoost's per-merchant fraud rate encoding outperformed both one-hot encoding and learned embeddings on datasets under 50M rows.

Latency: The Operational Constraint That Neural Networks Lose

Sub-50ms scoring requirements are the constraint that eliminates most neural network architectures from consideration before the accuracy comparison even begins. A LightGBM model with 500 trees produces a score in 0.5–3ms on CPU. A transformer-based neural network on the same feature set requires GPU inference to achieve similar throughput, with GPU inference adding infrastructure cost and complexity.

For payment processors scoring 500–2,000 transactions per second at peak, the cost difference between CPU-based gradient boosting serving and GPU-based neural network serving is significant. A standard CPU instance (c5.4xlarge, ~$0.68/hour) handles LightGBM scoring at 500 TPS with ample headroom. The equivalent throughput on GPU infrastructure for a neural network model is approximately 5–8x the cost. This cost difference is a permanent operational cost, not a one-time difference, and it compounds across the multi-year lifetime of the fraud scoring system.

The latency and cost arguments don't apply if you're not doing real-time scoring. For batch fraud review — scoring transactions after authorization, identifying suspicious patterns for manual review — neural network inference on GPU with batched throughput is economically viable. The architecture choice depends on whether real-time scoring is a requirement, not a blanket judgment about which approach is better.

Explainability: The Compliance and Operational Argument

Gradient boosting models produce per-prediction feature importance through SHAP values (SHapley Additive exPlanations) that are interpretable, auditable, and implementable in under 100ms with the fast-SHAP tree explanation algorithm. Each transaction score includes the exact contribution of each feature to the prediction, in units that analysts can understand: "the velocity count feature added 120 points to the fraud score; the device reputation score subtracted 40 points."

Neural networks are fundamentally harder to explain in this way. Gradient × input explanations for neural networks exist (Integrated Gradients, SHAP DeepLIFT) but are either approximate or computationally expensive. For dispute resolution teams handling chargeback disputes, explainable decisions are operationally important: the analyst reviewing a disputed transaction needs to understand why it was declined, not just that the model said so.

This is also a regulatory compliance argument. The EU's GDPR Article 22 and equivalent regulations in other jurisdictions require that automated decision-making affecting individuals can be explained. Using a black-box neural network for authorization decisions in European markets creates compliance exposure that explainable gradient boosting models avoid.

When Does Deep Learning Actually Win?

The cases where neural network approaches outperform gradient boosting for payment fraud are specific but real.

Sequence modeling for transaction history. When the fraud signal is in the temporal sequence of transactions — the order matters, not just the aggregate statistics — recurrent neural networks and transformer sequence models capture sequential dependencies that tree-based models don't handle well. A time series of 20 transactions where the sequence (small purchase → medium purchase → large purchase) is the fraud signal is better modeled by an LSTM or transformer than by velocity features alone, which lose the sequential information. For account takeover detection, where the behavioral sequence over a session is the primary signal, sequence models have a genuine advantage.

Very large datasets. With 100M+ labeled fraud examples, the neural network sample efficiency disadvantage disappears and their capacity advantage becomes relevant. Payment networks at Visa/Mastercard scale have this volume. Most payment processors do not, but processors with very large portfolios who have been accumulating labeled data for years may reach the scale where neural architectures become competitive.

Raw behavioral biometrics signals. Keystroke timing sequences, mouse trajectory data, and touchscreen interaction patterns are time series data from raw sensors — the data type where neural networks historically outperform feature-engineered tree models. For behavioral biometrics specifically (not just derived behavioral features), CNNs applied to session interaction data can outperform gradient boosting on pre-extracted features. This is the case in behavioral biometrics products from BioCatch and similar vendors.

The Ensemble Question

Many production fraud systems use ensembles that combine gradient boosting with neural components rather than choosing one exclusively. The most common combination is a gradient boosting base model for tabular features (transaction attributes, velocity, BIN data) with a separate neural component for sequential behavioral features, with outputs combined in a meta-learner or through a weighted score combination.

The ensemble approach gets the benefit of gradient boosting's tabular data efficiency and latency properties for the core scoring path, while adding the sequential modeling capability of neural networks for behavioral signals that tree models handle less well. The complexity cost is real — two models to maintain, train, and monitor — but for high-volume processors where each percentage point of detection accuracy translates to significant dollar values, the ensemble improvement is often worth the operational cost.

InferX's production stack uses a LightGBM base model with CatBoost handling high-cardinality categorical features and a lightweight LSTM model processing session behavioral sequences where available. The LSTM adds 2–4ms to scoring latency when behavioral data is present. The combined system achieves higher detection rates than either model alone on held-out test data, particularly for account takeover attack patterns where the sequential behavioral signal adds meaningful information beyond the tabular features.

Practical Guidance for Model Selection

The decision framework for model architecture selection in payment fraud comes down to four questions. First: what's your latency requirement? If sub-100ms real-time scoring is required, gradient boosting on CPU is the default choice unless you have strong evidence that a neural approach provides accuracy improvement significant enough to justify the GPU infrastructure cost. Second: what's your labeled dataset size? Under 10M fraud examples, gradient boosting almost certainly outperforms neural approaches. Above 50M, neural architectures become competitive. Between 10–50M, run experiments. Third: do you need explainable decisions? If yes, gradient boosting is significantly easier to comply with. Fourth: does your fraud signal live in behavioral sequences? If yes, add a neural sequence component — the tabular fraud signal and the behavioral sequence signal are additive, not competing.

The gradient boosting advantage in payment fraud is not permanent — it reflects the current state of tabular neural architecture research and the operational constraints of real-time scoring infrastructure. If GPU costs continue to decline and tabular neural architectures continue to improve, the equilibrium will shift. For now, the team making a fresh technology selection for a production payment fraud system in 2025 should default to LightGBM or XGBoost for the core tabular model, evaluate CatBoost if high-cardinality categoricals are a significant feature group, and add neural sequence components specifically for behavioral biometrics if those signals are available in their data pipeline.

Why Gradient Boosting Still Outperforms Deep Learning for Payment Fraud — and When It Doesn't