The Pandas Syndrome: How 10 Lines of Code Can Save $400K per Year

A story that repeats itself in 8 out of 10 companies

oct. 30, 2025

The audit that no longer surprises

I’ve seen it fifteen times over the past few years. Different audits, different companies, but always the same story. A competent data science team, an inexplicably exploding cloud budget, and a request that comes up systematically: “Our ML models need more RAM.”

The last time was earlier this year. The ML team was brilliant: PhDs in machine learning, scientific publications, real technical expertise. Their work on fraud detection was remarkably sophisticated. But their cloud bill exceeded €35,000 per month, just for development environments.

My first reflex during an audit is always the same: examine the instances being used. And there, I found what I was looking for. r5.24xlarge instances everywhere. For those unfamiliar with AWS instances, I’m talking about virtual machines with 96 vCPUs and 768 GB of RAM. The kind of configuration you use for critical databases or massive distributed processing. The cost? €6.05 per hour, or €4,350 per month per data scientist. Multiply by eight data scientists, and we’re at €34,800 monthly just to allow them to develop their models.

The ML lead explained the situation with conviction: “We’re working on fraud detection. Our datasets are absolutely enormous. We genuinely need all this RAM to run our models.” Then, anticipating my question about costs, he added: “Anyway, our system detects fraud that saves us millions of euros in losses per year. The ROI is enormous. These €35,000 per month, it’s a rounding error.”

At first glance, both arguments hold water. Fraud detection does indeed require analyzing considerable volumes of historical transactions. And yes, the ROI of a good fraud detection system is generally excellent.

But I asked to see the code. And that’s where the story gets interesting.

The painful revelation

What I discovered in their Jupyter notebooks, I had seen the week before at a French retailer, the month before at a London fintech, three months earlier at an insurance company, and six months ago at a media group. Always. The. Same. Mistake.

Here’s the code I find in 80% of professional notebooks:

# Loading data
df_transactions = pd.read_parquet(’s3://bucket/transactions/’)  
# 2 billion rows, 500 GB
df_clients = pd.read_parquet(’s3://bucket/clients/’)            
# 50 million rows, 20 GB
df_products = pd.read_parquet(’s3://bucket/products/’)          
# 10 million rows, 5 GB

# In-memory joins
df = df_transactions.merge(df_clients, on=’customer_id’)
df = df.merge(df_products, on=’product_id’)

# Filtering AFTER joining
df = df[df[’date’] >= ‘2023-01-01’]
df = df[df[’segment’] == ‘premium’]
df = df[df[’amount’] > 100]

Let’s pause for a moment to calculate what’s actually happening. We’re loading 525 GB of raw data into memory. Then we’re performing joins that multiply this volume exponentially, temporarily creating even more massive data structures. And only then, after saturating RAM with billions of rows, do we apply the filters we actually care about.

The problem isn’t in the code itself. Pandas is a remarkable tool, probably one of the most useful Python libraries ever created. The problem is in the fundamental approach: load first, filter later.

The origin of the syndrome

To understand why this error is so widespread, we need to go back to the initial training of data scientists. Most learn their craft on Kaggle, with competitions using carefully prepared datasets of 50 to 500 MB. In this learning environment, loading the entire dataset into memory is not only possible, it’s the norm. It’s even recommended.

Pandas then becomes the conditioned reflex. The universal tool. The comfort zone. And when these data scientists arrive at the company with its multi-terabyte data lake, they naturally reproduce the same patterns they learned. Nobody explained to them that the rules of the game change completely when you go from 100 MB to 100 GB.

But the problem doesn’t just come from Kaggle. University data science programs also bear some responsibility. I’ve gone through dozens of syllabi from Masters and specialized programs in France and abroad. Result? Hundreds of hours on machine learning algorithms, advanced statistics, deep learning. But how many hours on enterprise data architecture? On efficient use of a data lake? On the distinction between local and distributed processing? Often zero. Sometimes one or two sessions, buried in an infrastructure course. And what about Data Engineering curricula that, just 2-3 years ago, were still talking about Hive or Pig and not mentioning Parquet or Cloud?

We’re training algorithm experts who don’t know how to use the infrastructures on which these algorithms will run. It’s like training race car drivers without ever teaching them automotive mechanics. They know how to take corners at 200 km/h, but don’t understand why the engine overheats.

The gap is even more striking when you look at the profiles companies are seeking. Job postings ask for: “Experience with Snowflake, BigQuery, or Databricks.” But graduates arrive with: “Expert in scikit-learn and TensorFlow.” Both skills are valuable, but one without the other creates exactly the type of inefficiency I see in my audits.

It’s like learning to cook in a student kitchenette, then becoming head chef in a three-star restaurant with a brigade of fifteen people. The fundamental techniques are the same, but orchestration, delegation, and resource optimization become critical.

The paradox of the unused data lake

Here’s what makes the situation particularly ironic: this client had invested hundreds of thousands of euros in a modern data lake. Databricks on AWS, to be precise. A platform specifically designed to process petabytes of data with exceptional performance. A system optimized for massive joins, with dozens of CPUs working in parallel, intelligent indexes...

But their data scientists were loading everything into Pandas from S3. On a single machine. To perform operations that the data lake would have executed a hundred times faster, with a fraction of the resources.

It’s like having a Ferrari in the garage but preferring to push your car on foot because that’s what you’ve always done. The metaphor isn’t that exaggerated.

The approach that changes everything

I showed the team a different approach. Not revolutionary. Not particularly sophisticated. Just... logical.

“You’ve paid for a performant data lake. Use it for what it does best: filtering, joining, and aggregating massive volumes of data.”

Here’s the SQL query I wrote with them in five minutes:

-- Filter first, load later
SELECT 
    t.customer_id,
    t.product_id,
    t.amount,
    c.segment,
    p.category,
    COUNT(*) as transaction_frequency,
    AVG(t.amount) as avg_transaction_amount,
    MAX(t.date) as last_transaction_date
FROM transactions t
JOIN customers c ON t.customer_id = c.id
JOIN products p ON t.product_id = p.id
WHERE t.date >= ‘2023-01-01’      -- Filtering BEFORE the join
    AND c.segment = ‘premium’       -- Massive volume reduction
    AND t.amount > 100              -- Even more reduction
GROUP BY 1,2,3,4,5

The result of this query? 200 MB. Two hundred megabytes instead of 525 gigabytes. It’s this result, and only this result, that they then load into Pandas for feature engineering and model training.

The fundamental difference boils down to a simple principle: push the computation to the data, not the other way around.

The impact that justifies everything

The before and after numbers are spectacular, but they deserve to be detailed to truly understand the impact of this transformation.

Before optimization:

Instance: r5.24xlarge (96 vCPU, 768 GB RAM)
Hourly cost: €6.05
Monthly cost per data scientist: €4,350
Data preparation time: 45 minutes on average
Crash rate: about 15% of executions failed with MemoryError
Maintenance complexity: high (manual memory management, chunks, various optimizations)
Scalability: limited (impossible to analyze longer periods)

After optimization:

Instance: m5.xlarge (4 vCPU, 16 GB RAM)
Hourly cost: €0.20
Monthly cost per data scientist: €144
Data preparation time: 8 minutes on average
Crash rate: 0% (the final dataset fits comfortably in RAM)
Maintenance complexity: low (the code works reliably)
Scalability: excellent (the data lake can handle any volume)

Monthly savings per data scientist: €4,206. For eight data scientists: €33,648 per month. Over a year: €403,776 in savings.

For changing ten lines of code.

But the financial savings are only part of the story. The productivity gain is perhaps even more significant. Data scientists no longer spend hours waiting for their joins to complete, or debugging memory crashes, or reorganizing their code to optimize RAM usage. They can focus on what really matters: developing better models.

The trap of the ROI argument

Let’s return to the ML lead’s argument: “The ROI is enormous, these €35,000 per month, it’s a rounding error.”

I hear this argument in almost all my audits. And each time, it hides a dangerous reasoning that deserves to be deconstructed.

First, it’s a false dilemma. Nobody is asking to choose between ROI and operational efficiency. Their fraud detection system would have exactly the same ROI with the optimized approach. Same performance, same business value, but with 97% less cost. It’s not one OR the other, it’s one AND the other.

Second, these costs are never really “a rounding error” when you look at the complete picture. This client saves €400,000 per year only on development environments. But the Pandas syndrome doesn’t stop there. The same inefficiencies are often found:

In staging environments (×2)
In automated retraining pipelines (×3)
In production A/B tests (×5)

When you add it all up, we’re easily talking about several million euros per year. Hard to consider that “a rounding error,” even for a large bank.

But the most problematic thing is the mentality this argument reveals. “It doesn’t matter if we spend, the ROI is good” quickly becomes a universal excuse for not optimizing. I’ve seen this logic gradually extend:

“We’re paying too much for S3 storage, but the data lake ROI is good”
“We could optimize our BigQuery queries, but the analytics ROI is good”
“We duplicate a lot of data, but the project ROI is good”

This mentality ends up creating a culture of generalized inefficiency. And ironically, it reduces the actual ROI. Because the true ROI should also include operational efficiency. If you can get the same business value with 10 times fewer resources, your real ROI is 10 times better.

Finally, there’s an aspect that nobody mentions in these discussions: environmental impact. Running r5.24xlarge instances 24/7 to load data that a data lake could process in seconds is pure energy waste. At a time when all companies are displaying carbon footprint reduction targets, ignoring these inefficiencies becomes increasingly difficult to justify.

So no, it’s never really “a rounding error.” It’s an optical illusion created by high ROI that masks costly structural inefficiencies.

Why this error is universal

I’ve thought long and hard about why I see this pattern everywhere. It’s not a competency problem. The teams I audit are technically excellent. They understand machine learning algorithms, master statistics, know how to optimize hyperparameters.

The problem is more subtle. It’s a gap between two technical cultures. Data scientists come from the academic and research world, where the approach is often: “Give me all the data, I’ll explore.” Data engineers come from the world of distributed systems and databases, where the fundamental principle is: “Filter first, process later.”

This difference in perspective creates a paradoxical situation. Companies invest massively in modern data management infrastructures: data lakes, data warehouses, distributed processing platforms. Then they recruit brilliant data scientists who... don’t use these infrastructures. Not out of ill will, but simply because nobody taught them how.

It’s a bit like hiring Michelin-starred chefs and discovering they all cook their dishes in the microwave because that’s what they know best.

The fundamental principle to remember

After dozens of similar audits, I’ve distilled my recommendation into a simple principle that I share with all teams:

Use each tool for what it does best.

The data lake (Snowflake, BigQuery, Redshift, Databricks, Trino/Presto/Starburst) excels at:

Filtering terabytes of data in seconds
Performing massive joins on billions of rows
Aggregating considerable volumes
Automatically optimizing queries

Pandas excels at:

Manipulating datasets from a few megabytes to a few hundred megabytes
Doing creative feature engineering
Creating quick visualizations
Prototyping complex transformations

The problem arises when we use Pandas for what the data lake does, or vice versa. It’s like using a screwdriver to hammer a nail: technically possible, but inefficient and frustrating.

Warning signs to watch for

How do you know if your team is affected by the Pandas syndrome? Here are the indicators I look for during my audits:

Technical signals:

Your data scientists regularly request more RAM
Notebooks crash with “MemoryError” or “Killed” errors
You use r5 or r6 instances (memory-optimized) even though you have a data lake
Your pd.read_parquet() or pd.read_csv() files read entire files without filters
You perform merge() before applying filter()
Data preparation time exceeds 30 minutes for model training

Organizational signals:

Your ML cloud bill increases without the produced value following
Data scientists spend more time “getting data to fit in memory” than developing models
You hear phrases like “we need to upgrade our machines” or “we need a Spark cluster”
ML projects are delayed due to infrastructure problems

If you check three or more of these boxes, there’s a strong chance you have a data processing architecture problem, not a resource shortage problem.

The three-step solution

Fixing the Pandas syndrome doesn’t require a complete infrastructure overhaul. Here’s the progressive approach I recommend:

Step 1: Audit the existing (1 week)

Start by understanding where you are:

Document how much RAM your current pipelines actually use
Measure the time spent loading and transforming data
Calculate the monthly cost of your instances
Identify queries or transformations that could be delegated to the data lake

Step 2: Create a best practices template (2 weeks)

Develop a reference notebook that shows how to:

Formulate efficient SQL queries that filter upstream
Load only pre-aggregated results into Pandas
Use data lake capabilities for heavy joins
Document the resource savings obtained
Use controlled on-demand instances (code starts an instance that processes the request and stops when the result is available)

Share this template as a standard for all new projects.

Step 3: Progressive migration (1-3 months)

Don’t try to refactor everything at once. Prioritize:

The most resource-intensive pipelines
Treatments that fail regularly
New projects in development

For each migration, measure and communicate the gains achieved. Nothing convinces better than a clearly documented €4,000/month saving.

The decisive test

I always end my audits with a simple recommendation I’d like to share with you:

If your data preparation requires more than 32 GB of RAM, there’s a 95% chance the problem isn’t the amount of RAM available. It’s your architectural approach.

This empirical rule, based on dozens of audits, is obviously not absolute. There are legitimate cases where you need a lot of memory: certain deep learning algorithms, massive Monte Carlo simulations, very high-resolution image processing.

But in the vast majority of cases, if you need several hundred gigabytes of RAM to prepare your data, it’s a sign that you’re loading too much data too early, or that you’re performing operations that should be delegated to your data lake.

Your Pandas syndrome, how much does it cost?

I encourage you to do this calculation today:

Price of your current ML instances × number of hours × number of data scientists
Price of m5.xlarge or m5.2xlarge instances × same parameters
Difference × 12 months

This figure probably represents your training and optimization budget for next year. Or maybe a new data scientist to recruit. Or a few international conferences for the whole team.

The choice is yours: continue paying for underutilized resources, or invest that time and money in what will truly create value for your organization.

This story is based on real cases, with details modified to preserve the anonymity of the companies involved. If you recognize your organization in this description, know that you are absolutely not alone. The Pandas syndrome affects the majority of data teams I encounter. The good news? It’s easily corrected once identified.

Data Detox

Discussion à propos de ce post

Tout à fait prêt. Qu'avez-vous pour moi ?