Podcast
Build and Scale Data Engineering Systems for Fraud Detection: Feature Pipelines, Real-Time Inference, Graph Databases & Production Debugging
Open original DataTalks.Club episode
Build and Scale Data Engineering Systems for Fraud Detection: Feature Pipelines, Real-Time Inference, Graph Databases & Production Debugging
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How do you build data infrastructure that stops stolen-card transactions and return abuse in real time? In this episode, Angela Ramirez, a Sam’s Club data engineer who moved from Sephora and specializes in machine learning for fraud prevention, walks through the engineering behind retail fraud detection. Drawing on her background in NLP and four years as a data engineer, Angela explains pipelines, feature engineering workflows that combine daily batches with real-time scoring, and the MLOps responsibilities for.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 0:00 - Podcast Introduction & Guest Overview (Angela Ramirez)
- 2:41 - Career Journey: Sephora to Sam’‘s Club
- 3:45 - Fraud Detection in Retail: Stolen Cards & Return Abuse
- 6:22 - Data Engineering for Fraud: Pipelines, Features, Dashboards
- 8:24 - Feature Engineering Workflow: Daily Batches + Real-Time Scoring
- 9:48 - MLOps Responsibilities: Model Metrics, Deployment, Monitoring
- 11:19 - Team Structure: Data Engineers, ML Engineers, Data Scientists
- 12:48 - Academic Background: Cognitive Science, NLP, HCI
- 14:14 - Data-Centric Mindset: Why Data Engineering Powers ML
- 16:02 - Career Transition: Process Improvement → Data Analyst → Data Engineer
- 19:15 - System Design Best Practices: Stakeholders, Timing, Documentation
- 20:30 - Data Modeling Decisions: Relational vs Document vs Graph
- 21:30 - Elasticsearch & Document Indexing for Entity Data
- 23:04 - Graph Databases & SPARQL: Wikidata and Entity Relationships
- 29:15 - Network Features for Fraud: Members, Transactions, Products
- 33:34 - Real-Time Decisioning: Front-End Signals for Cashiers & Security
- 34:46 - Hybrid Architecture: Batch Computation with Instant Inference
- 35:33 - Database Selection Criteria: Static Schema vs Dynamic Data
- 38:11 - Graph Visualization for Investigations: Neo4j Use Cases
- 40:50 - Software Engineering for Data Engineers: Testing & Code Quality (PySpark)
- 43:28 - Data Quality Tooling: Great Expectations and Cloud Monitoring
- 44:41 - Operational Challenges: Job Failures, Schema Changes, Scaling
- 48:21 - Debugging Playbook: Logs, Runbooks, and Error Documentation
- 50:23 - Tech Stack Overview: GCP, Dataproc/Databricks, PySpark, Cassandra
- 51:23 - Managed vs Serverless Spark: Dataproc, EMR, Serverless Execution
- 53:18 - Pandas & PyArrow: Performance Improvements for Big Data
- 54:57 - Cassandra Use Cases: Scalability, Fault Tolerance, Clusters
- 56:19 - External Data Integration: APIs, Data Contracts, Stability
- 1:00:00 - Recommended Resources: Designing Data-Intensive Applications, PySpark, SQL