Podcast
Big Data Engineer vs Data Scientist: Skills, Tools, and Career Paths
Open original DataTalks.Club episode
Big Data Engineer vs Data Scientist: Skills, Tools, and Career Paths
Original Episode
Use these links for the canonical episode and media sources.
- Open the original DataTalks.Club podcast page
- Watch on YouTube
- Listen on Spotify
- Listen on Apple Podcasts
Episode Overview
How do the day-to-day responsibilities and skill sets really differ between a Big Data Engineer and a Data Scientist—and what should you learn to move between those roles? In this episode, Roksolana Diachuk, a Big Data Engineer at Captify, Women Who Code Kyiv lead and speaker on Scala and Kubernetes, walks through her career transition from backend Java into big data engineering and R&D.
People
Use these links to connect the episode to guest notes.
Chapter Summary
Use these checkpoints to decide whether to open the source transcript.
- 1:52 - Episode Overview & Guest Introduction
- 2:28 - Career Path: From Backend Java to Big Data Engineering (Scala, R&D, Captify)
- 4:26 - Core Responsibilities: Building ETL Data Pipelines, HDFS/S3, Impala
- 6:38 - Performance Focus: Spark Job Optimization & Cluster Resource Planning
- 7:18 - Big Data Tooling: Spark, S3/HDFS, Kubernetes, Prometheus, Grafana, Scala
- 8:04 - Storytelling in Tech Talks: “Alice” Series and Conference Presentations
- 9:12 - Role Comparison: Big Data Engineer vs Data Engineer (formats: Avro, Parquet,
- 11:07 - Essential Skills: Coding, SQL, Distributed Systems & Infrastructure Awareness
- 13:56 - Data Scientist Scope: Data Cleaning, Feature Engineering, Model Cycle & Deployment
- 15:32 - Tool Overlap: Spark & Python vs ML Libraries for Modeling
- 16:26 - Collaboration Model: File Interfaces (Parquet) and Team Structures
- 18:54 - Case Study: Recommendation System — Streaming and Batch Pipeline Design
- 22:51 - Streaming vs Batch Choices: Flink for Streaming, Spark for Batch, Parquet
- 23:40 - ML Deployment Stack: MLflow, Kubeflow, Kubernetes & ML Engineer Roles
- 24:49 - Cross-Skill Expectations: What Data Scientists Should Know About Pipelines
- 27:30 - Upskilling for Engineers: Data Engineers Learning ML Inputs/Outputs (not
- 30:53 - Transition Path: Analyst/Data Scientist → Data Engineer (coding, DBs, infra)
- 34:53 - Databases to Learn: PostgreSQL, MySQL, MongoDB, Neo4j (SQL vs NoSQL)
- 36:07 - Infrastructure Essentials: Docker, Cloud Services, Intro to Kubernetes
- 39:09 - Data Quality & Monitoring: Flow Metrics, Spikes, and Schema Change Alerts
- 43:37 - Data Documentation & Governance: Schema Descriptions, Confluence, HypeSQL
- 46:14 - Software Engineering for Data Scientists: Code Quality, Reproducibility,
- 48:26 - Hands-on Learning Resources: Katacoda, Google Codelabs, Databricks Trainings
- 49:29 - Career Advice for Graduates: Choosing Data Engineering vs Data Science
- 51:16 - Starter Projects: Word Count, Twitter Streaming, Elasticsearch + Kibana
- 53:28 - Datasets for Practice: Wikipedia Dumps, CommonCrawl, NASA APIs, Social Media
- 56:08 - Pre-built ETL Platforms vs Custom Pipelines: Trade-offs & Scalability
- 58:05 - Operational Challenges: Deduplication, Historical Reprocessing, Risk Management
- 1:00:25 - Data Versioning & Time Travel: Delta Lake for Reprocessing and Auditing
- 1:00:40 - Learning Recommendations: Coursera Big Data Specialization; Spark & Data