Podcast

Big Data Engineer vs Data Scientist: Skills, Tools, and Career Paths

S4E3

Open original DataTalks.Club episode

YouTube Spotify Apple Podcasts

career transition software engineering data engineering data science

Big Data Engineer vs Data Scientist: Skills, Tools, and Career Paths

Original Episode

Use these links for the canonical episode and media sources.

Open the original DataTalks.Club podcast page
Watch on YouTube
Listen on Spotify
Listen on Apple Podcasts

Episode Overview

How do the day-to-day responsibilities and skill sets really differ between a Big Data Engineer and a Data Scientist—and what should you learn to move between those roles? In this episode, Roksolana Diachuk, a Big Data Engineer at Captify, Women Who Code Kyiv lead and speaker on Scala and Kubernetes, walks through her career transition from backend Java into big data engineering and R&D.

People

Use these links to connect the episode to guest notes.

Roksolana Diachuk

Chapter Summary

Use these checkpoints to decide whether to open the source transcript.

1:52 - Episode Overview & Guest Introduction
2:28 - Career Path: From Backend Java to Big Data Engineering (Scala, R&D, Captify)
4:26 - Core Responsibilities: Building ETL Data Pipelines, HDFS/S3, Impala
6:38 - Performance Focus: Spark Job Optimization & Cluster Resource Planning
7:18 - Big Data Tooling: Spark, S3/HDFS, Kubernetes, Prometheus, Grafana, Scala
8:04 - Storytelling in Tech Talks: “Alice” Series and Conference Presentations
9:12 - Role Comparison: Big Data Engineer vs Data Engineer (formats: Avro, Parquet,
11:07 - Essential Skills: Coding, SQL, Distributed Systems & Infrastructure Awareness
13:56 - Data Scientist Scope: Data Cleaning, Feature Engineering, Model Cycle & Deployment
15:32 - Tool Overlap: Spark & Python vs ML Libraries for Modeling
16:26 - Collaboration Model: File Interfaces (Parquet) and Team Structures
18:54 - Case Study: Recommendation System — Streaming and Batch Pipeline Design
22:51 - Streaming vs Batch Choices: Flink for Streaming, Spark for Batch, Parquet
23:40 - ML Deployment Stack: MLflow, Kubeflow, Kubernetes & ML Engineer Roles
24:49 - Cross-Skill Expectations: What Data Scientists Should Know About Pipelines
27:30 - Upskilling for Engineers: Data Engineers Learning ML Inputs/Outputs (not
30:53 - Transition Path: Analyst/Data Scientist → Data Engineer (coding, DBs, infra)
34:53 - Databases to Learn: PostgreSQL, MySQL, MongoDB, Neo4j (SQL vs NoSQL)
36:07 - Infrastructure Essentials: Docker, Cloud Services, Intro to Kubernetes
39:09 - Data Quality & Monitoring: Flow Metrics, Spikes, and Schema Change Alerts
43:37 - Data Documentation & Governance: Schema Descriptions, Confluence, HypeSQL
46:14 - Software Engineering for Data Scientists: Code Quality, Reproducibility,
48:26 - Hands-on Learning Resources: Katacoda, Google Codelabs, Databricks Trainings
49:29 - Career Advice for Graduates: Choosing Data Engineering vs Data Science
51:16 - Starter Projects: Word Count, Twitter Streaming, Elasticsearch + Kibana
53:28 - Datasets for Practice: Wikipedia Dumps, CommonCrawl, NASA APIs, Social Media
56:08 - Pre-built ETL Platforms vs Custom Pipelines: Trade-offs & Scalability
58:05 - Operational Challenges: Deduplication, Historical Reprocessing, Risk Management
1:00:25 - Data Versioning & Time Travel: Delta Lake for Reprocessing and Auditing
1:00:40 - Learning Recommendations: Coursera Big Data Specialization; Spark & Data