In this article, we take a closer look at the Data Engineering Zoomcamp, a free nine-week course that covers the fundamentals of data engineering. It is perfect for people who already know how to code and want to learn about building data systems.
We will describe different aspects of this course so you can learn more about it:
- Who the course is for?
- Course curriculum: what topics and technologies are covered by the course
- Course project for your portfolio
- Course assignments and scoring
- Homework and getting feedback
- Learning in public approach
- DataTalks.Club community
Who the course is for?
Before we get into the details, it’s important to know what skills you should have to join the course comfortably.
Here are the main prerequisites for the course:
- Skilled in coding
- Comfortable with the command line
- Basic SQL
We’ll mainly use Python, but if you’re not familiar with it yet, it’s not a problem. If you’re already programming in another language, you’ll have no trouble picking up Python here.
The course curriculum is structured to guide you through the essential elements of data engineering, beginning with foundational concepts and advancing to complex topics.
You’ll spend the first six weeks learning and practicing each part of the data engineering process. In the concluding three weeks, you will apply your acquired knowledge and skills to develop an end-to-end data pipeline from the ground up.
- Week 1: Introduction & Prerequisites
- Week 2: Workflow Orchestration
- Week 3: Data Warehouse
- Week 4: Analytics engineering
- Week 5: Batch processing
- Week 6: Streaming
- Weeks 7, 8, 9: Project
Let’s quickly go over each week, focusing on the main points and the tech you’ll use.
Week 1: Introduction & Prerequisites
Tech: Docker, Postgres, Google Cloud Platform (GCP), Terraform
Focus: Week 1 is dedicated to setting up the key tools and technologies you’ll be using throughout the course.
- Local Part: Learn to package your application and its dependencies with Docker and run a fully functioning database using PostgreSQL.
- Cloud Part: Discover Google Cloud Platform and Terraform, focusing on automating the setup and management of your cloud architecture.
Week 2: Workflow Orchestration
Tech: Mage.AI, Google Cloud Storage (GCS)
Focus: Week 2 is about task automation and coordination within complex data pipelines.
- Learn what Data Lakes are and how to set them up using GCS.
- Use Mage to connect multiple steps in the workflow into a single process.
Week 3: Data Warehouse
Focus: Week 3 focuses on efficient data management.
- Learn to use Google’s BigQuery for effective data management.
- Study data storage techniques like partitioning and clustering, and understand their differences.
Week 4: Analytics engineering
Focus: Week 4 emphasizes transforming raw data into actionable insights.
- Get introduced to dbt, a tool for data transformation.
- Explore dbt models, their testing, documentation, and deployment options.
- Conclude with a section on data visualization using Google Data Studio or Metabase.
Week 5: Batch processing
Focus: Week 5 explores the collection and processing of data blocks.
- Dive into Spark SQL, DataFrames, and Spark internals.
- Learn data manipulation techniques like joins.
- Explore cloud-based options for running Spark more efficiently.
Week 6: Streaming
Tech: Kafka, Stream Processing
Focus: Week 6 transitions you from batch to Stream Processing, enabling real-time data handling.
- Understand Kafka fundamentals, including advanced techniques like stream joins and windowing.
- Cover Faust and Pyspark for Python-based stream processing.
- Learn how to use KSQL and ksqlDB for data manipulation.
- Complete the week by setting up your environment with Docker.
Weeks 7, 8, 9: Project
Duration: 2 weeks for development, 1 week for peer review
Objective: The project focuses on applying your acquired skills to build a data engineering pipeline from scratch. Completing this hands-on project not only validates your skills but also enhances your portfolio, offering a competitive edge in job searches.
Peer Review: To complete the project, you are required to evaluate projects from at least three of your peers. Failure to do so will result in your project being marked as incomplete. For detailed peer review criteria, check this link.
- Select a Dataset: Choose a dataset that intrigues you
- Data Lake Pipeline: Create a pipeline to process the selected dataset and store it in a data lake.
- Data Warehouse Pipeline: Develop another pipeline to move the data from the lake to a data warehouse.
- Data Transformation: Transform and prepare the data in the warehouse to make it dashboard-ready.
- Dashboard Creation: Build a dashboard featuring at least two tiles that visually represent your data.
The course description on GitHub provides a detailed overview of the topics covered each week. You can see the video lectures, slides, code, and community notes for each week of the course to dive into the content. By the end of the course, you will have acquired the fundamental skills necessary for a career as a data engineer.
If you’re ready to join the next cohort of the course, submit this form to register and stay updated.
Course assignments and scoring
Homework and getting feedback
To reinforce your learning, you can submit a homework assignment at the end of each week. It’s reviewed and scored by course instructors. Your scores are added to an anonymous leaderboard, creating friendly competition among course members and motivating you to do your best.
For support, we have an FAQ section with quick answers to common questions. If you need more help, our Slack community is always available for technical questions, clarifications, or guidance. Additionally, we host live Q&A sessions called “office hours” where you can interact with instructors and get immediate answers to your questions.
Learning in public approach
Throughout the course, we actively encourage and incentivize learning in public. By sharing your progress, insights, and projects online, you earn additional points for your homework and projects.
This not only demonstrates your knowledge but also builds a portfolio of valuable content. Sharing your work online also helps you get noticed by social media algorithms, reaching a broader audience and creating opportunities to connect with individuals and organizations you may not have encountered otherwise.
DataTalks.Club has a supportive community of like-minded individuals in our Slack. It is the perfect place to enhance your skills, deepen your knowledge, and connect with peers who share your passion. These connections can lead to lasting friendships, potential collaborations in future projects, and exciting career prospects.
The Data Engineering Zoomcamp offers a nine-week program covering the essentials of data engineering. As a student, you will explore different aspects of data engineering and technologies to work with each of them. The curriculum covers everything from managing workflows to proficiency in data warehousing and analytics engineering, emphasizing practical knowledge and its application in real-world scenarios.
We focus on hands-on learning through consistent homework assignments and a final project, preparing graduates to advance in their data engineering careers. DataTalks.Club is a supportive community. We help more students finish the course and stay motivated.
Again, the next cohort starts on January 15, 2024! Take your chance to start 2024 by learning data engineering.
Register for the Data Engineering Zoomcamp.