MLOps Zoomcamp FAQ

General Course-Related Questions
Module 1: Introduction
Module 2: Experiment Tracking
Module 3: Orchestration
Module 4: Deployment
Module 5: Monitoring
Module 6: Best Practices
Capstone Project

General Course-Related Questions

# I forgot if I registered, can I still join the zoomcamp?

You don't need to register, as registration is not mandatory. It is only used for gauging interest and collecting data for analytics. You can start learning and submitting homework without registering even while a cohort is “live”. There is no check against any registered list.

edit on GitHub

# Is it going to be live? When?

The course videos are pre-recorded, and you can start watching the course right now.

The zoomcamps are spread out throughout the year. See the article Guide to Free Online Courses at DataTalks Club.

We will also occasionally have office hours — live sessions where we will answer your questions. The office hours sessions are recorded too.

You can see the office hours (playlist with year 20xx) as well as the pre-recorded course videos in the Course Channel’s Bookmarks and/or DTC’s YouTube channel.

edit on GitHub

# Course - Can I still join the course after the start date?

Yes, even if you don't register, you're still eligible to submit the homeworks as long as the form is still open and accepting submissions.

Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything to the last minute.

edit on GitHub

# Course: How do I start?

No matter if you're with a 'live' cohort or following in self-paced mode, start by:

Reading pins and bookmarks on the course channel to see what things are where.
Reading the repository (bookmarked in channel) and watching the video lessons (playlist bookmarked in channel).
If you have questions, search the channel itself first; someone may have already asked and gotten a solution.
For the most Frequently Asked Questions, refer to this document:
If you don't want to read/skimmer/search the FAQ document, tag the @ZoomcampQABot when asking questions, and it will summarize answers from its knowledge base.
For generic, non-zoomcamp queries, consider using tools like ChatGPT, BingCopilot, or Google Gemini, especially for error messages.
Check if you're on track by checking the deadlines in the Course Management form for homework submissions.

The main difference if you're not in a "live" cohort is that responses to your questions might be delayed because fewer active students are online. This won't be an issue if you do your own due diligence by searching for answers first and reading the documentation of the library.

If you need to ask questions and the resources above haven't helped, follow the guidelines in the asking-questions.md document (bookmarked in channel) and also check the Pins.

edit on GitHub

# Course - Can I still graduate when I didn’t complete homework for week x?

Yes

edit on GitHub

# Certificate - Can I follow the course in a self-paced mode and get a certificate?

No, you can only get a certificate if you finish the course with a “live” cohort. We don't award certificates for the self-paced mode. The reason is you need to peer-review capstone(s) after submitting a project. You can only peer-review projects at the time the course is running.

edit on GitHub

# What’s the difference between the 2023 and 2022 course?

The difference is the Orchestration and Monitoring modules. Those videos will be re-recorded to use Prefect. The rest should mostly be the same.

Also, all of the homeworks will be changed for the 2023 cohort.

edit on GitHub

# Cohort: What’s the difference between the 2024 and 2023 course?

The difference is the Orchestration and Monitoring modules. Those videos will be re-recorded to use Mage-AI. The rest should mostly be the same.

Additionally, all of the homeworks will be changed for the 2024 cohort.

edit on GitHub

# Cohort: Will there be a 2024 Cohort? When will the 2024 cohort start?

Yes, it will start in May 2024.

edit on GitHub

# Cohort: I missed the current cohort, when is the next cohort scheduled for? Will there be a 202x cohort?

Please see the summary of all zoomcamps and their respective schedule at this link.

Note that there's no guarantee the zoomcamps will be run indefinitely or that the same zoomcamps will be conducted every year.

edit on GitHub

# Homework: What if my answer is not exactly the same as the choices presented?

Please choose the closest one to your answer. Also, do not post your answer in the course Slack channel.

edit on GitHub

# Homework: where can I find the schedule and/or deadlines of each homework assignment?

You can find the deadlines for each homework assignment in the course schedule or timeline provided at https://courses.datatalks.club/mlops-zoomcamp-2024/. The time is your own local time, as it has been automatically converted.

edit on GitHub

# Homework: Is the due date for homework 20th May? How do I check the updated playlist and homework for the mlops course?

The due date differs due to participants being in different time zones. It was the midnight of May 19th/20th in Berlin, and whatever corresponded to that in your particular time zone. You can find the deadline on the homework submission page:

Homework Submission Page

You can find all cohort-specific information for the 2025 cohort here:

2025 Cohort Info

edit on GitHub

# Homework: Why is the experiment in question 6 taking so long to run? Should we use yellow taxi data, or green taxi data?

You might need to use the green taxi data rather than yellow taxi data. The preprocessing code provided with the homework expects green taxi data (even though the question specifies yellow taxi data). Using green taxi data seems to be the correct approach, based on similar questions. Note that there is no official confirmation of this yet (as of early morning, May 27th in Berlin).

edit on GitHub

# Homework: I am getting conflicts on server ports and cannot establish a connection to the MLflow server, why?

Your port (5000) may be in use by some other process. To resolve this:

Run the following command to find out which process is using the port:
```
lsof -i :5000
```
Either kill the process using that port or route to a different port. You can explicitly change the port with the following command:
```
mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5001
```

edit on GitHub

# Homework 3, 2025 Cohort - Do I have to use MAGE as the orchestrator? Can I use any orchestrator I want?

You do not have to use MAGE or any specific orchestrator; it is totally up to you.

edit on GitHub

# Homework: Just found this course, can I still submit homeworks?

To clarify on late homework submissions:

You cannot submit after the homework is scored, as the form is closed.
Once the form is closed (i.e., scored), no further submissions are possible.
You can check your code against the solution by reviewing the homework.md file.

If the due date has passed but the form is still "Open/Submittable":

This is considered a "late homework submission," and the form is still editable.
Don’t forget to click the Update button to save any changes.

Please note, it's uncertain when the form will be closed as this process is currently manual.

edit on GitHub

# Hi, is it too late to start the course, I have ML experience?

It really depends on how much time and effort you can dedicate to the project over the coming weeks. Since you're late for the homeworks and they aren't required for the certificate, it might make sense to focus on the projects. Even if the first attempt is a struggle, it will be the best preparation for a second attempt, should you need or want one.

edit on GitHub

# Project: Are we free to choose our own topics for the final project?

Please pick up a problem you want to solve yourself. Potential datasets can be found on either Kaggle, Hugging Face, Google, AWS, or the UCI Machine Learning Datasets Repository. More links are documented in datasets.md. Please also read the README.md in 07-project folder.

edit on GitHub

# Project: Is the capstone an individual or team project?

It is an individual project.

edit on GitHub

# Project: For the final project, is it required to be put on the cloud?

You can get a few cloud points by using Kubernetes even if you deploy it only locally. Alternatively, you can use LocalStack to mimic AWS. Be sure you're clear on the Evaluation Criteria.

edit on GitHub

# Homework and Leaderboard: what is the system for points in the course management platform?

After you submit your homework, it will be graded based on the number of questions in that particular homework. You can see how many points you have on the homework page at the top.

In the leaderboard, you will find the sum of all points you've earned, which include:

Points for Homeworks
Points for FAQs
Points for Learning in Public

If you submit something to the FAQ, you receive one point. For each Learning in Public link, you also get one point. Hover over the "?" for some explanations.

edit on GitHub

# What exactly is a learning-in-public post?

They are content that you create about what you have learned on a specific topic. Some DOs and DON’Ts are explained by Alexey in the following video:

https://www.loom.com/share/710e3297487b409d94df0e8da1c984ce

Anyone caught abusing and gaming the system will be publicly called out and have their points stripped so they don’t appear high on the Leaderboard (as of 18 June 2024).

edit on GitHub

# Leaderboard: I am not on the leaderboard / how do I know which one I am on the leaderboard?

When you set up your account, you are automatically assigned a random name such as “Lucid Elbakyan.” Click on the "Jump to your record on the leaderboard" link to find your entry.

To see what your display name is, click on the Edit Course Profile button.

image #1

The first field is your nickname/displayed name. Change it if you want to use your Slack username, GitHub username, or any other nickname to remain anonymous.
Unless you want "Lucid Elbakyan" on your certificate, it is mandatory that you change the second field to your official name as per your identification documents (passport, national ID card, driver's license, etc.). This is the name that will appear on your Certificate!

edit on GitHub

# Error: creating Lambda Function (...): InvalidParameterValueException: The image manifest, config or layer media type for the source image ... is not supported.

This error occurs when the Docker image you are using is a manifest list (multi-platform). AWS Lambda does not support manifest lists—it only accepts single-platform images with a standard image manifest.

Quick fix: Build your Docker image using docker buildx and specify the platform explicitly.

docker buildx build --platform linux/amd64 -t your-ecr-image:latest -f Dockerfile .

This ensures the image is compatible with AWS Lambda. Also, make sure that you push your image using the --platform option.

edit on GitHub

# Criteria for getting a certificate?

Finish the Capstone project.

edit on GitHub

# Is completion of Homework necessary for a certificate?

No.

Can I submit the final project on the second attempt and still receive the certificate?

Yes, absolutely. It's your choice whether to submit one or two times; passing any one attempt is sufficient to earn the certificate.

edit on GitHub

Module 1: Introduction

# Can I submit and update my project attempt multiple times before the final deadline?

Yes, you can submit and update your project attempts multiple times before the final deadline.

It is advisable not to wait until the last minute. Submitting even a partially completed project early allows you to make improvements over time.
Continue adding improvements as needed until the final date.
Simply update the Git commit SHA to reflect changes.

image #1

edit on GitHub

# Opening Jupyter in VSCode

You can install the Jupyter extension to open notebooks in VSCode.

edit on GitHub

# Launching Jupyter notebook from codespace VM

When you are ready and have installed Anaconda, you can launch a Jupyter notebook in a new terminal with the following command:

jupyter notebook

Be careful not to make any typos. For instance, entering "jupyter-notebook" will result in an error:

Jupyter command `jupyter-notebook` not found.

edit on GitHub

# Configuring Github to work from the remote VM

In case you want to set up a GitHub repository (e.g., for homeworks) from a remote VM, you can follow these helpful tutorials:

Setting up GitHub on AWS instance: Tutorial
Setting up keys on AWS instance: GitHub Documentation

Once you complete these steps, you should be able to push to your repository successfully.

AWS Instance Note:

The selected AWS instance may not be covered under the free tier due to its size or other factors. Here is what the AWS free tier includes:

Resizable compute capacity in the Cloud.
750 hours per month of Linux, RHEL, or SLES t2.micro or t3.micro* instance, depending on the region.
750 hours per month of Windows t2.micro or t3.micro* instance, depending on the region.
750 hours per month of public IPv4 address regardless of the instance type.

*Instances launch in Unlimited mode and may incur additional charges.

edit on GitHub

# Opening Jupyter in AWS

Faced issue while setting up Jupyter Notebook on AWS. I was unable to access it from my desktop. (I am not using Visual Studio and hence faced problem)

Run the following command:
```
jupyter notebook --generate-config
```
Edit the file /home/ubuntu/.jupyter/jupyter_notebook_config.py to add the following line:
```
NotebookApp.ip = '*'
```

edit on GitHub

# WSL: instructions

If you wish to use WSL on your Windows machine, here are the setup instructions:

Install wget:
```
sudo apt install wget
```
Download Anaconda from the Anaconda download page using the wget command:
```
wget <download-address>
```
Turn on Docker Desktop WSL 2:

Follow the instructions here.
Clone the desired GitHub repository:
```
git clone <github-repository-address>
```
Install Jupyter:
```
pip3 install jupyter
```
Consider using Anaconda, which includes tools like PyCharm and Jupyter.
Alternatively, download Miniforge for a lightweight, open-source version of conda that supports mamba for improved environment solving speed. The Texas Tech University High Performance Computing Center provides a detailed guide:

Installing Miniforge3 Guide by TTU HPCC
For Windows, install WSL via:
```
wsl --install
```
If Python shows as version 3.10 after installing Anaconda with Python 3.9, execute:
```
source .bashrc
```
If the issue persists, add the following to your PATH:
```
export PATH="<anaconda-install-path>/bin:$PATH"
```

For using VSCode with WSL, refer to VSCode on WSL.

edit on GitHub

# Git: Created repo without .gitignore

If you created a repository without a .gitignore, follow these steps to add one:

Open Terminal.
Navigate to the location of your Git repository.
Create a .gitignore file for your repository:
```
touch .gitignore
```
Locate the .gitignore file. If you already have it, open it.
Edit the .gitignore file and add the following lines:
```
# Python
*.pyc
__pycache__/
*.py[cod]
*$
```
Save the changes to the .gitignore file.
Commit the changes.

edit on GitHub

# .gitignore: how-to

If you create a folder data and download datasets or raw files to your local repository, you might want to push all your code to a remote repository without including these files or folders. To achieve this, use a .gitignore file.

Follow these steps to create a .gitignore file:

Create an empty .txt file using a text editor or command line.
Save as .gitignore (ensure you use the dot symbol).
Add rules:
- *.parquet to ignore all Parquet files.
- data/ to ignore all files in the data folder.

For more patterns, read the GIT documentation.

<>{IMAGE:image_id}

edit on GitHub

# AWS: Suggestions

Ensure when stopping an EC2 instance that it fully stops. Look for the status indicator: green (running), orange (stopping), and red (stopped). Refresh the page to confirm it shows a red circle and status as stopped.
Note that stopping an EC2 instance might still incur charges, such as storage costs for uploaded data on an EBS volume.
Consider setting up billing alerts to monitor costs. However, specific instructions for setting them up are not provided here.

edit on GitHub

# IBM Cloud an alternative for AWS

You can get an invitation code from Coursera and use it in your account to verify it. IBM Cloud offers different features.

IBM CLOUD - Coursera Free Feature Code 395 Days

edit on GitHub

# AWS costs:

I am worried about the cost of keeping an AWS instance running during the course.

With the instance specified during working environment setup, if you remember to Stop Instance once you finish your work for the day, using that strategy, in a day with about 5 hours of work, you will pay around $0.40 USD, which will account for $12 USD per month. This seems to be an affordable amount.

You must remember that you will have a different public IP address every time you restart your instance, and you will need to edit your SSH Config file. It's worth the time though.

Additionally, AWS enables you to set up an automatic email alert if a predefined budget is exceeded.

Here is a tutorial to set this up.

Also, you can estimate the cost yourself using the AWS pricing calculator. At the time of writing (20.05.2023), a t3a.xlarge instance with 2 hr/day usage (which translates to 10 hr/week and should be enough to complete the course) and 30GB EBS monthly cost is 10.14 USD.

Here’s a link to the estimate.

edit on GitHub

# Is the AWS free tier enough for doing this course?

image #1

For many parts - yes. Some services like Kinesis are not in the AWS free tier, but you can use them locally with LocalStack.

edit on GitHub

# AWS EC2: this site can’t be reached

When I click an open IP address in an AWS EC2 instance, I get an error: "This site can’t be reached." What should I do?

This IP address is not meant to be opened in a browser. It is used to connect to the running EC2 instance via terminal. Use the following command from your local machine or a remote server:

Assume the IP address is 11.111.11.111
The downloaded key name is razer.pem (ensure the key is moved to a hidden folder .ssh)
Your username is user_name

ssh -i /Users/user_name/.ssh/razer.pem ubuntu@11.111.11.111

edit on GitHub

# Unprotected private key file!

After running the command:

ssh -i ~/.ssh/razer.pem ubuntu@XX.XX.XX.XX

I encountered the error: "unprotected private key file". To resolve this issue, ensure the file permissions are correctly set by running the following command:

chmod 400 ~/.ssh/razer.pem

For more detailed steps, see this guide.

edit on GitHub

# AWS EC2 instance constantly drops SSH connection

My SSH connection to AWS cannot last more than a few minutes, whether via terminal or VS Code.

My config:

Host mlops-zoomcamp  # ssh connection calling name
User ubuntu  # username AWS EC2
HostName <instance-public-IPv4-addr>  # Public IP, changes when instance is turned off.
IdentityFile ~/.ssh/name-of-your-private-key-file.pem  # Private SSH key file path
LocalForward 8888 localhost:8888  # Connecting to internal service
StrictHostKeyChecking no

The disconnection occurs whether I SSH via WSL2 or via VS Code, often after running some code like import mlflow.

To reconnect, I need to stop and restart the instance, which assigns a new IPv4 address.

I've checked the steps at AWS's troubleshooting page: AWS SSH Connection Errors

Inbound rule should allow all IPs for SSH.

Expected Behavior:

SSH connection should remain active while using the instance.
Should be able to reconnect if disconnected.

Solution:

Memory Issue: Disconnections may occur if the instance runs out of memory. Use EC2's screenshot feature to troubleshoot. If it's an OS out-of-memory issue, consider:
- Using a higher compute VM with more RAM.
- Adding a swap file, which uses disk as a RAM substitute to prevent OOM errors.
- Follow Ubuntu's documentation: Ubuntu Swap FAQ.
- Alternatively, follow AWS documentation: AWS Swap File.
Timeout Issue: If connections drop due to timeouts, add the following to your local .ssh/config file to ping every 50 seconds:
```
ServerAliveInterval 50
```

edit on GitHub

# AWS EC2: How do I handle changing IP addresses on restart?

Every time I restart my EC2 instance, I receive a different IP and need to update the config file manually.

Solution:

You can create a script to automatically update the IP address of your EC2 instance. Refer to this guide for detailed steps.

edit on GitHub

# VS Code crashes when connecting to Jupyter

Make sure to use an instance with enough compute capabilities such as a t2.xlarge. You can check the monitoring tab in the EC2 dashboard to monitor your instance.

edit on GitHub

# My connection to my GCP VM instance keeps timing out when I try to connect

If you switched off the VM instance completely in GCP, the IP address may change when it switches back on. You need to update the ssh_config file with the new external IP address. This can be done in VS Code if you have the Remote-SSH extension installed.

Open the command palette and type Remote-SSH: Open SSH Configuration File….
Select the appropriate ssh_config file.
Edit the HostName to the correct IP address.

edit on GitHub

# X has 526 features, but expecting 525 features

Error:

ValueError: X has 526 features, but LinearRegression is expecting 525 features as input.

Solution:

The DictVectorizer creates an initial mapping for the features (columns). When calling the DictVectorizer again for the validation dataset, transform should be used as it will ignore features that it did not see when fit_transform was last called. For example:

X_train = dv.fit_transform(train_dict)

X_test = dv.transform(test_dict)

edit on GitHub

# Missing dependencies

If some dependencies are missing:

image #1

Install the following packages:

pandas
matplotlib
scikit-learn
fastparquet
pyarrow
seaborn

pip install -r requirements.txt

I have seen this error when using pandas.read_parquet(). The solution is to install pyarrow or fastparquet by running the following command in the notebook:

!pip install pyarrow

Note: If you’re using Conda instead of pip, install fastparquet rather than pyarrow, as it is much easier to install and it’s functionally identical to pyarrow for our needs.

edit on GitHub

# squared Option Not Available in mean_squared_error

The mean_squared_error function in scikit-learn no longer includes the squared parameter. To compute the Root Mean Squared Error (RMSE), use the dedicated function root_mean_squared_error from sklearn.metrics instead.

edit on GitHub

# No RMSE value in the options

The evaluation RMSE I get doesn’t figure within the options!

If you’re evaluating the model on the entire February data, try to filter outliers using the same technique you used on the train data (0 ≤ duration ≤ 60) and you’ll get an RMSE which is (approximately) in the options. Also, don’t forget to convert the columns' data types to str before using the DictVectorizer.

Another option:

Along with filtering outliers, additionally filter on null values by replacing them with -1.
You will get an RMSE which is (almost) the same as in the options.
Use the .round(2) method to round it to 2 decimal points.

Warning Deprecation

The Python interpreter warns of modules that have been deprecated and will be removed in future releases while also suggesting how to update your code. For example:

C:\ProgramData\Anaconda3\lib\site-packages\seaborn\distributions.py:2619:
FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

To suppress the warnings, you can include this code at the beginning of your notebook:

import warnings

warnings.filterwarnings("ignore")

edit on GitHub

# How to replace distplot with histplot

To replace sns.distplot with sns.histplot, you can use the following syntax:

sns.distplot(df_train["duration"])

Can be replaced with:

sns.histplot(
    df_train["duration"], kde=True,
    stat="density", kde_kws=dict(cut=3), bins=50,
    alpha=.4, edgecolor=(1, 1, 1, 0.4),
)

This will give you an almost identical result.

edit on GitHub

# KeyError: 'PULocationID' or 'DOLocationID'

You need to replace the capital letter "L" with a small one "l".

edit on GitHub

# ImportError: Unable to find a usable engine; tried using: ‘pyarrow’, ‘fastparquet’.

To resolve this error, run the following command:

!pip install pyarrow

After successfully installing, you can delete the command.

edit on GitHub

# Reading large parquet files

When reading large parquet files, you might encounter the following error:

IndexError: index 311297 is out of bounds for axis 0 with size 131743

Here are some possible solutions:

Run as a Python Script:
- Try executing your code as a standalone Python script instead of within Jupyter Notebook.
Use PySpark Library:
- Consider using the PySpark library, which is optimized for handling large data files.
Read Parquet in Chunks:
- You can read parquet files in chunks using the pyarrow library. Reference this blog post for more details.

Using these methods may help manage and process large parquet files more efficiently.

edit on GitHub

# Kernel getting killed during assignment tasks on local

If the Jupyter notebook kernel gets killed repeatedly due to out-of-memory issues when converting a Pandas DataFrame to a dictionary or other memory-intensive steps, try using Google Colab as it offers more memory.

Here's how you can proceed:

Upload the datasets to Google Drive in the folder "Colab Notebooks."

Mount the drive on Colab:

from google.colab import drive
drive.mount('/content/drive')

Pull the data from uploaded tables in Colab:

df_jan = pq.read_table('/content/drive/My Drive/Colab Notebooks/yellow_tripdata_2023-01.parquet').to_pandas()

Complete the assignment in Colab.
Download the final assignment to your local machine and copy it into the relevant repository.

edit on GitHub

# What is the difference between label and one-hot encoding?

Two main encoding approaches are generally used to handle categorical data: label encoding and one-hot encoding.

Label Encoding: Assigns each categorical value an integer based on alphabetical order. Suitable for logical categorical data, such as a rating system or classification.
One-Hot Encoding: Creates new variables using 0s and 1s to represent original categorical data. Useful when there is no inherent order or logic to the categories.

Tools and Implementation

Sci-kit Learn:
- Dictionary Vectorizer: Handles categorical data and generates arrays based on unique instances in a DataFrame or other data structures.
- OneHotEncoder class: Specifically for applying one-hot encoding.
Pandas:
- pd.get_dummies(): Similar functionality for one-hot encoding.

Note: Sometimes resetting a dataset into objects is necessary to apply one-hot encoding, especially when there is logical structuring in the data that could influence label encoding, which can be limiting for some applications.

edit on GitHub

# Distplot takes too long

First, remove the outliers (trips with unusual duration) before plotting.

edit on GitHub

# RMSE on test set too high

Problem

RMSE on the test set was too high when hot encoding the validation set using a previously fitted OneHotEncoder(handle_unknown='ignore') on the training set. In contrast, DictVectorizer yielded the correct RMSE.

Explanation

In principle, both transformers should behave identically when treating categorical features, especially in scenarios where there are no sequences of strings in each row (as in this week’s homework):

Features are put into binary columns encoding their presence (1) or absence (0).
Unknown categories are imputed as zeros in the hot-encoded matrix.

This discrepancy indicates that there might be a difference in how OneHotEncoder and DictVectorizer handle the data after fitting on the training set and applying to the validation set.

edit on GitHub

# ictVectorizerA: Alexey’s answer

In summary:

pd.get_dummies or One-Hot Encoding (OHE) can produce results in different orders and handle missing data differently, potentially causing train and validation sets to have different columns.
DictVectorizer will ignore missing values (during training) and new values (during validation) in datasets.

Other sources:

<{IMAGE:image_id}>

edit on GitHub

# Why did we not use OneHotEncoder(sklearn) instead of DictVectorizer?

There are several reasons for choosing DictVectorizer over OneHotEncoder:

Simple One-Step Process: DictVectorizer provides a straightforward method to encode both categorical and numerical features from dictionaries, outputting directly to a sparse matrix.
Ideal for ML Pipelines: The direct output in sparse matrix format makes DictVectorizer a good fit for machine learning pipelines without needing additional preprocessing.
Use Cases:
- Use OneHotEncoder if you need full control, are working with sklearn pipelines, or need to handle unknown categories safely.
- Use DictVectorizer when your data is in dictionary format (e.g., JSON or from APIs) and you aim for quick integration into the pipeline.

edit on GitHub

# Clipping outliers

How to check that we removed the outliers?

Use the pandas function describe() which can provide a report of the data distribution along with the statistics to describe the data. For example, after clipping the outliers using a boolean expression, the min and max can be verified using:

 df['duration'].describe()

image #1

edit on GitHub

# Replacing NaNs for pickup location and drop off location with -1 for One-Hot Encoding

pd.get_dummies and DictVectorizer both create a one-hot encoding on string values. Therefore, you need to convert the values in PUlocationID and DOlocationID to string.

If you convert the values in PUlocationID and DOlocationID from numeric to string, the NaN values get converted to the string "nan". With DictVectorizer, the RMSE is the same whether you use "nan" or "-1" as the string representation for the NaN values. Therefore, the representation doesn't have to be "-1" specifically; it could also be some other string.

edit on GitHub

# Slightly different RMSE

Problem: My LinearRegression RMSE is very close to the answer but not exactly the same. Is this normal?

Answer: No, LinearRegression is a deterministic model; it should always output the same results when given the same inputs.

Check the Following:

Ensure outliers are properly treated in both the train and validation sets.
Verify that one-hot encoding is correctly applied by inspecting the shape of the one-hot encoded feature matrix. If it shows 2 features, there may be an issue.
- Hint: Convert drop-off and pick-up codes to the proper data format before fitting with DictVectorizer.

edit on GitHub

# Extremely low RMSE

Problem: I’m facing an extremely low RMSE score (e.g., 4.3451e-6) - what should I do?

Answer:

Recheck your code to see if your model is inadvertently learning the target before making predictions.
Ensure that the target variable is not included as a parameter while fitting the model. Including it can result in misleadingly low scores.
Verify that X_train does not contain any part of your y_train. This applies to the validation set as well.
Adjust your data handling to avoid data leakage between your features and the target.

edit on GitHub

# Enabling Auto-completion in Jupyter Notebook

Problem: How to enable auto-completion in Jupyter Notebook? Tab doesn’t work.

Solution:

You can enable auto-completion by running the following command:

!pip install --upgrade jedi==0.17.2

edit on GitHub

# Downloading the data from the NY Taxis datasets gives error: 403 Forbidden

Problem: While following the steps in the videos, you may encounter a 403 Forbidden error when trying to download files using wget.

Solution:

The issue occurs because the links point to files on cloudfront.net. An example of such a link is:
```
https://d37ci6vzurychx.cloudfront.net/trip+data/green_tripdata_2021-01.parquet
```
Instead of downloading the dataset directly from the link, use the dataset URL in your file.
Update (27-May-2023):
- You can now download the data from the official NYC trip record page: TLC Trip Record Data.
- Go to the page, right-click, and use "copy link" to get the URL since the URL provided might change if NYC updates their system.

Example command:

wget https://d37ci6vzurychx.cloudfront.net/trip-data/green_tripdata_2021-01.parquet

edit on GitHub

# Using PyCharm & Conda env in remote development

Problem: PyCharm (remote) doesn’t see the conda execution path, preventing the use of a conda environment located on a remote server.

Solution:

On the remote server's command line, run:
```
conda activate envname
```
Then, execute:
```
which python
```
This will provide the Python execution path.
Use this path to add a new interpreter in PyCharm:
- Add local interpreter.
- Select system interpreter.
- Enter the path obtained from the previous step.

edit on GitHub

# Running out of memory

Problem: The output of DictVectorizer was consuming too much memory, causing an inability to fit the linear regression model before running out of memory on a 16 GB machine.

Solution:

In the example for DictVectorizer on the scikit-learn website, the parameter sparse is set as False. While this helps with viewing results, it greatly increases memory usage.
To address this, either set sparse=True, or leave it at the default setting, which is also True.

By using sparse=True, memory usage will be reduced, allowing for more efficient model fitting.

edit on GitHub

# Activating Anaconda env in .bashrc

Problem: Installing Anaconda didn’t modify the .bashrc profile. This means the Anaconda environment was not activated after exiting and relaunching the Unix shell.

Solution:

For Bash:
- Initiate conda again, which will add entries for Anaconda in the .bashrc file.
```
cd YOUR_PATH_ANACONDA/bin 
./conda init bash
```
- This will automatically edit your .bashrc.
Reload:
```
source ~/.bashrc
```

edit on GitHub

# The feature size is different for training set and validation set

While working through HW1, you may notice that the feature sizes for the training and validation datasets are different. This issue often arises when using the incorrect method with a dictionary vectorizer.

Ensure you use the transform method on the premade dictionary vectorizer instead of fit_transform. Since you already have the dictionary vectorizer created, there's no need to execute the fit pipeline on the model.

edit on GitHub

# Permission denied (publickey) Error (when you remove your public key on the AWS machine)

If you encounter a "Permission denied (publickey)" error after removing your public key from an AWS machine, follow these steps:

Access your machine via Session Manager to recreate your public key. Refer to the guide for more details: Fix Permission Denied Errors.
To retrieve your old public key, use this command:
```
ssh-keygen -y -f /path_to_key_pair/my-key-pair.pem
```
Replace /path_to_key_pair/my-key-pair.pem with the actual path to your key pair.
For additional instructions on retrieving the public key, consult the AWS documentation: Retrieving the Public Key.

edit on GitHub

# Overfitting: Absurdly high RMSE on the validation dataset

Problem: The February dataset has been used as a validation/test dataset and stripped of the outliers in a similar manner to the train dataset (taking only the rows for the duration between 1 and 60, inclusive). The RMSE obtained afterward is in the thousands.

Solution:

Ensure that the sparse matrix result from DictVectorizer is not turned into an ndarray. After removing that part of the code, a correct result was achieved.

<{IMAGE:image_id}>

If there are further issues, carefully review each preprocessing step to ensure consistency between training and validation datasets.

edit on GitHub

# Can’t import sklearn

If you encounter an error when trying to import sklearn, specifically:

from sklearn.feature_extraction import DictVectorizer

You can resolve it by installing scikit-learn with the following command:

!pip install scikit-learn

edit on GitHub

# Install docker in WSL2 without installing Docker Desktop

If you want to install Docker in WSL2 on Windows without Docker Desktop, follow these steps:

Install Docker

You can ignore the warnings during installation.

curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

Add Your User to the Docker Group
```
sudo usermod -aG docker $USER
```
Enable the Docker Service
```
sudo systemctl enable docker.service
```
Test the Installation

Verify that both Docker and Docker Compose are installed successfully.
```
docker --version
docker compose version
docker run hello-world
```

Ensure Docker Starts Automatically

If the service does not start automatically after restarting WSL, update your .profile or .zprofile file with:

if grep -q "microsoft" /proc/version > /dev/null 2>&1; then
   if service docker status 2>&1 | grep -q "is not running"; then
      wsl.exe --distribution "${WSL_DISTRO_NAME}" --user root \
      --exec /usr/sbin/service docker start > /dev/null 2>&1
   fi
fi

edit on GitHub

# Zero elements in sparse matrix (AKA when dictionary vectorizer / categorical X transformation fails)

Seeing a message like:

<2855951x515 sparse matrix of type '<class 'numpy.float64'>' with 0 stored elements in Compressed Sparse Row format>

This issue might occur because your variables, intended for vectorization, were imported as floating point numbers rather than integers. This can lead to nonsensical models. To resolve this, convert your data with the following code (assuming dg is your dataframe and categorical stores the names of your variables to be vectorized):

dg[categorical] = dg[categorical].round(0).astype(int).astype(str)

edit on GitHub

# Using a docker image as development environment (Linux)

If you don’t want to install Anaconda locally and prefer not to use Codespace or a VPS, you can create and run a Docker image locally.

For this, use the following Dockerfile:

FROM docker.io/bitnami/minideb:bookworm

RUN install_packages wget ca-certificates vim less silversearcher-ag

# Uncomment the `COPY` and comment the `RUN` line if you have downloaded anaconda manually
# I did this to save bandwidth when experimenting with the image creation

RUN wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh && bash Anaconda3-2022.05-Linux-x86_64.sh -b -p /opt/anaconda3

#COPY Anaconda3-2022.05-Linux-x86_64.sh /tmp/Anaconda3-2022.05-Linux-x86_64.sh

RUN bash /tmp/Anaconda3-2022.05-Linux-x86_64.sh -b -p /opt/anaconda3 && \
    rm /tmp/Anaconda3-2022.05-Linux-x86_64.sh

ENV PATH="/opt/anaconda3/bin:$PATH" \
    HOME="/app"

EXPOSE 8888

WORKDIR /app

USER 1001

ENTRYPOINT [ "jupyter", "notebook", "--ip", "0.0.0.0" ]

Build the image using:

docker build -f Dockerfile -t mlops:v0 .

Then you can run it with:

mkdir app
chmod -R 777 app
docker run --name jupyter -p 8888:8888 -v ./app:/app mlops:v0

In the logs, you will see the Jupyter URL needed to access the environment. The files you create will be stored in the app directory.

edit on GitHub

# Use uv as a package manager

There is an option to run the project without Anaconda while easily managing multiple Python versions on your machine. The new package manager, uv, is a fast and efficient one written in Rust. It's recommended for use in Python projects overall. Install guide

uv venv --python 3.9.7 # install python 3.9.7 used in the course

source .venv/bin/activate # activate the environment

python -V # should be 3.9.7

uv pip install pandas scikit-learn notebook seaborn pyarrow # install required packages

jupyter notebook # run jupyter notebook

Cleanup is straightforward. Deactivate the environment and delete the folder:

deactivate

rm -rf .venv

edit on GitHub

# I get `TypeError: got an unexpected keyword argument 'squared'` when using `mean_squared_error(..., squared=False)`. Why?

The squared parameter was added in scikit-learn 0.22. In earlier versions, it is not recognized, which causes the TypeError.

To compute RMSE in older versions:

Use np.sqrt(mean_squared_error(...)).

In scikit-learn 1.0 and later, you can use:

from sklearn.metrics import root_mean_squared_error as rmse

rmse_value = root_mean_squared_error(y_train, y_pred)

print('RMSE:', rmse_value)

This approach is more explicit and convenient.

edit on GitHub

# Visualizing outliers in large datasets with Seaborn: Boxplot vs Histplot

seaborn.boxplot is generally faster because it uses a smaller set of summary statistics (min, Q1, median, Q3, max) to represent the data, which requires less computational effort, especially for large datasets.

seaborn.histplot can be slower, particularly with large datasets, because it needs to bin the data and compute frequency counts for each bin, which involves more processing.

So, if speed is a concern, especially with large datasets, boxplots are typically faster than histograms.

edit on GitHub

# Reading parquet files with Pandas (pyarrow dependency)

Error

A module that was compiled using NumPy 1.x cannot be run in NumPy 2.2.4 as it may crash.

AttributeError: module 'pyarrow' has no attribute '__version__'

Solution

Downgrade the version of your numpy:

pip uninstall numpy -y

conda remove numpy --force

conda clean --all -y

conda install numpy=1.26 -y

edit on GitHub

Module 2: Experiment Tracking

# Kernel died during Model Training on Github Codespaces

While training the model in Jupyter Notebook on GitHub Codespaces, the Jupyter kernel may die. To resolve this, upgrade the machine type in Codespaces from 8 cores to 14 cores. It is free to upgrade, but be aware that you will use more hours.

edit on GitHub

# Do we absolutely need to save data to disk? Can we use it directly from download?

Yes, you can use data directly from a URL without saving it to disk. For example, you can use Pandas to read data from a URL:

# Access Denied at Localhost:5000 - Authorization Issue

Problem

Localhost:5000 Unavailable // Access to Localhost Denied // You don’t have authorization to view this page (127.0.0.1:5000)

Solution

If you are using Chrome, follow these steps:

Navigate to chrome://net-internals/#sockets.
Press "Flush Socket Pools".

image #1

edit on GitHub

# Connection in use: ('127.0.0.1', 5000)

You have something running on the 5000 port. You need to stop it. Here are some ways to resolve the issue:

Using Terminal on Mac:
1. Run the command:
```
ps -A | grep gunicorn
```
2. Identify the process ID (the first number after running the command).
3. Kill the process using the ID:
```
kill 13580
```
  where 13580 represents the process number.
To Kill All Processes Using Port 5000:
```
sudo fuser -k 5000/tcp
```

Alternative Command to Kill the Running Port:

kill -9 $(ps -A | grep python | awk '{print $1}')

Change to a Different Port:

mlflow ui --backend-store-uri sqlite:///mlflow.db --port 5001

For more information, refer to the source.

edit on GitHub

# Could not convert string to float - ValueError

Running python register_model.py results in the following error:

ValueError: could not convert string to float: '0 int\n1   float\n2     hyperopt_param\n3       Literal{n_estimators}\n4       quniform\n5         Literal{10}\n6         Literal{50}\n7         Literal{1}'

Full Traceback:

Traceback (most recent call last):

File "/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/homework/scripts/register_model.py", line 101, in <module>

run(args.data_path, args.top_n)

File "/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/homework/scripts/register_model.py", line 67, in run

train_and_log_model(data_path=data_path, params=run.data.params)

File "/Users/name/Desktop/Programming/DataTalksClub/MLOps-Zoomcamp/2. Experiment tracking and model management/xfsub/scripts/register_model.py", line 41, in train_and_log_model

params = space_eval(SPACE, params)

File "/Users/name/miniconda3/envs/mlops-zoomcamp/lib/python3.9/site-packages/hyperopt/fmin.py", line 618, in space_eval

rval = pyll.rec_eval(space, memo=memo)

File "/Users/name/miniconda3/envs/mlops-zoomcamp/lib/python3.9/site-packages/hyperopt/pyll/base.py", line 902, in rec_eval

rval = scope._impls[node.name](*args, **kwargs)

ValueError: could not convert string to float: '0 int\n1   float\n2     hyperopt_param\n3       Literal{n_estimators}\n4       quniform\n5         Literal{10}\n6         Literal{50}\n7         Literal{1}'

Solution:

There are two plausible errors related to the hpo.py file where hyper-parameter tuning is run. The objective function should be structured as follows:

Ensure the with statement and the log_params function are correctly applied to log all runs and parameters:

def objective(params):
    with mlflow.start_run():
        mlflow.log_params(params)
        rf = RandomForestRegressor(**params)
        rf.fit(X_train, y_train)
        y_pred = rf.predict(X_valid)
        rmse = mean_squared_error(y_valid, y_pred, squared=False)
        mlflow.log_metric('rmse', rmse)

Add the with statement immediately before the function, just after:

X_valid, y_valid = load_pickle(os.path.join(data_path, "valid.pkl"))

Log parameters just after defining the search_space dictionary:
```
search_space = {....}
mlflow.log_params(search_space)
```

Logging parameters in groups can lead to issues because register_model.py expects to receive parameters individually. Ensure the objective function matches the example above.

edit on GitHub

# Experiment not visible in MLflow UI

image #1

image #2

image #3

Make sure you launch the MLflow UI from the same directory as the code that is running the experiments (the same directory that contains the mlruns directory and the database that stores the experiments).

Or, navigate to the correct directory when specifying the tracking_uri.

For example:

If the mlflow.db is in a subdirectory called database, the tracking URI would be:
```
sqlite:///database/mlflow.db
```
If the mlflow.db is a directory above your current directory, the tracking URI would be:
```
sqlite:///../mlflow.db
```

Another alternative is to use an absolute path to mlflow.db rather than a relative path.

You can also launch the UI from the same notebook by executing the following code cell:

import subprocess

MLFLOW_TRACKING_URI = "sqlite:///data/mlflow.db"

subprocess.Popen(["mlflow", "ui", "--backend-store-uri", MLFLOW_TRACKING_URI])

Then, use the same MLFLOW_TRACKING_URI when initializing MLflow or the client:

from mlflow.tracking import MlflowClient

client = MlflowClient(tracking_uri=MLFLOW_TRACKING_URI)

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

edit on GitHub

# Metrics not visible in mlflow UI

I encountered the following issue: I was able to run experiments and the different model parameters were visible. However, the metrics, including the “handmade” metric rmse in the training script, were not visible (empty field).

I solved my problem by making sure to specify the “key” and “value” explicitly when using mlflow.log_metric:

mlflow.log_metric(key="rmse", value=rmse)

edit on GitHub

# Unable to create new Experiment

Following the instructions in the video did not work, even though the Jupyter notebook indicates it was successfully created.

image #1

It is recommended to set the URI to the listener directly. This discrepancy might be due to differences in the "mlflow" package versions between the video and the latest version we are using. The documentation for the latest "mlflow" package suggests setting the URI as follows:

mlflow.set_tracking_uri(uri="http://127.0.0.1:5000")

edit on GitHub

# Hash Mismatch Error with Package Installation

Problem:

When attempting to install MLFlow using pip install mlflow, an error occurs related to a hash mismatch for the Numpy package:

ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE.

Error Details:

During the installation on 27th May 2022, the following occurred while Numpy was being installed:

Collecting numpy
  Downloading numpy-1.22.4-cp310-cp310-win_amd64.whl (14.7 MB)
  |██████████████              | 6.3 MB 107 kB/s eta 0:01:19
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE.
If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.

Expected SHA256: 3e1ffa4748168e1cc8d3cde93f006fe92b5421396221a02f2274aab6ac83b077
Got: 15e691797dba353af05cf51233aefc4c654ea7ff194b3e7435e6eec321807e90

Solution:

Install Numpy Separately:
- Try installing Numpy separately using:
```
pip install numpy
```
Install MLFlow:
- After successfully installing Numpy, proceed with reinstalling MLFlow:
```
pip install mlflow
```

This approach resolved the issue in this instance, although the problem may not be consistently reproducible. Be aware that similar hash mismatch errors might occur during package installations.

edit on GitHub

# How to Delete an Experiment Permanently from MLFlow UI

After deleting an experiment from the UI, it may still persist in the database. To delete this experiment permanently, follow these steps:

Install ipython-sql
```
pip install ipython-sql
```
Load SQL Magic Scripts in Jupyter Notebook
```
%load_ext sql
```
Load the Database

Replace nameofdatabase.db with your actual database name:
```
%sql sqlite:///nameofdatabase.db
```
Run SQL Script

Use SQL commands to delete the experiment permanently. Refer to this link for a detailed guide.

edit on GitHub

# How to Update Git Public Repo Without Overwriting Changes

Problem: I cloned the public repo, made edits, committed, and pushed them to my own repo. Now I want to get the recent commits from the public repo without overwriting my own changes to my own repo. Which command(s) should I use?

Below is the Git configuration:

[core]
  repositoryformatversion = 0
  filemode = true
  bare = false
  logallrefupdates = true
  ignorecase = true
  precomposeunicode = true

[remote "origin"]
  url = git@github.com:my_username/mlops-zoomcamp.git
  fetch = +refs/heads/*:refs/remotes/origin/*

[branch "main"]
  remote = origin
  merge = refs/heads/main

Solution:

Fork the original repository from DataClubsTak instead of cloning it directly.
On GitHub, navigate to your forked repository.
Click “Fetch and Merge” under the “Fetch upstream” menu on the main page of your own repository.

edit on GitHub

# Image size of 460x93139 pixels is too large. It must be less than 2^16 in each direction.

This issue is caused by mlflow.xgboost.autolog() in version 1.6.1 of XGBoost. To resolve this:

Downgrade XGBoost to version 1.6.0 using the following command:

pip install xgboost==1.6.0

Alternatively, update your requirements file to specify xgboost==1.6.0.

edit on GitHub

# MlflowClient object has no attribute 'list_experiments'

Since version 1.29, the list_experiments method was deprecated and then removed in later versions.

You should use the following code instead:

# Register the best model
model_uri = f"runs:/{best_run.info.run_id}/model"
mlflow.register_model(model_uri=model_uri, name="RandomForestBestModel")

For more details, refer to the Mlflow documentation.

edit on GitHub

# MLflow Autolog not working

Make sure mlflow.autolog() (or framework-specific autolog) is written before with mlflow.start_run(), not after.

Also, ensure that all dependencies for the autologger are installed, including matplotlib. A warning about uninstalled dependencies will be raised.

edit on GitHub

# MLflow URL ([127.0.0.1:5000](http://127.0.0.1:5000)) doesn't open.

If you’re running MLflow on a remote VM, you need to forward the port too, like we did in Module 1 for the Jupyter notebook port 8888. Simply connect your server to VS Code, as we did, and add 5000 to the PORT.

image #1

If you are running MLflow locally and 127.0.0.1:5000 shows a blank page, navigate to localhost:5000 instead.

edit on GitHub

# MLflow.xgboost Autolog Model Signature Failure

Got the same warning message as Warrie Warrie when using mlflow.xgboost.autolog():

image #1

It turned out that this was just a warning message, and upon checking MLflow UI (making sure that no "tag" filters were included), the model was actually automatically tracked in MLflow.

edit on GitHub

# MlflowException: Unable to Set a Deleted Experiment

raise MlflowException(

mlflow.exceptions.MlflowException: Cannot set a deleted experiment 'random-forest-hyperopt' as the active experiment. You can restore the experiment, or permanently delete the experiment to create a new one.

To resolve this issue, consider the following options:

Restore or Permanently Delete the Experiment: Refer to guidance on Stack Overflow for methods to permanently delete an experiment in MLflow.
Command Line Resolution: If you have deleted the experiment from the MLflow UI, run the following command in the CLI. Make sure to use the correct database filename.
```
mlflow gc --backend-store-uri sqlite:///backend.db
```
Ensure .trash is Empty: If the above command does not work and your .trash folder is already empty, confirm this by executing:
```
rm -rf mlruns/.trash/*
```
Note: Ensure no files remain in .trash/ that could be interfering with the experiment reset.

edit on GitHub

# MlflowException: Unable to Set a Deleted Experiment with Postgres backend

If you’re using a Postgres backend locally or remotely and don’t want to delete the entire backend, you can run this script to permanently delete an experiment. This script assumes you have a separate env.py file to retrieve your environment variables.

import os
import sys
import psycopg2

sys.path.insert(0, os.getcwd())
from env import DB_NAME, DB_PASSWORD, DB_PORT, DB_USER

def perm_delete_exp():
    connection = psycopg2.connect(
        database=DB_NAME,
        user=DB_USER,
        password=DB_PASSWORD,
        host="localhost",
        port=int(DB_PORT)
    )
    with connection.cursor() as cursor:
        queries = """
        DELETE FROM experiment_tags WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted');
        DELETE FROM latest_metrics WHERE run_uuid=ANY(SELECT run_uuid FROM runs WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted'));
        DELETE FROM metrics WHERE run_uuid=ANY(SELECT run_uuid FROM runs WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted'));
        DELETE FROM tags WHERE run_uuid=ANY(SELECT run_uuid FROM runs WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted'));
        DELETE FROM params WHERE run_uuid=ANY(SELECT run_uuid FROM runs where experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted'));
        DELETE FROM runs WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted');
        DELETE FROM datasets WHERE experiment_id=ANY(SELECT experiment_id FROM experiments where lifecycle_stage='deleted');
        DELETE FROM experiments where lifecycle_stage='deleted';
        """
        for query in queries.splitlines()[1:-1]:
            cursor.execute(query.strip())
    connection.commit()
    connection.close()

if __name__ == "__main__":
    perm_delete_exp()

edit on GitHub

# No Space Left on Device - OSError[Errno 28]

You do not have enough disk space to install the requirements. Here are some solutions:

Increase EBS Volume on AWS: Follow this guide to increase the base EBS volume.
Add an External Disk on AWS: Add and configure an external disk to your instance, then configure conda installation to happen on this external disk.
Add Persistent Disk on GCP:
1. Add another disk to your VM and follow this guide to mount the disk.
2. Confirm the mount by running the following command in the bash shell:
```
df -H
```
1. Delete Anaconda and use Miniconda instead. Download Miniconda on the additional disk that you mounted.
2. During the Miniconda installation, enter the path to the extra disk instead of the default disk, so that conda is installed on the extra disk.

</ANSWER>

edit on GitHub

# Homework: Parameters Mismatch in Homework Q3

I was using an old version of sklearn, which caused a mismatch in the number of parameters. In the latest version, min_impurity_split for RandomForestRegressor was deprecated. Upgrading to the latest version resolved the issue.

edit on GitHub

# Protobuf error when installing MLflow

Error:

I installed all the libraries from the requirements.txt document in a new environment with the following command:

pip install -r requirements.txt

Then, when I run mlflow from my terminal like this:

mlflow

I get this error:

image #1

Solution:

You need to downgrade the version of the protobuf module to 3.20.x or lower. Initially, it was version 4.21. Use the following command to install the compatible version:

pip install protobuf==3.20

After doing this, I was able to run mlflow from my terminal.

edit on GitHub

# SSH: Connection to AWS EC2 instance from local machine WSL getting terminated frequently within a minute of inactivity.

If the SSH connection from your local machine’s WSL to an AWS EC2 instance is frequently getting terminated after a short period of inactivity, you might see the following message displayed:

image #1

To fix this issue, add the following lines to your config file in the .ssh directory of your WSL environment:

ServerAliveInterval 60
ServerAliveCountMax 3

For example, after adding these lines, your SSH configuration should look somewhat like this:

Host mlops-zoomcamp
  HostName 45.80.32.7
  User ubuntu
  IdentityFile ~/.ssh/siddMLOps.pem
  StrictHostKeyChecking no
  ServerAliveInterval 60
  ServerAliveCountMax 3

edit on GitHub

# Setting up Artifacts folders

Please check your current directory while running the mlflow ui command. You need to run mlflow ui or mlflow server command in the right directory.

edit on GitHub

# Setting up MLflow experiment tracker on GCP

If you have problems setting up MLflow for experiment tracking on GCP, you can check these two links:

edit on GitHub

# Setuptools Replacing Distutils - MLflow Autolog Warning

Downgrade setuptools:

Change from version 62.3.2 to 49.1.0

edit on GitHub

# Sorting runs in MLflow UI

I can’t sort runs in MLflow

Make sure you are in table view (not list view) in the MLflow UI.

image #1

edit on GitHub

# TypeError: send_file() unexpected keyword 'max_age' during MLflow UI Launch

Problem: When running $ mlflow ui on a remote server and attempting to open it in a local browser, the following exception occurs, and the MLflow UI page does not load.

Solution:

Uninstall Flask on your remote server by using:
```
pip uninstall flask
```
Reinstall Flask with:
```
pip install Flask
```
This issue arises because the base conda environment includes a version of Flask that's less than 1.2. Cloning this environment retains the older version, causing the error. Installing a newer version of Flask resolves the issue.

edit on GitHub

# mlflow ui on Windows FileNotFoundError: [WinError 2] The system cannot find the file specified

Problem: After successfully installing mlflow using pip install mlflow on a Windows system, running the mlflow ui command results in the error:

FileNotFoundError: [WinError 2] The system cannot find the file specified

Solution:

Add C:\Users\{User_Name}\AppData\Roaming\Python\Python39\Scripts to the PATH.

edit on GitHub

# Unsupported Operand Type Error in hpo.py

Running the command:

python hpo.py --data_path=./your-path --max_evals=50

leads to the following error:

TypeError: unsupported operand type(s) for -: 'str' and 'int'

Full Traceback:

File "~/repos/mlops/02-experiment-tracking/homework/hpo.py", line 73, in <module>
  run(args.data_path, args.max_evals)

File "~/repos/mlops/02-experiment-tracking/homework/hpo.py", line 47, in run
  fmin(

File "~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py", line 540, in fmin
  return trials.fmin(

File "~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/base.py", line 671, in fmin
  return fmin(

File "~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py", line 586, in fmin
  rval.exhaust()

File "~/Library/Caches/pypoetry/virtualenvs/mlflow-intro-SyTqwt0D-py3.9/lib/python3.9/site-packages/hyperopt/fmin.py", line 364, in exhaust
  self.run(self.max_evals - n_done, block_until_done=self.asynchronous)

Solution:

The --max_evals argument in hpo.py is not defined with a datatype, leading to it being interpreted as a string. It should be an integer to ensure the script functions correctly. Modify the argument definition as follows:

parser.add_argument(
    "--max_evals",
    type=int,
    default=50,
    help="the number of parameter evaluations for the optimizer to explore."
)

edit on GitHub

# Unsupported Scikit-Learn version

Getting the following warning when running mlflow.sklearn:

2022/05/28 04:36:36 WARNING mlflow.utils.autologging_utils: You are using an unsupported version of sklearn. If you encounter errors during autologging, try upgrading / downgrading sklearn to a supported version, or try upgrading MLflow.

Solution:

Use scikit-learn version between 0.24.1 and 1.4.2.

Reference: MLflow Documentation

edit on GitHub

# Mlflow CLI does not return experiments

Problem

CLI commands (mlflow experiments list) do not return experiments.

Solution

You need to set the environment variable for the Tracking URI:

export MLFLOW_TRACKING_URI=http://127.0.0.1:5000

edit on GitHub

# Viewing MLflow Experiments using MLflow CLI

Problem:

After starting the tracking server, when trying to use the MLflow CLI commands as listed here, most commands can't find the experiments that have been run with the tracking server.

Solution:

Set the environment variable MLFLOW_TRACKING_URI to the URI of the SQLite database:
```
export MLFLOW_TRACKING_URI=sqlite:///{path to sqlite database}
```
After setting the environment variable, you can view the experiments from the command line using commands like:
```
mlflow experiments search
```
Note: Commands like mlflow gc may still not get the tracking URI and need to be passed explicitly as an argument every time the command is run.

edit on GitHub

# Viewing SQLite Data Raw & Deleting Experiments Manually

All the experiment and other tracking information in MLflow are stored in an SQLite database provided while initiating the mlflow ui command. This database can be inspected using PyCharm’s Database tab by selecting the SQLite database type.

Once the connection is created, the tables can be queried and inspected using standard SQL. The same applies to any SQL-backed database such as PostgreSQL.

This approach is useful to understand the entity structure of the data being stored within MLflow and is beneficial for systematic archiving of model tracking for extended periods.

image #1

image #2

edit on GitHub

# What does launching the tracking server locally mean?

Launching the tracking server locally means starting an MLflow server on your machine for remote hosting. This setup is useful when multiple colleagues are collaborating and need to connect to the same MLflow server instead of running it individually on their laptops.

edit on GitHub

# Parameter adding in case of max_depth not recognized

Problem: Parameter was not recognized during the model registry.

Solution: Parameters should be added prior to the model registry. Use the following method to add parameters:

mlflow.log_params(params)

This way, the dictionary can be directly appended to data.run.params.

edit on GitHub

# Max_depth is not recognize even when I add the mlflow.log_params

Problem:

Max_depth is not recognized even when I add the mlflow.log_params.

Solution:

The mlflow.log_params(params) should be added to the hpo.py script. If you run it, it will append the new model to the previous run that doesn’t contain the parameters. You should either:

Remove the previous experiment
Change it

edit on GitHub

# AttributeError: 'tuple' object has no attribute 'tb_frame'

Problem: About week_2 homework: The register_model.py script, when copied into a Jupyter notebook, fails and produces the following error:

AttributeError: 'tuple' object has no attribute 'tb_frame'

Solution: Remove click decorators.

edit on GitHub

# WandB API error

Problem: When running the preprocess_data.py file, you encounter the following error:

wandb: ERROR api_key not configured (no-tty). call wandb.login(key=[your_api_key])

Solution:

Go to your WandB profile and navigate to user settings.
Scroll down to the “Danger Zone” and copy your API key.
Before running preprocess_data.py, add and run the following cell in your notebook:
```
%%bash
wandb login <YOUR_API_KEY_HERE>
```

edit on GitHub

# WARNING mlflow.xgboost: Failed to infer model signature: could not sample data to infer model signature: please ensure that autologging is enabled before constructing the dataset.

Please make sure you follow the order below, enabling autologging before constructing the dataset. If you still have this issue, check that your data is in a format compatible with XGBoost.

Enable MLflow autologging for XGBoost
```
mlflow.xgboost.autolog()
```
Construct your dataset
```
X_train, y_train = ...
```

Train your XGBoost model

import xgboost as xgb
model = xgb.XGBRegressor(...)

model.fit(X_train, y_train)

edit on GitHub

# Old version of glibc when running XGBoost

Starting from version 2.1.0, XGBoost distributes its Python package in two variants:

manylinux_2_28: For recent Linux distributions with glibc 2.28 or newer. This variant includes all features, such as GPU algorithms and federated learning.
manylinux2014: For older Linux distributions with glibc versions older than 2.28. This variant lacks support for GPU algorithms and federated learning.

If you're installing XGBoost via pip, the package manager automatically selects the appropriate variant based on your system's glibc version. Starting May 31, 2025, the manylinux2014 variant will no longer be distributed.

This means that systems with glibc versions older than 2.28 will not be able to install future versions of XGBoost via pip unless they upgrade their glibc version or build XGBoost from source.

This behavior is disabled for conda.

edit on GitHub

# wget not working

Problem

Using the wget command to download either data or Python scripts on Windows, I am using the notebook provided by Visual Studio and despite having a Python virtual environment, it did not recognize the pip command.

Solution

Use python -m pip, this applies to any other command as well, e.g., python -m wget.

edit on GitHub

# Open/run github notebook(.ipynb) directly in Google Colab

Problem: Open/run GitHub notebook (.ipynb) directly in Google Colab

Solution:

Change the domain from github.com to githubtocolab.com. The notebook will open in Google Colab.
Note: This only works with public repositories.

Navigating in Wandb UI became difficult to me, I had to intuit some options until I found the correct one.

Solution: Refer to the official documentation.

edit on GitHub

# Why do we use Jan/Feb/March for Train/Test/Validation Purposes?

We use this type of split approach instead of a random split to address specific needs in model evaluation, primarily focusing on seasonality and preventing data leakage.

Solution:

"Out of Time" Validations:
- Check for Seasonality:
  - By using specific periods like Jan/Feb/March, we can assess if there are seasonal effects in the data.
  - Example: If the RMSE for the test period is 5, but the RMSE for validation is 20, this indicates significant seasonality. This might suggest switching to Time Series approaches.
Prevent Data Leakage:
- When predicting future outcomes, a "random sample" train/test split can introduce data leakage, resulting in overfitting and poor model performance in production.
- It's crucial not to use future information when predicting the present in a model context.

Approach:

Train: January
Test: February
Validate: March

The validation process is essential for reporting model metrics to leadership, regulators, auditors, and for analyzing target drift in the models.

edit on GitHub

# WARNING: mlflow.sklearn: Failed to log training dataset information to MLflow Tracking.

Problem

When using MLflow’s autolog function, you may encounter the following warning:

WARNING mlflow.sklearn: Failed to log training dataset information to MLflow Tracking. Reason: 'numpy.ndarray' object has no attribute 'toarray'

This occurs because the autolog function is attempting to log your dataset. MLflow expects the dataset to be in a pd.DataFrame format. If you’re following course code that provides a numpy.ndarray, MLflow fails as the numpy.ndarray is already an array.

Solution

Since we are not processing datasets in this zoomcamp, use the following parameter in the autolog function to prevent logging datasets:

log_datasets = False

Problem

Error when running the mlflow server on AWS CLI with an S3 bucket and POSTGRES database:

Reproducible Command:

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri postgresql://<DB_USERNAME>:<DB_PASSWORD>@<DB_ENDPOINT>:<DB_PORT>/<DB_NAME> --default-artifact-root s3://<BUCKET_NAME>

Error Message:

urllib3 v2.0 only supports OpenSSL 1.1.1+, currently

ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'OpenSSL 1.0.2k-fips  26 Jan 2017'. See: [GitHub](https://github.com/urllib3/urllib3/issues/2168)

Solution

Upgrade mlflow to address compatibility issues:

pip3 install --upgrade mlflow

Resolution

This process will downgrade urllib3 from version 2.0.3 to 1.26.16, ensuring compatibility with mlflow and ssl version 1.0.2. You should see the following output after the upgrade:

Installing collected packages: urllib3
Attempting uninstall: urllib3
Found existing installation: urllib3 2.0.3
Uninstalling urllib3-2.0.3:
Successfully uninstalled urllib3-2.0.3
Successfully installed urllib3-1.26.16

edit on GitHub

# ImportError: urllib3 v2.0 only supports OpenSSL 1.1.1+

If you're encountering an error while running S3 buckets, ensure to resolve the dependencies issue by downgrading urllib3 to a compatible version:

pip3 install "urllib3<1.27"

edit on GitHub

# AttributeError: 'MlflowClient' object has no attribute 'list_run_infos'

Problem: In the scenario 2 notebook, the error

AttributeError: 'MlflowClient' object has no attribute 'list_run_infos'

is thrown when running:

run_id = client.list_run_infos(experiment_id='1')[0].run_id

Solution: Use the following code instead:

run_id = client.search_runs(experiment_ids='1')[0].info.run_id

Scenario: This solution works for MLflow version 2.12.2 and might work for other recent versions as of May, 2024.

edit on GitHub

# When using Autologging, do I need to set a training parameter to track it on Mlflow UI?

No, in the official documentation it’s mentioned that autologging keeps track of the parameters even when you do not explicitly set them when calling .fit.

You can run the training, only setting the parameters you want, but you can check all the parameters in the MLflow UI.

edit on GitHub

# Hyperopt is not installable with Conda

Description

When setting up your virtual environment with

conda install --file requirements.txt

you may encounter the following error:

PackagesNotFoundError: The following packages are not available from current channels:

- hyperopt

Solution

Your conda installation might be out of date. You can update Conda with:
```
conda update -n base -c defaults conda
```
If updating does not solve the issue, consider installing the package via the Intel channel, as advised on the conda page:
```
conda install intel::hyperopt
```

edit on GitHub

# Error importing xgboost in python with OS mac: library not loaded: @rpath/libomp.dylib

To fix this error, run the following command:

brew install libomp

edit on GitHub

# Size limit when uploading to GitHub

To manage size limits effectively when uploading to GitHub, add the mlruns and artifacts directories to your .gitignore, like this:

02-experiment-tracking/mlruns
02-experiment-tracking/runnin-mflow-examples/mlruns
02-experiment-tracking/homework/mlruns
02-experiment-tracking/homework/artifacts

edit on GitHub

Module 3: Orchestration

# Why does MlflowClient no longer support list_experiments?

Older versions of MLflow used client.list_experiments(), but in recent versions, this method was replaced.

Use client.search_experiments() instead.

edit on GitHub

# Mage shortcut key to open Text Editor is not working on Windows

On Windows, use the shortcut key CTRL+WIN+..

For MacOS, the shortcut is CMD+..

edit on GitHub

# Mage: Pipeline breaks with `[Errno 2] No such file or directory: '/home/src/mage_data/{…} /.variables/{...}/output_1/object.joblib'`

Export the pipeline as a zip file.
Create a new Mage project.
Import the pipeline zip to the new project.

Additionally, check the following:

Review the logs of the upstream block that was expected to generate object.joblib. Ensure it completed successfully.
Verify that the expected output (often named output_1) was created and saved.
Check in the Mage UI or directly in the file system (if accessible) to confirm whether the file exists in the .variables directory for that upstream block.

edit on GitHub

# Docker: Update docker-compose to initiate Mage

When running ./scripts/start.sh, the following error occurs:

ERROR: The Compose file './docker-compose.yml' is invalid because:

Unsupported config option for networks: 'app-network'

Unsupported config option for services: 'magic-platform'

Solution:

Download the latest version of Docker Compose

sudo curl -L "https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose

Apply executable permissions to the binary

sudo chmod +x /usr/local/bin/docker-compose

edit on GitHub

# Mage in Codespaces in a subfolder under mlops-zoomcamp repository

Issue 1: Errors such as:

[+] Running 1/1

✘ magic-database Error too many requests: You have reached your pull rate limit. You may increase the limit by authenticating and upgra...                   

Error response from daemon: too many requests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: [docker.com](https://www.docker.com/increase-rate-limit)

Issue 2: Popups with different percentage values indicating space is in single digits.

image #1

Solution: It is not recommended to set up Mage as a subfolder of mlops-zoomcamp. See findings in this thread for more information.

edit on GitHub

# Mage in Codespaces

The below errors seem to occur only when using Mage in Codespaces.

Errors

Error (1)

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Error (2)
```
Error response from daemon: invalid volume specification: '/workspaces/mage-mlops:/:rw': invalid mount config for type "bind": invalid specification: destination can't be '/'
```
Solution for (1) & (2):
- Stay tuned…still testing.
- Running docker info and docker –version works fine.
- Executing docker compose down, stopping Codespaces, and reconnecting resolved the errors, though it might not be reproducible for everyone.
Error (3)
```
warning: unable to access '/home/codespace/.gitconfig': Is a directory
```
Solution (3):
- This is targeted for 3.5.x Deploying with Mage. If not deploying:
  - Comment line #20 in docker-compose.yml.
  - Place a dummy empty file named .gitconfig in your repo’s root folder.
  - Copy it in the Dockerfile with this line, place it below line #9:
```
COPY .gitconfig /root/.gitconfig
```
- The reason this happens is that when the file is missing, Docker auto-creates it as a directory instead of a file. Creating a dummy file prevents this.

edit on GitHub

# Mage updated in UI

When you see the mage version change in the UI after you’ve started the container, and you want to update, follow these steps. Read the release notes first to see if there’s a fix that affected your work and would benefit from an update.

If you want to remain in the previous version, it's also fine unless the fixes were specifically for our zoomcamp coursework (check the repository for any new instructions or PRs added).

Close the browser page.
In the terminal console, bring down the container:
```
docker compose down
```
Rebuild the container with the new mage image:
```
docker compose build --no-cache
```
Verify that you see:
```
[magic-platform 1/4] FROM docker.io/mageai/mageai:alpha
```
This means that the container is being rebuilt with a new version.
If the image is not updated, press ctrl+c to cancel the process and pull the image manually:
```
docker pull mageai/mageai:alpha
```
Then rebuild.
Restart the docker container as before:
```
./scripts/start.sh
```

Note: This is the same sequence of steps if you want to switch to the latest tagged image instead of using the alpha image.

What does alpha and latest mean?

Latest is the fully released version ready for production use, and it has gone through verification, testing, QA, and whatever else the release cycle entails.
Alpha is the potentially buggy version with fresh new fixes and newly added features; but not yet put through the full beta test (if there’s one), integration testing, and other QA steps. Expect issues to occur.

edit on GitHub

# Mage Time Series Bar Chart Not Showing

import requests
from io import BytesIO
from typing import List
import numpy as np
import pandas as pd

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader

@data_loader
def ingest_files(**kwargs) -> pd.DataFrame:
    dfs: List[pd.DataFrame] = []
    for year, months in [(2024, (1, 3))]:
        for i in range(*months):
            response = requests.get(
                f'https://github.com/mage-ai/datasets/raw/master/taxi/green/{year}/{i:02d}.parquet'
            )
            if response.status_code != 200:
                raise Exception(response.text)
            df = pd.read_parquet(BytesIO(response.content))
            # if time series chart on mage error, add code below
            df['lpep_pickup_datetime_cleaned'] = df['lpep_pickup_datetime'].astype(np.int64) // 10**9
            dfs.append(df)
    return pd.concat(dfs)

edit on GitHub

# Mage: data_exporter block not taking all outputs from previous transformer block

I encountered this issue while trying to run the data_export block that saves the dict vectorizer and the logs of the linear regression model into MLflow. My two distinct outputs were clearly created by the previous transformer block where the linear regression model is trained and the dict vectorizer is fitted to the training dataset.

I received this error while trying to run my export code:

Exception: Block mlflow_model_registry may be missing upstream dependencies. It expected to have 2 arguments, but only received 1. Confirm that the @data_exporter method declaration has the correct number of arguments.

The outputs are stored in a list, and this list is the input with the two outputs as the two elements. I had to modify my code in the data_exporter function to take only one argument and to define the two variables after that:

Dv = output[0]

Lr = output[1]

This adjustment resolved the issue.

edit on GitHub

# Mage Dashboard on unit_3 is not showing charts

Error: Cannot cast DatetimeArray to dtype float64

Have the runs completed successfully? We need to have successfully running Pipelines in order to populate the mage and mlflow databases.
If all pipelines are successfully completed and you are still getting this error, please provide further information.

edit on GitHub

# Creating Helper functions in Mage

There’s no need to add the utility functions in each sub-project when you watch the videos as there only needs to be one set. Just verify the code is still the same as in Mage’s mlops repository.

As for the import statements:

from mlops.utils.[...] import [...]

All refer to the same path in the main mlops "parent" project:

/[mage-mlops-repository-name]/mlops/utils/...

edit on GitHub

# Video 3.2.1 - Various issues with Global Data Products

Refer to the following documentation for more details:

Issues and Solutions

Running the GDP Block Takes Forever

Exception:

Pipeline run xx for global data product training_set: failed
AttributeError: 'NoneType' object has no attribute 'to_dict'

Potential Causes and Solutions:

Ensure Project and Pipeline Matching:

Make sure the following configurations are correct:
```
"project": "unit_2_training",
"repo_path": "/home/src/mlops/unit_2_training",
```
Restart Steps:
1. Interrupt and restart the Kernel from the Run menu.
2. Bring Docker down and restart it via the script.
Recreate Everything (if above steps fail):
1. Remove connections from the hyperparameter_tuning/sklearn block in the Tree panel to its upstream blocks.
  - Click on the connector → Remove Connection.
2. Remove the Global Data Product block from the Tree panel.
  - Right click → Delete Block (ignore dependencies).
3. Click on All blocks, select Global Data Products, drag and drop this block to be the first in the pipeline.
4. Rename the block to what is used in the video.
5. Run the block to test it (Play button or Ctrl+Enter).

Note

If helpful, repeat similar steps for the file in path "unit_3_observability." There is an ongoing attempt to replicate this process.

Error with Creating Global Data Product on Mage

Error:

AttributeError: 'NoneType' object has no attribute 'to_dict'

Solution:

Global product is currently not cross-product. You will need to create the data preparation pipeline in unit_2_training and configure it to build.

edit on GitHub

# How do you remove a global data product?

There is no way to remove this through the UI. You need to manually edit the global_data_products.yaml, which is stored in your project’s utils function. You can do this through the Text Editor.

edit on GitHub

# Error: TypeError: string indices must be integers

If you've removed and re-added blocks, especially due to issues with Global Data Products, try the following steps:

Remove the connections from the hyperparameter_tuning/sklearn block in the Tree panel to its upstream blocks.
Re-add these connections.
Remember to save the pipeline using Ctrl+S.

Video 3.2.8 Error:

Issue: ValueError: not enough values to unpack (expected 3, got 1)

Ensure your code follows this order:

data → training_set
data_2 → hyperparameter_tuning/xgboost

If not, proceed with:

Remove the connections for the xgboost.
Reconnect starting with the training set, followed by hyperparameter_tuning/xgboost.

edit on GitHub

# MLflow container error: Can't locate revision identified by …

This means your MLflow container tries to access a db file which was a backend for a different MLflow version than the one you have in the container. Most likely, the MLflow version in the container does not match the MLflow version of the MLflow server you ran in module 2.

The easiest solution is to check which version you worked with before, and change the docker image accordingly.

Open a terminal on your host and activate the conda environment you worked in:
```
conda activate <your-env-name>
```
Run the following command to check your MLflow version:
```
mlflow --version
```
Edit the mlflow.dockerfile line to your version:
```
RUN pip install mlflow==2.??.??
```
Save the file and rebuild the docker service by running:
```
docker-compose build
```
Now you can start up the containers again, and your MLflow container should be able to successfully read your mounted DB file.

edit on GitHub

# Permission denied in GitHub Codespace

When you encounter a permission denied error while setting up the server in GitHub Codespaces, refer to this guide:

https://askubuntu.com/questions/409025/permission-denied-when-running-sh-scripts

edit on GitHub

# Where is the FAQ for Prefect questions?

Here

edit on GitHub

# (root) Additional property mlflow is not allowed

This error means you are not writing below server on the Docker Compose file. To solve the issue:

image #1

edit on GitHub

# Q6: Logged model artifacts lost when mlflow container is down or removed

By default, the logged model and artifacts are stored in a local folder in the mlflow container but not in /home/src/mlflow. Therefore, when the container is restarted (after a compose down or container removal), the artifacts are deleted and you cannot see them in the mlflow UI.

To prevent this issue, you can include a new volume in the Docker Compose service for mlflow to map a folder on the local machine to the folder /mlartifacts in the mlflow container:

```
"${PWD}/mlartifacts:/mlartifacts/"
```

This way, every data logged to the experiment will be available even when the mlflow container is recreated.

edit on GitHub

# Q6: mlflow not showing artifacts

When using localstore, try to start mlflow where mlflow.db is present. For example, if mlflow.db is in mlops/mlflow, navigate to that folder and run ../scripts/start.sh. This assumes you followed the instructions in the homework.md file of week 3 and set up the mlops folder.

edit on GitHub

# Q6: Correct mlflow tracking uri

For the correct mlflow tracking URI, use:

mlflow.set_tracking_uri(uri="http://mlflow:5000")

This assumes you used the suggested Docker file snippet in Homework Question 6.

edit on GitHub

# I get the following error: invalid mount config for type "bind": invalid specification: destination can't be '/' when running docker compose up when running mage

You should not run docker compose up for the mage repo directly. Instead, use:

bash ./scripts/start.sh

Additional Information

The start.sh script handles necessary environment variable settings before executing docker compose up.
Key environment variables such as PROJECT_NAME and MAGE_CODE_PATH should be set, potentially in your .env file.
Note that if you are starting a new mage project, like in a capstone project, you may not have a start.sh script or a scripts directory, so ensure the environment variables are set correctly.

Update by another student from the MLOps Zoomcamp.

edit on GitHub

# AttributeError: module 'mlflow' has no attribute 'set_tracking_url'

In a mage block, the Python statement mlflow.set_tracking_uri() was returning an attribute error. This issue was observed when running Mage in one container and MLflow in another. If you encounter this, there may be something else in your project with the name "mlflow."

Debugging the Import Issue:
- Insert a print statement before the Python statement that produces the attribute error:
```
print(mlflow.__file__)
```
- This will show what the mlflow module points to. It should return a site-packages location, something like:
```
'/usr/local/lib/python3.10/site-packages/mlflow/__init__.py'
```
- If not, you may have another file or folder called "mlflow" that is confusing the Python import statement.
Checking Backend Store Location:
- Look at the folder name where the mlflow.db is being created via this command (either in command line or in the Dockerfile for the MLflow service):
```
mlflow server --backend-store-uri sqlite:///home/mlflow/mlflow.db --host 0.0.0.0 --port 5000
```
- If the folder name for the backend store is "mlflow," Python may be trying to import that instead of the MLflow package you installed. Change the backend store folder name to something else, like mlflow_data.
- Rename the folder in your local drive (since it gets mounted in docker-compose.yml).
- Update the folder name in the Dockerfile for the MLflow service:
  - Specify the backend-store-uri in the MLflow server command with the new folder name.
- Update the folder name in docker-compose.yml (when mounting the folder for the MLflow service), e.g.,
```
volumes:
  - "${PWD}/mlflow_data:/home/mlflow_data/"
```
If import mlflow Gives a Module Not Found Error:
- Check the PYTHONPATH variable in the container:
```
docker ps  # Copy the Mage container ID
docker exec -it <container-ID> /bin/bash
echo $PYTHONPATH
```
- If you do not see the path to the site-packages directory for your Python version, add it to the PYTHONPATH environment variable.
- To find out what path to use, execute this from the running container:
```
import sys
print(sys.path)
```
- Add this to the PYTHONPATH in the Dockerfile for the Mage service:
```
ENV PYTHONPATH="${PYTHONPATH}:/usr/local/lib/python3.10/site-packages"
```

edit on GitHub

# prefect project init Error: No such command 'project'.

The newest version of Prefect does not have the module project. To initiate a project, use the command:

project init

edit on GitHub

# Video 3.3.4: Training Metrics RMSE chart does not show due to the error: KeyError: ‘rmse_LinearRegression’

Solution: Check the difference between xgboost and sklearn pipelines. In the xgboost pipeline, there is a track_experiment callback, which is missing in the sklearn pipeline.

Please add these lines:

You can refer to them in the similar commit linked here:

Lines to be added

edit on GitHub

# How can I enable communication between Docker containers when invoked from a Kestra task?

![Diagram of Docker Container Communication]( image #1 )

Use the docker.Run plugin in your Kestra task to run containers. This plugin supports advanced Docker options like custom networks.

For local development, you can use networkMode: host to allow containers to access services on your host (e.g., MLflow running on localhost).

Example:

networkMode: host

Note:

host mode is only supported on Linux.
For Docker Desktop on Windows/macOS, use host.docker.internal or create a shared Docker network.

Best Practice:

In production setups, tools like MLflow should run outside Kestra and be accessed over a stable URI (e.g., a cloud endpoint or a container with a known hostname in a shared network).

edit on GitHub

Module 4: Deployment

# Fix Out of Memory error while orchestrating the workflow on a ML Pipeline for a high volume dataset.

We come across situations in data transformation & pre-processing as well as model training in a ML pipeline where we need to handle datasets of high dimensionality or high cardinality (usually millions). We often end up with Out of Memory (OOM) errors like below when the flow is running:

image #1

If you do not have the option of increasing your RAM, the following approaches can be effective in mitigating this error:

Read Only Required Features/Columns:
- During the data loading step, read only the necessary features/columns from the dataset.
Remove Unused Dataframes:
- Before encoding/vectorizing, remove the dataframe when you have obtained X_train & y_train.
Create or Resize Swap File:
- If you do not have a swap file or have a small one, create a swap file (size as per memory requirement) or replace the existing one with a properly sized one.
To remove an existing swapfile, use:
```
sudo swapoff /swapfile
sudo rm /swapfile
```
To create a new properly sized swapfile (e.g., 16 GB), use:
```
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
```
To check the swap file created:
```
free -h
```

edit on GitHub

# Docker: aws.exe: error: argument operation: Invalid choice — Docker can not login to ECR.

When using AWS CLI on Windows, you might encounter the following error:

aws.exe: error: argument operation: Invalid choice

Solution

Check your AWS CLI version. For example:

aws-cli/2.4.24 Python/3.8.8 Windows/10 exe/AMD64 prompt/off

Instead of using the outdated command, use the updated command provided by AWS:

aws ecr get-login-password \
--region <region> \
| docker login \
--username AWS \
--password-stdin <aws_account_id>.dkr.ecr.<region>.amazonaws.com

Refer to the official AWS documentation for additional details: AWS CLI ECR Login Password

Ensure that you replace <region> and <aws_account_id> with your specific values.

edit on GitHub

# Multiline commands in Windows Powershell

To use multiline commands in Windows PowerShell, place a backtick (`) at the end of each line except the last. Note that multiline strings do not require a backtick.

Escape double quotes (") to "\
Use $env: to create environment variables (non-persistent). For example:

$env:KINESIS_STREAM_INPUT="ride_events"

aws kinesis put-record --cli-binary-format raw-in-base64-out `

--stream-name $env:KINESIS_STREAM_INPUT `

--partition-key 1 `

--data '{

\"ride\": {

\"PULocationID\": 130,

\"DOLocationID\": 205,

\"trip_distance\": 3.66

},

\"ride_id\": 156

}'

edit on GitHub

# Pipenv installation not working (AttributeError: module 'collections' has no attribute 'MutableMapping')

If you encounter pipenv failures with the command pipenv install and see the following error:

AttributeError: module 'collections' has no attribute 'MutableMapping'

The issue occurs because you are using the system Python (3.10) for pipenv.

To resolve this issue:

If pipenv was previously installed via apt-get, remove it using:
```
sudo apt remove pipenv
```
Ensure a non-system Python is installed in your environment. An easy way to achieve this is by installing Anaconda or Miniconda.
Install pipenv using your non-system Python:
```
pip install pipenv
```
Re-run pipenv install <dependencies> with the relevant dependencies. It should work without issues.

This solution was tested and worked on an AWS instance similar to the configuration presented in class.

edit on GitHub

# module is not available (Can't connect to HTTPS URL)

First, check if the SSL module is configured with the following command:

python -m ssl

If the output is empty, there is no problem with the SSL configuration. Then you should upgrade your pipenv package in your current environment to resolve the problem.

edit on GitHub

# No module named 'pip._vendor.six'

During scikit-learn installation via the command:

pipenv install scikit-learn==1.0.2

The following error is raised:

ModuleNotFoundError: No module named 'pip._vendor.six'

To resolve this issue, follow these steps:

Install the python-six package:
```
sudo apt install python-six
```
Remove the existing Pipenv environment:
```
pipenv --rm
```
Reinstall scikit-learn:
```
pipenv install scikit-learn==1.0.2
```

edit on GitHub

# Pipenv with Jupyter

Problem Description: How can we use Jupyter notebooks with the Pipenv environment?

Solution:

Install Jupyter and ipykernel using Pipenv.
Register the kernel within the Pipenv shell using the following command:
```
python -m ipykernel install --user --name=my-virtualenv-name
```
If you are using Jupyter notebooks in VS Code, this will also add the virtual environment to the list of kernels.

For more details, refer to this Stack Overflow question.

edit on GitHub

# Pipenv: Jupyter no output

Problem: I tried to run a starter notebook in a Pipenv environment but had issues with no output on prints. I used scikit-learn==1.2.2 and python==3.10. Tornado version was 6.3.2.

Solution: The error you're encountering seems to be a bug related to Tornado, which is a Python web server and networking library. It's used by Jupyter under the hood to handle networking tasks.

Downgrading to tornado==6.1 fixed the issue.

More information can be found on this Stack Overflow post.

edit on GitHub

# AWS CLI: 'Invalid base64' error after running `aws kinesis put-record`

Problem Description: You might encounter an 'Invalid base64' error after executing the aws kinesis put-record command on your local machine. This issue can occur if you are using AWS CLI version 2. In a referenced video (4.4, around 57:42), a warning is visible as the instructor is using version 1 of the CLI.

Solution: To resolve this issue, use the argument --cli-binary-format raw-in-base64-out when executing the command. This option will encode your data string into base64 before transmitting it to Kinesis.

aws kinesis put-record --cli-binary-format raw-in-base64-out --other-parameters

edit on GitHub

# Error index 311297 is out of bounds for axis 0 with size 131483 when loading parquet file.

Problem description: Running starter.ipynb in homework’s Q1 will show this error.

Solution:

Update pandas along with related dependencies to the latest versions.

edit on GitHub

# Pipenv: Pipfile.lock was not created along with Pipfile

Use the following command to force the creation of Pipfile.lock:

pipenv lock

edit on GitHub

# Permission Denied using Pipenv

This issue is usually due to the pythonfinder module in pipenv.

The solution involves manually changing the scripts as described here: python_finder_fix

edit on GitHub

# Going further with Google Cloud Platform: Load and save data to GCS

There is a possibility to load and store data in a Google Cloud Storage bucket. To do that, authenticate through the IDE you are using and allow read and write access to a GCS bucket:

Authenticate gsutil with your GCP account:
```
gsutil config
```

Upload the data to your GCS bucket:

gsutil cp path/to/local/data gs://your-bucket-name

Create a service account and manage permissions:
- In the GCP Console, go to "IAM & Admin," then "Service accounts."
- Create a new service account, grant it permissions (e.g., "Storage Object Admin" for GCS access), and generate a JSON key file.
Install the Google Cloud SDK: Google Cloud SDK Installation Guide
Authenticate the SDK with your GCP account:
```
gcloud auth login
```

Set your GCP project:

gcloud config set project YOUR_GCP_PROJECT_ID

Install the Google Cloud Storage library:
```
!pip install google-cloud-storage
```

Example Script

Here's how to load a CSV file from Google Cloud Storage into a pandas DataFrame:

from google.cloud import storage
import pandas as pd

# Set up the storage client with the service account key
storage_client = storage.Client.from_service_account_json('path/to/service-account-key.json')

# Get the GCS bucket
bucket = storage_client.get_bucket('your-bucket-name')

# List the contents of the bucket
blobs = bucket.list_blobs()
for blob in blobs:
    print(blob.name)

# Load a CSV file from the bucket into a pandas DataFrame
csv_blob = bucket.blob('path/to/csv/in/bucket.csv')
df = pd.read_csv(csv_blob.download_as_string())

You can directly save output data by setting the output file name to your desired GCS URI.

edit on GitHub

# Error: Error while parsing arguments via CLI [ValueError: Unknown format code 'd' for object of type 'str']

When passing arguments to a script via command line and converting it to a 4-digit number using f’{year:04d}’, this error can occur.

This happens because command line inputs are read as strings. They need to be converted to integers before formatting with an f-string:

year = int(sys.argv[1])
f'{year:04d}'

If you use the click library, update your decorator accordingly:

import click

@click.command()
@click.option("--year", help="Year for evaluation", type=int)
def your_function(year):
    # Your code

edit on GitHub

# Docker: Dockerizing tips

Ensure the correct image is being used to derive from.

Copy the data from local to the Docker image using the COPY command to a relative path. Using absolute paths within the image might be troublesome.
Use paths starting from /app and don’t forget to do WORKDIR /app before actually performing the code execution.

Most Common Commands

Build container:
```
docker build -t mlops-learn .
```
Execute the script:
```
docker run -it --rm mlops-learn
```

<mlops-learn> is just a name used for the image and does not have any significance.

edit on GitHub

# Running multiple services in a Docker container

If you are trying to run Flask with Gunicorn and an MLFlow server from the same container, defining both services in the Dockerfile with CMD will only run MLFlow and not Flask.

Solution

Create separate shell scripts with server run commands:
- For Flask with Gunicorn:
  
  Save as script1.sh:
```
#!/bin/bash
```

gunicorn --bind=0.0.0.0:9696 predict:app ```

For MLFlow server:

Save as script2.sh:
```
#!/bin/bash
```

mlflow server -h 0.0.0.0 -p 5000 --backend-store-uri=sqlite:///mlflow.db --default-artifact-root=g3://zc-bucket/mlruns/ ```

Create a wrapper script to run the above two scripts:

Save as wrapper_script.sh:

#!/bin/bash

# Start the first process
./script1.sh &

# Start the second process
./script2.sh &

# Wait for any process to exit
wait -n

# Exit with status of process that exited first
exit $?

Give executable permissions to all scripts:
```
chmod +x *.sh
```
Define the last line of your Dockerfile as:
```
CMD ./wrapper_script.sh
```
Don't forget to expose all ports defined by the services.

edit on GitHub

# Cannot generate pipfile.lock raise InstallationError( pip9.exceptions.InstallationError)

Problem description: Cannot generate pipfile.lock. Raises InstallationError( pip9.exceptions.InstallationError: Command "python setup.py egg_info" failed with error code 1.

Solution:

You need to force an upgrade of wheel and pipenv.

Run the following command:

pip install --user --upgrade --upgrade-strategy eager pipenv wheel

edit on GitHub

# Connecting s3 bucket to MLFLOW

Problem Description

How can we connect an S3 bucket to MLflow?

Solution

To connect an S3 bucket to MLflow, use boto3 and AWS CLI to store access keys. These access keys allow boto3 (AWS' Python API tool) to authenticate and connect with AWS servers. Without access keys, access to the bucket cannot be verified, which could prevent connection attempts by unauthorized individuals.

Steps:

Ensure Access Keys are Available:
- Access keys are essential for boto3 to communicate with AWS servers securely.
- They ensure that only authorized users with the correct permissions can access the bucket.
Set Bucket as Public (Optional):
- Alternatively, you can set the bucket to public access.
- In this case, access keys are not needed as anyone can access the bucket without authentication.

For more detailed information on credentials management, refer to the official documentation: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html

edit on GitHub

# Uploading to s3 fails with "An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records."

Even though the upload works using AWS CLI and boto3 in Jupyter notebook.

Solution:

Set the AWS_PROFILE environment variable (the default profile is called default).

edit on GitHub

# Docker: Dockerizing LightGBM

Problem Description:

lib_lightgbm.so Reason: image not found

Solution:

Add the following command to your Dockerfile:
```
RUN apt-get install libgomp1
```
Modify the installer command based on your OS if needed.

edit on GitHub

# Error raised when executing mlflow’s pyfunc.load_model in lambda function.

When the request is processed in a lambda function, the mlflow library raises the following warning:

2022/09/19 21:18:47 WARNING mlflow.pyfunc: Encountered an unexpected error (AttributeError("module 'dataclasses' has no attribute '__version__'")) while detecting model dependency mismatches. Set logging level to DEBUG to see the full traceback.

Solution:

Increase the memory of the lambda function.

edit on GitHub

# 4.3 FYI Notebook is end state of Video -

Just a note if you are following the video but also using the repo’s notebook. The notebook is the end state of the video which eventually uses MLflow pipelines.

Just watch the video and be patient. Everything will work :)

edit on GitHub

# Solution: The notebook in the repo is missing some code, include the code to log the dict_vectorizer. If the error is after using pipelines, update the predict function as seen in the video.

Ensure that the code to log the dict_vectorizer is included in your notebook. If you're using pipelines and encountering an error, update the predict function according to the video instructions to resolve the issue.

edit on GitHub

# Docker: Passing envs to my docker image

Problem Description:

I was having issues because my Python script was not reading AWS credentials from environment variables. After building the image, I was running it like this:

docker run -it homework-04 -e AWS_ACCESS_KEY_ID=xxxxxxxx -e AWS_SECRET_ACCESS_KEY=xxxxxx

Solutions:

**Environment Variables Order: **

You can set environment variables like AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN (if using AWS STS). Ensure these variables are passed before the image name:
```
docker run -e AWS_ACCESS_KEY_ID=xxxxxxxx -e AWS_SECRET_ACCESS_KEY=xxxxxx -it homework-04
```
Using an Env File:

You can pass an env file by using the following command, assuming your env file is named .env:
```
docker run -it homework-04 --env-file .env
```
AWS Configuration Files:

If AWS credentials are not found, the AWS SDKs and CLI will check the ~/.aws/credentials and ~/.aws/config files for credentials. You can map these files into your Docker container using volumes:
```
docker run -it --rm -v ~/.aws:/root/.aws homework:v1
```

edit on GitHub

# Docker: How to see the model in the docker container in app/?

If you need to view the model inside the Docker container for the image svizor/zoomcamp-model:mlops-3.10.0-slim, follow these steps:

Create a Dockerfile:

FROM svizor/zoomcamp-model:mlops-3.10.0-slim

Build the Docker Image:
```
docker build -t zoomcamp_test .
```
Run the Container and List the Contents of /app:
```
docker run -it zoomcamp_test ls /app
```
The output should include model.bin, confirming the model is present.

Additional Instructions

You can copy files into the Docker image by adding lines like COPY myfile . to the Dockerfile, and then run a script with arguments:
```
docker run -it myimage myscript arg1 arg2
```
Remember, a new build is required whenever the Dockerfile is modified.

Alternative Method

To list the contents of /app when the container runs, modify the Dockerfile:

FROM svizor/zoomcamp-model:mlops-3.10.0-slim

WORKDIR /app

CMD ls

Note:
- Use CMD to specify commands for container runtime.
- Use RUN for building the image and CMD during container execution.

edit on GitHub

# WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

To resolve this issue, make sure to build the Docker image with the platform tag. Use the following command:

docker build -t homework:v1 --platform=linux/arm64 .

edit on GitHub

# HTTPError: HTTP Error 403: Forbidden when call apply_model() in score.ipynb

Solution:

Instead of using the following input file:

input_file = f'https://s3.amazonaws.com/nyc-tlc/trip+data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'

Use:

input_file = f'https://d37ci6vzurychx.cloudfront.net/trip-data/{taxi_type}_tripdata_{year:04d}-{month:02d}.parquet'

edit on GitHub

# ModuleNotFoundError: No module named 'pipenv.patched.pip._vendor.urllib3.response'

If you're encountering the error:

ModuleNotFoundError: No module named 'pipenv.patched.pip._vendor.urllib3.response'

Follow these steps to resolve it:

Reinstall pipenv with the following command:
```
pip install pipenv --force-reinstall
```
If you see an error referring to site-packages\pipenv\patched\pip\_vendor\urllib3\connectionpool.py, then:
- Upgrade pip and install requests:
```
pip install -U pip
pip install requests
```

edit on GitHub

# Error: pipenv command not found after pipenv installation

When installing pipenv using the --user option, you need to update the PATH environment variable to run pipenv commands. It's recommended to update your .bashrc or .profile (depending on your OS) to persist the change. Edit your .bashrc file to include or update a line like this:

PATH="<path_to_your_pipenv_install_dir>:$PATH"

Alternatively, you can reinstall pipenv as root for all users:

sudo -H pip install -U pipenv

edit on GitHub

# Homework/Question 2: Namerror: name ‘year’ is not defined

For question 2, which requires you to prepare the dataframe with the output, you need to first define the year and month as integers.

edit on GitHub

# Mage error: Error loading custom object at…

When returning an object from a block, you may encounter an error like this:

Error loading custom_object at /home/src/mage_data/*************/pipelines/taxi_duration_pipe/.variables/make_predictions/output_0: [Errno 2] No such file or directory: '/home/src/mage_data/*************/pipelines/taxi_duration_pipe/.variables/make_predictions/output_0/object.joblib'

Error loading custom_object at /home/src/mage_data/*************/pipelines/taxi_duration_pipe/.variables/make_predictions/output_0: [Errno 2] No such file or directory: '/home/src/mage_data/*************/pipelines/taxi_duration_pipe/.variables/make_predictions/output_0/object.joblib'

This occurred when returning a numpy.ndarray, specifically the y_pred variable containing the predictions for the taxi dataset. It seems Mage struggles with some types of objects and expects data structures like DataFrames instead of numpy.ndarrays. To resolve this, you can return a DataFrame that includes both the y_pred and the ride IDs.

edit on GitHub

# Docker: The arm64 chip doesn’t match with Alexey’s docker image

You may get a warning similar to the one below when trying to run the docker:

WARNING: The requested image's platform (linux/amd64) does not match the detected host platform (linux/arm64/v8) and no specific platform was requested

Python 3.10.13 (main, Mar 12 2024, 12:22:40) [GCC 12.2.0] on linux

Add the tag --platform linux/amd64 when running, and it should work. For example:

docker run -it --platform linux/amd64 --rm -p 9696:9696 homework:v2

edit on GitHub

# Pipenv installation

Make sure you have Python and pip installed by checking their versions:

python --version

pip --version

To install Pipenv, use the following command:

pip install pipenv --user

edit on GitHub

# Jupyter: nbconvert error

If you encounter an error when converting your notebook.ipynb into a Python script using the command:

jupyter nbconvert --to script your_notebook.ipynb

and you see the error message:

Jupyter command `jupyter-nbconvert` not found.

follow these steps:

Verify the Directory

Ensure that you're in the directory containing your Jupyter notebook.
Install the Necessary Package

If the issue persists, you may need to install the nbconvert package. Run the following command:
```
pip install nbconvert
```
Convert the Notebook

After installing nbconvert, use the following command to convert your notebook to a Python script:
```
jupyter nbconvert your_notebook.ipynb --to python
```
Note: The correct command is slightly different (--to python instead of --to script).

edit on GitHub

# Homework/Question 6: Do not forget to specify that the folder output/yellow should be created in the working directory of your docker file

For question 6, which requires you to include your script in a Dockerfile, specify the creation of the folder output/yellow in the working directory of your Docker container by adding the following line in your Dockerfile:

RUN mkdir -p output/yellow

edit on GitHub

# Homework/Question 6: Entry point for running scoring script in Docker container

For question 6, if you are using the script as instructed in the homework and not Flask, your entry point should be bash. This can be set by specifying:

ENTRYPOINT = ["bash"]

edit on GitHub

# Error: Unable to locate credentials

This error appeared when I was running the Jupyter notebooks inside Visual Studio Code in Codespaces. I fixed it by running the Jupyter notebooks outside of Codespaces.

edit on GitHub

Module 5: Monitoring

# How do I log in to AWS ECR from the terminal using Docker?

Before (deprecated command):

$(aws ecr get-login --no-include-email)

Now (updated and secure command):

aws ecr get-login-password --region us-west-1 | docker login --username AWS --password-stdin <ACCOUNTID>.dkr.ecr.<REGION>.amazonaws.com

Note: Make sure you specify the correct AWS region where your ECR repository is located (e.g., us-west-1).
If the region is incorrect or not set properly, the login will fail with a 400 Bad Request error — which doesn’t clearly indicate the region is the issue.

Tip: Use a Python Block in Mage to Interact with Your Dockerized ML Model

The most effective way to integrate your machine learning model into a Mage pipeline is to have your Docker container serve the model via an API (like FastAPI or Flask). Then, a custom Python block within your Mage pipeline can easily call this API to get predictions.

Here’s the concise workflow:

In Your Docker Container:
- Create an API for your model: Use a framework like FastAPI to wrap your model's prediction logic in an API endpoint. For example, create a /predict endpoint that accepts input data and returns the model's output.
- Build and run the Docker container: Ensure the container is running and the API is accessible. For local development, you can use docker-compose to run both your model's container and the Mage container, connecting them on the same Docker network for easy communication.
In Your Mage Pipeline:
- Create a custom Python block: Add a new "transformer" or "data loader" block to your pipeline.
- Call the model's API: Inside this block, use a Python library like requests to send the data you want to get predictions for to your model's API endpoint.
- Process the results: The block will receive the predictions back from the API. You can then continue your Mage pipeline, using the model's output for further transformations or exporting it to a database or other destination.

edit on GitHub

# ImportError when using ColumnQuantileMetric with Evidently

Problem Description

While working on the monitoring module homework, the instructions mention using ColumnQuantileMetric. However, attempting to import it results in an error:

ImportError: cannot import name 'ColumnQuantileMetric' from 'evidently.metrics'

Solution Description

The ColumnQuantileMetric class does not exist in current versions of Evidently (e.g., 0.7.8+). The correct class to use is QuantileValue, which serves the same purpose.

Additionally, the expected argument is not column_name, but column. This differs from other metrics like MissingValueCount that use column_name.

If you see a ValidationError: column field required, you are likely using the wrong parameter name.

You can use it as follows:

from evidently.metrics import QuantileValue

QuantileValue(column="fare_amount", quantile=0.5)

This mismatch likely results from outdated references or changes in the library’s API.

edit on GitHub

# Login window in Grafana

Problem description: When running docker-compose up as shown in video 5.2, if you go to http://localhost:3000/, you are asked for a username and a password.

Solution:

The default credentials are:
- Username: admin
- Password: admin
After logging in, you can set a new password.

For more details, see Grafana documentation.

edit on GitHub

# Error in starting monitoring services in Linux

Problem Description:

In Linux, when starting services using docker compose up --build as shown in video 5.2, the services won’t start and instead we get the message:

unknown flag: --build

Solution:

Since we install docker-compose separately in Linux, use the following command:

docker-compose up --build

edit on GitHub

# KeyError ‘content-length’ when running prepare.py

Problem: When running prepare.py, encountering KeyError: 'content-length'.

Solution:

From Emeli Dral: It seems the link used in prepare.py to download taxi data is no longer functional. Replace the URL in the script as follows:

url = f"https://nyc-tlc.s3.amazonaws.com/trip+data/{file}"

with:

url = f"https://d37ci6vzurychx.cloudfront.net/trip-data/{file}"

By making this substitution in prepare.py, the problem should be resolved, allowing access to the necessary data.

edit on GitHub

# Evidently service exit with code 2

Problem Description

When running the command docker-compose up –build and sending data to the real-time prediction service, the service returns "Max retries exceeded with url: /api". This issue occurs because the evidently service exits with code 2 due to "app.py" in the evidently service being unable to import from pyarrow import parquet as pq.

Solution

Install the pyarrow module:
```
pip install pyarrow
```
Restart your machine.
If the first and second solutions don’t work:
- Comment out the pyarrow module in "app.py" of the evidently service, as it may not be used, which resolved the issue in some cases.

edit on GitHub

# ValueError: Incorrect item instead of a metric or metric preset was passed to Report

When using Evidently, if you encounter this error, you likely forgot to add parentheses. Simply include opening and closing parentheses to resolve the issue.

edit on GitHub

# For the report RegressionQualityMetric()

You will get an error if you didn’t add a target='duration_min'.

If you want to use RegressionQualityMetric(), you need a target='duration_min' and you need this added to your current_data['duration_min'].

edit on GitHub

# Found array with 0 sample(s)

Problem Description

ValueError: Found array with 0 sample(s) (shape=(0, 6)) while a minimum of 1 is required by LinearRegression.

Solution Description

This error occurs because the generated data is based on an early date, resulting in an empty training dataset.

Adjust the following:

begin = datetime.datetime(202X, X, X, 0, 0)

edit on GitHub

# Adding additional metric

Problem Description

Getting “target columns” “prediction columns” not present errors after adding a metric.

Solution Description

Make sure to read through the documentation on what is required or optional when adding the metric. For example, DatasetCorrelationsMetric doesn’t require any parameters because the metric evaluates for correlations among the features.

edit on GitHub

# Grafana: Standard login does not work

When trying to log in to Grafana with the standard credentials (admin/admin), an error occurs.

Solution

To reset the admin password, use the following command inside the Grafana container:
```
grafana cli admin reset-admin-password admin
```
Note: The grafana-cli command is deprecated. Use grafana cli instead.
Enter the Docker container with Grafana:
- Find the Container ID by running:
```
docker ps
```
- Use the Container ID to reset the password. Replace <container_ID> with the actual Container ID:
```
lpep_pickup_datetime<container_ID> grafana cli admin reset-admin-password admin
```

This should resolve the login issue.

edit on GitHub

# The chart in Grafana doesn’t get updates

Problem Description: While my metric generation script was still running, I noticed that the charts in Grafana don’t get updated.

Solution:

Refresh Interval: Set it to a small value, such as 5, 10, or 30 seconds.
Timezone Setting: Ensure you use your local timezone in a call to pytz.timezone. For example, change the setting from "Europe/London" to your local timezone to get updates.

edit on GitHub

# Prefect: Prefect server was not running locally

Problem Description

Prefect server was not running locally. The command prefect server start was executed but it stopped immediately.

Solution

Used Prefect Cloud to run the script.
Created an issue on the Prefect GitHub repository.

edit on GitHub

# Docker: no disk space left error when doing docker compose up

To resolve the "no disk space left" error when running docker compose up, follow these steps:

Run the following command to remove unused objects (build cache, containers, images, etc.):
```
docker system prune
```
If you want to see what is taking up space before pruning, use:
```
docker system df
```

edit on GitHub

# Failed to listen on :::8080 (reason: php_network_getaddresses: getaddrinfo failed: Address family for hostname not supported)

Problem: When running docker-compose up --build, you may encounter this error.

To solve this issue, add the following command in the adminer block in your docker-compose.yml file:

a dminer:
  command: php -S 0.0.0.0:8080 -t /var/www/html
  image: adminer...

This configuration specifies the command to be executed when the container starts, setting up PHP to listen on 0.0.0.0:8080. This addresses the network error by changing the bind address.

edit on GitHub

# Generate Evidently Chart in Grafana

Problem: Can we generate charts like Evidently inside Grafana?

Solution:

In Grafana, you can use a stat panel (just a number) and a scatter plot panel, which may require a plug-in.
Unfortunately, there's no native method to directly recreate the Evidently dashboard.
Ensure that all relevant information is logged to your Grafana data source, then design your custom plots.

External Recreation:

Export the Evidently output in JSON with include_render=True for external visualization:
- See more details here.
For non-aggregated visuals, use the option "`raw_data": True".
- More details here.

This specific plot with under- and over-performance segments is particularly useful during debugging and might be easier to view ad hoc using Evidently.

edit on GitHub

# Error when importing evidently package because of numpy version upgraded

A new version of Numpy has just been released v 2.0.0 (on Jun 16, 2024), which causes an import error of the package.

"`np.chararray` is deprecated and will be removed from "

419     "the main namespace in the future. Use an array with a string "

420     "or bytes dtype instead.", DeprecationWarning, stacklevel=2): `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.

AttributeError: `np.float_` was removed in the NumPy 2.0 release. Use `np.float64` instead.

You can solve it by reinstalling numpy to a previous version 1.26.4. Just run:

python -m pip install numpy==1.26.4

Or modify the requirements.txt to freeze the version:

numpy==1.26.4

edit on GitHub

# Bind for 0.0.0.0:5432 failed: port is already allocated

Problem: When trying to start the postgres services through docker-compose up, this error occurs:

Bind for 0.0.0.0:5432 failed: port is already allocated

Note: This issue occurs because port 5432 is already used by another service.

Solution: Update the port mapping for the Postgres service to 5433:5432 in the Docker Compose YAML file.

edit on GitHub

# Table/database not showing on Grafana dashboard

Problem:

For version 5.4, when trying to create a new dashboard, Grafana does not list the dummy_metrics table in the query tab.

Note: Change the datasource name from the default "PostgreSQL."

Solution 1:

Update the config/grafana_datasources.yaml with the following:

# List of datasources to insert/update
# Available in the database
datasources:

- name: NewPostgreSQL
  type: postgres
  url: db:5432
  user: postgres
  secureJsonData:
    password: 'example'
  jsonData:
    sslmode: 'disable'
    database: test

Solution 2:

Use the "Code" option rather than the "Builder" option.
Load the data using your own SQL queries. See the screenshot below (box highlighted in red).
Tip: If you write your FROM statement first, the SELECT options can be done through auto-complete.

image #1

edit on GitHub

# Adminer Not Loaded

Problem: After running Docker Compose, Adminer cannot be accessed on http://127.0.0.1:8080/

Solution: Add index.php after the URL, so the URL will be http://127.0.0.1:8080/index.php

edit on GitHub

# Grafana: UI Changes

Problem: When selecting a column from the table, the error message is displayed:

no time column: no time column found

image #1

Solution: Add a timestamp column in the query builder.

image #2

edit on GitHub

# Runtime Error: Failed to Reach API on Prefect

Problem: When running evidently_metrics_calculation.py, the following error is shown:

RuntimeError: Cannot create flow run. Failed to reach API at https://api.prefect.cloud/api/accounts/ee976605-4ca7-4a27-b5e3-0a37da3c7678/workspaces/78b23cf5-38bb-4d8b-9888-5bf8070d6d62/

Solution:

image #1

edit on GitHub

# Grafana dashboard error after reset: db query error: pq: database “test” does not exist

Problem: You’ve already loaded your data, created a dashboard, and saved it. However, upon running docker-compose up after saving the dashboard, you encounter this error:

db query error: pq: database “test” does not exist

image #1

Solution:

This error indicates you haven’t run the DB initialization code. If you did run it before and even saw results, the issue likely arises because you restarted the docker-compose services.

The default docker-compose.yml file doesn’t have a volume for the Postgres DB. This means every restart will delete the DB data.

To resolve this:

If not planning to restart the services again: Simply rerun the DB initialization and filling code of your exercise.
If you plan to restart services frequently:
- Add a volume to your PostgreSQL service in the docker-compose.yml file:
```
volumes:
  - ./data/postgres:/var/lib/postgresql/data
```
- Note: Ensure you create a ./data directory in your project.

To attach the volume, run the following:

docker-compose down
docker-compose up --build

edit on GitHub

# Are there any alternative to Evidently on cloud platforms?

There are several alternatives to Evidently for monitoring machine learning models in the cloud. Here are a few options on popular cloud platforms:

Google Cloud Platform (GCP): AI Platform Predictions with Cloud Monitoring & Logging
Microsoft Azure: Azure Machine Learning
Amazon Web Services (AWS): Amazon SageMaker Model Monitor

These services provide model monitoring capabilities, allowing you to track the performance and data quality of your machine learning models within the cloud environment.

edit on GitHub

# docker.errors.DockerException: Error while fetching server API version: HTTPConnection.request() got an unexpected keyword argument 'chunked'

Instead of using:
```
docker-compose up --build
```
Use:
```
docker compose up --build
```

edit on GitHub

# Docker: Docker-Compose deprecated

Docker Compose v1 is deprecated from April 2023 onwards. More information on why v2 is better can be found in this blog post:

New Docker Compose V2 and V1 Deprecation

edit on GitHub

# psycopg.OperationalError: connection failed: connection to server at "127.0.0.1", port 5432 failed: FATAL: password authentication failed for user "postgres"

It could be that there is already another Docker container running (for example, from a previous session).

To resolve this issue:

Check for running containers:
```
docker ps
```
Stop the running container:
```
docker stop <container_name_or_ID>
```

edit on GitHub

# Login to DB not working in Adminer UI even after right DB, user and password.

Problem: Adminer UI is not responding or showing database details, even with the correct database, user, and password.

image #1

Solution: Try accessing the database from the command line using psql.

You can quickly install psql via package managers such as sudo apt.

Here is an example:

(base) cpl@inpne-ed-lab003:~$ psql -h localhost -p 5432 -U postgres

Password for user postgres: 

psql (14.12 (Ubuntu 14.12-0ubuntu0.22.04.1), server 16.4 (Debian 16.4-1.pgdg120+1))

WARNING: psql major version 14, server major version 16.
Some psql features might not work.

Type "help" for help.

postgres=# \l

List of databases

 Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges 
-----------+----------+----------+------------+------------+-----------------------
postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
          |          |          |            |            | postgres=CTc/postgres
template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
          |          |          |            |            | postgres=CTc/postgres
test      | postgres | UTF8     | en_US.utf8 | en_US.utf8 |
(4 rows)

edit on GitHub

# Is it mandatory to use a reference dataset when generating a report with Evidently?

No. While Evidently is designed to compare a reference dataset with a current one, it can also be used without a reference dataset.

In such cases, you can pass reference_data=None when creating the report. This is useful for generating descriptive statistics or univariate analyses on a single dataset (e.g., using ColumnSummaryMetric, DatasetMissingValuesMetric, etc.).

edit on GitHub

# What version of Evidently AI is used in the course?

In the video (current cohort: 2025), the Evidently version used is 0.4.17. However, any version up to 0.6.7 will work with the code provided in the video and the repository.

Note that newer versions have changed the APIs, so the code in the video may not run with versions beyond 0.6.7.

Error: Failed to create provisioner when running `docker-compose up –build`

✗ Failed to create provisioner: Failed to read dashboards config: could not parse provisioning config file: dashboards.yaml error: read /etc/grafana/provisioning/dashboards/dashboards.yaml: is a directory

To resolve this error in your docker-compose.yml file, update the Grafana volumes:

Change from a YML file reference to a directory reference.
Instead of specifying /etc/grafana/provisioning/dashboards/dashboards.yaml, use /etc/grafana/provisioning/dashboards/dashboards.
Apply this change to all file names in the Grafana volumes section.

edit on GitHub

Module 6: Best Practices

# Evidently: Import Error

Problem Description

When running the command:

from evidently import ColumnMapping

The following import error occurs:

ImportError: cannot import name 'ColumnMapping' from 'evidently'

Solution

Uninstall the latest version of evidently:
```
pip uninstall evidently -y
```
Install an older compatible version:
```
pip install evidently==0.4.18
```
Restart the kernel to reload the environment.

edit on GitHub

# Error following video 6.2: mlflow=1.27.0

When following the video instructions and running the Dockerfile, I encountered an error that the Dockerfile build failed on line 8 due to no matching distribution for mlflow==1.27.0. Below is the code output:

4.900 ERROR: No matching distribution found for mlflow==1.27.0

4.901 ERROR: Couldn't install package: {}

4.901  Package installation failed...

------

Dockerfile:8

--------------------

6 |     COPY [ "Pipfile", "Pipfile.lock", "./" ]

7 |

8 | >>> RUN pipenv install --system --deploy

9 |

10 |     COPY [ "lambda_function.py", "model.py", "./" ]

--------------------

ERROR: failed to solve: process "/bin/sh -c pipenv install --system --deploy" did not complete successfully: exit code: 1

edit on GitHub

# Get an error ‘Unable to locate credentials’ after running localstack with kinesis

You may encounter the error {'errorMessage': 'Unable to locate credentials', … from the print statement in test_docker.py after running localstack with Kinesis.

To resolve this issue:

In the docker-compose.yaml file, add the following environment variables:
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
You can assign any value to these variables (e.g., abc).
Alternatively, you can run the following command:
```
aws --endpoint-url http://localhost:4566 configure
```
Provide random values for the following prompts:
- AWS Access Key ID
- AWS Secret Access Key
- Default region name
- Default output format

<END>

edit on GitHub

# Get an error ‘unspecified location constraint is incompatible’

You may get an error while creating a bucket with LocalStack and the Boto3 client:

botocore.exceptions.ClientError: An error occurred (IllegalLocationConstraintException) when calling the CreateBucket operation: The unspecified location constraint is incompatible for the region specific endpoint this request was sent to.

To fix this, instead of creating a bucket via:

s3_client.create_bucket(Bucket='nyc-duration')

Create it with:

s3_client.create_bucket(Bucket='nyc-duration', CreateBucketConfiguration={'LocationConstraint': AWS_DEFAULT_REGION})

edit on GitHub

# Get an error "<botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>" after running an AWS CLI command

When executing an AWS CLI command (e.g., aws s3 ls), you may encounter the error:

<botocore.awsrequest.AWSRequest object at 0x7fbaf2666280>

To fix this, set the AWS CLI environment variables:

export AWS_DEFAULT_REGION=eu-west-1
export AWS_ACCESS_KEY_ID=foobar
export AWS_SECRET_ACCESS_KEY=foobar

Their values are not important; any values will suffice.

edit on GitHub

# Pre-commit: Triggers an error at every commit: “mapping values are not allowed in this context”

At every commit, the above error is thrown and no pre-commit hooks are run.

Ensure the indentation in .pre-commit-config.yaml is correct, particularly the 4 spaces ahead of every repo statement.

edit on GitHub

# Could not reconfigure pytest from zero after getting done with previous folder

image #1

No option to remove pytest test

Remove the .vscode folder located in the folder you previously used for testing. For example, if you chose to test in the "week6-best-practices" folder, remove the .vscode directory inside that folder.

edit on GitHub

# Empty Records in Kinesis Get Records with LocalStack

Problem Description

Following video 6.3, at minute 11:23, the get records command returns empty records.

Solution

Add --no-sign-request to the Kinesis get records call:

aws --endpoint-url=http://localhost:4566 kinesis get-records --shard-iterator [...] --no-sign-request

edit on GitHub

# In Powershell, Git commit raises utf-8 encoding error after creating pre-commit yaml file

Problem Description

When executing the following command in PowerShell, an error occurs:

git commit -m 'Updated xxxxxx'

The error message is:

An error has occurred: InvalidConfigError:

==> File .pre-commit-config.yaml

=====> 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Solution Description

Set UTF-8 encoding when creating the pre-commit YAML file:

pre-commit sample-config | out-file .pre-commit-config.yaml -encoding utf8

edit on GitHub

# Git: Commit with pre-commit hook raises error ‘'PythonInfo' object has no attribute 'version_nodot'

Problem Description

When attempting to commit in Git, the following error occurs:

git commit -m 'Updated xxxxxx'

[INFO] Initializing environment for GitHub.
[INFO] Installing environment for GitHub.
[INFO] Once installed this environment will be reused.

An unexpected error has occurred: CalledProcessError: command:
…
return code: 1
expected return code: 0
stdout:
AttributeError: 'PythonInfo' object has no attribute 'version_nodot'

Solution

To resolve this issue, clear the app-data of the virtual environment using the following command:

python -m virtualenv api -vvv --reset-app-data

edit on GitHub

# Pytest error 'module not found' when using custom packages in the source code

Problem Description

Project structure:

/sources/production/model_service.py
/sources/tests/unit_tests/test_model_service.py

The test file contains:

from production.model_service import ModelService

Running python test_model_service.py from the sources directory works.
Running pytest ./test/unit_tests fails with: No module named 'production'.

Solution

Use the following command:
```
python -m pytest ./test/unit_tests
```

Explanation

pytest does not automatically add to the sys.path the path where it is run.
Alternatives include:
- Running python -m pytest
- Exporting the PYTHONPATH before executing pytest:
```
export PYTHONPATH=.
```

edit on GitHub

# Pytest error ‘module not found’ when using pre-commit hooks if using custom packages in the source code

Problem Description

Project structure:

/sources/production/model_service.py
/sources/tests/unit_tests/test_model_service.py

In test_model_service.py:

from production.model_service import ModelService

A git commit -t ‘test’ raises No module named ‘production’ when calling the pytest hook:

- repo: local

  hooks:
    - id: pytest-check
      name: pytest-check
      entry: pytest
      language: system
      pass_filenames: false
      always_run: true
      args: [
        "tests/"
      ]

Solution Description

Use this hook instead:

- repo: local

  hooks:
    - id: pytest-check
      name: pytest-check
      entry: "./sources/tests/unit_tests/run.sh"
      language: system
      types: [python]
      pass_filenames: false
      always_run: true

Ensure that run.sh sets the correct directory and runs pytest:

cd "$(dirname "$0")"

cd ../..

export PYTHONPATH=.

pipenv run pytest ./tests/unit_tests

edit on GitHub

# Github actions: Permission denied error when executing script file

Problem Description

This issue occurs when running the following step in the CI YAML file definition:

yml
- name: Run Unit Tests
  working-directory: "sources"
  run: ./tests/unit_tests/run.sh

When executing the GitHub CI action, the following error occurs:

…/tests/unit_test/run.sh Permission error

Error: Process completed with error code 126

Solution

To resolve this issue, add execution permission to the script and commit the changes:

git update-index --chmod=+x ./sources/tests/unit_tests/run.sh

edit on GitHub

# Managing Multiple Docker Containers with docker-compose profile

Problem Description

When a Docker Compose file contains many containers, running them all may consume too many resources. There is often a need to easily select only a group of containers while ignoring irrelevant ones during testing.

Solution Description

Add profiles: ["profile_name"] in the service definition within your docker-compose.yml file.
Start the service with the specific profile using the command:
```
docker-compose --profile profile_name up
```

edit on GitHub

# AWS CLI: Why do AWS CLI commands throw <botocore.awsrequest.AWSRequest object at 0x74c89c3562d0> type messages when listing or creating AWS S3 buckets with LocalStack?

If you encounter such messages when trying to list your AWS S3 buckets (e.g., aws --endpoint-url=http://localhost:4566 s3 ls), you can try configuring AWS with the same region, access key, and secret key as those in your docker-compose file.

To configure AWS CLI, follow these steps:

After installing the AWS CLI, run the following command in your terminal:
```
aws configure
```
Input the required information when prompted:
- AWS Access Key ID: [Example: abc]
- AWS Secret Access Key: [Example: xyz]
- Default region name: [Example: eu-west-1]

edit on GitHub

# AWS: Regions need to match in docker-compose

Problem Description

If you are experiencing issues with integration tests and Kinesis, ensure that your AWS regions are consistent between docker-compose and your local configuration. Otherwise, you may create a stream in an incorrect region.

Solution Description

Set the region in your AWS config file:
```
~/.aws/config
```
Example:
```
region = us-east-1
```
Ensure that the region in your docker-compose.yaml is also set:
```
environment:
  - AWS_DEFAULT_REGION=us-east-1
```

edit on GitHub

# Isort Pre-commit

Problem Description

Pre-commit command was failing with isort repo.

Solution

Set version to 5.12.0

edit on GitHub

# How to destroy infrastructure created via GitHub Actions

Problem Description

Infrastructure created in AWS with CD-Deploy Action needs to be destroyed.

Solution Description

To destroy the infrastructure from local:

terraform init -backend-config="key=mlops-zoomcamp-prod.tfstate" --reconfigure

terraform destroy --var-file vars/prod.tfvars

edit on GitHub

# Error "[Errno 13] Permission denied: '/home/ubuntu/.aws/credentials'" when running any aws command

After installing AWS CLI v2 on Linux, you may encounter a permission error when trying to run AWS commands that require access to your credentials. For example, when running aws configure, you might insert the key and secret but receive a permission error.

The issue arises because the ubuntu user does not have permission to read or write files in the .aws folder, and the credentials and config files do not exist. To resolve this:

Navigate to the .aws folder, typically located at /home/ubuntu/.aws.
Create empty credentials and config files:
```
touch credentials
touch config
```

Modify the file permissions:

sudo chmod -R 777 credentials
sudo chmod -R 777 config

Run aws configure, modify the keys and secret, and save them to the credentials file. You can then execute AWS commands from your Python scripts or the command line.

edit on GitHub

# Why do I get a `ValueError: Invalid endpoint` error when using Boto3 with Docker Compose services?

Boto3 does not support underscores (_) in service URLs. Naming your Docker Compose services with underscores will cause Boto3 to throw an error when connecting to the endpoint. (Source: GitHub Issue)

Incorrect Docker Compose configuration with underscores

docker-compose.yml

version: '3.8'

services:
  backend_service:
    image: my_backend_image
    ...
  s3_service:
    image: localstack/localstack
    …

Rename your services to avoid using underscores. For example, change s3_service to s3service.

This way, when you run

client = boto3.client('s3', endpoint_url="http://s3service:4566")

You won’t get any error.

Problem: Pre-commit fails with `RuntimeError: The Poetry configuration is invalid:`

data.extras.pipfile_deprecated_finder[2] must match pattern ^[a-zA-Z-_.0-9]+$

Solution:

This is caused by a version mismatch between the pre-commit-config.yaml designated version for your package and the actual versions. Check the versions in Pipfile.lock and update as appropriate.

edit on GitHub

# Why do I get a “ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()” error when doing unit test that involves comparing two data frames?

When you compare two Pandas DataFrames, the result is also a DataFrame. The same is true for Pandas Series. To properly compare them, you should not compare data frames directly.

Instead, convert the actual and expected DataFrames into a list of dictionaries, then use assert to compare the resulting lists.

Example:

actual_df_list_dicts = actual_df.to_dict('records')

expected_df_list_dicts = expected_df.to_dict('records')

assert actual_df_list_dicts == expected_df_list_dicts

edit on GitHub

Capstone Project

# pytest doesn't recognize my installed libraries, but the script works in the terminal. Why?

This usually happens because VS Code is using a different Python interpreter than the one in your terminal. As a result, pytest can't see the packages installed in your virtual environment.

How to fix:

In your terminal, run:
```
which python
```
In VS Code, open the command palette (Ctrl+Shift+P) and select:

Python: Select Interpreter
Choose the same interpreter shown in step 1.

edit on GitHub

# Is it a group project?

No, the capstone is a solo project.

edit on GitHub

# Do we submit 2 projects, what does attempt 1 and 2 mean?

You only need to submit one project. If the submission at the first attempt fails, you can improve it and re-submit during the attempt 2 submission window.

If you want to submit two projects for the experience and exposure, you must use different datasets and problem statements.
If you can’t make it to the attempt 1 submission window, you still have time to catch up to meet the attempt 2 submission window.

Remember that the submission does not count towards the certification if you do not participate in the peer review of three peers in your cohort.

edit on GitHub

# How is my capstone project going to be evaluated?

Each submitted project will be evaluated by three randomly assigned students who have also submitted the project.

You will also be responsible for grading the projects of three fellow students yourself. Please be aware that not complying with this rule will result in failing to achieve the Certificate at the end of the course.

The final grade you receive will be the median score of the grades from the peer reviewers.

The peer review criteria for evaluation must follow the guidelines defined here (TBA for link).

edit on GitHub

# Homework: What is the criteria of scoring home work?

Each homework assignment has a scoring system based on the following criteria:

Answering 6 questions correctly: 6 points
Adding 7 public learning items: 7 points
Adding 1 valid question to the FAQ: 1 point

In total, you can earn up to 14 points per homework, which will contribute to the leaderboard ranking.

edit on GitHub

MLOps Zoomcamp FAQ

Table of Contents

General Course-Related Questions

# I forgot if I registered, can I still join the zoomcamp?

# Is it going to be live? When?

# Course - Can I still join the course after the start date?

# Course: How do I start?

# Course - Can I still graduate when I didn’t complete homework for week x?

# Certificate - Can I follow the course in a self-paced mode and get a certificate?

# What’s the difference between the 2023 and 2022 course?

# Cohort: What’s the difference between the 2024 and 2023 course?

# Cohort: Will there be a 2024 Cohort? When will the 2024 cohort start?

# Cohort: I missed the current cohort, when is the next cohort scheduled for? Will there be a 202x cohort?

# Homework: What if my answer is not exactly the same as the choices presented?

# Homework: where can I find the schedule and/or deadlines of each homework assignment?

# Homework: Is the due date for homework 20th May? How do I check the updated playlist and homework for the mlops course?

# Homework: Why is the experiment in question 6 taking so long to run? Should we use yellow taxi data, or green taxi data?

# Homework: I am getting conflicts on server ports and cannot establish a connection to the MLflow server, why?

# Homework 3, 2025 Cohort - Do I have to use MAGE as the orchestrator? Can I use any orchestrator I want?

# Homework: Just found this course, can I still submit homeworks?

# Hi, is it too late to start the course, I have ML experience?

# Project: Are we free to choose our own topics for the final project?

# Project: Is the capstone an individual or team project?

# Project: For the final project, is it required to be put on the cloud?

# Homework and Leaderboard: what is the system for points in the course management platform?

# What exactly is a learning-in-public post?

# Leaderboard: I am not on the leaderboard / how do I know which one I am on the leaderboard?

# Error: creating Lambda Function (...): InvalidParameterValueException: The image manifest, config or layer media type for the source image ... is not supported.

# Criteria for getting a certificate?

# Is completion of Homework necessary for a certificate?

Module 1: Introduction

# Can I submit and update my project attempt multiple times before the final deadline?

# Opening Jupyter in VSCode

# Launching Jupyter notebook from codespace VM

# Configuring Github to work from the remote VM

# Opening Jupyter in AWS

# WSL: instructions

# Git: Created repo without .gitignore

# .gitignore: how-to

# AWS: Suggestions

# IBM Cloud an alternative for AWS

# AWS costs:

# Is the AWS free tier enough for doing this course?

# AWS EC2: this site can’t be reached

# Unprotected private key file!

# AWS EC2 instance constantly drops SSH connection

Expected Behavior:

Solution:

# AWS EC2: How do I handle changing IP addresses on restart?

# VS Code crashes when connecting to Jupyter

# My connection to my GCP VM instance keeps timing out when I try to connect

# X has 526 features, but expecting 525 features

# Missing dependencies

# squared Option Not Available in mean_squared_error

# No RMSE value in the options

Warning Deprecation

# How to replace distplot with histplot

# KeyError: 'PULocationID' or 'DOLocationID'

# ImportError: Unable to find a usable engine; tried using: ‘pyarrow’, ‘fastparquet’.

# Reading large parquet files

# Kernel getting killed during assignment tasks on local

# What is the difference between label and one-hot encoding?

Tools and Implementation

# Distplot takes too long

# RMSE on test set too high

Problem

Explanation

# ictVectorizerA: Alexey’s answer

# Why did we not use OneHotEncoder(sklearn) instead of DictVectorizer?

# Clipping outliers

# Replacing NaNs for pickup location and drop off location with -1 for One-Hot Encoding

# Slightly different RMSE

Check the Following:

# Extremely low RMSE

# Enabling Auto-completion in Jupyter Notebook

# Downloading the data from the NY Taxis datasets gives error: 403 Forbidden

# Using PyCharm & Conda env in remote development

# Running out of memory

# Activating Anaconda env in .bashrc

# The feature size is different for training set and validation set