Lecture 1.2: Preparing the Environment

Important Links:
Lecture recording

Overview
Setup Options Overview
Setup Option 1: GitHub Codespaces (Recommended)
Common Setup Steps (All Options)
Setup Option 2: Anaconda/Miniconda
Setup Option 3: Local Machine with Docker
Summary & Next Steps

Overview

This lecture covers setting up the development environment for the LLM course. You have three main setup options, each with different advantages depending on your experience level and requirements.

Setup Option 1: GitHub Codespaces (Recommended)

Codespaces comes with Docker and Python pre-installed:

Python: The programming language we’ll use for the entire course
Docker: Containerization platform for running isolated applications (like databases)

Without Codespaces, you’d need to manually install:

Python (correct version, package manager, virtual environments)
Docker Desktop (can be tricky on some systems, requires admin rights)
Git (for version control)
Code editor (VS Code or similar)

With Codespaces, you get:

Everything ready in ~30 seconds
Consistent environment for all students
No installation headaches
Works on any computer with a browser

When to Choose This Option

You’re new to development environments
You want to get started quickly
You don’t want to install software locally
You’re comfortable with browser-based development

Creating Your Repository

Go to your GitHub account
Create new repository:
- Name: llm-zoomcamp (or your preferred name)
- Visibility: Must be public (for homework sharing)
- Add: Python .gitignore
- License: Optional

Launching Codespace

Click Code button in your repository
Switch to Codespaces tab
Click Create codespace on main
Choose browser or VS Code desktop

Using VS Code Desktop

Install VS Code desktop application
Install GitHub Codespaces extension
Click “Open in VS Code Desktop” when prompted

Installing Required Libraries

As mentioned already, Codespaces comes with Docker and Python pre-installed, so we only need to install some Python libraries that are not yet installed.

Install the Python Libraries

pip install notebook==7.1.2 openai elasticsearch scikit-learn pandas

Library breakdown:

notebook==7.1.2: Jupyter notebook environment for interactive coding and completing homework assignments
openai: Official OpenAI Python client for accessing GPT models and other OpenAI APIs
elasticsearch: Search engine for document retrieval in RAG (Retrieval-Augmented Generation) systems
scikit-learn: Machine learning library for text processing, similarity calculations, and embeddings
pandas: Data manipulation library for handling CSV files, datasets, and structured data processing

Common Setup Steps (All Options)

OpenAI API Configuration

Getting Your API Key

Visit platform.openai.com
Register/login to your account
Navigate to API Keys section
Create new API key with descriptive name
Critical: Keep key secure, never expose it by showing it to others or committing it to Git

Setting Environment Variable with OpenAI API Key

Use this command:

export OPENAI_API_KEY="your-api-key-here"

Now we have the key in our environment.

Jupyter Notebook Setup

Starting Jupyter

jupyter notebook

Accessing Notebook

Codespaces: Automatically forwards port 8888 to localhost:8888
Local: Access via localhost:8888 in browser
Token: Copy from terminal if prompted
Alternative ports: 8889, 8890 if 8888 is occupied

When you go to localhost:8888, you’ll need to paste your token output in the terminal to enter your environment. Alternatively, you can directly use the link output in the terminal and follow it.

When Jupyter Notebook is open, it will show your created llm-zoomcamp repo with all the files you’ve created. In our case, it’s just the README.

Create a new Python 3 file to start working there.

Testing Your Setup

In your new Python 3 file in the Jupyter notebook, you can test the OpenAI API key by running one API call to the model.

You can do this like that:

OpenAI API Test

from openai import OpenAI

# Create client (uses environment variable)
client = OpenAI()

# Or you can explicitly write your API key here
client = OpenAI(api_key='your-api-key-here')

You can test whether the OpenAI API key is set as an environment variable by running:

import os
print(os.environ.get('OPENAI_API_KEY'))

If you encounter an issue - Important: Environment Variables in Jupyter

**Common Issue**: If you started Jupyter Notebook from another terminal (not the one where you exported the OpenAI API key), the terminal finds the key, but Jupyter shows `None` when you check `os.environ.get('OPENAI_API_KEY')`. **Why this happens:** - Environment variables are specific to each terminal session - If you set the variable in Terminal A but start Jupyter from Terminal B, Jupyter won't see it - Even in the same terminal: if you start Jupyter FIRST, then export the variable, Jupyter won't see it (since it was already running) This happens because Jupyter runs as a separate process and might not inherit your terminal's environment variables. **Solutions:** **Option 1: Set the key directly in Jupyter (Easiest)** ```python import os os.environ['OPENAI_API_KEY'] = 'your-actual-api-key-here' # Verify it worked print(os.environ.get('OPENAI_API_KEY')) ``` **Option 2: Restart Jupyter from the same terminal** ```bash # In terminal where you set the environment variable jupyter notebook ```

Test API Call

# Test API call
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "user", "content": "Is it too late to join the course?"}
    ]
)

# Check the answer
print(response.choices[0].message.content)

Parameters explained:

model: The model you’re calling (e.g., “gpt-4o”)
messages: List of message objects with role and content
role: Commonly used roles include “system,” “user,” and “assistant.” The “system” provides high-level instructions, the “user” presents queries or prompts, and the “assistant” is the model’s response. By differentiating these roles, we can set the context and direct the conversation efficiently.

Git Workflow

Project Structure

llm-zoomcamp/
├── 01-intro/
│   └── homework.ipynb
├── 02-rag/
└── README.md

Committing Your Work

# Stage changes
git add .

# Commit with meaningful message
git commit -m "Module 1 homework completed"

# Push to GitHub
git push

Note: Stop the codespace when you’re done to conserve your free hours.

Setup Option 2: Anaconda/Miniconda

When to Choose This Option

You prefer a Python-focused environment
You want many libraries pre-installed
You’re familiar with conda package management

Installation Steps

Anaconda (larger, more libraries): Visit docs.anaconda.com
Miniconda (smaller, minimal): Visit docs.conda.io

Installing Anaconda/Miniconda

# Download installer for Miniconda (Linux example)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# Install
bash Miniconda3-latest-Linux-x86_64.sh

# Accept the terms and conditions

# Initialize conda
conda init

Installing Required Packages

With Anaconda/Miniconda, you only need to install:

pip install openai elasticsearch tqdm

Pre-installed libraries: jupyter, pandas, scikit-learn

When this is ready, follow the steps from Common Setup Steps.

Setup Option 3: Local Machine with Docker

Docker delivers software in packages called containers that are isolated from each other. This is crucial for data engineering workflows.

Think of containers as:

Lightweight, portable boxes containing everything needed to run an application
Isolated environments that don’t interfere with your host system
Reproducible environments that work identically everywhere

When to Choose This Option

You want full control over your environment
You plan to work offline frequently
You have a machine with GPU for Module 2
You’re comfortable with command-line tools

The Problem Docker Solves

Without Docker (Traditional Problems):

Version Conflicts: Multiple Python versions causing PATH issues
Library Conflicts: Project A needs pandas 1.0, Project B needs pandas 2.0
“Works on My Machine”: Code works locally but fails in production
Installation Nightmares: Complex setup procedures for each tool
System Pollution: Leftover files and configurations cluttering your system

With Docker (Clean Solution):

Your Computer:
├── Container 1: Python 3.9 + LLM libraries + your scripts
├── Container 2: PostgreSQL 13 database
├── Container 3: Elasticsearch for document search  
├── Container 4: Another project (completely isolated)
└── Host machine stays clean and unaffected

Installing Docker

Visit Docker’s official website
Download installer for your operating system
Follow installation instructions
Test installation: docker run hello-world

For detailed Docker learning, check the Data Engineering Zoomcamp Docker module.

Terminal Setup by OS

Windows: Git Bash (recommended), WSL, or PowerShell
macOS: Built-in Terminal
Linux: Built-in terminal

Creating Your Environment

Method A: Using Dockerfile (Recommended)

Create a Dockerfile:

FROM python:3.9

# Install required libraries
RUN pip install jupyter==7.1.2 openai elasticsearch scikit-learn pandas

# Set working directory
WORKDIR /app

# Default to bash for interactive use
ENTRYPOINT ["/bin/bash"]

Build and run:

# Build your image
docker build -t llm-course .

# Run interactive container
docker run -it llm-course

Method B: Direct Installation

# Run Python container
docker run -it python:3.9 bash

# Inside container, install libraries
pip install jupyter==7.1.2 openai elasticsearch scikit-learn pandas

Important: Changes in running containers are lost when stopped. Use Dockerfiles for permanent setups.

When this is ready, follow the steps from Common Setup Steps.

Summary and Next Steps

What You Should Have Now

✅ Working development environment (one of the three options)
✅ All required Python libraries installed
✅ OpenAI API key configured
✅ Jupyter notebook running
✅ Git repository set up

Key Takeaways

Multiple paths work: Choose the setup that matches your experience level
Reproducibility matters: Docker ensures consistent environments
Security first: Protect your API keys
Version control: Use Git for tracking your progress

Next Steps

Module 1.3 will cover:

Document indexing with search engines
Building your first RAG (Retrieval-Augmented Generation) pipeline
Working with the search implementation

Recommended: Complete a simple test notebook to verify everything works before proceeding.

Lecture 1.2: Preparing the Environment

Table of Contents

Overview

Setup Option 1: GitHub Codespaces (Recommended)

When to Choose This Option

Creating Your Repository

Launching Codespace

Using VS Code Desktop

Installing Required Libraries

Install the Python Libraries

Common Setup Steps (All Options)

OpenAI API Configuration

Getting Your API Key

Setting Environment Variable with OpenAI API Key

Jupyter Notebook Setup

Starting Jupyter

Accessing Notebook

Testing Your Setup

OpenAI API Test

Test API Call

Git Workflow

Project Structure

Committing Your Work

Setup Option 2: Anaconda/Miniconda

When to Choose This Option

Installation Steps

Installing Anaconda/Miniconda

Installing Required Packages

Setup Option 3: Local Machine with Docker

When to Choose This Option

The Problem Docker Solves

Installing Docker

Terminal Setup by OS

Creating Your Environment

Method A: Using Dockerfile (Recommended)

Method B: Direct Installation

Summary and Next Steps

What You Should Have Now

Key Takeaways

Next Steps