RepoFlow | Mirror the Entire PyPI Repository with Bash

Mirroring the entire PyPI repository can be essential for organizations with strict security requirements or air-gapped networks that need a complete, self-contained copy of PyPI. This approach can also be useful for enterprises that require local access to all available Python packages without relying on an external internet connection.

Why Mirror PyPI?

Mirroring a package repository like PyPI can be beneficial for the following reasons:

Air-Gapped Networks: For secure environments where internet access is restricted or completely unavailable.
Regulatory Compliance: Some organizations need complete control over their software supply chain for compliance purposes.
Disaster Recovery: Ensures packages are always available, even if the external repository goes dow

Prerequisites

Before you start, make sure you have the following installed:

Bash: Usually pre-installed on most Linux distributions and macOS. For Windows, you can use the Windows Subsystem for Linux (WSL) or a tool like Git Bash or Cygwin.
wget
curl

Understanding the Script

This Bash script is designed to mirror the entire PyPI repository to a local directory. It crawls the PyPI package index, retrieves the list of all available packages, and then downloads every available version of each package. This approach creates a local, self-contained copy of PyPI, which can be particularly useful for air-gapped networks or organizations with strict security requirements.

Consider the Storage Requirements

Keep in mind that this process can require a significant amount of storage, depending on the number of packages and versions you choose to mirror. Currently, PyPI hosts over 4 million packages, totaling around 27.6 TB of data. Be sure you have sufficient storage capacity before starting.

Consider the Storage Requirements

Here is a Bash script to mirror the entire PyPI repository:

#!/bin/bash

# Create the mirror directory
mkdir -p ./pypi_mirror
# Log file to track last mirrored package
LOG_FILE="./pypi_mirror/index.log"

# Get the list of all package names (strip "/simple/")
packages=($(curl -s https://pypi.org/simple/ | awk -F '"' '/href="/ {print $2}' | sed 's|/simple/||g' | sed 's|/$||'))

# Get the total number of packages
total_packages=${#packages[@]}
start_time=$SECONDS

echo "Total packages to download: $total_packages"
echo ""

# Read last completed package from log
if [[ -f "$LOG_FILE" ]]; then
 last_package=$(tail -n 1 "$LOG_FILE")
 echo "Resuming from package: $last_package"
 skip=true
else
 last_package=""
 skip=false
fi

# Loop through each package and download all available versions
for i in "${!packages[@]}"; do
 package="${packages[$i]}"

 # Skip previously completed packages
 if [[ "$skip" == true ]]; then
 if [[ "$package" == "$last_package" ]]; then
 skip=false # Found the last completed package, start from the next one
 fi
 continue
 fi

 # Update progress
 progress=$(( (i + 1) * 100 / total_packages ))
 elapsed_time=$(( SECONDS - start_time ))
 avg_time_per_pkg=$(( elapsed_time / (i + 1) ))
 remaining_pkgs=$(( total_packages - i - 1 ))
 eta=$(( avg_time_per_pkg * remaining_pkgs ))
 # Prevent negative ETA
 if [[ $eta -lt 0 ]]; then eta=0; fi

 # Progress bar settings
 bar_length=40
 filled_length=$(( bar_length * (i + 1) / total_packages ))

 # Ensure at least 1 character for cut
 if [[ $filled_length -lt 1 ]]; then filled_length=1; fi

 # Construct progress bar
 bar=$(printf "%-${bar_length}s" "█████████████████████████████████████████" | cut -c1-"$filled_length")
 empty_bar=$(printf "%-${bar_length}s" "")

 # Print progress dynamically
 tput sc
 echo -ne "Progress: [$bar$empty_bar] $progress% | Elapsed: ${elapsed_time}s | ETA: ${eta}s | Downloading: $package\r"
 tput rc
 # Create a directory for the package
 mkdir -p "./pypi_mirror/$package"

 # Get the list of package files from PyPI
 package_page=$(curl -s "https://pypi.org/simple/$package/")

 # Extract all file URLs
 urls=$(echo "$package_page" | awk -F '"' '/href="https/ {print $2}')

 if [[ -z "$urls" ]]; then
 continue # Skip if no files found
 fi

 # Download each file (silent mode to keep terminal clean)
 for url in $urls; do
 cleaned_url="${url%%#*}"
 file_name="./pypi_mirror/$package/$(basename "$cleaned_url")"
  # Check if the file already exists and is not empty
 if [[ -f "$file_name" && -s "$file_name" ]]; then
 echo "Skipping already downloaded file: $file_name"
 continue
 fi
  wget -q -P "./pypi_mirror/$package/" "$url"
 done

 # Log the completed package
 echo "$package" >> "$LOG_FILE"
done

# Final message
echo -e "\n\n🎉 PyPI mirroring complete! All $total_packages packages downloaded."

Key Features of the Script

Resumable Downloads: The script can resume from the last completed package if interrupted.
Progress Bar: Real-time progress bar to track the download status.

Alternative Methods

bandersnatch: A PyPI package for mirroring Python packages. More details at bandersnatch on PyPI.

Final Thoughts

This is a simple example script to demonstrate how mirroring PyPI can be achieved. Feel free to modify it based on your specific needs, whether that's optimizing for speed, adding error handling, or integrating it with your existing infrastructure.

Happy mirroring!

Mirror the Entire PyPI Repository with Bash

Why Mirror PyPI?

Prerequisites

Understanding the Script

Consider the Storage Requirements

Consider the Storage Requirements

Key Features of the Script

Alternative Methods

Final Thoughts

Java 18 to 25 Benchmarks: How Performance Evolved Over Time

Node.js vs Deno vs Bun Performance Benchmarks

Python 3.9 to 3.14 Performance Benchmarks for Official Python (CPython)

Mirror the Entire PyPI Repository with Bash

Why Mirror PyPI?

Prerequisites

Understanding the Script

Consider the Storage Requirements

Consider the Storage Requirements

Key Features of the Script

Alternative Methods

Final Thoughts

Share article

Java 18 to 25 Benchmarks: How Performance Evolved Over Time

Node.js vs Deno vs Bun Performance Benchmarks

Python 3.9 to 3.14 Performance Benchmarks for Official Python (CPython)