How to fix relative links in site downloaded by wget

Wget might have troubles with turning some absolute paths into relative paths when downloading a whole website. I am using GNU Wget 1.21.2 and after running this command to download a site:

wget -mpEk "https://url-of-a-website"

the links which were absolute stay absolute so the website is not fully working offline.

But it can be fixed with a simple Bash script!

You can replace all absolute links (URLs) easily since you know the prefix https://url-of-a-website. The script just needs to keep track of the level of nesting. At the top level it just removes the prefix https://url-of-a-website. At deeper levels it prepends the appropriate number of parent directories ../.

The code:

DOMAIN="https://url-of-a-website"

replace_urls() {
    file=$1
    depth=$(echo "$file" | grep -o "/" | wc -l)
    relative_path=$(printf '../%.0s' $(seq 2 $depth))
    sed -i "s|${DOMAIN}|${relative_path}|g" "$file"
}

export -f replace_urls
export DOMAIN

find . -type f -exec bash -c 'replace_urls "$0"' {} \;

Voìla!

published: 2024-04-01
last modified: 2024-04-17

https://vit.baisa.cz/notes/code/fix-wget-relative-links/