How to fix relative links in site downloaded by wget
Wget might have troubles with turning some absolute paths into relative paths when downloading a whole website. I am using GNU Wget 1.21.2 and after running this command to download a site:
wget -mpEk "https://url-of-a-website"
the links which were absolute stay absolute so the website is not fully working offline.
But it can be fixed with a simple Bash script!
You can replace all absolute links (URLs)
easily since you know the prefix
https://url-of-a-website
.
The script just needs to keep track of the level of nesting.
At the top level it just removes the prefix
https://url-of-a-website
.
At deeper levels
it prepends the appropriate number of parent directories ../
.
The code:
DOMAIN="https://url-of-a-website"
replace_urls() {
file=$1
depth=$(echo "$file" | grep -o "/" | wc -l)
relative_path=$(printf '../%.0s' $(seq 2 $depth))
sed -i "s|${DOMAIN}|${relative_path}|g" "$file"
}
export -f replace_urls
export DOMAIN
find . -type f -exec bash -c 'replace_urls "$0"' {} \;
Voìla!
published: 2024-04-01
last modified: 2024-04-17
https://vit.baisa.cz/notes/code/fix-wget-relative-links/
last modified: 2024-04-17
https://vit.baisa.cz/notes/code/fix-wget-relative-links/