wget - how to reject string from downloading html pages

Question

I am using the following wget command and it downloads the required files I need except for one thing...
wget -U "Mozilla/5.0" --wait=3 --load-cookies cookies.txt --timestamping --recursive --level=2 --convert-links --no-parent --page-requisites --adjust-extension --max-redirect=0 --exclude-directories=blog --reject "*per_page=18.html" --reject "*per_page=36.html" (url here)

I want to download files like these:
a1546997.html

But I don't want to download files like these:
a1546997.html?pwd=&per_page=36.html

I cannot seem to figure out how to reject downloading the html pages containing the extra stuff at the end.
The main problem is that wget gets stuck retrying and times out on the second types of links because the don't go anywhere - and then wget client gets banned.
Any suggestions?

Gilles Quenot · Answer

What I would do, pragmatic approach ahead:
wget ....
rename 's/.html?.*/.html/' *.html*

This the Perl's rename command

gabriel · Answer

Try using the --reject-regex switch of wget. You could probably do something like:

wget --recursive --no-parent --reject-regex '[^?]' url

wget - how to reject string from downloading html pages

2 Answers

Add your own answers!

Ask a Question