Unix & Linux Asked by speld_rwong on November 9, 2021
I am using the following wget command and it downloads the required files I need except for one thing…
wget -U "Mozilla/5.0" --wait=3 --load-cookies cookies.txt --timestamping --recursive --level=2 --convert-links --no-parent --page-requisites --adjust-extension --max-redirect=0 --exclude-directories=blog --reject "*per_page=18.html" --reject "*per_page=36.html" (url here)
I want to download files like these:
a1546997.html
But I don’t want to download files like these:
a1546997.html?pwd=&per_page=36.html
I cannot seem to figure out how to reject downloading the html pages containing the extra stuff at the end.
The main problem is that wget gets stuck retrying and times out on the second types of links because the don’t go anywhere – and then wget client gets banned.
Any suggestions?
What I would do, pragmatic approach ahead:
wget ....
rename 's/.html?.*/.html/' *.html*
This the Perl's rename command
Answered by Gilles Quenot on November 9, 2021
Try using the --reject-regex switch of wget. You could probably do something like:
wget --recursive --no-parent --reject-regex '[^?]' url
Answered by gabriel on November 9, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP