Stack Overflow Asked by Slacoff on November 15, 2021
The website has 146 pages with words but after page 146 the last page is showing again.
`
if next_page is not None:
yield response.follow(next_page, callback = self.parse)`
With this method sprider is not stoping at page 146 and it continues because page 147,148,149..is same as page 146. I tried to use for loop but that not worked. Also, I tried to take the value in next page button and break the function with next_extract. By the way output of next_extract is [‘kelimeler.php?s=1’]and the number increases with the page number like [‘kelimeler.php?s=2’]. Also, this way is not worked.
next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()
print(next_page)
print(next_extract)
if next_extract is 'kelimeler.php?s=147':
break
if next_page is not None:
yield response.follow(next_page, callback = self.parse)
What should I do to stop the scrapying at page 146?
That’s the whole parse function
def parse(self,response):
items = TidtutorialItem()
all_div_kelimeler = response.css('a.collapsed')
for tid in all_div_kelimeler:
kelime = tid.css('a.collapsed::text').extract()
link= tid.css('a.collapsed::text').xpath("@href").extract()
items['Kelime'] = kelime
items['Link'] = link
yield items
next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()
print(next_page)
print(next_extract)
if next_page is not None:
#if next_extract is not 'kelimeler.php?s=2':
#for i in range (10):
yield response.follow(next_page, callback = self.parse)
I can't be very precise about the best approach without seeing the page, but I can giv you some suggestions.
next_page = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a::attr(href)').get()
next_extract = response.css('div.col-md-6.col-sm-6.col-xs-6:nth-child(2) a').xpath("@href").extract()
I'm not sure what you are trying to accomplish here, as both the selectors are essentially the same, except that the second one you are using the .extract()
method, which returns a LIST. And since it returns a list this following line will ALWAYS fail:
if next_extract is 'kelimeler.php?s=147':
break
Another important point is that break
is meant to be used inside a loop, so if the if statement ever resolved into True
, this would cause an exception. Read more here.
Again, without seeing the page I can't say this for sure, but I believe this would acomplish what you are trying to do:
if next_page == 'kelimeler.php?s=147':
return
Notice next_page
instead of next_extract
. If you want to use the latter, remember it is a list, not a string.
Answered by renatodvc on November 15, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP