Stack Overflow Asked by Winters on November 16, 2021
I am scraping multiple pages using Scrapy-Splash.
class Spider(scrapy.Spider):
name = "scrape"
def start_requests(self):
urls = get_urls()
for url in urls:
yield scrapy.Request(url, self.parse, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 8 }
}
})
The code works fine, I get the desired result from the pages.
The problem is, I have to set a larger wait time (>4) or Splash is sometimes terminated by the next request before returning a result. This seems terribly unreliable.
Is there a way to set the wait time to something more dynamic? I found a partial solution here using a LUA script:
Adding a wait-for-element while performing a SplashRequest in python Scrapy
function main(splash)
splash:set_user_agent(splash.args.ua)
assert(splash:go(splash.args.url))
-- requires Splash 2.3
while not splash:select('.my-element') do
splash:wait(0.1)
end
return {html=splash:html()}
end
But it appears to require a hard-coded element to terminate Splash (".my-element"), and I am scraping many different websites with different elements to be collected.
How can I dynamically code the ‘wait’ arg or customise the LUA script to terminate Splash when it has collected the desired element? Surely this is a common problem?
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP