Stack Overflow Asked by papelr on December 20, 2021
I’m scraping this website: https://eintaxid.com/companies/a/?page=1
I’ve successfully extracted the company name:
r = requests.get('https://eintaxid.com/companies/a/?page=1')
soup = BeautifulSoup(r.content, "html.parser")
# outputs all strong tags
for tag in soup.find_all('strong'):
print(tag.text)
I can easily isolate the company names doing this:
r = requests.get('https://eintaxid.com/companies/a/?page=1')
soup = BeautifulSoup(r.content, "html.parser")
table = soup.find_all('strong')
comp_list = []
# loop to extract just the company names from the strong tags, then using the a tags
for j in table:
td = j.find_all(['a'])
row = [i.text for i in td]
comp_list.append(row)
# puts company names into a pandas df
comp_list = list(filter(lambda x: len(x) > 0, comp_list))
comp_list = pd.DataFrame(comp_list, columns = ['Company']).reset_index(drop = True)
comp_list
I can’t for the life of me extract the EIN numbers, though. The <strong>EIN Number:</strong>
is plain to see, and I can extract that from the first code chunk above. But how do I get the actual number? The 98-1455367
as seen in the following screenshot?
For reference, I’m going to put the EIN number next to each company in a pandas df – but can’t really do that until I’ve extracted the EIN number itself.
Referring to the docs you might want to use the next_sibling of your tag, catch the strong tag first, then get the next item from the context:
strong_element.next_sibling # contains "EIN number"
"Sibling" in this context is the next node, not the next element/tag. Your element's next node is a text node, so you get the text you want.
Answered by Asiri H. on December 20, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP