Using BeautifulSoup to get the text after a strong tag, when that text is not within a ... itself

Question

I'm scraping this website: https://eintaxid.com/companies/a/?page=1 I've successfully extracted the company name: r = requests.get('https://eintaxid.com/companies/a/?page=1') soup = BeautifulSoup(r.content, "html.parser") # outputs all strong tags for tag in soup.find_all('strong'): print(tag.text) I can easily isolate the company names doing this: r = requests.get('https://eintaxid.com/companies/a/?page=1') soup = BeautifulSoup(r.content, "html.parser") table = soup.find_all('strong') comp_list = [] # loop to extract just the company names from the strong tags, then using the a tags for j in table: td = j.find_all(['a']) row = [i.text for i in td] comp_list.append(row) # puts company names into a pandas df comp_list = list(filter(lambda x: len(x) > 0, comp_list)) comp_list = pd.DataFrame(comp_list, columns = ['Company']).reset_index(drop = True) comp_list I can't for the life of me extract the EIN numbers, though. The EIN Number: is plain to see, and I can extract that from the first code chunk above. But how do I get the actual number? The 98-1455367 as seen in the following screenshot? For reference, I'm going to put the EIN number next to each company in a pandas df - but can't really do that until I've extracted the EIN number itself.

Asiri H. · Answer

Referring to the docs you might want to use the next_sibling of your  tag, catch the strong tag first, then get the next item from the context:
strong_element.next_sibling    # contains "EIN number"

"Sibling" in this context is the next node, not the next element/tag. Your element's next node is a text node, so you get the text you want.

Using BeautifulSoup to get the text after a strong tag, when that text is not within a ... itself

One Answer

Add your own answers!

Ask a Question