Stack Overflow Asked by santma on November 17, 2021
The Data
I have data taken from a webscraper that I am trying to clean. For each webpage scraped, I exported a csv consisting of one row and 10-14 columns.
Input:
"Featured Snippet" | title | misc. content | misc. content | misc. content | website | page title | "Feedback" | "About snippets"
The misc. content cells vary from csv to csv. Sometimes there are two, three, or four. What I am trying to do is combine these middle columns into a single string.
Output:
filename | website | page title | title | content
So, my code imports each csv in a for loop as a pandas dataframe. It extracts the second column for the title, then flips the dataframe to extract the 3rd-to-last column for website, 4th-to-last for Page title, and the whole row up to the 5th column for the content (so the content includes extra data (title and "featured snippet") but thats ok because i can clean it in excel later. It also gret the filename as a value. It puts all these values for each csv into lists, which I combine into a dataframe at the end.
Code
files = sorted(glob.glob('*.csv'))
filenames = []
websites = []
pagetitles = []
titles = []
contents = []
for f in files:
df = pd.read_csv(f,index_col=False)
df = df[0:1]
title = df.iloc[:,1]
title = title.to_string(index = False)
titles.append(title)
df_flipped = df.iloc[:, ::-1]
website = df_flipped.iloc[:,2]
website = website.to_string(index = False)
websites.append(website)
pagetitle = df_flipped.iloc[:,3]
pagetitle = pagetitle.to_string(index = False)
pagetitles.append(pagetitle)
content = df_flipped.iloc[:,4:]
content = content.dropna(axis = 1)
content = content.apply(lambda row: ' // '.join(row.values.astype(str)), axis=1)
contents.append(content)
filename = os.path.splitext(str(f))[0]
filenames.append(filename)
snippet_data = pd.DataFrame(list(zip(filenames, websites, pagetitles, titles)))
snippet_data.to_csv('datasets/black-friday-snippets.csv')
My Problem
I’ve actually done everything I wanted to do, but my content keeps getting truncated. I’ve tried a billion variations of the .join function, tried converting the content into a bunch of different datatypes, and I’ve already tried about 3904312590781038941 different ways of this:
pd.set_option('display.max_columns', 50000000)
pd.set_option('display.width', 1500000000)
Also, I’ve done a bunch of similar codes and never had a problem.
Clues
I am using Spyder, and when I open up the content variable, I have to double click on the row to see the full content.
Content is a Series, and contents is a list of Series. Likewise when I open contents variable, I have to double click on the cell to see the full text.
Just to @#$# with my head even more, it shows the truncated version when i try print(content)
It truncates after pd.Dataframe()
, but since it also truncates with the print() function, I have no idea exactly why to how to avoid this.
Yes, I tried pd.set_options(blah blah blah). Maybe I’m not using it right.
Ok, so i figured this one out by putting pd.options.display.max_colwidth = 500
in the for loop right after pd.read_csv()
So it goes:
files = sorted(glob.glob('*.csv'))
filenames = []
websites = []
pagetitles = []
titles = []
contents = []
for f in files:
df = pd.read_csv(f,index_col=False)
df = df[0:1]
pd.options.display.max_colwidth = 500
title = df.iloc[:,1]
title = title.to_string(index = False)
titles.append(title)
df_flipped = df.iloc[:, ::-1]
website = df_flipped.iloc[:,2]
website = website.to_string(index = False)
websites.append(website)
pagetitle = df_flipped.iloc[:,3]
pagetitle = pagetitle.to_string(index = False)
pagetitles.append(pagetitle)
content = df_flipped.iloc[:,4:]
content = content.dropna(axis = 1)
content = content.apply(lambda row: ' // '.join(row.values.astype(str)), axis=1)
contents.append(content)
filename = os.path.splitext(str(f))[0]
filenames.append(filename)
snippet_data = pd.DataFrame(list(zip(filenames, websites, pagetitles, titles, content)))
snippet_data.to_csv('datasets/black-friday-snippets.csv')
Answered by santma on November 17, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP