Data Science Asked on November 30, 2020
I have a pandas dataframe with a salary column which contains values like:
£36,000 – £40,000 per year plus excellent bene…,
£26,658 to £32,547 etc
I isolated this column and split it with the view to recombining into the data frame later via a column bind in pandas.
I now have an object with columns like the below. The columns I split the original data frame column I think are blank because I didn’t specify them (I called df['salary']=df['salary'].astype(str).str.split()
)
So my new object contains this type of information:
[£26,658, to, £32,547],
[Competitive, with, Excellent, Benefits]
What I want to do is:
"^£"
?(substr(x,2,nchar(x)))
?I am very new to pandas and programming in general, but keen on learning, your help would be appreciated.
This is more of a general regex question, rather than pandas specific.
I would first create a function that extracts the numbers you need from strings, and then use the pandas.DataFrame.apply
function to apply it on the pandas column containing the strings. Here is what I would do:
import re
def parseNumbers(salary_txt):
return [int(item.replace(',','')) for item in re.findall('£([d,]+)',salary_txt)]
#testing if this works
testcases = ['£23,000 to £100,000','£34,000','£10000']
for testcase in testcases:
print testcase,parseNumbers(testcase)
Here, I just used re.findall
, which finds all patterns that look like £([d,]+)
. This is anything that starts with £ and is followed by an arbitrary sequence of digits and commas. The parenthesis tells python to extract only the bit after the £ sign. The last thing I do is I remove commas, and parse the remaining string into an integer. You could be more elegant about this I guess, but it works.
df['salary_list'] = df['salary'].apply(parseNumbers)
df['minsalary'] = df['salary'].apply(parseNumbers).apply(min)
df['maxsalary'] = df['salary'].apply(parseNumbers).apply(max)
Checking if this all works:
import pandas
df = pandas.DataFrame(testcases,columns = ['salary'])
df['minsalary'] = df['salary'].apply(parseNumbers).apply(min)
df['maxsalary'] = df['salary'].apply(parseNumbers).apply(max)
df
salary minsalary maxsalary
0 £23,000 to £100,000 23000 100000
1 £34,000 34000 34000
2 £10000 10000 10000
The advantages of moving the parsing logic to a separate function is that:
Answered by Ferenc Huszár on November 30, 2020
You can check the data types of your columns by doing df.dtypes
, and if 'salary'
isn't a string, you can convert it using df['salary'] = df['salary'].astype(str)
. This is what you were already doing before splitting. From there, Ferenc's method should work!
Answered by harshil on November 30, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP