TransWikia.com

Split columns by finding "£" and then converting to minvalue, maxvalue

Data Science Asked on November 30, 2020

I have a pandas dataframe with a salary column which contains values like:

£36,000 – £40,000 per year plus excellent bene…,
£26,658 to £32,547 etc

I isolated this column and split it with the view to recombining into the data frame later via a column bind in pandas.

I now have an object with columns like the below. The columns I split the original data frame column I think are blank because I didn’t specify them (I called df['salary']=df['salary'].astype(str).str.split()
)

So my new object contains this type of information:

[£26,658, to, £32,547],
[Competitive, with, Excellent, Benefits]

What I want to do is:

  1. Create three columns called minvalue and maxvalue and realvalue
  2. List items starting with £ (something to do with "^£"?
  3. Take till the end of the items found ignoring the £ (get the number out) (something to do with (substr(x,2,nchar(x)))?
  4. If there are two such items found, call the first number “minvalue” and call the second number “maxvalue” and put it below the right column. If there is only one value in the row, put it below the realvalue column.

I am very new to pandas and programming in general, but keen on learning, your help would be appreciated.

2 Answers

This is more of a general regex question, rather than pandas specific.

I would first create a function that extracts the numbers you need from strings, and then use the pandas.DataFrame.apply function to apply it on the pandas column containing the strings. Here is what I would do:

import re
def parseNumbers(salary_txt):
    return [int(item.replace(',','')) for item in re.findall('£([d,]+)',salary_txt)]

#testing if this works
testcases = ['£23,000 to £100,000','£34,000','£10000']
for testcase in testcases:
    print testcase,parseNumbers(testcase)

Here, I just used re.findall, which finds all patterns that look like £([d,]+). This is anything that starts with £ and is followed by an arbitrary sequence of digits and commas. The parenthesis tells python to extract only the bit after the £ sign. The last thing I do is I remove commas, and parse the remaining string into an integer. You could be more elegant about this I guess, but it works.

using this function in pandas

df['salary_list'] = df['salary'].apply(parseNumbers)
df['minsalary'] = df['salary'].apply(parseNumbers).apply(min)
df['maxsalary'] = df['salary'].apply(parseNumbers).apply(max)

Checking if this all works:

import pandas
df = pandas.DataFrame(testcases,columns = ['salary'])
df['minsalary'] = df['salary'].apply(parseNumbers).apply(min)
df['maxsalary'] = df['salary'].apply(parseNumbers).apply(max)
df

    salary  minsalary   maxsalary
0   £23,000 to £100,000 23000   100000
1   £34,000 34000   34000
2   £10000  10000   10000

The advantages of moving the parsing logic to a separate function is that:

  1. it may be reusable in other code
  2. it is easier to read for others, even if they aren't pandas experts
  3. it's easier to develop and test the parsing functionality in isolation

Answered by Ferenc Huszár on November 30, 2020

You can check the data types of your columns by doing df.dtypes, and if 'salary' isn't a string, you can convert it using df['salary'] = df['salary'].astype(str). This is what you were already doing before splitting. From there, Ferenc's method should work!

Answered by harshil on November 30, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP