Data Science Asked by RustyNails on July 31, 2021
My objective: Using pandas, check a column for matching text [not exact] and update new column if TRUE.
From a csv file, a data frame was created and values of a particular column – COLUMN_to_Check, are checked for a matching text pattern – ‘PEA’. Based on whether pattern matches, a new column on the data frame is created with YES or NO.
I have the following data in file DATA2.csv
ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY
GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness)
RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure
RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure
RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access
RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales
And using this code….
import pandas as pd
df=pd.read_csv('DATA2.csv')
Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
IScritical=df['COLUMN_to_Check'].str.contains(pattern)
for CHECK in IScritical:
if not CHECK:
print CHECK
df['NEWcolumn']='NO'
else:
print CHECK
df['NEWcolumn']='YES'
df.to_csv('OUPUT.csv')
Printing the value of ‘CHECK’ returns correct values, i.e., first row returns false.
C:UsersMEDocumentsSandBox (master)
λ python numpytest_pub.py
False
True
True
True
True
But the output csv file shows all values of ‘NEWColumn’ as ‘YES’, where on ‘NEWcolumn’, row[0], value should be ‘NO’ as the ‘COLUMN_to_Check’ here should not match the pattern.
,ASSIGNMENT,Open date,Resolved date,COLUMN_to_Check,NUMBER,Open Time,RESOLVED_GROUP,RESOLVED_TIME,SUBCATEGORY,NEWcolumn
0,GBL_IS_GRC_PROCESSCONTROL,3/1/2017 13:39,11/1/2017 13:09,APAC_LT-ERP-FICO-BOKADABISH_PRD,IM-17-001200,3/1/2017 13:39,GBL_GSO_MQG,11/1/2017 13:09,Security (breach or weakness),YES
1,RSP_SERVICEDESK,12/1/2017 0:08,12/1/2017 0:27,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006462,12/1/2017 0:08,RSP_SERVICEDESK,12/1/2017 0:27,failure,YES
2,RSP_SERVICEDESK,10/1/2017 5:27,12/1/2017 0:52,APAC_LT-ERP-SUPPLY-PEA_PRD,IM-17-004667,10/1/2017 5:27,RSP_PCS_INCIDENTS,12/1/2017 0:52,failure,YES
3,RSP_SERVICEDESK,12/1/2017 2:35,12/1/2017 3:03,APAC_LT-ERP-SALES-PEA_PRD,IM-17-006483,12/1/2017 2:35,RSP_SERVICEDESK,12/1/2017 3:03,access,YES
4,RSP_SAP_BI,10/1/2017 21:04,12/1/2017 6:01,APAC_LT-ERP-SALES-PEA_PRD,IM-17-005498,10/1/2017 21:04,RSP_SAP_SALES,12/1/2017 6:01,SAP Sales,YES
I can sense that something is missing in the CHECK part, but not able to figure out what. Can anyone help ?
Let me know if the question needs rephrasing for better understanding or future community use.
df['NEWcolumn']='NO'
sets the whole column to the value 'NO'
. So you see the result for the last row in your table, distributed over the whole column.
Here is a way to achieve what you want:
df['NEWcolumn'][IScritical]='YES'
df['NEWcolumn'][~IScritical]='NO'
See https://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking
Answered by Matthias Berth on July 31, 2021
You may use directly the IScritical
feature you created:
import pandas as pd
df=pd.read_csv('DATA2.csv')
Search_for_These_values = ['PEA', 'DEF', 'XYZ'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
IScritical=df['COLUMN_to_Check'].str.contains(pattern)
df['NEWcolumn'] = IScritical.replace((True,False), ('YES','NO'))
Answered by michaelg on July 31, 2021
You simply need to do:
df['NEWcolumn'] = df['COLUMN_to_Check'].str.contains(pattern)
df['NEWcolumn'] = df['NEWcolumn'].map({True: 'Yes', False: 'No'})
Answered by Suresh Kasipandy on July 31, 2021
You could first add the column and default the value to 'NO' and then update the dataframe with .loc:
df['NEWcolumn']='NO'
df.loc[df['COLUMN_to_Check'].str.contains(pattern), 'NEWcolumn'] = 'YES'
Answered by Bas on July 31, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP