Stack Overflow Asked by giantg2 on December 10, 2020
I am trying to combine multiple rows with the same VAERS_ID. I have my current code below. Is there a better way to do this? This code is extremely slow, even when I launch each file concurrently using multiprocessing. I’m not sure what I can do to speed this up. I think it takes a few hours to run the 30 years of VAERS data.
Sample Input:
VAERS_ID,VAX_TYPE,VAX_MANU,VAX_LOT,VAX_DOSE_SERIES,VAX_ROUTE,VAX_SITE,VAX_NAME
794159,DTAPIPV,GLAXOSMITHKLINE BIOLOGICALS,G9P35,1,IM,LL,DTAP + IPV (KINRIX)
794159,MMRV,MERCK and CO. INC.,R015744,1,SC,LL,MEASLES + MUMPS + RUBELLA + VARICELLA (PROQUAD)
Sample Output:
VAERS_ID,VAX_TYPE_1,VAX_MANU_1,VAX_LOT_1,VAX_DOSE_SERIES_1,VAX_ROUTE_1,VAX_SITE_1,VAX_NAME_1,VAX_TYPE_2,VAX_MANU_2,VAX_LOT_2,VAX_DOSE_SERIES_2,VAX_ROUTE_2,VAX_SITE_2,VAX_NAME_2
794159,DTAPIPV,GLAXOSMITHKLINE BIOLOGICALS,G9P35,1,IM,LL,DTAP + IPV (KINRIX),MMRV,MERCK and CO. INC.,R015744,1,SC,LL,MEASLES + MUMPS + RUBELLA + VARICELLA (PROQUAD)
def combineVaxRecords(file):
print('processing ' + file)
headers = ['VAX_TYPE_1', 'VAX_MANU_1', 'VAX_LOT_1', 'VAX_DOSE_SERIES_1','VAX_ROUTE_1', 'VAX_SITE_1', 'VAX_NAME_1',
'VAX_TYPE_2', 'VAX_MANU_2', 'VAX_LOT_2', 'VAX_DOSE_SERIES_2','VAX_ROUTE_2', 'VAX_SITE_2', 'VAX_NAME_2',
'VAX_TYPE_3', 'VAX_MANU_3', 'VAX_LOT_3', 'VAX_DOSE_SERIES_3','VAX_ROUTE_3', 'VAX_SITE_3', 'VAX_NAME_3',
'VAX_TYPE_4', 'VAX_MANU_4', 'VAX_LOT_4', 'VAX_DOSE_SERIES_4','VAX_ROUTE_4', 'VAX_SITE_4', 'VAX_NAME_4',
'VAX_TYPE_5', 'VAX_MANU_5', 'VAX_LOT_5', 'VAX_DOSE_SERIES_5','VAX_ROUTE_5', 'VAX_SITE_5', 'VAX_NAME_5',
'VAX_TYPE_6', 'VAX_MANU_6', 'VAX_LOT_6', 'VAX_DOSE_SERIES_6','VAX_ROUTE_6', 'VAX_SITE_6', 'VAX_NAME_6']
dfOut = pd.DataFrame(columns=headers)
df = pd.read_csv(file, engine='python', error_bad_lines=False) #drop records with errors
# get a unique list of the IDs
idList = list(df['VAERS_ID'])
idList = list(dict.fromkeys(idList))
inRows = pd.DataFrame()
# for each record, write the row if it's the only one found for that ID. Otherwise combine the rows
for record in idList:
inRows = df.loc[df['VAERS_ID'] == record]
count = 1
for index, row in inRows.iterrows():
if count == 1:
outRow = row
else:
if count > 6:
print('error - more than 6 vaccines for this id ' + str(record))
# map the current record to the combined record
strCount = str(count)
vaxType = 'VAX_TYPE_' + strCount
vaxMenu = 'VAX_MANU_' + strCount
vaxLot = 'VAX_LOT_' + strCount
vaxSeries = 'VAX_DOSE_SERIES_' + strCount
vaxRoute = 'VAX_ROUTE_' + strCount
vaxSite = 'VAX_SITE_' + strCount
vaxName = 'VAX_NAME_' + strCount
countIndex = count - 1
location = (countIndex * 7) + 1
# combine the data for record to be writen to the new file
outRow[vaxType] = inRows.iat[countIndex,location]
outRow[vaxMenu] = inRows.iat[countIndex,location+1]
outRow[vaxLot] = inRows.iat[countIndex,location+2]
outRow[vaxSeries] = inRows.iat[countIndex,location+3]
outRow[vaxRoute] = inRows.iat[countIndex,location+4]
outRow[vaxSite] = inRows.iat[countIndex,location+5]
outRow[vaxName] = inRows.iat[countIndex,location+6]
#write the outRow to new df here
dfOut = dfOut.append(outRow)
count += 1
#change to new dataframe
dfOut.set_index("VAERS_ID", inplace=True)
dfOut.to_csv(file)
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP