TransWikia.com

Excel error may be caused by pandas writing or big data? advise needed

Data Science Asked by Monta on March 19, 2021

I am reading multiple xml files extracting some data then forming a pandas Dataframe with my data. These are the main steps that I do:

  1. open an xml file
  2. extract some elements
  3. create a pandas dataframe with the extracted elements
  4. append the results in the excel file named "output.xlsx"(using the code below in python)

My steps are repeated for all xml files that i have (15gb of initial data that usually have 100MB of valuable text data)

This is my python code for appending data frames in the output excel file:

book = load_workbook('output.xlsx')
writer = pd.ExcelWriter('output.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = {ws.title: ws for ws in book.worksheets}
startrow = writer.sheets['Sheet1'].max_row
output.to_excel(writer, startrow=startrow,index = False, header = False)
writer.save()

When I open the "output.xlsx" in Excel, I receive a prompt message saying "We found a problem with some content in "output.xlsx". Do you want us to try to recover as much as we can?" with a yes or no answer

This is the log file that excel generates:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"> 
<logFileName>error001280_01.xml</logFileName><summary>Errors were detected in 
 file 'D:JUPYWORKDIR2009Resultsoutput.xlsx'</summary><repairedRecords> 
<repairedRecord>Repaired Records: String properties from /xl/worksheets/sheet1.xml part
</repairedRecord></repairedRecords></recoveryLog>

I am worried that saving my results to excel format is corrupting my data, i will read "output.xlsx" with pandas in future in order to do some data analysis, does this problem effect my future analysis? I wanted to know why this problem is generated and should I save my data in CSV? any suggestions?

Ps. Checking the last row of "output.xlsx" using python code it is the same number of rows when i import the excel file in a pandas Dataframe, lastly checking the last row of the "recovered file" of Microsoft excel i still find the same number of rows so i think its a generic error of Microsoft excel because of large data but i am not sure

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP