Data Science Asked by Monta on March 19, 2021
I am reading multiple xml files extracting some data then forming a pandas Dataframe with my data. These are the main steps that I do:
My steps are repeated for all xml files that i have (15gb of initial data that usually have 100MB of valuable text data)
This is my python code for appending data frames in the output excel file:
book = load_workbook('output.xlsx')
writer = pd.ExcelWriter('output.xlsx', engine='openpyxl')
writer.book = book
writer.sheets = {ws.title: ws for ws in book.worksheets}
startrow = writer.sheets['Sheet1'].max_row
output.to_excel(writer, startrow=startrow,index = False, header = False)
writer.save()
When I open the "output.xlsx" in Excel, I receive a prompt message saying "We found a problem with some content in "output.xlsx". Do you want us to try to recover as much as we can?" with a yes or no answer
This is the log file that excel generates:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<recoveryLog xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<logFileName>error001280_01.xml</logFileName><summary>Errors were detected in
file 'D:JUPYWORKDIR2009Resultsoutput.xlsx'</summary><repairedRecords>
<repairedRecord>Repaired Records: String properties from /xl/worksheets/sheet1.xml part
</repairedRecord></repairedRecords></recoveryLog>
I am worried that saving my results to excel format is corrupting my data, i will read "output.xlsx" with pandas in future in order to do some data analysis, does this problem effect my future analysis? I wanted to know why this problem is generated and should I save my data in CSV? any suggestions?
Ps. Checking the last row of "output.xlsx" using python code it is the same number of rows when i import the excel file in a pandas Dataframe, lastly checking the last row of the "recovered file" of Microsoft excel i still find the same number of rows so i think its a generic error of Microsoft excel because of large data but i am not sure
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP