Data Science Asked by shalu on December 4, 2020
pd is a panda module is one way of reading excel but its not available in my cluster. I want to read excel without pd module. Code1 and Code2 are two implementations i want in pyspark.
pdf = pd.read_excel(Name.xlsx)
sparkDF = sqlContext.createDataFrame(pdf)
df = sparkDF.rdd.map(list)
type(df)
Want to implement without pandas module
stringsList = []
columnList = list(df[colname])
for i in range(len(columnList)):
if type(columnList[i]) != float:
text = columnList[i]
stringsList.append(text.lower())
else:
stringsList.append(u'')
return stringsList
I want to implement this in pyspark.
Is pandas itself available on the cluster?
If so, you may try to go with the in-built read_excel()
.
You may also try the HadoopOffice library, it contains a Spark DataSource, also available as Spark Package, you can easily test it out without any installation:
$SPARK_HOME/bin/pyspark --packages com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.0.4
Some people also recommend the Spark Excel
dependency.
Answered by Dominik on December 4, 2020
You need the jar crealytics. Use the link - jar to download the jar
Try this, it would help!
def get_df_from_excel(sqlContext, file_name):
"""
This method is intended to create a dataframe form excel file
:param sqlContext: sqlContext
:param file_name: - Address of file
:return: dataframe
"""
return sqlContext.read.format("com.crealytics.spark.excel")
.option("useHeader", "true")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.option("addColorColumns", "False")
.option("maxRowsInMey", 2000)
.option("sheetName", "Import")
.load(file_name)
Answered by Rahul on December 4, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP