Is there any way to read Xlsx file in pyspark?Also want to read strings of column from each columnName

Question

pd is a panda module is one way of reading excel but its not available in my cluster. I want to read excel without pd module. Code1 and Code2 are two implementations i want in pyspark.

Code 1: Reading Excel

pdf = pd.read_excel(Name.xlsx)
sparkDF = sqlContext.createDataFrame(pdf)
df = sparkDF.rdd.map(list)
type(df)

Want to implement without pandas module

Code 2: gets list of strings from column colname in dataframe df

stringsList = []
columnList = list(df[colname])
for i in range(len(columnList)):
    if type(columnList[i]) != float:
        text = columnList[i]
        stringsList.append(text.lower())    
    else:
        stringsList.append(u'')
return stringsList

I want to implement this in pyspark.

Dominik · Answer

Is pandas itself available on the cluster?
If so, you may try to go with the in-built read_excel().

You may also try the HadoopOffice library, it contains a Spark DataSource, also available as Spark Package, you can easily test it out without any installation:

$SPARK_HOME/bin/pyspark --packages com.github.zuinnote:spark-hadoopoffice-ds_2.11:1.0.4

Some people also recommend the Spark Excel dependency.

Rahul · Answer

You need the jar crealytics. Use the link - jar to download the jar

Try this, it would help!

def get_df_from_excel(sqlContext, file_name):
    """    
    This method is intended to create a dataframe form excel file
    :param sqlContext: sqlContext
    :param file_name:  - Address of file 
    :return: dataframe
    """
    return sqlContext.read.format("com.crealytics.spark.excel") 
        .option("useHeader", "true") 
        .option("treatEmptyValuesAsNulls", "true") 
        .option("inferSchema", "true") 
        .option("addColorColumns", "False") 
        .option("maxRowsInMey", 2000) 
        .option("sheetName", "Import") 
        .load(file_name)

Is there any way to read Xlsx file in pyspark?Also want to read strings of column from each columnName

Code 1: Reading Excel

Code 2: gets list of strings from column colname in dataframe df

2 Answers

Add your own answers!

Ask a Question