PySpark: How do I specify dropna axis in PySpark transformation?

Question

I would like to drop columns that contain all null values using dropna(). With Pandas you can do this with setting the keyword argument axis = 'columns' in dropna(). Here an example in a GitHub post.

How do I do this in PySpark ? dropna() is available as a transformation in PySpark, however axis is not an available keyword.

Note: I do not want to transpose my dataframe for this to work.

How would I drop the furniture column from this dataframe ?

data_2 = { 'furniture': [np.NaN ,np.NaN ,np.NaN], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]}

df_1 = pd.DataFrame(data_2)
ddf_1 = spark.createDataFrame(df_1)
ddf_1.show()

pm2020 · Answer

I  know this is a bit late, but I struggled with this also. This is my attempt at removing null columns from a Spark Dataframe.
from pyspark.sql.functions import when, isnull

colsthatarenull = df.select([(when(isnull(c), c)).alias(c) for c in df.columns]).first().asDict()
namesofnullcols = {key:val for key, val in colsthatarenull.items() if val != None}.values()
df = df.drop(*namesofnullcols)

Nitish Sahay · Answer

You should be able to use the column name like:

df_1 = df_1.drop('furniture')

PySpark: How do I specify dropna axis in PySpark transformation?

2 Answers

Add your own answers!

Ask a Question