Data Science Asked by Horbaje on January 14, 2021
I would like to drop columns that contain all null values using dropna()
. With Pandas you can do this with setting the keyword argument axis = 'columns'
in dropna()
. Here an example in a GitHub post.
How do I do this in PySpark ? dropna()
is available as a transformation in PySpark, however axis
is not an available keyword.
Note: I do not want to transpose my dataframe for this to work.
How would I drop the furniture column from this dataframe ?
data_2 = { 'furniture': [np.NaN ,np.NaN ,np.NaN], 'myid': ['1-12', '0-11', '2-12'], 'clothing': ["pants", "shoes", "socks"]}
df_1 = pd.DataFrame(data_2)
ddf_1 = spark.createDataFrame(df_1)
ddf_1.show()
You should be able to use the column name like:
df_1 = df_1.drop('furniture')
Answered by Nitish Sahay on January 14, 2021
I know this is a bit late, but I struggled with this also. This is my attempt at removing null columns from a Spark Dataframe.
from pyspark.sql.functions import when, isnull
colsthatarenull = df.select([(when(isnull(c), c)).alias(c) for c in df.columns]).first().asDict()
namesofnullcols = {key:val for key, val in colsthatarenull.items() if val != None}.values()
df = df.drop(*namesofnullcols)
Answered by pm2020 on January 14, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP