Pyspark: Filter dataframe based on separate specific conditions

Question

How can I select only certain entries that match my condition and from those entries, filter again using regex?

For instance, I have this dataframe (df)

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   f|   5|   g|  
|   D|  er|  2e|  sd| 
|   F|   g|23sd|   a| 
|   F| fgf|  45|   d| 
|   E|   r|   3|   e| 
|   A|  sd|  8f|  dw| 
|   F|  sd| 3h1|   d| 
+----+----+----+----+

I want to select those entries with 'F' value in col1, and filter again with regex ([a-zA-Z0-9]+) to get only entries with numbers and letters.

+----+----+----+----+         +----+----+----+----+
|col1|col2|col3|col4|         |col1|col2|col3|col4|
+----+----+----+----+         +----+----+----+----+ 
|   F|   g|23sd|   a|   -->   |   F|   g|23sd|   a|
|   F| fgf|  45|   d|         |   F|  sd| 3h1|   d|
|   F|  sd| 3h1|   d|         +----+----+----+----+
+----+----+----+----+

Brian Spiering · Answer

You can use the filter method on Spark's DataFrame API:
df_filtered = df.filter("df.col1 = F").collect()
which also supports regex
pattern = r"[a-zA-Z0-9]+"
df_filtered_regex = df.filter([df_filtered.c.rlike(pattern) for c in df.columns]).collect()`

Pyspark: Filter dataframe based on separate specific conditions

One Answer

Add your own answers!

Ask a Question