TransWikia.com

Pyspark: Filter dataframe based on separate specific conditions

Data Science Asked by randomizer0000 on December 23, 2020

How can I select only certain entries that match my condition and from those entries, filter again using regex?

For instance, I have this dataframe (df)

+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   A|   f|   5|   g|  
|   D|  er|  2e|  sd| 
|   F|   g|23sd|   a| 
|   F| fgf|  45|   d| 
|   E|   r|   3|   e| 
|   A|  sd|  8f|  dw| 
|   F|  sd| 3h1|   d| 
+----+----+----+----+

I want to select those entries with ‘F’ value in col1, and filter again with regex ([a-zA-Z0-9]+) to get only entries with numbers and letters.

+----+----+----+----+         +----+----+----+----+
|col1|col2|col3|col4|         |col1|col2|col3|col4|
+----+----+----+----+         +----+----+----+----+ 
|   F|   g|23sd|   a|   -->   |   F|   g|23sd|   a|
|   F| fgf|  45|   d|         |   F|  sd| 3h1|   d|
|   F|  sd| 3h1|   d|         +----+----+----+----+
+----+----+----+----+

One Answer

You can use the filter method on Spark's DataFrame API:

df_filtered = df.filter("df.col1 = F").collect()

which also supports regex

pattern = r"[a-zA-Z0-9]+"
df_filtered_regex = df.filter([df_filtered.c.rlike(pattern) for c in df.columns]).collect()`

Answered by Brian Spiering on December 23, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP