Stack Overflow Asked by user3061338 on November 22, 2021
I have these sample values
prm_2020 P02 United Kingdom London 2 for 2
prm_2020 P2 United Kingdom London 2 for 2
prm_2020 P10 United Kingdom London 2 for 2
prm_2020 P11 United Kingdom London 2 for 2
Need to find P2, P02, P11,p06,p05 like this, trying to use Regexp_extract function in databricks. struggling to find the correct expression. Once i find P10, p6 from string i need to put numbers in new column called ID
select distinct
promo_name
,regexp_extract(promo_name, '(?<=pd+s+)Pd+') as regexp_id
from stock
where promo_name is not null
select distinct
promo_name
,regexp_extract(promo_name, 'P[0-9]+') as regexp_id
from stock
where promo_name is not null
both generating errors
function regexp_extract
will take 3 parameters.
def regexp_extract(e: org.apache.spark.sql.Column,exp: String,groupIdx: Int): org.apache.spark.sql.Column
You are missing last parameter in regexp_extract
function.
Check below code.
scala> df.show(truncate=False)
+------------------------------------------+
|data |
+------------------------------------------+
|prm_2020 P02 United Kingdom London 2 for 2|
|prm_2020 P2 United Kingdom London 2 for 2 |
|prm_2020 P10 United Kingdom London 2 for 2|
|prm_2020 P11 United Kingdom London 2 for 2|
+------------------------------------------+
df
.withColumn("parsed_data",regexp_extract(col("data"),"(P[0-9]*)",0))
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
df.createTempView("tbl")
spark
.sql("select data,regexp_extract(data,'(P[0-9]*)',0) as parsed_data from tbl")
.show(truncate=False)
+------------------------------------------+-----------+
|data |parsed_data|
+------------------------------------------+-----------+
|prm_2020 P02 United Kingdom London 2 for 2|P02 |
|prm_2020 P2 United Kingdom London 2 for 2 |P2 |
|prm_2020 P10 United Kingdom London 2 for 2|P10 |
|prm_2020 P11 United Kingdom London 2 for 2|P11 |
+------------------------------------------+-----------+
Answered by Srinivas on November 22, 2021
Just select the group 0
regexp_extract(promo_name, 'P[0-9]+',0)
Answered by Shubham Jain on November 22, 2021
One regex could be (?<=prm_d+s+)Pd+
Besides searching for strings in the form of P* where * is a digit, it also checks that such strings are preceded by strings in the form prm_* where * is a digit.
Keep in mind case sensitivity. The solution above IS case sensitive (if your input comes as PRM, then your match will be discarded.) I am not familiar with apache-spark but I assume it supports parameters such as /i as other platforms to indicate the regex should be case insensitive.
Answered by Veverke on November 22, 2021
The expression would be:
select regexp_extract(col, 'P[0-9]+')
Answered by Gordon Linoff on November 22, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP