TransWikia.com

How to efficiently map over DF and use combination of outputs?

Stack Overflow Asked on November 18, 2021

Given a DF, let’s say I have 3 classes each with a method addCol that will use the columns in the DF to create and append a new column to the DF (based on different calculations).

What is the best way to get a resulting df that will contain the original df A and the 3 added columns?

val df = Seq((1, 2), (2,5), (3, 7)).toDF("num1", "num2")

def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method1", col("num1")/col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method2", col("num1")*col("num2"))
}
def addCol(df: DataFrame): DataFrame = {
    df.withColumn("method3", col("num1")+col("num2"))
}

One option is actions.foldLeft(df) { (df, action) => action.addCol(df))}. The end result is the DF I want — with columns num1, num2, method1, method2, and method3. But from my understanding this will not make use of distributed evaluation, and each addCol will happen sequentially. What is the more efficient way to do this?

One Answer

Efficient way to do this is using select.

select is faster than the foldLeft if you have very huge data - Check this post

You can build required expressions & use that inside select, check below code.

scala> df.show(false)
+----+----+
|num1|num2|
+----+----+
|1   |2   |
|2   |5   |
|3   |7   |
+----+----+
scala> val colExpr = Seq(
                          $"num1",
                          $"num2",
                          ($"num1"/$"num2").as("method1"),
                          ($"num1" * $"num2").as("method2"),
                          ($"num1" + $"num2").as("method3")
)

Final Output

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

Update

Return Column instead of DataFrame. Try using higher order functions, Your all three function can be replaced with below one function.

scala> def add(
               num1:Column, // May be you can try to use variable args here if you want.
               num2:Column,
               f: (Column,Column) => Column
             ): Column = f(num1,num2)

For Example, varargs & while invoking this method you need to pass required columns at the end.

def add(f: (Column,Column) => Column,cols:Column*): Column = cols.reduce(f)

Invoking add function.

scala> val colExpr = Seq(
    $"num1",
    $"num2",
    add($"num1",$"num2",(_ / _)).as("method1"),
    add($"num1", $"num2",(_ * _)).as("method2"),
    add($"num1", $"num2",(_ + _)).as("method3")
)

Final Output

scala> df.select(colExpr:_*).show(false)
+----+----+-------------------+-------+-------+
|num1|num2|method1            |method2|method3|
+----+----+-------------------+-------+-------+
|1   |2   |0.5                |2      |3      |
|2   |5   |0.4                |10     |7      |
|3   |7   |0.42857142857142855|21     |10     |
+----+----+-------------------+-------+-------+

Answered by Srinivas on November 18, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP