Posts

Showing posts from July, 2019

Apache Spark UDF: Over Optimization Issue

Image
What is Apache Spark UDF User-Defined Function (aka UDF) is a feature of Apache Spark SQL to define new Column-based functions that extend the vocabulary of Spark SQL's DSL for transforming Dataset/DataFrame. Defining a UDF: You define a new UDF by defining a function (could be Scala/Python depending upon the underline language) as an input parameter of UDF function. For example: import spark.implicits._ val squared = (s: Long) => {s * s} val squared_udf = spark.udf.register("square", squared) Once it is registered, it can be used as: val df = spark.range(10) df.withColumn("s",squared_udf($"id")).show()