What is the Hive SQL COALESCE function, what does it do, and why on earth is it useful?

What is the Hive SQL COALESCE function?

There is a function called COALESCE that exists in Hive SQL. It has the following syntax.

COALESCE(field1, field2, … , fieldn)

What does it do?

Of the fields above (field1, field2, … , fieldn), for each record returned it returns the value of the first field of the ones listed in the COALESCE function that is not NULL. If all of the fields are NULL, it returns NULL.

For example, let’s say one had 3 date fields, datefield1, datefield2, and datefield3 from the table tblDates.

tblDates
primary_key datefield1 datefield2 datefield3
1 NULL NULL 1993-06-04

The code:

SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1

will return ‘1993-06-04’

OK, what on earth do I use this for?

You use it to create proxy values for NULLs in a field.

It is very useful when you have dirty data and have to use another field as an approximation of the dirty data’s actual value. If you have a field that is full of NULLs, you can use another field to put values in for those NULLs that you think provide a good approximate value of what should be there.

I want an example!

OK.

Let’s say you are William the Conqueror and have just decided to put together the Domesday Book, where you need to get everyone’s birth date so that you can figure out who and how much you can tax in your new conquered land of England and Wales. Let’s say to tax them that you need to know that they are legal adults over the age of 18 but birth records are not very good in 1066. Let’s also say you magically have an apache server with hadoop on it.

What do you do?

Let’s say your tblDomesDayProxyBirthDays table has 4 fields

table fields data status of the field
thouGreatethsOfKeyeths an 11th century version of a Social Security number
thouDayOfBirth You have some birthdays but not all
thouBaptism You have more baptisms but not all
thouLordRegistrationDate You have all the days they first registered with the local lord of the manor

Let’s say you have these three records in tblDomesDayProxyBirthDays.

tblDomesDayProxyBirthDays
thouGreatethsOfKeyeths thouDayOfBirth thouBaptism thouLordRegistrationDate
1 1048-02-15 1048-03-01 1048-04-01
2 NULL 1049-08-29 1049-10-29
3 NULL NULL 1050-07-01

Well, you need their birthday to figure out if they are adults you can tax. However, you do not always have their “thouDayOfBirth” as shown above.

COALESCE allows you to use other data from other fields as a proxy.

If you use the code:

SELECT
thouGreatethsOfKeyeths
, COALESCE(thouDayOfBirth, thouBaptism, thouLordRegistrationDate) AS proxy_bday
FROM
tblDomesDayBirthDays

You will get:

thouGreatethsOfKeyeths proxy_bday
1 1048-02-15
2 1049-08-29
3 1050-07-01

For the first subject, you have their actual birthday. For the second subject, you have their first day of kindergarten. For the third subject, you have the day they registered with the lord of the manor. You now have a birthday for all of your subjects, even if some of them are not as accurate as you would like.

facebooktwittergoogle_plusredditpinterestlinkedinmail

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

4 thoughts on “What is the Hive SQL COALESCE function, what does it do, and why on earth is it useful?

Leave a Reply to Yash Aggarwal Cancel reply

Your email address will not be published. Required fields are marked *