What is the Hive SQL COALESCE function, what does it do, and why on earth is it useful?

What is the Hive SQL COALESCE function?

There is a function called COALESCE that exists in Hive SQL. It has the following syntax.

COALESCE(field1, field2, … , fieldn)

What does it do?

Of the fields above (field1, field2, … , fieldn), for each record returned it returns the value of the first field of the ones listed in the COALESCE function that is not NULL. If all of the fields are NULL, it returns NULL.

For example, let’s say one had 3 date fields, datefield1, datefield2, and datefield3 from the table tblDates.

tblDates
primary_key datefield1 datefield2 datefield3
1 NULL NULL 1993-06-04

The code:

SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1

will return ‘1993-06-04’

OK, what on earth do I use this for?

You use it to create proxy values for NULLs in a field.

It is very useful when you have dirty data and have to use another field as an approximation of the dirty data’s actual value. If you have a field that is full of NULLs, you can use another field to put values in for those NULLs that you think provide a good approximate value of what should be there.

I want an example!

OK.

Let’s say you are William the Conqueror and have just decided to put together the Domesday Book, where you need to get everyone’s birth date so that you can figure out who and how much you can tax in your new conquered land of England and Wales. Let’s say to tax them that you need to know that they are legal adults over the age of 18 but birth records are not very good in 1066. Let’s also say you magically have an apache server with hadoop on it.

What do you do?

Let’s say your tblDomesDayProxyBirthDays table has 4 fields

table fields data status of the field
thouGreatethsOfKeyeths an 11th century version of a Social Security number
thouDayOfBirth You have some birthdays but not all
thouBaptism You have more baptisms but not all
thouLordRegistrationDate You have all the days they first registered with the local lord of the manor

Let’s say you have these three records in tblDomesDayProxyBirthDays.

tblDomesDayProxyBirthDays
thouGreatethsOfKeyeths thouDayOfBirth thouBaptism thouLordRegistrationDate
1 1048-02-15 1048-03-01 1048-04-01
2 NULL 1049-08-29 1049-10-29
3 NULL NULL 1050-07-01

Well, you need their birthday to figure out if they are adults you can tax. However, you do not always have their “thouDayOfBirth” as shown above.

COALESCE allows you to use other data from other fields as a proxy.

If you use the code:

SELECT
thouGreatethsOfKeyeths
, COALESCE(thouDayOfBirth, thouBaptism, thouLordRegistrationDate) AS proxy_bday
FROM
tblDomesDayBirthDays

You will get:

thouGreatethsOfKeyeths proxy_bday
1 1048-02-15
2 1049-08-29
3 1050-07-01

For the first subject, you have their actual birthday. For the second subject, you have their first day of kindergarten. For the third subject, you have the day they registered with the lord of the manor. You now have a birthday for all of your subjects, even if some of them are not as accurate as you would like.

facebooktwittergoogle_plusredditpinterestlinkedinmail

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

How To Turn A String Into An Integer In Hive SQL For Hadoop

You use the CAST function.


CAST(fieldname AS INT)

Let’s say you have a field called ‘subscription_ids’ in the ‘subscriber’ table that someone saved as a string but actually should be integers.

To change this to an integer, you would use the command:

SELECT CAST(subscription_ids AS INT) as subscription_ids_as_int

FROM Subscriber

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

Five Great Blog Posts on Dealing with Hive for Hadoop SQL: Date and Time, Numeric, String, Conditional, and Collection Functions

I currently work using SQL for Hive for Hadoop. Hive SQL is a little eccentric. If you have worked in other forms of SQL, it takes some getting used to.

Below are five blog posts from folkstalk.com which have helped me immensely.

  1. Date and Time Functions
  2. Numeric functions
  3. String functions
  4. Conditional functions
  5. Collection functions

Why are these posts any good?

  1. Date and Time Functions – Dates and Times in Hadoop are stored as strings in the format ‘YYYY-MM-DD HH:MM:SS.T’ (T is tenth of a second), not a date/time datatype or an integer. It gives you a good sense of how to deal with this.
  2. Numeric functions – It’s pretty much the same as other forms of SQL. Given how nerve rattling Hive SQL can be at first, it’s reassuring to know that some things are still the same.
  3. String functions – Check out the CONCAT function. It’s an actual function in this case and not a symbol like ‘&’.
  4. Conditional functions – Check out the COALESCE function. This one gets used quite a bit when one is trying to clean data by combining multiple fields. All it does is return the first non NULL value from the fields listed. It’s surprisingly useful.
  5. Collection functions – Check out the CAST function. This is how you change a value from one type to another. For example, to change the field ‘beers_drunk’ from a string to an integer would be CAST(beers_drunk AS int)