Five Great Blog Posts on Dealing with Hive for Hadoop SQL: Date and Time, Numeric, String, Conditional, and Collection Functions

I currently work using SQL for Hive for Hadoop. Hive SQL is a little eccentric. If you have worked in other forms of SQL, it takes some getting used to.

Below are five blog posts from folkstalk.com which have helped me immensely.

  1. Date and Time Functions
  2. Numeric functions
  3. String functions
  4. Conditional functions
  5. Collection functions

Why are these posts any good?

  1. Date and Time Functions – Dates and Times in Hadoop are stored as strings in the format ‘YYYY-MM-DD HH:MM:SS.T’ (T is tenth of a second), not a date/time datatype or an integer. It gives you a good sense of how to deal with this.
  2. Numeric functions – It’s pretty much the same as other forms of SQL. Given how nerve rattling Hive SQL can be at first, it’s reassuring to know that some things are still the same.
  3. String functions – Check out the CONCAT function. It’s an actual function in this case and not a symbol like ‘&’.
  4. Conditional functions – Check out the COALESCE function. This one gets used quite a bit when one is trying to clean data by combining multiple fields. All it does is return the first non NULL value from the fields listed. It’s surprisingly useful.
  5. Collection functions – Check out the CAST function. This is how you change a value from one type to another. For example, to change the field ‘beers_drunk’ from a string to an integer would be CAST(beers_drunk AS int)

HELP! Excel Dropped the Leading Zero from My New England 5 Digit ZIP Codes! – The New England Zip Code Problem, Part 1

Contents

Let me guess why you’re here:

Someone just gave you an Excel workbook where one of the columns is ZIP codes and many of them are 4 instead of 5 digits. The cities and states next to ZIP codes are all from New England,
and probably New Jersey and a few other places too
. You no longer have the original text file and are stuck with the workbook.

What’s the problem:

The problem is that Excel is not very good at preserving data types. A five digit ZIP code (which despite the postal service introducing nine digit ones with a dash between the 5th and 6th digit people still generally use) looks like an integer to Excel. Excel treats it as such and eliminates any zeros in the front. However, it is not an integer, it is a code. Specifically, it is a postal code, which you want to use to store as a text string. This isn’t a problem as long as your address is west of Philadelphia, but it becomes a real headache with New England, New Jersey, and Caribbean ZIP codes.

For example, the ZIP code for the neighborhood of Back Bay in Boston should be 02116. However, Excel assumes that it is an integer and gives you 2116.

Another way of thinking about the problem:

A good way of thinking about this is to look at the Postal Codes from other countries, which is what a ZIP Code really is. Here are a couple of examples what they look like in Canada and the United Kingdom, where they use alphanumeric postal codes.

Country Example Postal Code
Canada M5A 1N1
United Kingdom W11 2BQ

You would never store these codes as integers. You would store them as strings. Excel would not mistake them for integers, even if they started with zeros.

What’s the short term solution:

This solution assumes that all of your ZIP codes are in the original 5 digits, and not the new format of 5 digits + 4 digits.

The short term solution is fairly easy. Let’s say the zip codes are in column E. Insert a column to the right of the zip code column E. Next, in the first row of the new column F, type the following formula:

=IF(LEN(E2)<5,REPT("0",5-LEN(E2))&TEXT(E2,"#"),TEXT(E2,"#"))

Replicate the formula down the column.

Wow, that formula is hard to read:

In plain english, what the formula says is:
"If the length of the value in cell E2 is less than the 5 digits in a ZIP code, then concatenate the correct number of zeros to the front of the value of E2 and convert it to a text string. Else, just convert the value in E2 to a text string."

Let’s break it up into its components.

It is all nested in an Excel IF function.

=IF(condition, do if true, do if false)

The condition clause

LEN(E2) < 5 – is the ZIP code less than 5 digits.

The Do if true clause

REPT(“0”,5-LEN(E2)) - repeat the character "0" (NOT the number 0, the text string value "0"). You want to make this a text code, not an integer. You want 5 minus the length of the less than 5
digit numbers because this is the exact number of zeros missing.

&TEXT(E2,”#”) – convert from an integer to a string
and concatenate the 4 or less character ZIP code to the string of zeros you just generated

The Do if false clause

TEXT(E2,"#")) – you found a 5 character ZIP code, convert it to a string from digits.

That's it.

Up Next:
Let’s say you had the original text file. You could prevent this
That will be discussed in The New England ZIP Code Problem: Part II