An Explanation In Plain English Of What Business Intelligence Actually Is, With Two Real World Examples

OK, Explain It To Me

Companies and organizations now feel that they need “Business Intelligence” (BI) to make sense of the data that they gather in order to make better decisions. However, they can be awfully vague on what that means. Before I got hired to work in Business Intelligence, I used to wonder about what on earth they were talking. So what are employers actually doing when they say that they are creating “Business Intelligence”?

The answer is straight forward once you get past the buzzwords:

Business Intelligence means to use SQL queries and scripting to extract information that a human being can easily digest from an organization’s database or databases.

That’s It?

Yes, that’s it. It’s a branch of data analysis.

Fantastic, I am going to set up Business Intelligence for my organization tomorrow!

Now hold on there. Just because one can explain BI easily in one sentence does not mean that it is easy to do. There are at least four major barriers.

  1. The SQL queries can be very complex. They can take days or even weeks to write.
  2. Scripting, using a stats package like SAS or R, or a reporting program like Crystal Reports is often needed in addition to SQL to get the answers you need and to present it in an easily digestible format.
  3. The previous two statements assume that the databases, querying applications, and other software have been set up well in the first place, which often takes a large staff beyond those in an organization’s BI section.
  4. The data needed to do the analysis may not be currently collected. Developers’ of live business databases main priority is usually to ensure that an organization’s information infrastructure runs smoothly from day to day. It is very likely that it never occurred to your organization’s developers to record the data points that you need.

OK, fine, it’s tougher that it looks, but I want to see these real world examples so I can have some idea what I should expect.

No problem.

I can actually give you two good examples of Business Intelligence which I helped create. I used to work for the marketing company LivingSocial as a Business Intelligence Analyst. While most of my work was proprietary, I did BI for two articles for LivingSocial’s Blog.

  1. What are the Nicest Cities in the US? – discusses which of LivingSocial’s American cities have the largest percentage of gift purchases
  2. Who’s Leaving on a Jet Plane? – discusses which of LivingSocial’s American cities have the largest and smallest percentages of travel purchases

I wrote the SQL queries that returned the percentages and tables used in both blog posts. While both are simple examples of BI, they are very typical of the information that organizations wish to get from their data.

facebooktwittergoogle_plusredditpinterestlinkedinmail

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

What is the Hive SQL COALESCE function, what does it do, and why on earth is it useful?

What is the Hive SQL COALESCE function?

There is a function called COALESCE that exists in Hive SQL. It has the following syntax.

COALESCE(field1, field2, … , fieldn)

What does it do?

Of the fields above (field1, field2, … , fieldn), for each record returned it returns the value of the first field of the ones listed in the COALESCE function that is not NULL. If all of the fields are NULL, it returns NULL.

For example, let’s say one had 3 date fields, datefield1, datefield2, and datefield3 from the table tblDates.

tblDates
primary_key datefield1 datefield2 datefield3
1 NULL NULL 1993-06-04

The code:

SELECT COALESCE(datefield1, datefield2, datefield3) as first_date_found
FROM
tblDates
WHERE
primary_key = 1

will return ‘1993-06-04’

OK, what on earth do I use this for?

You use it to create proxy values for NULLs in a field.

It is very useful when you have dirty data and have to use another field as an approximation of the dirty data’s actual value. If you have a field that is full of NULLs, you can use another field to put values in for those NULLs that you think provide a good approximate value of what should be there.

I want an example!

OK.

Let’s say you are William the Conqueror and have just decided to put together the Domesday Book, where you need to get everyone’s birth date so that you can figure out who and how much you can tax in your new conquered land of England and Wales. Let’s say to tax them that you need to know that they are legal adults over the age of 18 but birth records are not very good in 1066. Let’s also say you magically have an apache server with hadoop on it.

What do you do?

Let’s say your tblDomesDayProxyBirthDays table has 4 fields

table fields data status of the field
thouGreatethsOfKeyeths an 11th century version of a Social Security number
thouDayOfBirth You have some birthdays but not all
thouBaptism You have more baptisms but not all
thouLordRegistrationDate You have all the days they first registered with the local lord of the manor

Let’s say you have these three records in tblDomesDayProxyBirthDays.

tblDomesDayProxyBirthDays
thouGreatethsOfKeyeths thouDayOfBirth thouBaptism thouLordRegistrationDate
1 1048-02-15 1048-03-01 1048-04-01
2 NULL 1049-08-29 1049-10-29
3 NULL NULL 1050-07-01

Well, you need their birthday to figure out if they are adults you can tax. However, you do not always have their “thouDayOfBirth” as shown above.

COALESCE allows you to use other data from other fields as a proxy.

If you use the code:

SELECT
thouGreatethsOfKeyeths
, COALESCE(thouDayOfBirth, thouBaptism, thouLordRegistrationDate) AS proxy_bday
FROM
tblDomesDayBirthDays

You will get:

thouGreatethsOfKeyeths proxy_bday
1 1048-02-15
2 1049-08-29
3 1050-07-01

For the first subject, you have their actual birthday. For the second subject, you have their first day of kindergarten. For the third subject, you have the day they registered with the lord of the manor. You now have a birthday for all of your subjects, even if some of them are not as accurate as you would like.

facebooktwittergoogle_plusredditpinterestlinkedinmail

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

How To Join Strings In SQL For Hive

Use the CONCAT function. It can be a bit confusing because joining strings, or concatenating (hence the abbreviation CONCAT), is usually done with a symbol like ‘&’ or ‘+’ in other systems and languages.

The syntax is pretty straight forward.


CONCAT(string1, string2, ...)

Let’s say you you had three string fields, address, city, and state from table contact_info
that you wanted to join (concatenate) together with commas and spaces inbetween. You would use the following code:


SELECT CONCAT(address,', ',city,', ',state)
FROM
contact_info

facebooktwittergoogle_plusredditpinterestlinkedinmail

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.

How To Turn A String Into An Integer In Hive SQL For Hadoop

You use the CAST function.


CAST(fieldname AS INT)

Let’s say you have a field called ‘subscription_ids’ in the ‘subscriber’ table that someone saved as a string but actually should be integers.

To change this to an integer, you would use the command:

SELECT CAST(subscription_ids AS INT) as subscription_ids_as_int

FROM Subscriber

Related books picked – and if possible read – by me. Sponsored by Amazon Associates.