Aggregating Using Unpivot

Aug 18, 2017

A common task when analyzing data concerning a population is to determine how various subpopulations compare to one another. If the sub populations are nonoverlaping populations, a simple group by clause will usually solve your problem. When they overlap however, you have to find other solutions.

A common task when analyzing data concerning a population is to determine how various subpopulations compare to one another. If the sub populations are nonoverlaping populations, a simple group by clause will usually solve your problem. When they overlap however, you have to find other solutions.

The most common solution I see to this particular problem is a series of case statements. This can be a huge pain to write and even worse to maintain. Recently, I've learned that Oracle's unpivot functionality can be a better solution.

The Problem

Recently, I was asked to report on course performance based on student demographics. I have a table that includes many indicators (values are mostly Y or N) and looks something like this.

demographics
    person_uid
    oregon_resident_ind
    us_minority_ind
    first_generation_ind

I need to use this table to build a report has a format that looks something like what's below.

course id | course avg grade | course dwfu rate
    group | avg grade | dwfu rate | Δ course avg grade | Δ course dwfu rate

Getting the Course Info

This part is pretty easy; a simple group by expression will give me the three values I need to calculate the desired course wide values in report.

select
    course_id
    , count(*)                  as course_cnt
    , avg(final_grade)          as course_avg_grade
    , sum(is_dwfu(final_grade)) as course_dwfu_cnt
from  ...
where ...
group by
    course_id

The difficulty now lies in dealing with all the demographic information.

Getting the Demographic Info

Solution 1 - Lots of Case Statements

As I said before, the most common solution I see for this is using a series of case statements. If you've done this before, you can probably you guess what this particular nightmare is going to look like.

sum(
    case when us_minority_ind = 'Y'
        then 1
        else 0
    end
) as us_minority_cnt
, avg(
    case when us_minority_ind = 'Y'
        then final_grade
        else null
    end
) as us_minority_avg
, sum(
    case when us_minority_ind = 'Y' and is_dwfu(final_grade) = 1
        then 1
        else 0
    end
) as us_minority_dwfu_cnt

For this one demographic, that's 18 lines of code! The example I listed is 3 demographics. Basic math tells us that's 54 lines of code just for this section of my example. In reality, I currently have 11 different demographic indicators to deal with, and there's a good chance that there will be more in the future. Show me a programmer that wants to deal with 198 lines of this nonsense and I'll show you a crazy person.

Want to or not, sometimes the fact that it works is enough. This does provide the various fields needed to calculate the desired reports, which is why I've used this solution in the past.

Unfortunately, this solution gives me one massively flat line. SSRS doesn't like that. Since SSRS does not include an unpivot feature and I'm loath to start hard coding textboxes, I've traditionally unpivoted these large lines. As I started to implement that here, it gave me the idea for solution 2.

Solution 2 - Calculations in the Unpivot

An unpivot query has two primary parts, the source query and the actual unpivot step. For my source query, I want to maintain the overall course information, and bring in the indicators. To do this, I used Oracle's window clause. This allowed me to define grouping, while still getting results row by row.

The Source Query

select
    course_identification
    , count(*) over (
        partition by
            course_identification
    ) as course_cnt
    , sum(final_grade) over (
            partition by
                course_identification
        ) / count(*) over (
            partition by
                course_identification
        )
    ) as course_avg_grade
    , sum(
        case when is_dwfu(final_grade)
            then 1
            else 0
        end
    ) over (
        partition by
            course_identification
    ) as course_cnt_dwfu
    , oregon_residency_ind
    , us_minority_ind
    , first_gen_ind
    , final_grade
from
 ...

Notice I'm just including the demographic fields that I care about? I haven't included the 198 lines of hell that I mentioned earlier. Even if I were to include all of my 11 demographics, the select clause would still only be around 35 lines long. So onto the unpivot!

The Unpivot Clause

The unpivot clause is the portion of a query where you define the actual unpivoting action you want to perform. Within this clause, you need to list the fields as they'll be available in the select clause. I've decided to name my fields ind (since it's identifying if they're in the group or not), grade (an alias for final_grade) and pop (as in population). The lines beneath that list what fields will be unpivoted. The first set puts oregon_residency_ind into ind, final_grade into grade and Resident into pop.

) unpivot (
    (ind, grade) for pop in (
        (oregon_residency_ind, final_grade) as 'Resident'
        , (us_minority_ind, final_grade) as 'US Minority'
        , (first_gen_ind, final_grade) as 'First Generation'
    )
)

Once that's done I'll have a bunch of rows. In fact, you'll see one row per person for each demographic that was originally selected. Since I've selected 3 demographics, I'll have three rows per person. Our last step will be to aggregate a second time to get our final result.

Putting it all Together

Finally, we can put it all together. We'll need to use our same 3 case statements from earlier, but now we only need to do this once! We can get away with this since each row is now only relevant to a single population, a normal group by clause will wrap everything up.

select 
    course_identification
    , course_cnt
    , course_avg_grade
    , course_cnt_dwfu
    , pop
    , sum(
        case when ind = 'Y'
            then 1
            else 0
        end
    ) as pop_cnt
    , sum(
        case when ind = 'Y'
                then bicreprt.f_letter_to_points_osu(grade)
            else 0
        end
    ) / sum(case when ind = 'Y' then 1 else 0 end) as pop_avg_grade
    , sum(
        case when ind = 'Y' and is_dwfu(grade)
            then 1
            else 0
        end
    ) as pop_cnt_dwfu
from (
    {{ source query goes here }}
) unpivot (
    (ind, grade) for pop in (
        (oregon_residency_ind, final_grade) as 'Resident'
        , (us_minority_ind, final_grade) as 'US Minority'
        , (first_gen_ind, final_grade) as 'First Generation'
    )
)
group by
    course_identification
    , course_cnt
    , course_avg_grade
    , course_cnt_dwfu
    , pop

In Conclusion

The concept of pivoting and unpivoting can be a little hard to get your head around. A lot of SQL is that way for most people. This trick however, does a good job of eliminating a huge amount of code redundancy (probably some error causing typos...), and in my testing actually performed significantly better than having 33 different case statements.