my commentary on things
A little SQL makes the world a better place
A couple of weeks ago a colleague of mine raised a problem they encountered transforming some data to a required output format. The problem was to take a table of traveller itinerary information and produce a summary of the cities they included as a comma separated list. Although I couldn’t think of the solution right away the problem reminded me of similar things I had come across in the past. This problem has now been rattling around in my head for long enough so this afternoon with a little ANZAC Day spirit I sat down to conquer it.
The Problem
To explain the issue in more detail and to run through my solution I will be using a simple table which represents the traveller itinerary data. The Journey table has four columns:
- Start – the start of one leg of a traveller’s journey
- End – the end of one leg of a traveller’s journey
- Hotel – a hotel reservation
- Car – a hire car reservation
Each row of the table represents one leg of a traveller’s journey, and each column contains a city code or a NULL value. Might be a little easier to explain with an example. My example data looks like this:
Start End Hotel Car ------------------------------- SYD MEL MEL NULL MEL ADL NULL ADL ADL SYD NULL NULL NULL NULL BRIS NULL BRIS PER PER PER PER SYD NULL NULL
The end goal is to reduce this matrix of Australia’s most jumping hotspots to a CSV summary of the overall journey like this:
SYD, MEL, ADL, SYD, BRIS, PER, SYD
The sequence of the cities visited is maintained but the duplicates of each city in the sequence are removed. To transform the data I want to go through three transformation stages, each of which are commonly recurring problems when working with data, especially for reporting. The three stages are:
- Rotating the data from the columns of table into rows of data
- Removing duplicates from an ordered lists of values while maintaining the overall sequence
- Producing the CSV string of results
Stage 1 – Rotating columns to rows
This stage will take our journey data and return all the city codes as row data like this.
LegName City ----------------- Start SYD End MEL Hotel MEL Start MEL End ADL Car ADL Start ADL End SYD Hotel BRIS Start BRIS End PER Hotel PER Car PER Start PER End SYD
Just by using the word “rotating” most devs with an exposure to business intelligence work will know where I am heading with this. If you guessed “pumpkin scones” you’re a little off, although they are delicious they are terrible at rotating data. However, if you guessed unpivot you are right. A great deal less delicious than pumpkin scones the SQL Server unpivot function is the perfect method for transforming our columns into rows.
For those unfamiliar with unpivot (or pivot) the syntax can seem a little whaaa? I will do my best to give a short explanation but for the correct, no jibba jabba explanation check out the MSDN documentation. The SQL to perform the unpivot on the Journey table is as follows.
SELECT LegName, City FROM Journey UNPIVOT (City FOR LegName IN (Start, [End], Hotel, Car)) Leg
To perform the unpivot we introduce two new columns to our SQL results set in addition to four columns of the Journey table. The first column LegName will contain the column values being rotated (Start, End, Hotel and Car) this is known as a pivot_column. The second column City will hold the values that currently reside in the columns being rotated, this is known as a value_column.
To properly utilise the unpivoted data to solve the overall problem I will add another column to this result set. Adding a gererated row id called Ord will make it easier keep the data in the correct sequence. To do this I select the result into a temporary table #Legs using the identity type to generate the ids.
SELECT Ord = identity(INT,1,1) ,City INTO #Legs FROM Journey UNPIVOT (City FOR LegName IN Start, [End], Hotel, Car)) Leg
The resulting temp table now looks like this:
Ord City -------------- 1 SYD 2 MEL 3 MEL 4 MEL 5 ADL 6 ADL 7 ADL 8 SYD 9 BRIS 10 BRIS 11 PER 12 PER 13 PER 14 PER 15 SYD
One important thing to note with both the pivot and unpivot in SQL Server, NULL values are not considered. As you can see the NULL values from the original table are not included in the output. This is very important to note if you are performing an aggregate function like count.
Stage 2 – Remove the duplicates
At first glance removing the duplicates in the result from stage 1 (#Legs) should be easy – just use DISTINCT or GROUP BY to return the list less duplicates. Unfortunately there is a condition to this problem which requires the sequence of cities to remain intact. Both distinct and group by will in fact return a unique list of all the cities but it will not keep the sequence of cities that have been visited intact. For example selecting distinct cities from #Legs returns the following result.
City ------- ADL BRIS MEL PER SYD
The result we need to keep the sequence accurate is like this:
City ------- SYD MEL ADL SYD BRIS PER SYD
To achieve this result we use some T-SQL FU that I acquired some time ago and had almost forgotten (behind a bevy of thoughts about pumpkin scones and cupcakes). To remove the duplicate cities within the sequence we need the Ord column added in stage 1 along with an additional ROW_NUMBER() column which will be added like this:
SELECT Ord ,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq ,City FROM #Legs ORDER BY Ord
This now returns a result set like this
Ord Seq City -------------- 1 1 SYD 2 1 MEL 3 2 MEL 4 3 MEL 5 1 ADL 6 2 ADL 7 3 ADL 8 2 SYD 9 1 BRIS 10 2 BRIS 11 1 PER 12 2 PER 13 3 PER 14 4 PER 15 3 SYD
Now for the Fu bit. By grouping our results by the (Ord – Seq) we can strip the duplicates from the sequence. Notice the Ord column is also used to maintain the original sequence of Cities.
SELECT CITY FROM ( SELECT Ord ,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq ,City FROM #Legs ) LegSeq GROUP BY CITY, (Ord-Seq) ORDER BY MIN(Ord)
The result is exactly what we were looking for
City ------- SYD MEL ADL SYD BRIS PER SYD
Stage 3 – SQL to CSV
If you recall at the start of the post the final output needed is a CSV list of the City codes. To do this I will use the magic that is the COALESCE function
DECLARE @AllCities VARCHAR(100) SELECT @AllCities = COALESCE(@AllCities + ', ', '') + CITY FROM ( SELECT Ord ,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq ,City FROM #Legs ) LegSeq GROUP BY CITY, (Ord-Seq) ORDER BY MIN(Ord) SELECT @AllCities
When @AllCities is NULL (the first row processed), it returns an empty string. On subsequent rows, it concatenates the @AllCities variable with a comma and the current City value.
The output of all of this is exactly what is required
SYD, MEL, ADL, SYD, BRIS, PER, SYD
All is right in the world (and I am off for a cupcake)
Although the scenario here is fairly specific and may seem simple on the face of it, the detail is the killer. There are countless other ways to solve this solution (some of them involving loops or cursors *shudder*). I chose this approach to demonstrate three useful techniques I have picked up over the years.
Now I really am off for a cupcake.
| Print article | This entry was posted by stownsend on April 25, 2010 at 1:31 pm, and is filed under Development. Follow any responses to this post through RSS 2.0. You can leave a response or trackback from your own site. |
about 1 year ago
Ah, always happy to see some T-Sql Fu. With this knowledge in hand, I can finally pass my yellow belt T-Sql Fu test.