A couple of weeks ago a colleague of mine raised a problem they encountered transforming some data to a required output format. The problem was to take a table of traveller itinerary information and produce a summary of the cities they included as a comma separated list. Although I couldn’t think of the solution right away the problem reminded me of similar things I had come across in the past. This problem has now been rattling around in my head for long enough so this afternoon with a little ANZAC Day spirit I sat down to conquer it.

The Problem

To explain the issue in more detail and to run through my solution I will be using a simple table which represents the traveller itinerary data. The Journey table has four columns:

  • Start – the start of one leg of a traveller’s journey
  • End – the end of one leg of a traveller’s journey
  • Hotel – a hotel reservation
  • Car – a hire car reservation

Each row of the table represents one leg of a traveller’s journey, and each column contains a city code or a NULL value. Might be a little easier to explain with an example. My example data looks like this:

Start	End	Hotel	Car
-------------------------------
SYD	MEL	MEL	NULL
MEL	ADL	NULL	ADL
ADL	SYD	NULL	NULL
NULL	NULL	BRIS	NULL
BRIS	PER	PER	PER
PER	SYD	NULL	NULL

The end goal is to reduce this matrix of Australia’s most jumping hotspots to a CSV summary of the overall journey like this:

SYD, MEL, ADL, SYD, BRIS, PER, SYD

The sequence of the cities visited is maintained but the duplicates of each city in the sequence are removed. To transform the data I want to go through three transformation stages, each of which are commonly recurring problems when working with data, especially for reporting. The three stages are:

  1. Rotating the data from the columns of table into rows of data
  2. Removing duplicates from an ordered lists of values while maintaining the overall sequence
  3. Producing the CSV string of results

Stage 1 – Rotating columns to rows

This stage will take our journey data and return all the city codes as row data like this.

LegName	City
-----------------
Start	SYD
End	MEL
Hotel	MEL
Start	MEL
End	ADL
Car	ADL
Start	ADL
End	SYD
Hotel	BRIS
Start	BRIS
End	PER
Hotel	PER
Car	PER
Start	PER
End	SYD

Just by using the word “rotating” most devs with an exposure to business intelligence work will know where I am heading with this. If you guessed “pumpkin scones” you’re a little off, although they are delicious they are terrible at rotating data. However, if you guessed unpivot you are right. A great deal less delicious than pumpkin scones the SQL Server unpivot function is the perfect method for transforming our columns into rows.

For those unfamiliar with unpivot (or pivot) the syntax can seem a little whaaa? I will do my best to give a short explanation but for the correct, no jibba jabba explanation check out the MSDN documentation. The SQL to perform the unpivot on the Journey table is as follows.

SELECT LegName, City
FROM Journey UNPIVOT (City FOR LegName IN (Start, [End], Hotel, Car)) Leg

To perform the unpivot we introduce two new columns to our SQL results set in addition to four columns of the Journey table. The first column LegName will contain the column values being rotated (Start, End, Hotel and Car) this is known as a pivot_column. The second column City will hold the values that currently reside in the columns being rotated, this is known as a value_column.

To properly utilise the unpivoted data to solve the overall problem I will add another column to this result set. Adding a gererated row id called Ord will make it easier keep the data in the correct sequence. To do this I select the result into a temporary table #Legs using the identity type to generate the ids.

SELECT Ord = identity(INT,1,1)
	,City
INTO #Legs
FROM Journey
	UNPIVOT (City FOR LegName IN Start, [End], Hotel, Car)) Leg

The resulting temp table now looks like this:

Ord	City
--------------
1	SYD
2	MEL
3	MEL
4	MEL
5	ADL
6	ADL
7	ADL
8	SYD
9	BRIS
10	BRIS
11	PER
12	PER
13	PER
14	PER
15	SYD

One important thing to note with both the pivot and unpivot in SQL Server, NULL values are not considered. As you can see the NULL values from the original table are not included in the output. This is very important to note if you are performing an aggregate function like count.

Stage 2 – Remove the duplicates

At first glance removing the duplicates in the result from stage 1 (#Legs) should be easy –  just use DISTINCT or GROUP BY to return the list less duplicates. Unfortunately there is a condition to this problem which requires the sequence of cities to remain intact. Both distinct and group by will in fact return a unique list of all the cities but it will not keep the sequence of cities that have been visited intact. For example selecting distinct cities from #Legs returns the following result.

City
-------
ADL
BRIS
MEL
PER
SYD

The result we need to keep the sequence accurate is like this:

City
-------
SYD
MEL
ADL
SYD
BRIS
PER
SYD

To achieve this result we use some T-SQL FU that I acquired some time ago and had almost forgotten (behind a bevy of thoughts about pumpkin scones and cupcakes). To remove the duplicate cities within the sequence we need the Ord column added in stage 1 along with an additional ROW_NUMBER() column which will be added like this:

SELECT Ord
	,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq
	,City
FROM #Legs
ORDER BY Ord

This now returns a result set like this

Ord	Seq	City
--------------
1	1	SYD
2	1	MEL
3	2	MEL
4	3	MEL
5	1	ADL
6	2	ADL
7	3	ADL
8	2	SYD
9	1	BRIS
10	2	BRIS
11	1	PER
12	2	PER
13	3	PER
14	4	PER
15	3	SYD

Now for the Fu bit. By grouping our results by the (Ord – Seq) we can strip the duplicates from the sequence. Notice the Ord column is also used to maintain the original sequence of Cities.

SELECT CITY
FROM (
	SELECT Ord
	,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq
	,City
	FROM #Legs
	) LegSeq
GROUP BY CITY, (Ord-Seq)
ORDER BY MIN(Ord)

The result is exactly what we were looking for

City
-------
SYD
MEL
ADL
SYD
BRIS
PER
SYD

Stage 3 – SQL to CSV

If you recall at the start of the post the final output needed is a CSV list of the City codes. To do this I will use the magic that is the COALESCE function

DECLARE @AllCities VARCHAR(100)

SELECT @AllCities =  COALESCE(@AllCities + ', ', '') + CITY
FROM (
	SELECT Ord
	,ROW_NUMBER() OVER (PARTITION BY City ORDER BY Ord) Seq
	,City
	FROM #Legs
	) LegSeq
GROUP BY CITY, (Ord-Seq)
ORDER BY MIN(Ord)

SELECT @AllCities

When @AllCities is NULL (the first row processed), it returns an empty string. On subsequent rows, it concatenates the @AllCities variable with a comma and the current City value.

The output of all of this is exactly what is required

SYD, MEL, ADL, SYD, BRIS, PER, SYD

All is right in the world (and I am off for a cupcake)

Although the scenario here is fairly specific and may seem simple on the face of it, the detail is the killer. There are countless other ways to solve this solution (some of them involving loops or cursors *shudder*). I chose this approach to demonstrate three useful techniques I have picked up over the years.

Now I really am off for a cupcake.