How do we go about calculating paths between two connecting nodes; say we have a table with two columns – FromNode and ToNode; our query has a starting node and a destination node and we need to calculate all possible paths between the start and destination and return them to the user. This comes from a question I answered on the Microsoft NNTP news groups.
Example
**Note, the script to build the table of nodes and data is attached at the bottom of this blog entry.
From To
100035 100046
100035 100327
100327 100046
etc…
The output needs to look like this, basically the relational node structure pairs flattened into an enumerated list (probably a CSV or XML in the real world).
p0 p1 p2 p3 p4 p5 edges
----------- ----------- ----------- ----------- ----------- ----------- -----------
100035 100046 -1 -1 -1 -1 1
100035 100327 100046 -1 -1 -1 2
100035 100334 100046 -1 -1 -1 2
100035 100351 100046 -1 -1 -1 2
100035 100036 100046 -1 -1 -1 2
100035 100065 100046 -1 -1 -1 2
100035 100179 100046 -1 -1 -1 2
100035 100182 100046 -1 -1 -1 2
100035 100076 100046 -1 -1 -1 2
100035 100078 100046 -1 -1 -1 2
100035 100094 100046 -1 -1 -1 2
100035 100125 100046 -1 -1 -1 2
100035 100150 100046 -1 -1 -1 2
100035 100158 100046 -1 -1 -1 2
100035 100267 100046 -1 -1 -1 2
100035 100372 100046 -1 -1 -1 2
100035 100036 100150 100046 -1 -1 3
100035 100036 100158 100046 -1 -1 3
100035 100036 100125 100046 -1 -1 3
So, how do we go about doing this in SQL Server? SQL Server 2005 introduced a new operator on the FROM clause – the APPLY operator; this allows us to take a row from the left table joining and passing a value into a table value function, the table value function runs and returns a recordset for each row passed in; you have CROSS and OUTER functionality, the CROSS is a bit confusing – to me its more like an INNER JOIN in that no rows are produced if the table valued function doesn’t return anything whereas the OUTER you always get a row from the table on the left hand side of the join.
What does that mean in English? Lets look at some examples.
Take our table valued function that really does the biz for us…
CREATE FUNCTION tvfn_get_nodes_for(
@FromNode as int,
@StartAtNode int,
@StopAtNode int )
RETURNS TABLE
AS
RETURN(
SELECT ToNode
FROM Paths
WHERE FromNode = @FromNode
AND FromNode <> @StopAtNode
AND ToNode <> @StartAtNode
)
GO
When we run our table valued function in isolation we get 873 rows…
SELECT *
FROM tvfn_get_nodes_for( 100035, 100035, 100046 )
So, that’s the nodes that 100035 is connected to, we need to do it recursively for each path through the tree, we can use the OUTER APPLY functionality to do that, so take the following example.
declare @nodes table (
start_node int not null
)
insert @nodes values( 100036 )
insert @nodes values( 100041 )
insert @nodes values( 100042 )
insert @nodes values( 100043 )
insert @nodes values( 100044 )
SELECT n.start_node, count(*)
FROM @nodes n
OUTER APPLY tvfn_get_nodes_for( n.start_node, 100035, 100046 )
GROUP BY n.start_node
We are driving the table valued function from the @nodes table variable, the table valued function will execute 5 times, once for 100036, then 100041 etc… The output is shown below…
start_node rows
----------- -----------
100036 738
100041 5
100042 25
100043 20
100044 24
(5 row(s) affected)
Cool or what! The biggest use I have for this CROSS APPLY stuff is for the dynamic management views to get at what SQL Server is doing – a lot of them have been implemented as table valued functions.
So that’s the guts of it, a real query that gives 5 node hops is shown below, increasing the number of hops is simple – you just add more OUTER APPLY’s and the other bits.
DECLARE @FromNode int,
@ToNode int
SET @FromNode = 100035
SET @ToNode = 100046
SELECT *,
edges = 5 - ( CASE WHEN p2 = -1 THEN 4 WHEN p3 = -1 THEN 3 WHEN p4 = -1 THEN 2 WHEN p5 = -1 THEN 1 ELSE 0 END )
FROM (
SELECT p0 = p1.FromNode,
p1 = COALESCE( p1.ToNode, -1 ),
p2 = COALESCE( p2.ToNode, -1 ),
p3 = COALESCE( p3.ToNode, -1 ),
p4 = COALESCE( p4.ToNode, -1 ),
p5 = COALESCE( p5.ToNode, -1 )
FROM paths AS p1
OUTER APPLY tvfn_get_nodes_for( p1.ToNode, @FromNode, @ToNode ) AS p2
OUTER APPLY tvfn_get_nodes_for( p2.ToNode, @FromNode, @ToNode ) AS p3
OUTER APPLY tvfn_get_nodes_for( p3.ToNode, @FromNode, @ToNode ) AS p4
OUTER APPLY tvfn_get_nodes_for( p4.ToNode, @FromNode, @ToNode ) AS p5
WHERE p1.FromNode = @FromNode
) AS d
WHERE ( p1 = @ToNode OR p2 = @ToNode OR p3 = @ToNode OR p4 = @ToNode OR p5 = @ToNode )
AND ( p1 NOT IN ( p2, p3, p4, p5 )
OR p1 = -1 )
AND ( p2 NOT IN ( p1, p3, p4, p5 )
OR p2 = -1 )
AND ( p3 NOT IN ( p1, p2, p4, p5 )
OR p3 = -1 )
AND ( p4 NOT IN ( p1, p2, p3, p5 )
OR p4 = -1 )
AND ( p5 NOT IN ( p1, p2, p3, p4 )
OR p5 = -1 )
ORDER BY edges
The usual recommendations apply, functions are usually resource intensive but in this situation I can’t honestly think of a better way of doing it without resorting to more linear approaches; of course – if the data was stored x.x.x.x. in the first place we wouldn’t have this problem.