**WARNING: outdated information STOP READING**

Please see Optimizing Rule Evaluation for the correct description. It is OK to work the examples on this page, but be sure to give each rule an unique ID as defined in the current writeup.

The goal of project 5 is to optimize the fix-point algorithm for generating facts from rules. The general goal is to precisely determine those rules that require iteration, those rules that do not require iteration, and the order of evaluation for the rules.

This explanation assumes there is a query relating to each rule, so if there is a rule for *A*, then there is a query about *A*. The next section will discuss how you perform this same optimization subject to the actual queries in the file (i.e., only evaluate rules needed to answer the queries). The process for optimizing rule evaluation consists of 3 steps.

- Building a dependency forest from the rules
- Finding the strongly connected components (SCCs) in the forest
- Visiting each SCC in topological order, computing a fix-point on the set of rules in each SCC.

Each step will be discussed independently by way of a working example.

The working example to describe how to optimize rule evaluation is the set of rules:

A(X,Y) :- B(X,Y), C(X,Y) B(X,Y) :- A(X,Y), D(X,Y) E(X,Y) :- F(X,Y), G(X,Y)

Recall too the meaning of a rule in Datalog. For example, the first rules means $\forall X\ \forall Y\ (B(X,Y) \wedge C(X,Y) \rightarrow A(X,Y))$. This understanding is important to the algorithm for optimizing rule evaluation.

The rule dependency forest shows how rules are inter-related. The set of vertices in the forest are the IDs of the rules and any relations referred to in the rules. For the example, the set of vertices in the forest for the example rule set is $V = \{A\ B\ C\ D\ E\ F\ G\}$.

The edges between the vertices in the forest are created from the rules directly: add a directed edge from the ID on the left of the rule to each ID on the right of the rule. Consider the first rule in the example rule set. The rule creates the edges $\{(A\ B)\ (A\ C)\}$ (note that both rules have the same source as expected). The second rule creates the edges $\{(B\ A)\ (B\ D)\}$. The final rule creates the edges $\{(E\ F)\ (E\ G)\}$ The edge set for the forrest is $E = \{(A\ B)\ (A\ C)\ (B\ A)\ (B\ D)\ (E\ F)\ (E\ G)\}$. The resulting graph $G = (V\ E)$ is the rule dependency forest for the rule set.

All of the graph concepts used in this section are defined in Chapter 11 of the course text. The full presentation of the algorithm to compute SCCs with its proof is found in section 3.4 of *Algorithms* by Dasgupta, C. H. Papadimitriou, and U. V. Vazirani. The algorithm is only briefly explained on the example in this presentation.

For the algorithm, given a graph $G = (V\ E)$ the graph, then $G^R = (V\ E^R)$ is the new graph created by reversing all of the edges in $E$ to get $E^R$. $E^R$ for the graph in the example is $\{(B\ A)\ (C\ A)\ (A\ B)\ (D\ B)\ (F\ E)\ (G\ E)\}$. Given the graphs $G$ and $G^R$, the algorithm to compute the SCCs in $G$ is as follows:

- Run depth-first search on $G^R$ labeling each node with its post-traversal order (recall that the graph may be a forest so you may need to start multiple searches from different nodes)
- Run depth-first search on G and, during the search, process the vertices in decreasing order of their post-traversal number and keep track of the connected component.

Step (2) merits further explanation. Depth-first-search can be used to find connected components in an undirected graph by keeping track of all the vertices visited in the search. At the end of the search, if there are unvisited vertices (as is the case in a non-connected graph or forest), then another depth-first search is started at one of the unvisited vertices, and the vertices visited in that new search belong to a new connected component. The process is repeated until all vertices are visited and assigned a connected component. This same algorithm also finds connected components in a directed graph when the vertices are processed in a specific order. That order is determined by step (1) of the algorithm.

For the example, step (1) processes nodes in lexicographical order in the depth-first search. The resulting post-order traversal ordering from $G^R$ is $B < A < C < D < E < F < G$. Step (2) of the algorithm finds the connected components processing vertices in decreasing order. Assuming there exists a method named *DFS(v)* that returns the visited vertices in the graph $G$, the order of calls to the depth-first-search in Step (2) with the actual connected components are

*DFS(G)*= {*G*}*DFS(F)*= {*F*}*DFS(E)*= {*E*}*DFS(D)*= {*D*}*DFS(C)*= {*C*}*DFS(A)*= {*A**B*}

There are six strongly connected components in the graph *G*.

It is now possible to determine the optimal rule evaluation order, and to detect when a fix-point algorithm is needed to evaluate a set of rules. The procedure is direct: the order in which the SCCs are computed is a topological sort of the meta-graph created by merging vertices in a strongly connected component into a single vertex. If an SCC contains multiple rules, then the rules need to be evaluated to a fix-point. Any relation not updated by a rule is skipped; these are the leaves in the graph *G*.

For the example, the algorithm proceeds as

- Skip
*G*since it has no rule - Skip
*F*since it has no rule - Evaluate
*E* - Skip
*D*since it has no rule - Skip
*C*since it has no rule - Evaluate
*A*and*B*to a fix-point

All the facts are now generated and known. The interpreter is able to process queries.

If a rule is not needed to answer any queries in the Datalog program, then that rule should not be evaluated. So far, the discussion assumed all the rules are needed to answer some query in the file. This section assumes that such is not the case. To optimize for this situation, suppose that the only query for the working example is

B(X,'a')?How can this information be used to prune out rules not involved in answering the query? The most direct solution is to build the dependency graph as discussed previously, only before computing the SCCs on that graph, prune the graph to only include nodes reachable from queries. This pruning is accomplished using a depth-first search from all vertices names in queries.

For the example query above, there is a single query that names vertex *B*, so a depth-first search is started at *B* on the dependency graph. Any vertex reachable from *B* is preserved in the dependency graph. The other vertices not reachable from *B* can safely be ignored or removed. For the example, vertices *E*, *F*, and *G* are removed. As a reminder, if the file contains multiple queries, for each different relation queried, do the depth-first search, until you have all the nodes reachable from the named vertices in the various queries, and then compute the SCCs considering only those vertices with all other vertices either ignored or removed.

For each of the following rule sets, compute the following:

- The dependency graph
- The reversed dependency graph
- The post-order traversal ordering of the vertices using lexicographical ordering
- Each SCC in the dependency graph (as computed by the algorithm using the post-order traversal ordering)
- The order of rule evaluation with indicated fix-points.

Rule set 1:

wnl(Z) :- cdh(Z, 'W', X), cr(Z, 'Newton Lab.'). spr(X, Y) :- cp(X,Z), cp(Z,Y), cr(Y, 'Turing Aud.'). gaisw(X,Y) :- spr(Z,W), wnl(W), csg(W, X, 'A'), snap(X, Y, V, T).

Rule set 2:

nl(X) :- snap(Y, X, Z, W), csg(T, Y , 'A'), before(T, 'CS236'). before(Y, X) :- cp(X, Y). before(Y, X) :- cp(X, Z), before(Y, Z).

Answer the following questions:

- For one of the rule sets above, prune the dependency graph according to a query.
- How do you avoid having to sort all the vertices to follow post-order traversal order?
- What classes are needed to perform all of the optimizations identified in this worksheet?
- What are the key methods of these classes?

Implement the algorithm to evaluate rules in the optimal order.

TODO: define the new output.