Exploring the structure of the DMOZ topic hierarchy using summary trees

Here we’ll describe how to compute maximum entropy summary trees to visualize the topic hierarcy of a subset of the URLs in DMOZ (aka the Open Directory Project).

The DMOZ directory of URLs contains about 3.7 million URLs (as of April 2015), and each URL is classified into a topic in the DMOZ topic hierarchy, which contains about 600,000 unique topics. The topic hierarchy is a tree whose maximum depth is 15 levels. In this vignette we will describe:

  1. How to process the raw DMOZ data to extract just the counts of URLs that are classified into the “Top/Sports” subtree.

  2. How to prepare this data for computation of maximum entropy summary trees, and how to subsequently compute them.

  3. How to visualize the resulting series of maximum entropy summary trees using d3.js.

Processing the raw DMOZ data

First, we downloaded the DMOZ “content” data in April of 2015 from http://www.dmoz.org/rdf.html. This file was about 1.8 GB unzipped.

Next, we use the command line to extract just the assigned topic of each of the approximately 3.77 million URLs and write them, one per line, to a file called “dmoz-topics.txt”.

grep "<topic>" content.rdf.u8 | sed 's/<topic>//g' | sed 's/<\/topic>//g' | sed 's/^    //g' > dmoz-topics.txt

Next, we read this list of topics into R and compute their frequencies. Note this chunk is not computed here, because the file is too large to include with the summarytrees R package.

raw <- readLines("dmoz-topics.txt")

# compute the frequency of each unique topic:
dmoz.table <- table(raw)
length(dmoz.table) # There are 595,005 unique topics with at least one URL

# retain only those nodes that fall under "Top/Sports/"
sports.subtree <- dmoz.table[grep("Top/Sports/", names(dmoz.table))]
length(sports.subtree)  # There are 14,284 unique topics in the Top/Sports subtree
sum(sports.subtree)  # There are 76,535 total URLs classified to this subtree

# separate into names + weights:
full.node <- names(sports.subtree)  # full node names with slashes
weight <- as.integer(sports.subtree)  # weights

Now we have to assign each node an ID number and compute the ID number of the node’s parent. The only complication here is that some nodes do not currently have their parent node listed in the data, because the parent node is an internal node to which zero URLs are directly assigned.

The following block of code starts with the nodes of the tree that have nonzero weights, looks for their parents, and if any of their parents are not in the tree already, they are added to the tree.

# initialize variables
parent <- NULL  # the vector of all parents
n.added <- NULL  # how many new internal nodes added each iteration
iter <- 0  # count iterations (just out of curiosity)
to.split <- full.node  # the set of full names to split
new.parent <- rep(NA, length(weight))  # parents of each round of nodes

t1 <- Sys.time()
while (sum(is.na(new.parent)) > 0) {
  iter <- iter + 1
  print(iter)

  # split by slash and extract the leaf label of each node, and the 'stem'
  split.node <- strsplit(to.split, "/")
  label <- sapply(split.node, function(x) tail(x, 1))
  stem <- sapply(split.node, function(x) paste(x[-length(x)], collapse = "/"))

  # compute the parent of each node:
  new.parent <- match(stem, full.node)

  # if new.parent is NA, then we have to add an internal node
  # get unique internal nodes that must be added
  new.internal.nodes <- unique(stem[is.na(new.parent)])
  n.added <- c(n.added, length(new.internal.nodes))
  # add the new internal nodes to the full list
  full.node <- c(full.node, new.internal.nodes)
  # internal nodes have a weight of zero by definition here
  weight <- c(weight, rep(0, length(new.internal.nodes)))
  # set up the next set of nodes whose parents must be found
  to.split <- new.internal.nodes
  # add to the vector of parents
  parent <- c(parent, match(stem, full.node))
}
t2 <- Sys.time()
t2 - t1

Now we compute the labels, we assemble the nodes, parents, weights, and labels in a data frame, and we clean up one pseudo-node that was computed as the parent of the root.

label <- sapply(strsplit(full.node, "/"), function(x) tail(x, 1))

# There should be one that is the 'parent' of the root, which is an empty node
# Give it a label of NA
label[sapply(label, length) == 0] <- NA
label <- unlist(label)

# Pull it all into a data.frame:
dmoz <- data.frame(node = 1:length(full.node),
                   parent = parent,
                   weight = weight,
                   label = label,
                   stringsAsFactors = FALSE)

# identify the 'parent' of the root, which doesn't really exist:
to.remove <- which(is.na(dmoz[, "label"]))

# Set the parent of the root to zero
dmoz[dmoz[, "parent"] == to.remove, "parent"] <- 0

# remove the 'parent' of the root
dmoz <- dmoz[-to.remove, ]

# Save this data frame to be included in the package:
save(dmoz, file = "data/dmoz.RData")

Computing Maximum Entropy Summary Trees

Now we load the summarytrees package and compute maximum entropy summary trees using the greedy algorithm for this data for \(k = 1, 2, ..., K = 100\). Note that these aren’t technically maximum entropy summary trees since the greedy algorithm does not have any performance guarantees – we use it here because it’s much faster than the exact algorithm, and based on several experiments with real data, returns summary trees whose entropies are usually very close to the corresponding maximum entropy summary trees.

library(summarytrees)
#> Loading required package: RJSONIO
#> Loading required package: RColorBrewer
#> Loading required package: servr
data(dmoz)

# Look at the data a bit:
dim(dmoz)
#> [1] 15018     4
dmoz[1:10, ]
#>    node parent weight                   label
#> 1     1  14285      9        Adventure_Racing
#> 2     2      1     11                   Clubs
#> 3     3      1      6     Equipment_Suppliers
#> 4     4      1      1          News_and_Media
#> 5     5      1     53                   Races
#> 6     6      5      3 Single_Sport_Adventures
#> 7     7      1      6                 Schools
#> 8     8      1     23                   Teams
#> 9     9  14285      6                 Airsoft
#> 10   10      9     18        Chats_and_Forums
# Look at the breakdown between interanl nodes vs. leaves:
table(dmoz[, "weight"] > 0)
#> 
#> FALSE  TRUE 
#>   734 14284
# compute a set of K summary trees:
t1 <- Sys.time()
K <- 100
g <- greedy(node = dmoz[, "node"],
            parent = dmoz[, "parent"],
            weight = dmoz[, "weight"],
            label = dmoz[, "label"],
            K = K)
#> [1] "Running order.nodes() function to prepare data"
#> [1] "  Computing node levels"
#> [1] "  Computing child indices for each parent"
#> [1] "Running C function to compute summary trees"
#> [1] "Computation finished; now formatting output"
t2 <- Sys.time()
t2 - t1
#> Time difference of 1.022814 secs

We can look at the \(k = 20\)-node summary tree, for example, by typing

g$summary.trees[[20]]
#>    node parent weight type           label
#> 1     1      0      0    1             Top
#> 2     2      1      0    1          Sports
#> 3     4      2   3371    2        Baseball
#> 4     7      2   6508    2      Equestrian
#> 5     8      2   5420    2        Football
#> 6     9      2   6005    2            Golf
#> 7    12      2   2954    2          Hockey
#> 8    13      2   3950    2    Martial_Arts
#> 9    14      2   3608    2     Motorsports
#> 10   16      2   1178    2        Softball
#> 11   18      2   1466    2 Track_and_Field
#> 12   19      2   5421    2      Basketball
#> 13   22      2   2525    2         Cycling
#> 14   25      2  11644    2          Soccer
#> 15   30      2   1494    2          Tennis
#> 16   31      2   1866    2   Winter_Sports
#> 17   42      2   1259    2      Volleyball
#> 18   47      2   3764    2    Water_Sports
#> 19   78      2   2197    2         Running
#> 20   NA      2  11905    3       63 others

Visualizing the resulting series of summary trees:

# Prepare the summary trees for visualization:
json <- prepare.vis(tree.list = g$summary.trees,
                    labels = g$data[, "label"],
                    tree = g$tree,
                    legend.width = 150,
                    node.width = 225,
                    node.height = 14,
                    units = "# of URLs",
                    print.weights = TRUE,
                    legend.color = "lightsteelblue",
                    color.level = 3)

Note that we set the color.level = 3 because every node starts with “Top/Sports”, thus the third level is probably going to be the most useful for color-matching so that we can visually track the ancestry of each node in the summary trees.

Last, serve the vis in the browser:

draw.vis(json.object = json,
         out.dir = tempfile(),
         open.browser = interactive())

To see this set of 100 maximum entropy summary trees, go to:

http://www.kennyshirley.com/summarytrees/dmozsports

From the visualization, we can see that:

  1. The number of URLs per sport is relatively uniformly distributed. Looking at the \(k = 20\)-node maximum entropy summary tree, we see that 18 individual sports are shown (as children of “Top/Sports”) and not one descendant of an individual sport is shown. This is a contrast to many other tree-structured data sets where one or two children of the root contain, say, half the weight of the entire tree, and thus some of their descendants would be members of the 20-node maximum entropy summary tree.

  2. From the 50-node summary tree, we see that the two largest children of “Top/Sports/Soccer” are “UEFA” and “CONCACAF”, which are the regional governing bodies for European and North/Central American soccer, respectively. Furthermore, within the UEFA subtree, the child node “England” contains about 5/6 of the weight (i.e. URLs).

  3. From the 100-node summary tree, we see that this tree is relatively “wide”. The maximum depth of the 100-node summary tree is 6 (in the “Top/Sports/Soccer” branch), and there are 36 individual children of “Top/Sports” visualized here (out of 100 nodes). Each child of “Top/Sports” is an individual sport. It might make more sense to insert another level of the hierarchy after “Top/Sports” that categorizes all individual sports in to 5-10 groups of similar sports, using attributes of the sports such as outdoor vs. indoor, team vs. individual, etc., to control the width of the tree. This would only be necessary if having a wide, shallow tree is suboptimal for visualizing/understanding the data.