Kenny Shirley

Statistics Research Department

AT&T Labs, New York, NY

August 5, 2015

JSM, Seattle, WA

kshirley@research.att.com

github.com/kshirley

twitter.com/kennyshirley

The problem: visualizing large node-weighted trees

Our solution: Maximum entropy summary trees

Going from “code that works for me” to an R package

Miscellaneous thoughts

- This is joint work with Howard Karloff.
- The
`summarytrees`

R package is hosted at https://www.github.com/kshirley/summarytrees.

- DMOZ is one of the largest directories of websites in the world (although it may be a bit dated…)

Data is available for download at http://www.dmoz.org/rdf.html.

As of April, 2015, there were about 3.7 million unique URLs listed in the directory, belonging to about 595,000 unique topics (excluding the Kids and Teens branch).

The topics are organized into a hierarchy.

Question: What is the distribution of URLs over this topic hierarchy?

Topic Frequency 1 Top/Arts/Animation 6 2 Top/Arts/Animation/Anime/Characters 6 3 Top/Arts/Animation/Anime/Clubs_and_Organizations 31 4 Top/Arts/Animation/Anime/Collectibles 10 5 Top/Arts/Animation/Anime/Collectibles/Cels 12 ... ... ... 595001 Top/World/Uyghurche/Rayonluq/Yawropa 3 595002 Top/World/Uyghurche/Référans 5 595003 Top/World/Uyghurche/Salametlik 1 595004 Top/World/Uyghurche/Sport 1 595005 Top/World/Uyghurche/Xewer 5

Representing this data as a node-weighted tree, we have \(n = 635,855\) total nodes in the tree (40,000 internal nodes with weight = 0 were added to the 595,000 nodes with weight > 0).

The total weight (# of URLs) of the tree is \(W = 3,776,432\).

The maximum depth of the tree is 15.

The range of node weights is \([1, 1276]\).

- One solution is aggregating the nodes.

- Problem: the distribution of aggregated weights across the children of the root is heavily skewed.

- We’d naturally like to drill down into the largest nodes, and perhaps group the small ones into a cluster called “other”.

- … we see the same thing: a heavily skewed distribution with a long tail.

- The central problem (
*across many real-world trees*): To visualize all the children of a node using a layered layout, we waste space on lots of inconsequential nodes.

- (Quick note on treemaps – yes, they are compact in space, but they don’t always do a good job of showing the hierarchy of a tree).

- The solution: aggregate nodes flexibly at different levels of the tree, and allow aggregation of a subset of children under each parent.

- This leads to our definition of a
*summary tree*.

The problem: visualizing large node-weighted trees

Our solution: Maximum entropy summary trees

Going from “code that works for me” to an R package

Miscellaneous thoughts

- Suppose this is the input tree, containing 9 nodes:

- A node in a summary tree can represent an entire subtree of the original tree.

- A node of a summary tree can represent at most one cluster of children under each parent (and all the descendants of those children).

- A node of a summary tree can represent at most one cluster of children under each parent (and all the descendants of those children).

- A node of a summary tree can represent at most one cluster of children under each parent.

- Not a summary tree!

- Among all summary trees with the same number of nodes, which is the best?
- Define the entropy of a node-weighted tree as the entropy of the discrete probability distribution defined by the normalized node weights.