Kenny Shirley
Statistics Research Department
AT&T Labs, New York, NY
August 5, 2015
JSM, Seattle, WA
kshirley@research.att.com
github.com/kshirley
twitter.com/kennyshirley
The problem: visualizing large node-weighted trees
Our solution: Maximum entropy summary trees
Going from “code that works for me” to an R package
Miscellaneous thoughts
summarytrees
R package is hosted at https://www.github.com/kshirley/summarytrees.Data is available for download at http://www.dmoz.org/rdf.html.
As of April, 2015, there were about 3.7 million unique URLs listed in the directory, belonging to about 595,000 unique topics (excluding the Kids and Teens branch).
The topics are organized into a hierarchy.
Question: What is the distribution of URLs over this topic hierarchy?
Topic Frequency 1 Top/Arts/Animation 6 2 Top/Arts/Animation/Anime/Characters 6 3 Top/Arts/Animation/Anime/Clubs_and_Organizations 31 4 Top/Arts/Animation/Anime/Collectibles 10 5 Top/Arts/Animation/Anime/Collectibles/Cels 12 ... ... ... 595001 Top/World/Uyghurche/Rayonluq/Yawropa 3 595002 Top/World/Uyghurche/Référans 5 595003 Top/World/Uyghurche/Salametlik 1 595004 Top/World/Uyghurche/Sport 1 595005 Top/World/Uyghurche/Xewer 5
Representing this data as a node-weighted tree, we have \(n = 635,855\) total nodes in the tree (40,000 internal nodes with weight = 0 were added to the 595,000 nodes with weight > 0).
The total weight (# of URLs) of the tree is \(W = 3,776,432\).
The maximum depth of the tree is 15.
The range of node weights is \([1, 1276]\).
The problem: visualizing large node-weighted trees
Our solution: Maximum entropy summary trees
Going from “code that works for me” to an R package
Miscellaneous thoughts