Kenny Shirley
Statistics Research Department
AT&T Labs, New York, NY
August 5, 2015
JSM, Seattle, WA
kshirley@research.att.com
github.com/kshirley
twitter.com/kennyshirley
The problem: visualizing large node-weighted trees
Our solution: Maximum entropy summary trees
Going from “code that works for me” to an R package
Miscellaneous thoughts
summarytrees
R package is hosted at https://www.github.com/kshirley/summarytrees.Data is available for download at http://www.dmoz.org/rdf.html.
As of April, 2015, there were about 3.7 million unique URLs listed in the directory, belonging to about 595,000 unique topics (excluding the Kids and Teens branch).
The topics are organized into a hierarchy.
Question: What is the distribution of URLs over this topic hierarchy?
Topic Frequency 1 Top/Arts/Animation 6 2 Top/Arts/Animation/Anime/Characters 6 3 Top/Arts/Animation/Anime/Clubs_and_Organizations 31 4 Top/Arts/Animation/Anime/Collectibles 10 5 Top/Arts/Animation/Anime/Collectibles/Cels 12 ... ... ... 595001 Top/World/Uyghurche/Rayonluq/Yawropa 3 595002 Top/World/Uyghurche/Référans 5 595003 Top/World/Uyghurche/Salametlik 1 595004 Top/World/Uyghurche/Sport 1 595005 Top/World/Uyghurche/Xewer 5
Representing this data as a node-weighted tree, we have \(n = 635,855\) total nodes in the tree (40,000 internal nodes with weight = 0 were added to the 595,000 nodes with weight > 0).
The total weight (# of URLs) of the tree is \(W = 3,776,432\).
The maximum depth of the tree is 15.
The range of node weights is \([1, 1276]\).
The problem: visualizing large node-weighted trees
Our solution: Maximum entropy summary trees
Going from “code that works for me” to an R package
Miscellaneous thoughts
A series of \(K\) summary trees. We use a layered layout for plots but this isn’t required (computation is independent of vis layout).
The entropy profile of the tree. Simply the plot of entropy as a function of the number of nodes in the maximum entropy summary tree:
A key benefit: There are no tuning parameters here, making it a good default method. The vector of maximum entropies for \(k = 1, ..., K\) is essentially a summary statistic of the input tree.
Wrote a paper (Karloff and Shirley, EuroVis 2013)
Wrote code to implement the algorithm on enough examples to support the paper
Generated a few static plots (for the paper and associated website)
Wrote a paper (Karloff and Shirley, EuroVis 2013)
Wrote code to implement the algorithm on enough examples to support the paper
Generated a few static plots (for the paper and associated website)
Wrote a paper (Karloff and Shirley, EuroVis 2013)
Wrote code to implement the algorithm on enough examples to support the paper
Generated a few static plots (for the paper and associated website)
The problem: visualizing large node-weighted trees
Our solution: Maximum Entropy Summary Trees
Going from “code that works for me” to an R package
Miscellaneous thoughts
summarytrees
summarytrees
package:optimal(..., K = 100, epsilon = 0)
for the exact algorithmoptimal(..., K = 100, epsilon > 0)
for the approximation algorithmgreedy(..., K = 100)
for the greedy algorithmCall prepare.vis()
to set plotting options, such as node colors, the sizes of various plotting elements, etc.
Call draw.vis()
to open a browser and locally serve the visualization from a temporary directory on your machine using the servr
package.
The package has vignettes, and some of this will change over time, most likely.
DMOZ
Math Genealogy
R Source code
Collapsible trees are a very nice d3.js example from Mike Bostock: http://bl.ocks.org/mbostock/4339083.
The function d3.layout.tree() implements the Reingold-Tilform algorithm for a nice-looking layered layout of nodes and links of a tree.
The collapsible tree allows for expanding/collapsing children of nodes.
Maximum entropy summary trees are less flexible, but come with nice properties of their own.
Think of them as a very strictly-guided tour of the structure of a node-weighted tree (rather than allowing one to explore with complete freedom).
The problem: visualizing large node-weighted trees
Our solution: Maximum Entropy Summary Trees
Going from “code that works for me” to an R package
Miscellaneous thoughts
The algorithms work and d3.js is slick and fun!
Take an R list, use toJSON(list), and scoop it up in javascript with d3.json().
jsonlite
, dendrogram()
(h/t Tal Galili, author of dendextend
), ape
(focus on phylogenetic trees), others?htmlwidgets
, rCharts
, networkD3
rotl
(R Open Tree of Life), others?capture.output()
tmp <- capture.output(.C("Roptimal", R_K = as.integer(K), R_n = as.integer(n), R_numparents = as.integer(numparents), R_epsilon = as.double(epsilon), R_weight = as.double(data[, "weight"]), R_childindex = as.integer(childindex), R_childstart = as.integer(childstart), R_childend = as.integer(childend), PACKAGE = "summarytrees"))
Thanks!
Acknowledgements: Thanks to Carson Sievert and Carlos Scheidegger for tips and discussion on d3.js
R summarytrees
package at Github: kshirley/summarytrees