Visualizing Maximum Entropy Summary Trees Using R and d3.js

Kenny Shirley
Statistics Research Department
AT&T Labs, New York, NY

August 5, 2015
JSM, Seattle, WA


kshirley@research.att.com
github.com/kshirley
twitter.com/kennyshirley

Outline

  1. The problem: visualizing large node-weighted trees

  2. Our solution: Maximum entropy summary trees

  3. Going from “code that works for me” to an R package

  4. Miscellaneous thoughts



Motivating Example: DMOZ (the Open Directory Project)

Motivating Example: DMOZ (the Open Directory Project)

Some Summary Statistics

                                                                                                                              Topic Frequency
1                                    Top/Arts/Animation         6
2                   Top/Arts/Animation/Anime/Characters         6
3      Top/Arts/Animation/Anime/Clubs_and_Organizations        31
4                 Top/Arts/Animation/Anime/Collectibles        10
5            Top/Arts/Animation/Anime/Collectibles/Cels        12
...                                                 ...       ...
595001             Top/World/Uyghurche/Rayonluq/Yawropa         3
595002                     Top/World/Uyghurche/Référans         5
595003                   Top/World/Uyghurche/Salametlik         1
595004                        Top/World/Uyghurche/Sport         1
595005                        Top/World/Uyghurche/Xewer         5

Distribution of URLs aggregated to Level 2

Drilling down into Top/World…

Drilling down into Top/World…

Preview of Solution

Outline

  1. The problem: visualizing large node-weighted trees

  2. Our solution: Maximum entropy summary trees

  3. Going from “code that works for me” to an R package

  4. Miscellaneous thoughts

The defintion of a summary tree

The defintion of a summary tree

The defintion of a summary tree

The defintion of a summary tree

The defintion of a summary tree

Maximum entropy summary trees

Algorithms:

Output:

The end of the project… or is it?

The end of the project… or is it?

The end of the project… or is it?

Outline

  1. The problem: visualizing large node-weighted trees

  2. Our solution: Maximum Entropy Summary Trees

  3. Going from “code that works for me” to an R package

  4. Miscellaneous thoughts

How to make this an R package

How to make this an R package

How to make this an R package

How to make this an R package

Using the summarytrees package:

  1. Read in your data as a “list of edges”:
    • 4 variables: node ID, parent ID, (non-negative) weight, and label.
    • To-do: accept nested JSON-formatted trees, other formats?
  2. Do the computation (with K = 100 for example):
    • optimal(..., K = 100, epsilon = 0) for the exact algorithm
    • optimal(..., K = 100, epsilon > 0) for the approximation algorithm
    • greedy(..., K = 100) for the greedy algorithm
    • All of them return the list of K summary trees as output
  3. Call prepare.vis() to set plotting options, such as node colors, the sizes of various plotting elements, etc.

  4. Call draw.vis() to open a browser and locally serve the visualization from a temporary directory on your machine using the servr package.

The package has vignettes, and some of this will change over time, most likely.

Examples:

Comparison to Collapsible Tree

Outline

  1. The problem: visualizing large node-weighted trees

  2. Our solution: Maximum Entropy Summary Trees

  3. Going from “code that works for me” to an R package

  4. Miscellaneous thoughts

The Good

The Bad

The Ugly

Thanks

Thanks!

Acknowledgements: Thanks to Carson Sievert and Carlos Scheidegger for tips and discussion on d3.js

kshirley@research.att.com

github.com/kshirley

twitter.com/kennyshirley

R summarytrees package at Github: kshirley/summarytrees