how tag clouds work

I like tag clouds. They look cool, and they work.

3d visualization cartography climate change geographic information science geographic visualization geovisualization gis giscience interactive map modeling self-organizing maps uncertainty visual analytics visualization

Tag clouds use the visual variable of size to convey information. As a cartographer, then, my first instinct was to compare tag clouds to proportional and graduated symbol maps.

When viewed in this light, tag cloud scaling begins to seem quite haphazard — font point sizes, not areas, are modified to communicate the data. Thus, two tags of varying character length (GIS vs. Geographic Information Science, for example) but equal incidence will have very different visual presence (though their font point size will be equalized, the area will not).

3d visualization cartography climate change geographic information science geographic visualization geovisualization gis giscience interactive map modeling self-organizing maps uncertainty visual analytics visualization

Sixteen hours into my 30 hour train trip from Boston South Station to Chicago Union Station, I got the hairbrained idea that tags could be more accurately scaled — each one can be thought of as a rectangle, and a point size can be calculated for each tag such that its rectangle’s area is proportional to its incidence.1 I call this the Amtrak method. I wrote first an as3 class, then a somewhat hacky PHP script to generate the pseudo area-scaled tag clouds. I say it’s hacky b/c the areas aren’t really exact — I don’t have a method for exactly figuring the bound area of rendered html text. So instead, I estimate it based on character length. Specifically, I estimate the area of a rendered tag to be:

tagarea = taglength * fontsize2 / 2

A simplification to be sure, but based on some tests in Flash, not too inaccurate (for my Helv anyway). This gives us the formula for the font size of each tag.

fontsize = sqrt( 2 * tagarea / taglength )

In the above, tagarea is determined by dividing the tag’s occurrence value by the maximum number of occurrences among all the tags and multiplying this percentage by a predetermined maxarea value (I use 12,000, natch). So for each tag:

tagarea = occ / maxocc * maxarea

This generates the following, based on the same data as the above two clouds.

3d visualization cartography climate change geographic information science geographic visualization geovisualization gis giscience interactive map modeling self-organizing maps uncertainty visual analytics visualization

Leave it to a grad student to create a problem where none exists. Looking at my new Amtrak tag clouds, I don’t think they communicate the data more clearly. As I say at the beginning, tag clouds work. Existing tag clouds work. And they work for a reason. Font point size is a measure of the height of the text, the distance “from the top of the capital letter to the bottom of the lowest descender, plus a small buffer space” (Ellen Lupton’s Thinking With Type).

Thus, tag clouds employ the specific size variable, height (or length). This is most similar to proportional symbol maps using a bar symbology.

But it isn’t that simple. The symbols in a tag cloud also vary in width. The important data is encoded in height, the width is simply related to font size and tag character length, and it must be temporarily ignored by readers to determine the tag’s number of occurrences. This is most similar to a bivariate technique in cartography where the width and height of a bar are varied independently to show two different variables.

In such maps, as in tag clouds, the user is expected to attend to and separate the visual dimensions of width and height. Is attending to the height of a symbol, while ignoring its width, a realistic task? Or, in the speak of the discipline, is length a separable or an integral dimension? If it is separable, it can be attended to selectively. If it is integral, then it cannot be ignored.

There is no definitive answer to the question of size (or length) selectivity; existing evidence is mixed. Bertin considered size to be dissociative, or integral (in Semiology of Graphics). Alan MacEachren cites numerous studies of selectivity of visual variables (in How Maps Work) and my own thesis research has required a survey of the bivariate symbol literature. It appears that size-size has a configural relationship, which is somewhere between integral and separable (selectivity appears to be a continuum). Though not a sparkling recommendation, this at least suggests that people can, with little work, separate perceptually the width and height of a symbol. In doing so, they can estimate the number of occurrences of a tag, while ignoring the tag’s overall visual presence (esp. it’s width).


1 OK, I suppose another way of thinking about the area of the text is to look at the amount of ink on the page. But this would be much harder to determine, and will correlate with character length.

cartogram design

I study cartogram design. Cartograms are thematic maps in which the enumeration units (states, countries) are resized based on a particular attribute (population, carbon emissions). There are dozens of types/designs of cartogram and many methods/algorithms for cartogram production.

These have gotten a lot of attention lately (uses the Gastner-Newman diffusion-based algorithm).

Some cartographers manually tweak their automatically generated cartograms to better preserve shape or topology (from the Dutch company Mapping Worlds).

Some preserve shape completely (from the Online Atlas of the Millennium Development Goals also by Mapping Worlds).

Others abstract enumeration units to geometric primitives (generated by my Python script, based on Daniel Dorling’s algorithm).

And of course many other designs can be found between these extremes (redrawn from the NY Times2006 election results app).

The standard approach to cartogram design is to classify them as either contiguous or noncontiguous (with some adding a pseudo-contiguous category). As the small gallery above illustrates, this is inadequate. It seems to me that cartogram designs vary along three dimensions, and that variation along each dimension is continuous.

  1. Shape preservation — how much the original shapes are preserved on the transformed cartogram (can be quantified with local angles and edge length ratios)
  2. Topology preservation — how well adjancencies are preserved
  3. Density equalization — how accurately unit size represents the chosen attribute

The latter requires perhaps more explanation. Isn’t size on a cartogram supposed to perfectly reflect the chosen attribute? Sure, but some recent cartogram algorithms (Gastner-Newman slightly, Kocmoud-House somewhat more) have chosen to allow for some inaccuracy in order to better preserve shape or topology. Since readers can’t accurately estimate area anyway, this seems like a fair tradeoff.

To show these continua, and better portray the tradeoffs involved in preserving individual properties, I drafted the Cartogram Cube. It has helped me think through some of these issues while writing my thesis.

Cartogram3

Cartogram Cube

Does any of the above matter? Well, only to the extent that it helps us make better maps. But I believe cartogram effectiveness has a lot to do with these design characteristics, and depends largely on the property tradeoffs made in cartogram design (for manual and algorithmically-produced cartograms). Indeed, this is precisely what my thesis results — to be defended on May 13 — indicate.

More on my actual results later.

aag post 0

Tomorrow, two of my fellow UW cartographers and I board a train in Chicago for our 24-hour ride to the Annual Meeting of the Association of American Geographers in Boston. I’ll present a paper, Cartograms for Political Cartography: A Question of Design, on Saturday, but I’m more excited about the visualization and cartography talks throughout the week.

To see what I can look forward to, I worked up a quick tag cloud of the keywords used by presentations marked ‘visualization’ in the AAG Preliminary Program.

3d visualization cartography climate change geographic visualization geovisualization gis giscience interactive map modeling self-organizing maps uncertainty visual analytics visualization

And for my fellow cartophiles, a cloud of the keywords used by presentations marked ‘cartography’.

animation art atlas cartograms cartography cinema computer mapping critical cartography education geographic education geology gis hazards historical cartography historical geography history of cartography map map use mapping maps navigation political cartography political geography propaganda maps remote sensing united states visualization

When I return, I’ll probably follow up this post with one or two more. My angle?– visualization in contemporary academic geography.

BTW, the map above is from Alex Tait & Erik Steiner’s actually quite cool Amtrak Route Atlas. Oh, and the reason our route stops in Albany is because of track repairs — from there we hop on a bus to Boston. Bring it on Amtrak!