Let’s give every fact its own home page on the web

My sister is writing a report for which she needs facts about the growth of New Jersey’s foreign-born population. She found some numbers at census.gov, and we explored them on a Facebook thread. For my friend Mike Caulfield, who’s writing a textbook called Making Fair Comparisons, the discussion reinforced a lot of what he’s been teaching lately. For me it was a reminder that the dream of straightforward access to canonical facts remains elusive.

I wanted to check my sister’s sources. She gave me this link: http://quickfacts.census.gov/qfd/states/34000.html. That page says New Jersey’s 2010 population was 8,791,894, of which 20.3% were foreign-born — so we can compute the number of those folks to be 1,784,754.

I never did find the 2000 counterpart to that report. While searching the FactFinder site, though, I found this page where, with further searching within the page — for Geography: New Jersey and “foreign born” — I landed on a report called “SELECTED CHARACTERISTICS OF THE NATIVE AND FOREIGN-BORN POPULATIONS 2010 ACS 1-year estimates” with an ID of S0501. According to it, there were 1,844,581 foreign-born New Jerseyans, or 21% (not 20.3%) of the same 8,791,894 total.

I cited that link in our Facebook discussion, but later was horrified to find that I actually hadn’t. The base URL never changes. If I navigate to a report on foreign-born New Jerseyans, and you navigate to the same report for Texans, or the whole US, it’s the same URL. This is catastrophic if you’re trying to have a discussion informed by canonical citation of source data.

Meanwhile I still hadn’t found the 2000 counterpart to http://quickfacts.census.gov/qfd/states/34000.html. Back on the FactFinder site I searched in vain for “SELECTED CHARACTERISTICS OF THE NATIVE AND FOREIGN-BORN POPULATIONS 2000″ and for combinations of terms like “foreign-born 2000.” So I searched the web for “foreign-born 2000 census”; both Google and Bing pointed me to http://www.census.gov/prod/2003pubs/c2kbr-34.pdf. From this PDF file I was able to extract New Jersey’s total (8.414,350) and foreign-born (1,476,327) populations in 2000. Now I could complete this table (using, arbitrarily, one of the values I found for 2010 foreign-born):

2000	8,414,340	1,476,327	17.5%
2010	8,791,894	1,784,754	20.3%

Now, finally, we could have the real discussion. Should growth be evaluated in terms of percentages, so (20.3-17.5)/20.3 = 15.7%, or absolute numbers, so (1.784-1.476)/1.476 = 20.9%? It depends, my friend Doug Smith said, on the point you’re trying to make:

When you do the calculation on the growth of the percentages it does not take into account that the total population also grew over the 10 years. So while the percentage of foreign- born people grew by 15.7%, the actual number of foreign-born people in the state grew by 20.3%. If you’re trying to make a case that depends on the total number, like services consumed or potential market size, then you should use the growth of total numbers. If you’re trying to make a case based on percentages, for example the likelihood of encountering a foreign-born individual, then growth based on percentages would be better.

Doug added this intriguing observation:

This small amount of data actually presents a very interesting picture. The total population of NJ grew 4.5% over ten years. During that time, the natural born population grew only 1%, while the foreign-born population grew 21%. This suggests that more than 80% of the population increase over these ten years came as a result of immigration. So, while going from 17.5% foreign-born to 20.3% foreign-born doesn’t seem like much of a change to me, the implications seem huge.

That made me wonder about comparable figures for other states. But the prospect of digging out the numbers from a mishmash of HTML pages and PDF files killed that curiosity. What would help? Let’s give every fact its own home page on the web. The OData is one good way to do that. Imagine census.gov as a web of data. A top-level path might be:

http://odata.census.gov/states

A next-level path might be:

http://odata.census.gov/states/NewJersey

A path to the ACS survey might be:

http://odata.census.gov/states/NewJersey/S0501

By year:

http://odata.census.gov/states/NewJersey/S0501/2010

And finally, paths to individual facts might be:

http://odata.census.gov/states/NewJersey/S0501/2000/ForeignBorn

http://odata.census.gov/states/NewJersey/S0501/2010/ForeignBorn

Nothing’s hidden behind a JavaScript veil or stored in a cookie. The entire web of data is navigable in a standard browser, which displays human-readable Atom feeds if set for human viewing, or raw XML or JSON if used to discover URLs for machine processing. Every URL is a canonical home page for a data set or an individual datum. User-friendly search and navigational tools are built on top of this foundation. Nobody has to deal with raw URLs and feeds. But they’re always available.

I’m not ungrateful for what census.gov (and so many other sites) offer. Any kind of web access to data is infinitely better than no access. But there are better and worse ways to provide access. It’s 2012. We ought to be doing better by now.

6 thoughts on “Let’s give every fact its own home page on the web

  1. Gary

    Is there any reason that a mechanism like this necessarily has to imply a particular hierarchy for the schema?

    I might prefer to treat:

    http://odata.census.gov/S0501/NJ/2010/ForeignBorn

    and

    http://odata.census.gov/S0501/2010/NJ/ForeignBorn

    equally, or even think up concepts like:

    http://odata.census.gov/S0501/NJ+PA/ForeignBorn

    to get me full data from NJ + PA for all available years, or

    http://odata.census.gov/S0501/2010/States/ForeignBorn

    to get 2010 data from all states, or

    http://odata.census.gov/S0501/2010/MSA/ForeignBorn

    to get MSA data, or whatever.

    So really it’s a slash-delimited set of facets, starting with the dataset and terminating with the measure (or measures?) of interest. I’m not familiar with how oData cooks this, or even whether it addresses that issue.

    Reply
  2. Jon Udell Post author

    Any schema is possible, including one with aliases that juggle the facets in various ways.

    The NJ+PA bit would most comfortably be handled as a query, i.e.:

    http://odata.census.gov/S0501/ForeignBorn?$filter=state eq ‘NJ’ or state eq ‘PA’

    Systems that automatically generate OData “heads” for SQL databases make best-guess choices about data models and cross-linking, but you don’t have to use those generators or if you do you’re free to override their choices and use the URI conventions (http://www.odata.org/documentation/uri-conventions) to suit your taste.

    Reply
  3. Alex

    Percentages doesn’t give you exact numbers and can be confusing sometimes, so its better to go for actual numbers. This is what I do during most of my decisions, and believe its good to do it that way.

    Reply
    1. backlord

      Percentages doesn’t give you exact numbers and can be confusing sometimes, so its better to go for actual numbers. This is what I do during most of my decisions, and believe its good to do it that way.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s