Goose Commerce


Ex Readex: Redux by goosecommerce
April 25, 2011, 5:29 pm
Filed under: Archival Follies, History and Historians, Now in Actual Work

Or, the world in a grain of ads

Granular :: Gimme some sugar by Vanessa Pike-Russell, on Flickr
Creative Commons Attribution-Noncommercial-No Derivative Works 2.0 Generic License  by  Vanessa Pike-Russell 

You’ll recall that in my last I wondered “What am I getting wrong?” — a big question, for sure, with many and varied answers, as friends, acquaintances and passer-by would be happy to tell you. But in this case I was specifically concerned with what I was misunderstanding about the search results I was receiving from a Readex database, America’s Historical Newspapers.

Well, you’ll be pleased to know that Readex, in the person of their marketing director, David Loiterstein, was kind enough to get in touch by e-mail and tell me exactly that. And the answer? Granularity.

Basically, the AHN database does not consistently break down advertising sections at the same level of granularity; it has changed over time. As David explained:

Initially, particularly for the 18th century in which the first series of newspapers was so heavily concentrated, we identified individually every advertisement on every page; however, in later series multiple contiguous advertisements were identified in groups.

So: sometimes individual ads count as individual “articles,” sometimes a multiple-ad block count as one, and sometimes entire columns of ads count as one unit; and the granularity of the ads goes down, generally speaking, over time. Which means that my results — which included all article types, including ads — were skewed by the ways ads are counted.

David provided a graph of his own, illustrating this effect, and suggesting a way to get clear of it (reproduced here with permission):
AdBlocker
I’ll let him explain:

This approach seen above—in which advertisements are isolated and an aggregate number of the other article types is counted separately—provides a more representative measure of available “texts.” While the data does in fact indicate fewer “articles” available between 1820 and 1850 in what is otherwise a steady increase in articles available between 1690 and 1819 and between 1850 and 1922. The declining number of ads as a percentage of “articles” or “text” is a result not of fewer ads but the changing approach by which we identify them.

Thus, practically speaking, if you want to get some kind of a baseline for how representative a given search’s results are, you’re going to have to sacrifice including ads in those search results. Not ideal, of course, but much better than not knowing what your results mean. In addition to responding directly to this specific question, David also mentioned that Readex was working to update the Readex Help section, and fix the discrepancy between the two portals I had noticed.

So where does this leave us?

Well, with a much better understanding of how one of the most important databases in Early American historical research functions, for which I am grateful to David and his colleagues for their quick response and kind explanation.

I would note, though, that even using the new numbers, the curve still shows an unexpected dip in the 1820s and 1830s — the heart of the Jacksonian era, where most historians would tell you that print, and especially newspapers, exploded. As I said before, this is not something I think unique to Readex, but rather an artifact of the way many digitization projects have done triage (or, alternately, it might be proof that print output indeed declined, in which case steam-powered presses were not actually all that important in the development of American democracy! But let’s hope not, as then we’d have to revise a lot of historiography…).

In any case, all good factors to keep in mind when trying to use large collections to buttress claims about relative representativeness, ubiquity, or uniqueness. And now on to new and exciting problems…

Advertisements
Comments Off on Ex Readex: Redux


Ex Readex: Not Much? by goosecommerce
April 9, 2011, 10:45 pm
Filed under: History and Historians, Now in Actual Work

Or, Caveat NewsBank

Seeing My World Through a Keyhole by katerha, on Flickr
Creative Commons Attribution 2.0 Generic License by  katerha

UPDATE: See the subsequent post for the thrilling reveal!

In harmony with one of the recent memes floating around the world of digital history — the happy attention to some of what historians don’t know about database design, how particular databases are missing parts of texts, within particular series, and proposals for how we might directly address this issue, as a collective, I thought it might be worthwhile to add my own experience to the pile.

Briefly stated: one of the standard databases, Readex’s America’s Historical Newspapers, seems to have a shockingly low number of texts available for the Jacksonian-to-Antebellum period — more, even, than their own product descriptions (which emphasize the coverage of particular years) would lead you to believe. Here’s a picture:

(see below for a table with the raw numbers)

But before I get too far in here, let me emphasize the caveat: I say “seems” for a couple of reasons.

First and foremost: I may be completely misunderstanding something about how searches work in this database. The y-axis on the above graph is the number of “hits” a blank or wildcard (* or ?) search in the fulltext field of the database returned for 5 year intervals (blanks and wildcards returned equal numbers). This may or may not be the same as the number of “documents” (articles) or “images” in the database; though I should think it would be.

Second, the results I’m getting seem to run counter to some of the statistics Readex itself provides about the component databases searched by “America’s Historical Newspapers.” See, for example, what they say about the number of images in each of the seven (7!) component series of “Early American Newspapers” in the product description. The numbers of images available seem out of whack with what you’d expect…but since these are such big dates ranges, it could be that what I’ve found is still true for this period; I don’t know. There also seems to be a discrepancy between the these figures and those produced on different search pages available on the Readex site.

On another level, though, all this jibes with something I’ve long suspected — the digitization of print materials from the U.S. follows an uneven U-shaped curve, where the trough is roughly 1800 to 1850. Broadly speaking, it seems like every possible scrap of material from the colonial and revolutionary era has been digitized, extending, in the case of the Founder’s Paper’s projects (e.g. Rotunda) far into manuscript materials. Then, just as the print explosion begins in the U.S., digitized materials drop off, picking up again with the Civil War, and increasing as we approach the 20th-century. That seems to be borne out here (or, at least as far as I was willing to go with the data entry).

This curve is in many ways totally understandable; there are fewer colonial and revolutionary periodicals, so why not be complete about it? And obviously there is more interest in the more recent past (perhaps the post-war stuff is digitized because it’s close enough that it might be good for local histories and genealogical work?). But on another level, it’s troubling; especially given how historians are beginning to use this and like databases to talk about the appearance of particular terms. Comprehensiveness, esp. relative comprehensiveness really matters there.

That’s how I happened on this case. I came across this oddity while trying to control for changes in the size of the database while tracking changes in the occurrence of a particular set of terms.1

What really shocks me about the numbers I’ve pulled out of AHN — which is, to my knowledge, far and away the most comprehensive database for this period there is — was how much the absolute number of articles scanned is lower over time. I figured, at best, that the coverage was reduced only in terms of geographic range, or narrowed by a focus on particular publishers; only New York, Philadelphia and Boston well-covered, for example, and not the vast West and South. But apparently (and again, I want to emphasize the tentative nature of my conclusion here), that was dead wrong.

The upshot: for given values of the “Early Republic,” digitization is still a ways away, and we should not trust any database’s comprehensiveness — even if, at first glance (or, in my case, continuous usage over aargh, years) seems to suggest that it contains a lot of material.

Okay, so now some blegs: Any thoughts on this? What am I getting wrong? As I said, I can’t help but think this puts a major crimp in what we can use these databases for, in terms of reliability — but I’d be glad to have any mistakes I’m making here pointed out, the sooner the better.

1.) If you’re interested, the string I was searching was this horrible stew of syntax:

(“East Indies” OR “East India” OR “East Indian” OR China OR Chinese OR Orient OR Orient*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (trade OR commerce) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (drai* OR expor*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*)

Suggestions on how to improve that monster would very welcome.

2.)There is also a discrepancy between two portals to search the Readex newspaper database. When I’ve searched only newspapers from the Archive of Americana portal, I consistently get higher returns than if I had searched America’s Historical Newspapers directly. The difference is potentially significant — in the period 1835-1839, AHN returns 1,702,150 hits compared to AA’s 1,933,685, a difference of 231,535, or 13.6%.

I’m not sure why this is so; the two searches say they are tapping into the same databases, to wit:

AA’s search says it includes:

Early American Newspapers, Series 1 (1690 – 1876), Early American Newspapers, Series 2 (1758 – 1900), Early American Newspapers, Series 3 (1829 – 1922), Early American Newspapers, Series 4 (1756 – 1922), Early American Newspapers, Series 5 (1777 – 1922), Early American Newspapers, Series 6 (1741 – 1922), Early American Newspapers, Series 7 (1773 – 1922), Hispanic American Newspapers (1808 – 1980), African American Newspapers, 1827-1998 (1827 – 1998) and Ethnic American Newspapers from the Balch Collection (1808 – 1980).

While AHN’s claims:

Early American Newspapers Series 1 – 7, 1690-1922; African American Newspapers, 1827-1998; Ethnic American Newspapers from the Balch Collection, 1799-1971; Hispanic American Newspapers, 1808-1980 and Selected Historical Newspapers.

That seems comparable to me. If anything, the AHN search should include more, what with the inclusion of “Selected Historical Newspapers.”

I’m planning to e-mail the Readex people to find out what’s going on — and what I might be missing — but any suggestions in the meantime are welcome.


Raw Numbers

(Note: these figures come from searches performed using the AHN portal, not the AA portal)

Years

Total “Hits” (articles?)

1795-1799

3,626,530

1800-1804

4,422,965

1805-1809

5,041,412

1810-1814

4,838,756

1815-1819

6,449,231

1820-1824

3,856,979

1825-1829

2,338,139

1830-1834

1,991,623

1835-1839

1,702,150

1840-1844

1,907,799

1845-1849

2,398,359

1850-1854

2,682,211

1855-1859

2,762,811

1860-1864

2,757,069

1865-1869

3,725,627

1870-1874

4,531,278

1875-1879

4,566,376

1880-1884

5,015,152

1885-1889

6,958,484

1890-1894

9,701,775

1895-1899

11,397,028