Ex Readex: Not Much? by goosecommerce
April 9, 2011, 10:45 pm
Filed under: History and Historians, Now in Actual Work

Or, Caveat NewsBank

UPDATE: See the subsequent post for the thrilling reveal!

In harmony with one of the recent memes floating around the world of digital history — the happy attention to some of what historians don’t know about database design, how particular databases are missing parts of texts, within particular series, and proposals for how we might directly address this issue, as a collective, I thought it might be worthwhile to add my own experience to the pile.

Briefly stated: one of the standard databases, Readex’s America’s Historical Newspapers, seems to have a shockingly low number of texts available for the Jacksonian-to-Antebellum period — more, even, than their own product descriptions (which emphasize the coverage of particular years) would lead you to believe. Here’s a picture:

(see below for a table with the raw numbers)

But before I get too far in here, let me emphasize the caveat: I say “seems” for a couple of reasons.

First and foremost: I may be completely misunderstanding something about how searches work in this database. The y-axis on the above graph is the number of “hits” a blank or wildcard (* or ?) search in the fulltext field of the database returned for 5 year intervals (blanks and wildcards returned equal numbers). This may or may not be the same as the number of “documents” (articles) or “images” in the database; though I should think it would be.

Second, the results I’m getting seem to run counter to some of the statistics Readex itself provides about the component databases searched by “America’s Historical Newspapers.” See, for example, what they say about the number of images in each of the seven (7!) component series of “Early American Newspapers” in the product description. The numbers of images available seem out of whack with what you’d expect…but since these are such big dates ranges, it could be that what I’ve found is still true for this period; I don’t know. There also seems to be a discrepancy between the these figures and those produced on different search pages available on the Readex site.

On another level, though, all this jibes with something I’ve long suspected — the digitization of print materials from the U.S. follows an uneven U-shaped curve, where the trough is roughly 1800 to 1850. Broadly speaking, it seems like every possible scrap of material from the colonial and revolutionary era has been digitized, extending, in the case of the Founder’s Paper’s projects (e.g. Rotunda) far into manuscript materials. Then, just as the print explosion begins in the U.S., digitized materials drop off, picking up again with the Civil War, and increasing as we approach the 20th-century. That seems to be borne out here (or, at least as far as I was willing to go with the data entry).

This curve is in many ways totally understandable; there are fewer colonial and revolutionary periodicals, so why not be complete about it? And obviously there is more interest in the more recent past (perhaps the post-war stuff is digitized because it’s close enough that it might be good for local histories and genealogical work?). But on another level, it’s troubling; especially given how historians are beginning to use this and like databases to talk about the appearance of particular terms. Comprehensiveness, esp. relative comprehensiveness really matters there.

That’s how I happened on this case. I came across this oddity while trying to control for changes in the size of the database while tracking changes in the occurrence of a particular set of terms.1

What really shocks me about the numbers I’ve pulled out of AHN — which is, to my knowledge, far and away the most comprehensive database for this period there is — was how much the absolute number of articles scanned is lower over time. I figured, at best, that the coverage was reduced only in terms of geographic range, or narrowed by a focus on particular publishers; only New York, Philadelphia and Boston well-covered, for example, and not the vast West and South. But apparently (and again, I want to emphasize the tentative nature of my conclusion here), that was dead wrong.

The upshot: for given values of the “Early Republic,” digitization is still a ways away, and we should not trust any database’s comprehensiveness — even if, at first glance (or, in my case, continuous usage over aargh, years) seems to suggest that it contains a lot of material.

Okay, so now some blegs: Any thoughts on this? What am I getting wrong? As I said, I can’t help but think this puts a major crimp in what we can use these databases for, in terms of reliability — but I’d be glad to have any mistakes I’m making here pointed out, the sooner the better.

1.) If you’re interested, the string I was searching was this horrible stew of syntax:

(“East Indies” OR “East India” OR “East Indian” OR China OR Chinese OR Orient OR Orient*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (trade OR commerce) NEAR25(specie OR silver OR dollar? OR currency OR circulati*) AND (drai* OR expor*) NEAR25(specie OR silver OR dollar? OR currency OR circulati*)

Suggestions on how to improve that monster would very welcome.

2.)There is also a discrepancy between two portals to search the Readex newspaper database. When I’ve searched only newspapers from the Archive of Americana portal, I consistently get higher returns than if I had searched America’s Historical Newspapers directly. The difference is potentially significant — in the period 1835-1839, AHN returns 1,702,150 hits compared to AA’s 1,933,685, a difference of 231,535, or 13.6%.

I’m not sure why this is so; the two searches say they are tapping into the same databases, to wit:

AA’s search says it includes:

Early American Newspapers, Series 1 (1690 – 1876), Early American Newspapers, Series 2 (1758 – 1900), Early American Newspapers, Series 3 (1829 – 1922), Early American Newspapers, Series 4 (1756 – 1922), Early American Newspapers, Series 5 (1777 – 1922), Early American Newspapers, Series 6 (1741 – 1922), Early American Newspapers, Series 7 (1773 – 1922), Hispanic American Newspapers (1808 – 1980), African American Newspapers, 1827-1998 (1827 – 1998) and Ethnic American Newspapers from the Balch Collection (1808 – 1980).

While AHN’s claims:

Early American Newspapers Series 1 – 7, 1690-1922; African American Newspapers, 1827-1998; Ethnic American Newspapers from the Balch Collection, 1799-1971; Hispanic American Newspapers, 1808-1980 and Selected Historical Newspapers.

That seems comparable to me. If anything, the AHN search should include more, what with the inclusion of “Selected Historical Newspapers.”

I’m planning to e-mail the Readex people to find out what’s going on — and what I might be missing — but any suggestions in the meantime are welcome.

Raw Numbers

(Note: these figures come from searches performed using the AHN portal, not the AA portal)


