It’s a fact of our researching lives that, in a database of millions of articles, some will inevitably bear QC blemishes. Recently, however, we came across a more interesting metadata problem in ProQuest’s Historical Newspapers—one that prompts us to take extra precaution when providing source information to researchers.
I was pulling some NYT articles from the 1970s and viewing them as PDFs. ProQuest stamps its PDFs with helpful metadata such as article title, author, date, and page number. I noticed that the pagination of the NYT articles for some years in the 1970s appeared to proceed consecutively rather than re-start at a new section, such as moving from A-28 to B-1. For example, the PDF of Robert J. Cole’s “No Bandwagon Expected for No-Fault Insurance” from August 30, 1970 is stamped by ProQuest with page number 154. Likewise, “Astros’ 2-Run 10th Beats Mets” is stamped as page 139. Confused, I clicked through to the full-page scan of the paper and a very different story unfolded. ProQuest page “139” is actually NYT page 1 of Section 5 (Sports). And Cole’s article on page “154” is actually “L_S_16,” or late edition, Sports section, page 16. In other words, ProQuest was consecutively numbering pages that were not consecutively numbered in the original NYT.
A researcher not in the know might simply cite to ProQuest’s page number, unaware that these numbers do not correspond to the original article. I contacted ProQuest and was advised that their “Manufacturing Area” assigned these page numbers to avoid a “duplicate numbering” problem. They did not explain how including the original pagination would have created such a problem.
I thought, “Okay, to find the original page number, just look at the scan of the original page and not the metadata-stamped PDF,” which contains ProQuest’s add-on numbers. Not so fast. If you have access to the ProQuest database, take a look at the original page view for the September 20, 1977 editorial, “One More Reason for No-Fault.” You’ll find the page shows the “L” for late edition, plus the number 40. The ProQuest metadata stamp also shows page 40. But, advance nine pages to ProQuest page 49. As of this morning, it’s NYT original page 73! We’ve advanced nine pages in ProQuest and thirty-one in the original paper!? Now try ProQuest page 69. That’s original NYT page 57! So now we’re advancing in the ProQuest pagination and decreasing in original pagination.
“A-ha!” I said. We have now restarted with numbering in a new section. Alas, ProQuest p. 69/NYT p. 57 is the first page of the Business/Finance section. It has not restarted at “1.” So everything is complicated by the fact that the original NYT late edition in 1977 seems to be consecutively numbered across sections—but that consecutive numbering does not match the consecutive numbering ProQuest assigned to it. In contrast, back in 1970 with the Cole article mentioned above, NYT was not consecutively numbering pages across sections, but ProQuest was!
Good grief! This is not meant as any sort of criticism of ProQuest, whose databases are near and dear to our researching hearts. It does, however, put us on alert to ensure we know what the “real” metadata is. To that end, we are very interested in learning more from NYT historians about the particular years and editions (late, national, etc.) that bear consecutive page numbering across sections (if that really is what’s happening). In addition, we hope to get more information from ProQuest about the years for which they added their own consecutive numbering to the pages, and why they chose to do so for those particular years but not others.
We certainly are not the first or only ones with historical NYT metadata woes, as evidenced by this 1994 e-mail posted to the LOC’s Research site. The issue back in 1994 related to proper cataloging and preservation of the “national” edition of the NYT, but its call for improved metadata standards rings the same today, and quite loudly.