Resolving Citations (we don’t need no stinkin’ parser)

If you are reading this, you may be faced with the following problem- You have a collection of free-form citations which you have copied from a scholarly article and you want to import them into a bibliographic management tool (or other database). In short, you would like to turn something like this:

Carberry, J 2008, “Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory.” Journal of Psychoceramics, vol. 5, no. 11, pp. 1-3.

Into something more like this:

@article{Carberry_2008, title={Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory}, volume={5}, url={http://dx.doi.org/10.5555/12345678}, DOI={10.5555/12345678}, number={11}, journal={Journal of Psychoceramics}, publisher={Society of Psychoceramics}, author={Carberry, Josiah}, year={2008}, month={Aug}, pages={1-3}}

Or even this:

TY - JOUR
JO - Journal of Psychoceramics
AU - Josiah Carberry
SN - 0264-3561
TI - Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory
SP - 1
EP - 3
VL - 5
PB - Society of Psychoceramics
PY - 2008

The traditional approach to this is often “We’ll start by trying to parse the citation into its component parts.” Indeed, there are a number of tools that try to do this:

Which is cool, but is very difficult- particularly with obscure and/or terse citation styles.

There is another way. Instead of trying to parse the citation, just search for the record in a database that already has the citation parsed. The CrossRef metadata database is good for this. For example, the following query using the CrossRef Metadata Search API

http://search.labs.crossref.org/dois?q=Carberry%2C+Josiah.+%E2%80%9CToward+a+Unified+Theory+of+High-Energy+Metaphysics%3A+Silly+String+Theory.%E2%80%9D+Journal+of+Psychoceramics+5.11+%282008%29%3A+1-3.#

Gives you the following result:

{
doi: "10.5555/12345678",
score: 7.1926823,
normalizedScore: 100,
title: "Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory",
fullCitation: "Josiah Carberry, 2008, 'Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory', <i>Journal of Psychoceramics</i>, vol. 5, no. 11",
coins: "ctx_ver=Z39.88-2004&amp;rft_id=info%3Adoi%2F10.5555%2F12345678&amp;rfr_id=info%3Asid%2Fcrossref.org%3Asearch&amp;rft.atitle=Toward+a+Unified+Theory+of+High-Energy+Metaphysics%3A+Silly+String+Theory&amp;rft.jtitle=Journal+of+Psychoceramics&amp;rft.date=2008&amp;rft.volume=5&amp;rft.issue=11&amp;rft.aufirst=Josiah&amp;rft.aulast=Carberry&amp;rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&amp;rft.genre=article",
year: 2008
},

That’s already pretty cool. But if you extract the DOI from the above and use DOI content negotiation to query the the DOI like this:

$ curl -LH "Accept: application/x-bibtex" http://dx.doi.org/10.5555/12345678

You get the following result in BibTex:

@article{Carberry_2008, title={Toward a Unified Theory of High-Energy Metaphysics: Silly String Theory}, volume={5}, url={http://dx.doi.org/10.5555/12345678}, DOI={10.5555/12345678}, number={11}, journal={Journal of Psychoceramics}, publisher={Society of Psychoceramics}, author={Carberry, Josiah}, year={2008}, month={Aug}, pages={1-3}}

Yay!

There, that wasn’t too hard, was it?

OK, what is the catch?

Well… using CrossRef Metadata Search has a number of limitations that you should be aware of:

  • It can produce false positives. It will almost always match *something*. You need to look at the score, etc. in order to determine the likelihood that you’ve got a correct match.
  • It only works on content listed in CrossRef’s database.
  • The metadata in CrossRef’s database can sometimes be… spotty*

It also has a big benefit– You won’t get false negatives. If you have a typo or incomplete metadata, it will do a much better job than a strict citation parser or OpenURL Query.

In short, CrossRef Metadata Search is remarkably good at resolving citations. We encourage you to try it and let us know how it works for you.

Note that if you are having trouble getting hold of free-form citations to begin with, you may want to see our tools for extracting citations from PDFs.

(*unmitigated bilge)