Boston University Noun Phrase Corpus


Welcome to the Boston University NP Corpus. This is a thoroughly coded corpus of noun phrases that is freely accessible to the public. For more information, see below.

Click here to search the corpus

The Corpus

The Boston University NP Corpus is specifically a corpus of 'possessive' noun phrases. The tokens are therefore all pairs of nominals (including pronouns), combined either via premodification or via postmodification:

  • his widow
  • the man's widow
  • the widow of the man
  • that type of man
  • a woman of strong principles
  • some of the women
  • etc.

Obviously, these are not all truly possessives, but rather constitute a superset which includes all possessives. The corpus does not include as 'possessve' noun phrases tokens of noun-noun modification (compounding), such as "the garbage man", although these are often included within a possessive. The coding attempts to remain theory-neutral with regard to syntax.

The corpus contains 10,008 tokens of such 'possessive' noun phrases, meaning that it contains 20,016 individual nominal tokens.

All of the tokens are taken from sections of the Brown Corpus. Specifically, they are taken from the following genres:

  • Press: Reportage (A)
  • Belles-lettres, Biography, Memoirs, etc. (G)
  • General Fiction (K)
  • Adventure and Western Fiction (N)
  • Learned (J)

Both NPs in each of the 10,000 tokens have been annotated for many features, including the following:

  • Animacy (human, organization, concrete object, etc.)
  • Definiteness (definite, indefinite)
  • Expression type (pronoun, proper noun, common noun, etc.)
  • Weight (number of words)
  • Certain semantic and constructional classes

The tokens are available with the full co-text of each example, up to the limits of the samples that comprise the Brown Corpus. All text has been part-of-speech tagged using Fred Karlsson's English Constraint Grammar system.


The Search Engine

The corpus is accessed by means of a complex search engine which, when fully developed, will allow the following operations:

  • Search by text
  • Search by part of speech
  • Search by annotation (coded features)
  • Boolean combination of all of above
  • Mark examples for output or for complex searches
  • Limit search to head nominal or modifier nominal
  • Limit search by genre or text type
  • Show context of tokens
  • Show in list or KWIC view
  • Show only annotations of interest
  • And more...

Who We Are

This corpus and website are two of the products of the NSF-funded project Optimal Typology of Determiner Phrases (BCS-0080377), with the following members:

  • M. Catherine O'Connor, Boston University, principal investigator
  • Joan Maling, Brandeis University, senior research consultant
  • Arto Anttila, New York University, senior research consultant
  • Vivienne Fong, New York University, senior research consultant
  • Gregory Garretson, Boston University, research assistant
  • Barbora Skarabela, Boston University, research assistant
  • Marjorie Hogan, Boston University, research assistant
  • Fred Karlsson, University of Helsinki, consultant

This site is created and maintained by Gregory Garretson. Please direct all correspondence to him.




Last modified August 21, 2006