Note: the instructions contained on this page were written for use specifically in the CLSP WS02 Lab. They may refer to applications, files or other materials that are not accessible from other locations.

JHU 2002 Summer School (pre-workshop) lab, July 5, 4-6pm

Syntactic Annotation Lab

by Jan Hajic and Martin Cmejrek

Welcome to the syntactic annotation lab. The goal of the lab is to

These pages will guide you through the process, starting with the English dependency grammar guidelines, and continuing to the actual lab exercise as prepared in the lab.


1. Annotation Guidelines: English, surface dependency

1.1 Introduction: What Do We Annotate?

Every sentence is annotated separately. Sentences are already identified and tokenized. Two things are being annotated: Every token from the sentence becomes a node in the resulting dependency tree. The tree is (mathematically speaking) an oriented (rooted) tree, where the direction of edges (in our interpretation) shows the direction of the dependency relation between the two nodes connected by that edge. By default, the direction goes down from the "upper" node (called the governor) down to its dependent.

1.1.1 Dependency

By linguistic wisdom and common sense (but often also just by convention) the direction of the dependency is determined as follows: the dependent is such a node from the two related nodes that, if left out, damages the grammaticality of the sentence the least (or perhaps not at all). (Apparently, leaving one node out usually changes the meaning of the sentence, but that's not our concern here - we are not actually leaving the node out but just determining the direction of the dependency.) For example, if you have a sentence "I like brown dogs", and want to determine the direction of dependency between brown and dogs, try saying "I like brown" and "I like dogs" - apparently the latter is better, so brown will be the dependent of dogs. The dependency direction of some tokens, primarily punctuation, particles, etc., for which it is hard to use the above test, is determined by convention.

1.1.2 Word Function

Since every token from the sentence appears in the resulting dependency tree, every token but one (the root) depends on some other token. On top of the structure (which also implies the direction of the dependency), we would like to know also the type of the dependency, much the same way as we want to know the type of a phrase in parse trees (such as NP, VP, PP etc.).

The type of the dependency is annotated and recorded with the dependent (since it can have only one governor (parent), it is quite clear and well-defined which dependency it describes). We distinguish about 25 types of dependency, many of them being rather technical; there are only five main dependency types:

Table 1 lists all of the relevant dependency types with short descriptions.

1.2 Annotation Guidelines

The following guidelines have been prepared specifically for this lab. They do not claim to be complete at all; also, they merely teach-by-example the annotation of the phenomena that will be encountered in the sentences you will annotate. (The complete guidelines with examples would be about 600 pages long, based on our experience with other languages.) Also, please remember there is no "truth" with regard to the style of syntactic annotation; what you see here is in fact what we think is the best surface structure representation of English sentence structure, with possible applications (such as deeper analysis for machine translation) in mind; it is certainly not an attempt to replace Quirk's (or anyone else's, for that matter) English grammar, of course :-).

In the examples below, both the structure of the annotation as well as the appropriate functions are given. Remember, the function is in fact the name of the dependency type between anode and its governor, thus it cannot be determined for the roots of the example fragments (dashes are used instead) without knowing what they themselves depend on. Sometimes, however, the most typical function is used for the fragment's root.

1.2.1 Core sentence (Pred, with Subject and Object)

Every (complete) sentence (as well as every clause in coordinated/subordinated complex sentences) has a verb somewhere inside. The main verb of the clause is the root of the tree (or clausal subtree). If it is the root of the sentence tree (i.e., it belongs to the main or only clause), it gets the function Pred; otherwise, it gets the function of the clause within its governing clause, and the verb depends on the root verb of the governing clause. Subject and Object(s) depend on the main verb. Final punctuation "depends" also on the main verb of the main clause.

1.2.2 Subjects (Sb)

The head of the subject phrase is annotated as the subject of a clause (see e.g. the above example). Subjects are understood purely syntactically; no attempt should be made to distinguish "deep" subjects ("actors" or "agents") at this annotation level.

Subjects are expressed usually by a "syntactic" noun (noun, pronoun, adjective used as a noun, or a numeral). Sometimes, even a whole clause can be a subject (subjective subordinate clause).

For more examples, see the subject nodes in the coordination examples below; see also the passive verb form example.

1.2.3 Objects (Obj)

Objects are verb arguments. By convention, infinitives in complex verb constructions ("want to read", "begin to move") are considered objects as well.

Direct as well as indirect objects (i.e., noun phrases modifying verbs without the use of a preposition) are a clear example of an object depending on its verb; however, prepositional phrases can become objects as well in certain typical verb constructions (especially when the preposition loses is typical, or "unmarked", function if it goes with the particular verb). As with all functions, subordinate clauses can also be objects.

Examples of objects expressed:
by a simple noun phrase: by an infinitive: by an infinitive
(with "to be"):
by a clause (and by
a simple noun inside it):

1.2.4 Attributes (Atr)

Attributes depend on their governing "syntactic" nouns (which can sometimes be just adjectives used as nouns, or pronouns, or numerals, or perhaps even something else). They never depend on verbs. On the other side, nothing else can depend on nouns but Attributes (for exceptions, see Adverbials; but there are not many exceptions, anyway). Articles (determiners) are considered attributes, too, and they depend on the head of the noun phrase they belong to. Attributes are probably the most common dependency type in the sentence representation.

Attributes (function: Atr) can be expressed in several ways:
by an adjective
(+ determiner):
by a simple
numeric expression:
by a little more complex
numeric expression:
by a leading noun
in a noun phrase:

by a prepositional
phrase:
by a possessive: by a subordinate clause: by a numeric range: by an -ing verb
construction:

1.2.5 Adverbials (Adv)

Adverbials typically modify verbs and adjectives, but in certain cases they may modify "syntactic" nouns as well (e.g., "almost five").

They can be expressed in various ways:

by an adverb: by a prepositional
phrase (numeric, time):
by a prepositional
phrase (time,
modifying a numeral):
by a prepositional
phrase (location):

Negation is also annotated as an adverbial (similarly, intensifiers such as "also", "thus", "so", "only" etc. are Advs as well):

Subordinate clauses can also be adverbial (introduced by subordinate conjunctions such as "as", "when", "if", "where", etc.)

1.2.6 Verbal Attributes (Atv)

Verbal attributes are interesting in that they as if depend on two nodes: the verb and one of its dependents (in some languages it is more clear than in English, since this "double dependency" is displayed in their morphological agreement).

As double dependency is not desirable (because it would make the resulting structure not a tree), a convention is adopted that it is annotated as if it depends on the noun (that in turn depends on the verb in question) only. It usually introduces a "non-projectivity" in the structure (for those familiar with the parsing evaluation terminology, it is something like the "crossing brackets" phenomenon), but in the dependency framework, there is no trouble with that at all

It is often expressed by a transgressive verb form ("Suzan paid bills keeping some reserve..."):

1.2.7 Coordination and Apposition (Coord, Apos)

Handling coordination and apposition is probably difficult in any annotation style. (The difference between coordination and apposition is perhaps not so important, and rather "semantic": coordination "brackets" several different things together to behave as a single sentence constituent, whereas apposition expresses the same object in two different ways).

Using dependencies, coordination can be handled easily by marking the coordinating conjunction (or a comma if there is no conjunction) by the function Coord and using it in place of any of the true dependents; they are then "dependent" on the conjunction's node, and marked accordingly as members of the coordination (similarly, for apposition the "governor" function is Apos, everything else being the same). Special care must be taken when the coordination is modified by a common phrase: such a phrase is obviously not marked as a coordination member. For marking the coordination/apposition "membership", appropriate function "suffix" (_Co, _Ap) is used.

Examples of coordinations/appositions of various complexity:
Simple coordination: Simple apposition: Coordination with a common modification:

Another common modification: Combination of coordination and apposition:

1.2.8 Complex verb forms (AuxV, AuxY)

By complex verb form we mean a verb that is expressed as several "words" (tokens, to be precise). The main word is used as the node representing the verb in the global sentence structure; the auxiliary words "depend" on it, with the function AuxV for various forms of auxiliary verbs and for infinitive particle "to", and AuxY for phrasal verb particles.
Simple passive: Passive w/negation: Infinitive particle:

1.2.9 Nominal predicate (with the copula "to be")

Simple enough. Let's just show an example here:

1.2.10 Graphical symbols

Graphical symbols (usually expressed by punctuation other than the sentence-final one and the comma, which always gets AuxX) get the function AuxG:

1.2.11 Other

Please refer to the additional examples of full sentence-length annotation (distributed on paper) to get an idea how things fit together, and how various other phenomena are annotated. And remember, do not hesitate to ask if we forgot to show you something you need for the annotation!

2. The Lab Setup

Every team of 2 has been assigned a number (see the board for the signup sheet if you have forgotten). Let's your team number be NN.

2.1 The task

There are 20 sentences to annotate using the above guidelines. The sentences are real-world sentences from several articles from the WSJ from the late 80s and early 90s, manually selected to avoid those that are too difficult. They have been pre-processed in such a way that you will initially see a "string" of nodes once you start the annotation software: each node "depends" on the previous one. No functions are filled in. (This is exactly how the real annotators get the sentences for annotation.) Your task is to modify this inital structure so that it obeys the above guidelines.

2.2 How to Annotate

Use the mouse to move nodes around to put them to the right places (by drag-and-drop): drop the node near what you think should be its governor, and it will be moved to the right place once you release the mouse button. Clicking on token's dependency function (light grey color, below the word form, initially ---) will bring up a window with a list of all possible dependency functions; select by double-clicking the chosen one.

2.3 The Annotation Procedure

Based on your team number, use one of the following machines:

Team NumberMachine
10e1
11e1
12e2
13e2
14e7
15e7
16e8
17e8
18e10
19e10
20e11
21e11
22e12
23e12
24e13
25e13
26e14
27e14
28e15
29e15
30e16
31e16

Login under the login name of one member of the team only; use one terminal and work together. We suggest that one of the team members searches the guidelines and the extra examples distributed on paper, comes up with solutions and then instructs the other one (sitting in front of the terminal, controlling the mouse and doing the final visual check) what to do. (Of course, you might want to switch places in the middle of the task.)

Do:

...$ cd ~hajic/lab/NN
...$ pwd

Please check VERY CAREFULLY (by using pwd as suggested above) that you are in a directory named NN (your team number)! There is no way at the moment to ensure that two teams do not annotate the same file!!! (resulting in the obvious consequences...)

Then run:

...$ annotate

The annotation software ("TrEd"'s) main white window should show up. Size it so that it has the maximum possible size while still seeing these guidelines (at least partially). You will see the first sentence (click to enlarge if you really want to check...):

Start working. Ask questions whenever you are not sure what to do. Save your work frequently (File->Save; do not use Save As... to save it under a different name - the result must be saved under your team's name, labNN.fs).

2.4 Evaluation

Before 5pm, do your last save and exit TrEd (File->Quit). At 5:00pm (but not earlier than that), you can run

...$ evaluate

to see how well you have done; it will show you your total accuracy against the "gold standard", as well as the separate accuracies in structure building and function assignment. The total accuracy is the complement of the average of the two separate error counts, in percent:

Team NN accuracy: 18.78 (structure 21.89, functions 15.68)

The example above also shows the baseline numbers (i.e., this is what you get if you do not do anything and leave the inital structure and functions intact).

The final results for all the teams, ranked, will be available about a minute later; announcement of the winning team and its accuracy figures will follow immediately!



http://www.clsp.jhu.edu/ws2002/preworkshop/labs/cmejrek/lab.html
This page was originally been accessible at http://www.clsp.jhu.edu/~hajic/lab/index.html; graphics was in the same directory as *.jpg.