MEasuring TExt Reuse – Yorick Wilks (Department of Computer Science, University of Sheffield)

February 18, 2003 all-day

In this paper we present initial results from the METER (MEasuring TExt Reuse) project whose aim is to explore issues pertaining to text reuse and derivation, especially in the context of newspapers using newswire sources. Although the reuse of text by journalists has been studied in linguistics, We are not aware of the investigation using existing computational methods for this particular task and context. In this paper we concentrate on classifying newspapers according to their dependency upon PA copy using a 3-class document-level scheme designed by domain experts from journalism and a number of well-known approaches to text analysis. We show that the 3-class document-level scheme is better implemented as 2 binary Naive Bayes classifiers and gives an F-measure score of 0.7309.

More biographical information can be found here.

Center for Language and Speech Processing