Iterative Part-of-Speech Tagging PDF

Title	Iterative Part-of-Speech Tagging
Author	Alipio Jorge
Pages	14
File Size	112.6 KB
File Type	PDF
Total Downloads	490
Total Views	845

Preview

CLICK TO PREVIEW PDF

Summary

Description

Iterative Part-of-Speech Tagging Alípio Jorge, Alneu de Andrade Lopes LIACC - Laboratório de Inteligência Artificial e Ciências de Computadores Universidade do Porto, Rua Campo Alegre 823, 4150 Porto, Portugal. {amjorge,alneu}@ncc.up.pt, http://www.ncc.up.pt/~{amjorge, alneu}

Abstract. Assigning a category to a given word (tagging) depends on the particular word and on the categories (tags) of neighboring words. A theory that is able to assign tags to a given text can naturally be viewed as a recursive logic program. This article describes how iterative induction, a technique that has been proven powerful in the synthesis of recursive logic programs, has been applied to the task of part-of-speech tagging. The main strategy consists of inducing a succession T1, T2, ..., Tn of theories, using in the induction of theory Ti all the previously induced theories. Each theory in the sequence may have lexical rules, context rules and hybrid ones. This iterative strategy is, to a large extent, independent of the inductive algorithm underneath. Here we consider one particular relational learning algorithm, CSC(RC), and we induce first order theories from positive examples and background knowledge that are able to successfully tag a relatively large corpus in Portuguese.

1

Introduction

The task of Part-of-Speech Tagging is to assign to each word in a given body of text an appropriate grammatical category like noun, article, ordinal number, etc., according to the role of the word in that particular context. These categories are called part-of-speech tags and may total a few tens, depending on the variants one considers for each particular category. The difficulty of this task lies in the fact that a given word may play different roles in different contexts. Although there is, for each word, a relatively small set of possible tags, for many words there is more than one tag. Words with a single possible assignable tag are dealt with by employing a simple lookup-table (dictionary). The way to solve the ambiguity for words with more than one possible tag is by considering the context of the word and possibly employing background knowledge. In our approach, we represent the text to be tagged as a set of facts. For example, the text “the car is red. I like the car.” is represented as word(s1,1,the). word(s1,2,car). word(s1,3,is).

word(s1,4,red). word(s1,5,’.’). word(s2,1,’I’). word(s2,2,like). word(s2,3,the). word(s2,4,car). word(s2,5,’.’). where s1 and s2 are sentence labels and the second argument is the position of the word within the sentence. Punctuation marks such as ‘.’ are regarded as words. The task of tagging is to assert a set of facts that assign a tag to each word. For the given example one possible result is: tag(s1,1,art). tag(s1,2,noun). tag(s1,3,v). tag(s1,4,adj). tag(s1,5,dot). tag(s2,1,art). tag(s2,2,v). tag(s2,3,art). tag(s2,4,noun). tag(s2,5,dot). In our approach we construct a theory that is interpreted as a decision list of first order rules. tag(S,P,art)← word(S,P,the),!. tag(S,P,art)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), tag(S,R1,noun),tag(S,R2,verb),!. tag(S,P,noun)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), tag(S,L1,art),tag(S,R1,verb),!. tag(S,P,art). Here window/9 defines a window consisting of four positions left of P and four positions right of P. window(P,L1,L2,L3,L4,R1,R2,R3,R4)← L1 is P-1, L2 is P-2, L3 is P-3, L4 is P-4, R1 is P+1, R2 is P+2, R3 is P+3, R4 is P+4. Tagging one particular word (e.g. ‘like’ in position 2 of sentence s2) is done by answering the query ?- tag(s2,2,T). This query may not terminate if we employ a straightforward top-down interpreter like the one used in Prolog. Some troublesome recursive calls may occur leading to

non-termination. We describe how we handle the interpretation of these recursive clauses in Section 5 (Algorithm 5 and Algorithm 6). In this article we use the technique of iterative induction to synthesize the recursive tagger. This technique has been proven powerful in the synthesis of recursive logic programs from sparse sets of examples (Jorge & Brazdil 96, Jorge 98). Here we show how it can be used in the task of part-of-speech tagging. In each iteration we use the novel first order algorithm CSC (Characterize, Select, Classify) in combination with the existing algorithm RC (Lopes & Brazdil 98).

2

The Inductive Task

The task of constructing such a theory is defined as Given: Corpus C (predicate word/3), A set of examples E that associate a tag to each word in the corpus (predicate tag/3), A clausal language L, Background knowledge BK, Output: A theory T ∈ L such that T ∪ BK ∪ C |– E The above specification is over-constrained since it does not allow the theory T to fail on any of the examples. In practice we will be looking for a theory T that maximizes the expected success rate over an unseen set of words, where the success rate is the proportion of correct answers given by the theory.

3

The Language

To decide which tag should be assigned to a given word, we want to have rules that take into account either the word or the context or both. For that we will look for rules of the form tag(S,P,T)← window(P,L1,L2,L3,L4, R1,R2,R3,R4) [,word(S,P,W)] [,tag(S,L1,TL1),tag(S,R1,TR1) [,tag(S,L2,TL2),tag(S,R2,TR2) [,tag(S,L3,TL3),tag(S,R3,TR3) [,tag(S,L4,TL4),tag(S,R4,TR4)]]]].

where [x] means that x is optional. The third argument of predicates word/3 and tag/3 are constants (ranging over words and tags respectively). We also considered other combinations of the same literals. To construct clauses in this recursive language we have to solve the following important problem. We need information about the context of each word. This information is available at induction time, but it is not available at classification time. To assign tags to an unseen text, we start only with the set of words to be tagged (predicate word/3). One common solution is to tag as many words as we can (for example the words that have no ambiguity) and then apply the recursive rules. After that we will have more tagged words and the recursive rules may be applied repeatedly until all words have been tagged. In other words, the theory is applied in layers. The key idea in this article is to employ the layer by layer strategy in induction. Although the use of a first order language is not strictly necessary to represent the learnt theories presented in this article, it has a few advantages. First of all, the approach proposed allows the use of background knowledge which can be provided as in other ILP systems. Second, the recursive nature of the rules is represented and exploited in a natural and elegant way. Finally, the use of a first order declarative bias language provides a powerful mechanism for focussing particular regions of the search space.

4

The Iterative Induction Strategy

Learning recursive first order clauses is a difficult problem that has been tackled by several different approaches. The iterative approach we present here is to a large extent based on (Jorge 98) and (Jorge & Brazdil 96). Given a set of words, we start by inducing clauses that are able to determine the tag of some of those words without any context information. These will also be the first clauses to be applied in classification. They are the base clauses of the recursive definition we want to induce and are not recursive. These clauses are also used to enrich the background knowledge, thus enabling and/or facilitating the synthesis of recursive clauses in the following iterations. Having obtained this first layer of clauses, let us call it T1, we are able to classify (tag) some of the words in the text used for training. Using the answers given by this theory T1 we may induce some recursive context clauses thus obtaining theory T2. By iterating the process, we obtain a sequence of theories T1, T2, ..., Tn. The final theory is T = T1 ∪ T2 ∪ ... ∪ Tn. To induce each theory in the sequence we may apply a sort of covering strategy, by considering as training examples in iteration i only the ones that have not been covered by theories T1, ..., Ti-1. We stop when all the examples have been covered, or when we cannot find any clauses. The construction of each theory T1, T2, ... is done by a learning algorithm. In this article we consider the algorithm CSC(RC).

Algorithm 1: Iterative Induction Given Language L, examples E and background knowledge BK, Learning algorithm ALG(Set of Examples, Background knowledge theory) Find A theory T in L Algorithm: Uncovered ← E T←∅ i←1 Do Ti ← ALG(Uncovered, BK) T ← T ∪ Ti BK ← BK ∪ Ti Uncovered ← Uncovered – covered_examples(Ti) i←i+1 Until covered_examples(Ti) = ∅ Example: Assume that the background knowledge includes the definition of the predicate word/3 (describing the text) and window/9. We now describe in more detail how a theory is produced by iterative induction. In iteration 1 non recursive rules like the following are induced: tag(A,B,adj)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,portuguesa),!. tag(A,B,n):window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,documento),!. These rules are defined solely in terms of the background predicates word/3 and window/9. They do not depend on the context of the word to be tagged. Before proceeding to iteration 2 we add these rules to the background knowledge. In iteration 2, some words can be tagged using the rules induced in iteration 1. Therefore recursive rules like the following appear:

tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4),tag(A,L1,prep), tag(A,R1,n),tag(A,L2,n), tag(A,R2,virg),tag(A,L3,prep),!. ... tag(A,B,art)← window(A,B,L1,L2,L3,L4,R1,R2,R3,R4), word(A,B,a),tag(A,R2,prep), tag(A,R3,n),tag(A,R4,prep),!. The second rule shown above is defined in terms of the word to tag and the context. In this second iteration we also find many non recursive rules. In subsequent iterations more clauses will appear until the stopping criterion is satisfied. In general, the total number of iterations depends on the data, the language, and the underlying learning algorithm employed.

5

The CSC Algorithm

CSC (Characterize, Select, Classify) is a new first order learning algorithm that learns from positive examples and background knowledge, and enables the use of declarative bias. This algorithm has two distinct learning phases (Algorithm 2). In the first one, all rules in the language that have confidence and support above given thresholds are produced. This is the characterization phase. It is akin to the discovery of association rules (Agrawal et al. 96) but using a first order language. In fact CSC uses the notions of support and confidence to eliminate potentially uninteresting rules. The second phase is selection. The set of all rules is sifted and sorted in order to obtain a decision list of first order rules that are able to classify. Here the selection is done by the algorithm RC described in the following section. We refer to the combination of CSC and RC as CSC(RC). Algorithm 2: CSC Given Language L, Examples E , Background knowledge BK, Minimal support MS, minimal confidence MC Find A theory T in L Algorithm ALLRULES ← Characterize(L,E,BK,MC,MS) T ← Select(ALLRULES)

The measures of support and confidence of a rule correspond to the notions used in the construction of association rules. support( A→B ) = #{ true instances of A∧B } confidence( A→B ) = #{ true instances of A∧B }/ #{ true instances of A } Algorithm 3 describes the construction of the set ALLRULES. The rules in this set are first order clauses belonging to the language L given as a parameter. This language can be defined by the user via one of two declarative bias formalisms. One is a DCG (definite clause grammar) and the other is clause templates. Both describe the set of acceptable clauses. The set ALLRULES can be very large. For the data set used here it reached more than thirteen thousand rules in the second iteration. Other iterations had smaller ALLRULES sets. The overall synthesis time, for all iterations and including characterization and selection phases took less than six hundred seconds. In general, the number of rules generated is linear on the number of examples, but it is very sensitive to minimal support and the size of language L. Algorithm 3: Characterize. Given Language L, Examples E, Background knowledge BK, Minimal support MS, minimal confidence MC Find all rules A→B in L such that, relatively to E and BK, support( A→B ) ≥ MS, confidence( A→B ) ≥ MC We call this set of rules ALLRULES. The selection is done by algorithm RC which is described in the following section.

6

The RC Algorithm

The RC (Rules and Cases) algorithm learns a logic program represented as an ordered list of clauses. It starts from a general rule (default) and selects, in successive iterations, more specific rules which minimize the error of the previously selected rule. General rules tend to incorrectly cover a large number of examples (negative examples). So, RC tries to minimize this incorrect coverage by selecting other more specific rules that correctly cover the examples. These new rules may themselves incorrectly cover other examples. Therefore the process iterates until there are no

more candidate rules. A detailed description of RC can be found in (Lopes & Brazdil 98). Here we give a short description of the algorithm. The RC algorithm receives as input the set ALLRULES. From these candidate rules we obtain the final theory in three main steps. Firstly the candidate rules are organized, by the relation of generality, into a set of hierarchies (a forest). Secondly the candidate rules are analyzed to establish their effective coverage, a measure explained below. Thirdly, the consistency of the candidate rules is analyzed. This analysis chooses, among the inconsistent rules, the best one and adds it to the theory. Algorithm RC is described as Algorithm 4. We now briefly describe each step of RC. After placing the candidate rules into the structure, the system assigns a certain level to each element in the forest. These levels reflect the generality relation between the candidate rules in each tree. The most general rules are assigned to level 1, their most general specializations as level 2, and so on. To determine the effective coverage of each rule, the levels are processed starting with the most specific level. For each rule R, it looks for more general rules (in the next upper level) which can be applied to the same examples as R but give different outputs (they are considered inconsistent with R). The effective coverage of each of these more general rules is updated to the sum of its support with the support of R. This effective coverage gives us a criterion to choose, in the consistency analysis step, the best general rule among candidates in each level. It estimates how many negative examples covered by the more general rule can be possibly corrected by more specify ones. Algorithm 4: RC. Given A set of candidate rules: ALLRULES Do Let T be the empty theory Add each candidate rule to the existing structure (forest). Determine the effective coverage of each candidate rule. Perform the Consistency Analysis, Identify potentially redundant rules While some non-redundant rule exists do: Select the rule with the largest effective coverage (among the most general candidate rules). Add the selected rule to the theory T. (to the beginning of the ordered list of clauses) Eliminate inconsistent candidate rules. Mark descendants of eliminated rules as non-redundant Output the final theory T. The Consistency analysis starts with the most general level. First, the system needs to identify all potentially redundant rules in the forest. These are all the candidate

rules except the most general ones in each tree. They are marked as potentially redundant. To explain the selection of the best rule suppose we are dealing with the inconsistent set of candidate rules in Table 1. Rule R1 states that the Portuguese word “a” is an article. Other rules define the same word as a preposition. The RC chooses the rule with the largest effective coverage. According to this measure, the rule R1 becomes the default, and the rules R2, R3, R4, R5 its exceptions. These other rules are specializations that state the word “a” is a preposition if there is some particular tag in a given position. For instance, a noun to its left (as in R2). This choice leads to overall gain in term of global positive coverage. The specialized rules can be seen as exceptions to the default rule that minimize its negative coverage. The effective coverage estimates the quality of the rule, considering the possibility of the existence of more specific rules in the final theory. Table 1. A set of rules as sorted by RC.

R5:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,R1,vinf),!. R4:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,R1,nc),!. R3:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,L1,adj),!. R2:tag(S,P,prep)← window(P,L1,L2,L3,L4,R1,R2,R3,R4), word(S,P,a),tag(S,L1,n),!. R1:tag(S,P,art)← word(S,P,a). In addition, to understand why the rules R2,…,R5 were selected, we emphasize the following. Whenever a rule has been selected from a set of inconsistent candidates, the remaining ones are used to identify their immediate descendants. They are marked as non-redundant. This enables choosing the rules R2,…,R5 in the next iteration, because the specializations of R1 remain redundant since R1 has been selected. When there are no more non-redundant rules, all rules selected until then are added to the top of the ordered list of learned rules and we get the final theory.

7

Classification with Iteratively Induced Theories

In this section we describe how theories produced by iterative induction are interpreted. A theory produced by Algorithm RC is interpreted as a decision list by

algorithm Classify (Algorithm 5). This interpretation is equivalent to using a Prolog interpreter on a set of clauses, each ending with a cut (!). Algorithm 5: Classify. Given A set of examples E, with unknown classes Background knowledge BK, A theory T Find A set of answers (facts representing the class) for the examples in E For each example ε in E, look for the first clause A→B in T such that A is true in BK, There is a substitution θ such that εθ = B collect the answer εθ. Theories induced iteratively are interpreted using an iterative classification algorithm (Algorithm 6). Suppose T = T1 ∪ ... ∪ Tn. We first classify as many examples as we can using algorithm Classify. Then we proceed to theory T2, classify, and so on until Tn. Algorithm 6: Iteratively Classify Given A set of examples E, with unknown classes Background knowledge BK, A theory T = T1 ∪ ... ∪ Tn Find A set of answers (facts) for the examples in E Answers ← ∅ Unanswered ← E For i=1 to n NewAnswers ← Classify(Unanswered, BK, T1 ∪ ... ∪ Ti) Unanswered ← Unanswered – {examples answered in this iteration} Answers ← Answers ∪ NewAnswers BK ← BK ∪ NewAnswers

8

Experiments with the Lusa Corpus

We used the described approach in a set of experiments with a given corpus. These preliminary results provide evidence that the use of the iterative strategy improves on the results obtained by the isolated algorithm. For these experiments we used a corpus of Portuguese text containing 150 sentences with more than 5000 words. This text was produced by the Portuguese news agency (Lusa) and includes short news about politics, sports, economy and other common matters. The corpus was manually tagged. The set of tags is {nco, vppser, conjcoord, trav, ppr, pr, vter, vgerser, pind, ppoa, conjsub, vinfser, vinfter, ch, virg, vger, dpto, pps, ord, nc, adv, v, vser, pd, np, vinf, vpp, prep, art, n, adj, par, pto}. The corpus was divided into a training set with the first 120 sentences, and a test set with the remaining 30 sentences. The theories were induced using the information in the training set only, and then we measured the success rate of the theories on both the training and ...