Statisticalsandhisplitterforagglutinativelanguages cicling PDF

Title Statisticalsandhisplitterforagglutinativelanguages cicling
Author Anonymous User
Course Probability and Statistics
Institution International Institute of Information Technology Hyderabad
Pages 10
File Size 310.7 KB
File Type PDF
Total Downloads 62
Total Views 134

Summary

Download Statisticalsandhisplitterforagglutinativelanguages cicling PDF


Description

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/285581940

Statistical Sandhi Splitter for Agglutinative Languages Conference Paper · April 2015 DOI: 10.1007/978-3-319-18111-0_13

CITATIONS

READS

5

525

4 authors: Prathyusha Kuncham

Kovida Nelakuditi

International Institute of Information Technology, Hyderabad

International Institute of Information Technology, Hyderabad

3 PUBLICATIONS11 CITATIONS

3 PUBLICATIONS6 CITATIONS

SEE PROFILE

SEE PROFILE

Sneha Nallani

Radhika Mamidi

International Institute of Information Technology, Hyderabad

International Institute of Information Technology, Hyderabad

1 PUBLICATION5 CITATIONS

87 PUBLICATIONS240 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

HindiRC: A Dataset for Reading Comprehension in Hindi View project

Leveraging Multilingual Resources for Language Invariant Sentiment Analysis View project

All content following this page was uploaded by Prathyusha Kuncham on 10 March 2017. The user has requested enhancement of the downloaded file.

SEE PROFILE

STATISTICAL SANDHI SPLITTER FOR AGGLUTINATIVE LANGUAGES K.Prathyusha

N.Kovida

N.Sneha

Radhika Mamidi

IIIT Hyderabad IIIT Hyderabad IIIT Hyderabad IIIT Hyderabad {prathyusha.k nelakuditi.kovida sneha.nallani} radhika.mamidi @research.iiit.ac.in @iiit.ac.in

Abstract. Sandhi splitting is a primary and an important step for any natural language processing (NLP) application for languages which have agglutinative morphology. This paper presents a statistical approach to build a sandhi splitter for agglutinative languages. The input to the model is a valid string in the language and the output is a split of that string into meaningful word/s. The approach adopted comprises of two stages namely Segmentation and Word generation, both of which use conditional random fields (CRFs). Our approach is robust and language independent. The results for two Dravidian languages viz. Telugu and Malayalam show an accuracy of 89.07% and 90.50% respectively.

1

Introduction

Agglutinative languages are rich in morphology. There are many agglutinative languages such as Dravidian languages, Turkic languages etc. In these languages many words combine to form a compound word. In this process, morphophonological changes i.e. fusion of final and initial characters occur at word boundaries. This is termed as “Sandhi”. Examples of sandhi in compound words:

(a) Compound Nouns: ‘vixyAlayaM’1 ‘vixya’ + ‘AlayaM’ university education temple (b) Compound Verbs: ‘kUdabeVttu’ ‘kUdu’ + ‘peVttu’ to accumulate be gatherer keep (c) Other type of compound words: ‘rAmudeVkkada’ ‘rAmudu + ‘eVkkada’ Where is Ramudu Ramudu where

1

Words are in wx format (sanskrit.inria.fr/DATA/wx.html). All the examples given in the paper are from Telugu language. adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

If this word is given as an input to a Question Answering system, it is very important to identify the question word ‘eVkkada’ (where) for proper functioning of the system which can be obtained only by splitting the compound word. As observed from the above examples, one has to split (c) as it is morphologically unanalysable and if not split, degrades the performance of NLP applications [1]. It is not necessary to split (a) and (b) as these are frequently occurring collocations in the language and are also handled by the existing morphological analyser [2]. Therefore, we focused on handling words of type (c) in this paper. We developed a statistical sandhi splitter for agglutinative languages. Our approach uses CRF which is one of the most successful statistical learning methods in NLP for labeling and segmenting sequential data [3]. Our approach consists of two stages namely Segmentation [4], [5], [6] and Word generation as discussed in section 4.

2

Related Work

Sandhi splitting can be done using (a) Rule based techniques (b) Statistical techniques and (c) Hybrid approaches. (a) Rule based systems: [7] and [8] developed a rule based system to split compound words into meaningful sub-words in Malayalam and Marathi respectively. However, the main drawback of this type of systems is that they require a lot of manual effort and time to prepare rules. Moreover, the system is language dependent. (b) Statistical systems: [9] built a Finite state transducer (FST) which was used to identify possible words for a given compound word with 80.3% accuracy. This approach fails for out-of-vocabulary (OOV) words i.e. where the base word doesn’t exist in the FST. [10] used statistical methods like Dirichlet process and Gibbs Sampling for Sanskrit sandhi splitting. (c) Hybrid systems: These systems combine both statistical and rule based techniques. [11] gives an accuracy of 91.1% for Malayalam. Hybrid systems are also not language independent. Segmentation in our approach has been inspired from [5]. Our model is purely statistical and language independent. When compared to rule based and hybrid approaches, our model is robust, faster and requires less effort.

3

Dataset

There was no available sandhi annotated data in agglutinative languages except for Malayalam. We have prepared sandhi annotated dataset for Telugu language. Following decisions were made while annotating the training data. (A). Context dependent particles: In cases where contextual information is required to decide whether or not to split a word, a decision to not split the word was made. The following examples in Telugu give an insight into these occurrences.

(i)

“gA” “gA” can act as question particle or it can mean “while” depending on the situation. a. “unnAvugA” ‘unnAvu’ + gA (you are) present, aren’t you? present aren’t you? b. “undagA” undu + gA while present present while

(ii)

Clitics [12] We look at Telugu clitics ‘e’,’o’ which are ambiguous. Examples: a. ‘Ameke puswakam kAvAli’ (What book does she need?) Here, ‘e’ acts as a question marker. b. ‘nenu Ameke iccAnu’ (I gave only to her (not others)) Here, ‘e’ is emphatic clitic.

a. ‘rAjuko BArya uMxi’ (King has one wife) Here, ‘o’ acts as a quantifier ‘one’. b. ‘rAjuko rAniko kala vacciMxi’ (either king or queen got a dream) Here, ‘o’ is an indefinite clitic. Ideally, we need (a) cases to be split and (b) cases not to be split, the decision whether to split or not can be made with contextual information obtained from words of compound word itself or sentence. Such cases which require this contextual information to disambiguate the sense are decided not to be split. (B). Dialectal influence: The base form of a word may change with dialect. Therefore, only words from the standard written language are considered while preparing the training data. Example: In standard Telugu, ‘vaccAraMxaru’ (all came) is ‘vacciMdraMxaru’ in Telangana Telugu dialect. So ‘vaccAraMxaru’ vaccAru (came) + aMxaru (all) is included in training data but not ‘vacciMdraMxaru’.

4

Our Approach

Our approach consists of two stages viz., Segmentation and Word Generation. A flow chart of the system is shown in Figure 1.

Fig. 1. Flow chart of the Sandhi Splitter module

4.1

Segmentation

In this stage, at each character in a word, CRF model decides whether or not to split at that point. Thereby, we identify the boundaries between different words i.e. the points where the morphphonological changes occur in the compound word. This is formulated as a two-class classification problem. The input for this task is a word and the output is the segments that show the boundary/split points in the input. The resulting segments may or may not be meaningful words. Example: Input: ‘pUjayyAkA’ (after having finished the prayer) Output: ‘pUj’-‘ayyAkA’ Here, ‘pUjayyAkA’ ‘pUja’ + ‘ayyAkA’. prayer finished We can observe the morphophonological change (a a + a) at the word boundaries. In the above output, the segments are “pUj” and “ayyAka” where “pUj” is not a meaningful word in Telugu. At this stage, CRF was trained with following feature set.

Feature Set: Characters: Morphophonological changes occur at character level. So this feature is important to identify where and what type of morphophonological changes take place in the word. Character Tags: Every character is given a tag based on its type of sound (consonant/short vowel/long vowel). This is important to capture the information of types of vowel/consonant clusters that occur during morphophonological changes. 4.2

Word Generation

This stage majorly deals with the generation of meaningful words from the segments obtained in segmentation stage. The input to this stage is the segments of the compound word and the outputs are the different meaningful words. Example: Input: ‘pUj’-’ayyAkA’ Output: ‘pUja’ (prayer) ‘ayyAkA’ ((after having) finished) Word generation has two components: ─ Class label assignment ─ Word Formation 4.2.1

Class label assignment

The morphophonological changes of addition or deletion of characters in sandhi are finite and form a class space. From the training data collected, automatically 41 such classes are extracted for Telugu and 49 for Malayalam. ‘baMXuvoVkaru’ (one relative) ‘baMXuv-oVkaru’ ‘baMXuv’ _u ‘baMXuvu’ (relative) ‘oVkaru’ NULL ‘oVkaru’ (one) In the above example the segments are ‘baMXuv’ and ‘oVkaru’. ‘baMxuv’ will be meaningful if ‘u’ is added at the end and ‘oVkaru’ is itself a meaningful word. So these two words fall into ‘_u’ and ‘NULL’ classes respectively. Input: Segmentation: Class label assignment:

Having generated these classes we prepare the training data for this stage with segments and class labels. CRF was trained with following feature set. Feature Set: Segments: This feature is important because segments which precede or follow decide the class label for a current segment in some cases. Example: a. Input: ‘manixxariki’ (for both of us)

Segmentation: Class label assignment:

‘man-ixxariki’. ‘man’ _a ‘mana’ (our) ‘ixxariki’ NULL ‘ixxariki’ (for both)

‘manixxaraM’ (we both) ‘man-ixxaraM’ ‘man’ _aM ‘manaM’ (we) ‘ixxaraM’ NULL ‘ixxaraM’ (both) From this example, the class label for the segment ‘man’ is decided based on its succeeding segments. b.

Input: Segmentation: Class label assignment:

Prefix & suffix characters: The prefix and suffix of a segment plays an important role in deciding the class label which can be seen in Table 2. 4.2.2

Word Formation

This step deals with generating a meaningful word from a segment using the information from the class label. The segments that have same class label adopt same method for formation of words. As continuation to the example ‘baMXuvoVkaru’, discussed in section 4.2.1, we have Word formation: ‘baMXuvoVkaru’ ‘baMxuvu’ + ‘oVkaru’ One relative relative one We add ‘u’ at the end of ‘baMxuv’ and the resulting word is ‘baMxuvu’. In case of ‘oVkaru’, as its class label is ‘NULL’, no change is made. So ‘baMXuvoVkaru’ is split into two meaningful words ‘baMxuvu’ and ‘oVkaru’.

5

Experiments and Results

Our model was trained on 1267 Telugu words. Development and test sets have 800 and 1151 words respectively. Test data contains words which have sandhi (Split) and which do not (Non-split). If CRF is used, one has to choose a proper feature template for accurate performance of the system. Table 1 and Table 2 show the results when the model is trained on different feature templates for Segmentation and Class label assignment stages respectively. From Table 1 and Table 2, we can observe that the template 4 gives better accuracy in both the tasks. Table 1. Results of different feature templates in Segmentation task on development data.

Template 1 2 3 4

C 2 2 3 3

T 0 2 0 3

C&T No Yes Yes Yes

Precision 97.14 97.14 97 96.94

Recall 95.33 94.07 95.51 96.09

F-Measure 96.22 95.58 96.25 96.51

5 6 7

3 3 2

2 1 1

No No No

97.07 97.15 97.34

95.32 95.39 95.14

96.19 96.26 96.23

 If ‘C’ = k, k characters on the left and right to the current character are included as features.  If ‘T’ = k, k tags on the left and right to the current tag are included.  An example for ‘C&T’ is C-1/T-1 which means previous character and previous tag is included. Table 2. Results for different feature templates in Class label assignment task on development data.

Template 1 2 3 4 5

Pw-C 1-0 1-2 2-2 2-3 1-3

C 3 3 3 3 3

Nw-C 1-0 1-2 2-2 2-3 1-3

Word Accuracy 96.61 96.87 96.69 96.95 96.61

 If ‘P-C’ is ‘n-k’, n previous segments along with k, k-1..., 1 character/s from the beginning (prefix) and ending (suffix) of the corresponding segments are included.  If ‘N-C’ is ‘n-k’, n next segments along with k, k-1…, 1 character/s from the beginning and ending of the corresponding segments are included.  If ‘C’ is k then k, k-1..., 1 character/s from the starting and ending of the current segment are considered as features.  ‘Word Accuracy’ gives the percentage of words which were correctly class labelled. After feature template selection for the two stages, the same templates were used for testing on Malayalam. The overall system when tested on Telugu data gave 89.07% accuracy and 90.5% on Malayalam data. Table 3. Overall accuracy of system on Telugu and Malayalam test data.

Language

#Train

#Test

Telugu Malayalam

1267 1926

1151 1000

#Split in Test 286 260

#Non-split in Test 865 740

Accuracy 89.07 90.50

Comparison with other systems: We compare our system with a few existing Statistical and Hybrid sandhi splitting systems for Telugu and Malayalam languages.

─ As mentioned in section 2, [11] gives an accuracy of 91.1% whereas our system gives 90.5% for the same dataset of Malayalam language. Though the difference of accuracies is very small, unlike their system our system can be easily adaptable to any language as it is purely statistical. ─ We could not compare our system with [9] as the dataset they used is unavailable. Compared to their system, our system is dynamic as it can handle OOV words because the features are not solely defined with respect to the vocabulary but are also derived from characters in word/s. Due to the unavailability of parallel corpus for sandhi-split words for other agglutinative languages, we couldn’t test on languages other than Telugu and Malayalam. But, we are confident that it will work for other agglutinative languages as well on the basis of their common typological features.

6

Conclusion

We have presented our efforts in building a statistical sandhi splitter. Our model is language independent mainly because of automatically extracted class labels. Even though we have handled sandhi in one type of compound words, our model can be readily adapted to other types as well. The model has been tested on Telugu and Malayalam. Through this work, we have also prepared a standard dataset for Telugu language consisting of a corpus of agglutinated words and its parallel corpora of sandhisplit words. Showing the impact of sandhi splitting on NLP applications like Machine Translation, Parsers, Dialogue System etc., is part of our immediate future work. As discussed in section 3, a few words require contextual information to split and a few words show dialectal influence. We plan to extend our model by taking contextual information and non-standard language into consideration in future. References 1. Kolachina, S., Sharma, D. M., Gadde, P., Vijay, M., Sangal, R., and Bharati, A. (2011). External sandhi and its relevance to syntactic treebanking. Polibits, (43):67–74. 2. Bharati, Akshar, et al. Natural language processing: a Paninian perspective. New Delhi: Prentice-Hall of India, 1995, ch. 3. 3. Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labelling sequence data. 4. Nguyen, C.-T., Nguyen, T.-K., Phan, X.-H., Nguyen L.-M., and Ha, Q.-T. (2006). Vietnamese word segmentation with crfs and svms: An investigation. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation (PACLIC 2006). 5. Peng, F., Feng, F., and McCallum, A. (2004). Chinese segmentation and new word detection using conditional random fields. In Proceedings of the 20th international conference on Computational Linguistics, page 562. Association for Computational Linguistics. 6. Xue, N. et al. (2003). Chinese word segmentation as character tagging. Computational Linguistics and Chinese Language Processing, 8(1):29–48.

7. Nair, L. R. and Peter, S. D. (2011). Development of a rule based learning system for splitting compound words in malayalam language. In Recent Advances in Intelligent Computational Systems (RAICS), 2011 IEEE, pages 751–755. IEEE. 8. Joshi Shripad, S. (2012). Sandhi splitting of Marathi compound words. Int. J. on Adv. Computer Theory and Engg, 2(2). 9. Vempaty, P. C. and Nagalla, S. C. P. (2011). Automatic sandhi spliting method for telugu, an indian language. Procedia-Social and Behavioral Sciences, 27:218–225. 10. Natarajan, A. and Charniak, E. (2011). S3-statistical sam. dhi splitting. 11. Devadath V V, Litton J Kurisinkel, D. M. S. V. V. (2014). A sandhi splitter for malayalam.(accepted but yet to be published in proceedings of ICON 2014.) 12. Bh.Krishnamurti. 1985. A grammar of modern Telugu. Oxford University Press, New York, Toronto.

View publication stats...


Similar Free PDFs