CMM539 Coursework 2019 PDF

Title	CMM539 Coursework 2019
Author	Abimbola Adeniran
Course	Information Retrieval Systems
Institution	The Robert Gordon University
Pages	4
File Size	138.4 KB
File Type	PDF
Total Downloads	25
Total Views	158

Preview

CLICK TO PREVIEW PDF

Summary

part 2...

Description

CMM539 Coursework 2019

CMM539 Information Retrieval Systems: Coursework 1 Hand-out date: Hand-in date:

16th September 2019 1st November 2019 @ 5 p.m.

You will need to submit the following:  Subtask 1A: a description the Search Engine’s indexing process (~300 words);  Subtask 2A: a comparison of the inverted index with and without stemming (~300 words);  Subtask 2B: a comparison of retrieval effectiveness for the specified queries (~750 words);  Subtask 3A: precision and recall calculations for the specified queries with the specified weighting functions (tables + graphs); and  Subtask 3B: a discussion of the topic retrieval and weighting function retrieval results for the specified queries (~750 words). Note: The suggested word counts are approximate and do not include any diagrams, tables or graphs you are asked or may wish to include. All coursework must be submitted with the front cover sheet provided for this coursework. You must submit a digital copy of your work to Moodle in the relevant submission area. Please let the School Office know if you are having any difficulties. The grade for this coursework contributes 50% to the overall assessment of this module. You will be given initial general feedback as an outline solution one week after submission. A provisional grade will be made available on Moodle with individual feedback within 20 working days of submission. You will have been made aware of the University regulations on plagiarism. Any student who is found guilty of submitting plagiarised work will be referred to the appropriate panel for academic misconduct.

Objectives The objectives of this coursework are:  to understand the architecture of an IR search engine;  to understand the process of stemming and its effect on (a) the size of the inverted file, and on (b) retrieval effectiveness;  to evaluate the retrieval effectiveness of three document weighting functions (tf, idf, and tf-idf); and  to write a short report on the results of your investigations. You will be supplied with a Search Engine (a simple IR application) developed in Java that runs locally on the desktop machines. You will evaluate retrievals made with the application for specified queries with different IR system architectures. These retrievals will be discussed in detail and mostly undertaken during scheduled labs. The challenge is to understand alternative IR system architectures, and then to analyse and explain the results of your retrieval experiments.

1

CMM539 Coursework 2019

Approach You will be provided with the Search Engine, a collection of text documents relating to Information Retrieval (Documents), and a second very small collection of just four documents relating to “cats and dogs” (IRDocuments). We will work on exercises relevant to this coursework in laboratories 2-4, and we will tackle each of the above objectives week by week. We will work through the initial part of each task together in the related lab; and you can then continue to answer the assessed subtasks on your own.

Task 1: Describe the Search Engine’s Indexing Process (1 grade) In this section of the coursework you should briefly describe the key tasks that the supplied IR application performs when it indexes a set of new documents. Use examples from the small document collection (IRDocuments) to explain your answer.

Task 2: Experiments with Stemming The IR search engine you have been given can stem document and query words using Porter’s stemming algorithm. You can choose parameters to switch the stemming on or off. You can also select parameters to display the generated index and to report the following statistics:     

Number of documents; Total number of index terms extracted from documents; Number of posting lists, i.e. number of different index terms; Total number of postings, i.e. entries in posting lists; and Average number of postings per posting list.

Subtask 2A: Compare the Inverted Index with and without Stemming (1 grade) Compare and discuss the space efficiency of the inverted file when stemming is used compared with when it is not used. You must include a print out the inverted file statistics (a) with stemming on, and (b) with stemming off, in order to do this comparison. You may wish to use specific examples to explain your discussion. Subtask 2B: compare retrieval effectiveness for the specified queries (3 grades) In these experiments, you will compare and discuss in detail the retrieval effectiveness of the search engine with and without stemming. Use the default-weighting scheme, namely co-ordination level matching. For each topic/query given in the next table, you should carefully judge the relevance of the top 10 retrieved documents, and report Precision with 5 documents retrieved (P@5), and Precision with 10 documents retrieved (P@10). Also, record the total number of documents retrieved. For each topic, compare the effectiveness of the system for the three queries. Explain any observed differences in the results. 2

CMM539 Coursework 2019

Based on the results, discuss whether stemming is a recall- or precision-enhancing device. Hint: Set up a table in Word document to record retrieved documents and relevance judgements.

Topic and Judgement of Relevance Find papers that try to understand or evaluate user satisfaction in IR queries or systems Find papers that make use of visualization techniques to display retrieved document

System With stemming Without stemming Query “Understanding Query “Understanding User Satisfaction” User Satisfaction” Query “Understand User Satisfaction” Query “Visualization of Query “Visualization of retrieval results” retrieval results” Query “Visualization of retrieval result”

Task 3: Experiments with tf, idf and tf*idf document weighting schemes Three additional weighting schemes have been implemented in the Search Engine:  tf weighting, namely tfi ($2 function)  idf weighting, namely –log(N/ni) ($3 function)  tf*idf weighting, namely ($4 function) tfi: frequency of ith term in document; ni: number of documents in which ith term is assigned; N: total number of documents. In the query input box, you indicate the use of a particular weighting function by preceding the query with $z, where ‘z’ is the number of the function. For example to use tf weighting with the query “black dog sat” enter: $1 black dog sat Subtask 3A: Calculate the precision and recall for the specified queries with the 3 weighting functions (2 grades) In these experiments, you will compare the retrieval effectiveness of the search engine using the three different weighting functions. Use stemming for indexing the document collection. For each weighting function (tf, idf, and tf*idf), you will perform the following process. For each topic/query given in the following table, retrieve documents using the current weighting function. You should carefully judge the relevance of the top 10 retrieved documents. We will assume for this exercise that all the relevant documents have been located. This is probably incorrect but it allows us to compute Recall figures.

3

CMM539 Coursework 2019

[Note: We will discuss these relevance judgements as a group during lab 4.] Topic and Relevance Judgement Deep learning approaches to information retrieval from text Collaborative-filtering approaches for recommender systems Employing user models to support personalized retrieval results

Query “deep learning text retrieval” “collaborative filtering recommender system” “user model personalized retrieval”

For each topic with each weighting scheme:  compute (Precision, Recall) figures at each relevant document rank;  draw the Precision-Recall graph with interpolated Precision at standard Recall values. For the three topics:  draw the averaged Precision-Recall graph for each weighting scheme. You will be supplied an Excel spreadsheet in Moodle (AnalysisHandout.xls) which you can use to record relevance judgements, to perform calculations, and to draw R-P graphs.

Subtask 3B: Discuss Topic Retrieval and Weighting Function Retrieval (3 grades) For each topic:  compare the individual topic performances, decide which weighting function is most effective in each case, and discuss the results in detail, providing suitable explanations where possible; and For the weighting schemes  compare the average performance of the three weighting schemes, and discuss the results in detail making conclusions about the relative effectiveness of the three weighting functions.

4...