Using Statistical Features to Find Phrasal Terms in Text Collections

Authors

  • Andre Luiz da Costa Carvalho Universidade Federal do Amazonas
  • Edleno Silva de Moura
  • Pável Calado

Keywords:

phrasal terms, phrase queries

Abstract

In this work we investigate alternatives to automatically detect phrasal terms, defined here as phrasal verbs, phrasal nouns, phrasal adjectives or phrasal adverbs found in a text. The automatic identification of phrasal terms may have several applications in text processing systems. We approach this problem and present a novel approach for detecting phrasal terms in a collection of documents. Our solution is based on machine learning and uses statistical features of the word n-grams found in the documents. We also investigate the particular impact of adding phrasal terms in the retrieval model of a search engine when processing queries on several data sets. Our results show that we are able to discover valid phrasal terms with a small error rate, achieving detection results ranging from 70% to 94% in terms of F1. Furthermore, the discovered phrasal terms, when used to enhance search tasks, allow improvements in retrieval performance of up to 11% in terms of MAP when considering all queries, and up to 36% in terms of MAP when considering only the queries that contained the detected phrasal terms.

Downloads

Published

2010-09-10

Issue

Section

Regular Articles