Identifying Geographic Terms within Natural Language Text

2008

The huge amount of textual data available in digital form in today’s world increases

the need for methods that facilitate ease of access and navigability. Automatic

extraction of keywords from text bodies is one promising approach. However,

the relevance of keywords are context dependent, and extracting relevant

keywords often requires a semantic analysis, simply because words may have different

meanings in different contexts. It is well-known that resolving such word

sense ambiguity automatically can be very challenging. When the topic of interest

is geographic information, important keywords would be geographic terms

like countries, cities, counties and states.

This thesis presents a probabilistic method for automatic identification of geographic

terms within natural language text. The method uses a database of geographic

terms to identify possible geographic entities. In contrast to state of

the art, we resolve semantic ambiguity by using a Bayesian classifier that takes

the context of ambiguous words into account. In our empirical results, we report a

geographic term identification accuracy of 90%. We thus believe that the approach

we present can be of importance for those working within the field of text analysis

and data-mining, when accurate geographic term identification is of importance.

Masteroppgave i informasjons- og kommunikasjonsteknologi 2008 – Universitetet i Agder, Grimstad

Universitetet i Agder / Agder University