Ontology guided financial knowledge extraction from semistructured information sources
Abstract
Intermedium has an agent searching the Web for financial articles defined by certain
criteria, for instance an industrial domain of interest. A portal service for reading and
searching these articles, are available for the customers. The sources searched among are
secondary sources, like online newspapers. Secondary sources publish information more
frequently, and other information than can be found in annual reports etc, like
predictions. Finding and comparing financial figures in the articles are often time
consuming and hard to compare with each other. Having the financial figures, and what
these applies for, presented in an application where information could be easy reviewed
and compared, would apply valuable information for decision makers in bigger
companies.
Web documents are usually semi-structured, and therefore almost impossible to query for
information. Only keyword searches are supported by the computers because of the lack
of understanding. Advanced extraction processes of the information needs to be
performed. This thesis evaluates an ontology guided approach for extracting financial
information from semi-structured information sources.
A financial ontology has been constructed based on an investigation of 50 articles
gathered from Intermedium’s agent. Instances with synonyms, the words to extract from
the text, and relations between the instances have been defined. The ontology language
RDF has been chosen and used as ontology language through the entire thesis.
A prototype application has been developed to perform the extraction process. Articles
are loaded from XML files; words to extract from the text are found by query the
ontology using the query language RDQL; NLP and NLTK are used to do the extraction
based on the words found in the ontology; Velocity template is used to get the proper
structure in the output files RDF and XBRL instance document. The ontology is
providing the application with knowledge in the extraction process. When a synonym is
found in one instance, a query for reference to other instances is performed, and
synonyms of these instances are searched for in the text. If a text does not contain any
interesting information, the application does not waste time with trying to match all
words in the ontology with the ones in the text.
The result is presented with semantic tagging in RDF syntax. A part of the information
extracted is also shown as an example of how the financial standard XBRL can be given.
The advantage of XBRL is that it can be used directly by supporting tools; RDF has to be
processed by a more intelligent application. Financial information has in both these
formats been added knowledge with computer processable semantic tagging.
Description
Masteroppgave i informasjons- og kommunikasjonsteknologi 2003 - Høgskolen i Agder, Grimstad
Publisher
Høgskolen i AgderAgder University College