Ontology guided financial knowledge extraction from semistructured information sources

Bjoraa, Eivind

Bjoraa, Eivind

Master thesis

Åpne

master_ikt_2003_bjoraa.pdf (1.067Mb)

Permanent lenke

http://hdl.handle.net/11250/137543

Utgivelsesdato

2003

Metadata

Vis full innførsel

Samlinger

Master's theses in Information and Communication Technology [491]

Sammendrag

Intermedium has an agent searching the Web for financial articles defined by certain

criteria, for instance an industrial domain of interest. A portal service for reading and

searching these articles, are available for the customers. The sources searched among are

secondary sources, like online newspapers. Secondary sources publish information more

frequently, and other information than can be found in annual reports etc, like

predictions. Finding and comparing financial figures in the articles are often time

consuming and hard to compare with each other. Having the financial figures, and what

these applies for, presented in an application where information could be easy reviewed

and compared, would apply valuable information for decision makers in bigger

companies.

Web documents are usually semi-structured, and therefore almost impossible to query for

information. Only keyword searches are supported by the computers because of the lack

of understanding. Advanced extraction processes of the information needs to be

performed. This thesis evaluates an ontology guided approach for extracting financial

information from semi-structured information sources.

A financial ontology has been constructed based on an investigation of 50 articles

gathered from Intermedium’s agent. Instances with synonyms, the words to extract from

the text, and relations between the instances have been defined. The ontology language

RDF has been chosen and used as ontology language through the entire thesis.

A prototype application has been developed to perform the extraction process. Articles

are loaded from XML files; words to extract from the text are found by query the

ontology using the query language RDQL; NLP and NLTK are used to do the extraction

based on the words found in the ontology; Velocity template is used to get the proper

structure in the output files RDF and XBRL instance document. The ontology is

providing the application with knowledge in the extraction process. When a synonym is

found in one instance, a query for reference to other instances is performed, and

synonyms of these instances are searched for in the text. If a text does not contain any

interesting information, the application does not waste time with trying to match all

words in the ontology with the ones in the text.

The result is presented with semantic tagging in RDF syntax. A part of the information

extracted is also shown as an example of how the financial standard XBRL can be given.

The advantage of XBRL is that it can be used directly by supporting tools; RDF has to be

processed by a more intelligent application. Financial information has in both these

formats been added knowledge with computer processable semantic tagging.

Beskrivelse

Masteroppgave i informasjons- og kommunikasjonsteknologi 2003 - Høgskolen i Agder, Grimstad

Utgiver

Høgskolen i Agder
Agder University College