dc.description.abstract | There has been written many papers on field of mining data from structured web pages. However,
few if any of these papers focus on the area of retrieving specific parts of discussion board postings.
A discussion board page contains a set of postings, which can be considered data-records. Our
goal is to provide insight on a specific approach to identify the locations of author, content and
date+time, which are parts of a complete discussion board posting data-record.
Our approach consists of combining a Naive Bayes pattern classifier, structure classification
and grammar to identify the sought after elements. We give a thorough evaluation of our Naive
Bayes classifier and it’s components in addition to how combinations of the different parts in our
approach affected the overall result.
Our best results for identifying the location of the individual elements was 94% for author, 76%
for content, 86% for date+time and 60% for getting every element of each post correct. While the
result for getting the complete posts is not very good, it does depend a lot on the other results.
We believe our approach shows promise and with further development and refinement, it will be a
viable method for automatic extraction of data from on-line discussion boards. | en |