Towards Detecting Textual Plagiarism Using Machine Learning Methods

2015

Textual plagiarism is passing off someone else’s text as your own. The current

state of the art in plagiarism detection performs well, but often uses a series of

manually determined thresholds of metrics in order to determine whether an author

is guilty of performing plagiarism or not. These thresholds are optimized for

a single data set and are not optimal for all situations or forms of plagiarism. The

detection methodologies also require a professional familiar with the algorithms

in order to be properly adjusted, due to their complexity. Using a pre-classified

data set, machine learning methods allow teachers and censors without knowledge

of the methodology to use a plagiarism detection tool specifically designed

for their needs.

This thesis demonstrates that a methodology using machine learning, without

the need to set thresholds, can match, and in some cases surpass, the top methodologies

in the current state of the art. With more work, future methodologies may

possibly outperform both the best commercial and freely available methodologies.

Masteroppgave informasjons- og kommunikasjonsteknologi - Universitetet i Agder, 2015

Universitetet i Agder ; University of Agder