MPS: improving exact string matching through pattern character frequency, journal of data processing

Sahota, Vijay, Li, M. and Bayford, Richard ORCID logoORCID: https://orcid.org/0000-0001-8863-6385 (2013) MPS: improving exact string matching through pattern character frequency, journal of data processing. Journal of Data Processing, 3 (3) . pp. 127-137. ISSN 2278 – 6481 [Article]

Abstract

One of the initial hurdles in taking advantage of big data is the ability to quickly analyze and establish its relevance. Intrinsic to this big data analysis problem is the need for tools to scale well in proportion to the growth of data and have the flexibility to operate across disparate datasets. Exact string matching is a fundamental tool used to search through data, though many existing algorithms do not scale well since their computational cost occurs and grows during reading of data. A proof of concept is presented in this paper which transfers all the required calculations to a pre-processing stage, removing all calculations during the reading stage; creating a search trie which incorporates the statistical distribution of characters in the search pattern to reduce overall calculation and lookups in a search. The resulting algorithm produces a total calculation costs which is independent of data (source) size. Preliminary results shows the new algorithm out performing existing general algorithms, as the search pattern becomes large for natural English text and when searching a small alphabet source (DNA).

Item Type: Article
Research Areas: A. > School of Science and Technology > Natural Sciences > Biophysics and Bioengineering group
Item ID: 15515
Useful Links:
Depositing User: Richard Bayford
Date Deposited: 28 Apr 2015 15:22
Last Modified: 13 Oct 2016 14:33
URI: https://eprints.mdx.ac.uk/id/eprint/15515

Actions (login required)

View Item View Item

Statistics

Activity Overview
6 month trend
0Downloads
6 month trend
362Hits

Additional statistics are available via IRStats2.