Last Update:
Preprocessing Phase for C++17's Searchers
Table of Contents
Searchers from C++17 are a new way to perform efficient pattern lookups. The new standard offers three searchers: default_searcher
, boyer_moore_searcher
and boyer_moore_horspool_searcher
. The last two implements algorithms that require some additional preprocessing for the input pattern. Is there a chance to separate preprocessing time from the search time?
Short Reminder
In my last article I’ve introduced searchers that were added into C++17.
Quoting the standard:
Searchers provides function object types for operations that search for a sequence
[pat_first, pat_last)
in another sequence[first, last)
that is provided to the object’s function call operator. The first sequence (the pattern to be searched for) is provided to the object’s constructor, and the second (the sequence to be searched) is provided to the function call operator. Each specialization of a class template specified in this subclause shall meet theCopyConstructible
andCopyAssignable
requirements.
template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
const Searcher& searcher );
For now we have three searchers:
default_searcher
boyer_moore_searcher
boyer_moore_horspool_searcher
Last time, however, I didn’t correctly summarise what a searcher is. This is because it’s not immediately clear - if you just look at std::search
reference.
The basic idea is that each Searcher wraps the pattern you’d like to search. That means also doing some necessary preprocessing. Later - inside std::search
- each searcher exposes operator()(first, last)
- a way to look for a pattern into the [first, last)
range.
Additional, since searcher is copyable and assignable - you can pass it around in your application.
Since a searcher is a separate object, we might do a little experiment and measure how much time it takes… let’s see.
The Series
This article is part of my series about C++17 Library Utilities. Here’s the list of the other topics that I’ll cover:
- Refactoring with
std::optional
- Using
std::optional
- Error handling and
std::optional
- About
std::variant
- About
std::any
- In place construction for
std::optional
,std::variant
andstd::any
std::string_view
Performance- Speeding up Pattern Searches with Boyer-Moore Algorithm from C++17 - C++ Stories
- Preprocessing Phase for C++17’s Searchers - C++ Stories
- Working with
std::filesystem
- Even more:
Resources about C++17 STL:
- C++17 In Detail by Bartek!
- C++17 - The Complete Guide by Nicolai Josuttis
- C++ Fundamentals Including C++ 17 by Kate Gregory
- Practical C++14 and C++17 Features - by Giovanni Dicanio
- C++17 STL Cookbook by Jacek Galowicz
Demo Application
Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp
How the test works:
- the app loads a file, like a book sample - 500kb of text.
- the whole file content is stored in one input string.
- a pattern is selected
- you can look for a string
- or for N characters from the input string (from start, centre or the end)
- the app uses several algorithms and runs each search
ITER
times.
Command line:
searcher.exe file iterations N|string Pos
file - text file to load
iterations - the number of iterations
N|string - number of letters or a given string
Pos - optional parameter when N is specified:
0 - the start of the input string
1 - the centre of the input string
> 1 - end of the input string
For example:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"
The above command will look for “the town” string in the input file “book-test.txt” and will perform 1000 iterations.
Another command:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10 1
This will look for 10 characters from the centre (pos=1)
.
Here’s the code for the boyer_moore_horspool
version:
Searcher Preprocessing
In the first version of the demo application I used code:
RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
for (size_t i = 0; i < ITERS; ++i)
{
auto it = std::search(testString.begin(), testString.end(),
std::boyer_moore_horspool_searcher(
needle.begin(), needle.end()));
if (it == testString.end())
std::cout << "The string " << needle << " not found\n";
}
});
The above code measured the whole search. But now we can split it, or extract the preprocessing phase.
For example:
RunAndMeasure("boyer_moore_searcher init only", [&]() {
for (size_t i = 0; i < ITERS; ++i)
{
std::boyer_moore_searcher b(needle.begin(), needle.end());
DoNotOptimizeAway(&b);
}
return 0;
});
All data structures must be initialised in the constructor of the searcher objects. Later only operator()
is used to perform the search.
Some Performance Results
Here’s what I got from running a few tests.
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 1000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 1000
string::find: 207.235 ms
default searcher: 336.469 ms
boyer_moore_searcher init only: 4.65379 ms
boyer_moore_searcher: 33.383 ms
boyer_moore_horspool_searcher init only: 0.926099 ms
boyer_moore_horspool_searcher: 31.652 ms
When searching for 1000 letters from the centre of the input string, both of the new algorithms were faster than the default searcher and string::find
. boyer_moore
uses more time to perform the initialisation than boyer_moore_horspool
(it creates two lookup tables, rather than one, so it will use more space and preprocessing). But it looks like boyer_moore
search time is a bit faster: 33ms - 4.6ms
vs 31.6 - 0.92ms
.
The cost of preprocessing in boyer_moore
is more visible if you make the pattern even larger:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 10000
string::find: 154.501 ms
default searcher: 291.107 ms
boyer_moore_searcher init only: 104.912 ms
boyer_moore_searcher: 126.098 ms
boyer_moore_horspool_searcher init only: 6.35085 ms
boyer_moore_horspool_searcher: 25.0702 ms
104ms
vs 6ms
!
How about more realistic searchers and patterns. It’s probably quite rare to look for 1000 letters…
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 8
string::find: 32.6093 ms
default searcher: 57.8666 ms
boyer_moore_searcher init only: 0.423179 ms
boyer_moore_searcher: 22.0527 ms
boyer_moore_horspool_searcher init only: 0.288173 ms
boyer_moore_horspool_searcher: 21.9978 ms
When looking for “the town” (appears in line 711 out of 9469 line). The preprocessing seems to be super fast, and the new algorithms could beat the string::find
version.
If the string is longer and positioned in near the end of the file:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "This Web site
includes information about Project"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 48
string::find: 60.324 ms
default searcher: 408.87 ms
boyer_moore_searcher init only: 0.670692 ms
boyer_moore_searcher: 125.899 ms
boyer_moore_horspool_searcher init only: 0.326464 ms
boyer_moore_horspool_searcher: 127.477 ms
Here, when looking for “This Web site includes information about Project” - which is located at the end of the file (a single occurrence) Boyer-Moore algorithms are 2x slower than string::find
.
As usual, I encourage you to perform your own tests.
Summary
In this post, I wanted to stress that each searcher can perform some initialisation in its constructor. What’s more, searchers can be initialised once and then passed around in the application - might be useful when you search for the same pattern over and over.
I've prepared a valuable bonus for you!
Learn all major features of recent C++ Standards on my Reference Cards!
Check it out here: