cjgdev / aho_corasick Goto Github PK
View Code? Open in Web Editor NEWA C++ implementation of the aho corasick pattern search algorithm
A C++ implementation of the aho corasick pattern search algorithm
there are some cases that overlap matching has not been removed even .remove_overlaps() has been invoked.
Hello, I got a performance problem about the ac machine of this project, and the following was my job.
First i constructed a trie tree with 2700+ patterns and did something about string-matching.
Hello. Can i add your implementation of Aho-Corasik algorithm to the Boost.Algorithm library?
[code ]
aho_corasick::trie trie;
trie.remove_overlaps()
.only_whole_words()
.case_insensitive();
trie.insert("great question");
trie.insert("forty-two");
trie.insert("deep thought");
auto tokens = trie.tokenise("123great question");
std::stringstream html;
html << "<html><body><p>";
for (const auto& token : tokens) {
if (token.is_match()) html << "<i>";
html << token.get_fragment();
if (token.is_match()) html << "</i>";
}
html << "</p></body></html>";
std::cout << html.str();
[output]
<html><body><p>great question</p></body></html>
that's not the whole words.
if change remove_partial_matches as below, it works ok (isalpha ---> isalnum)
void remove_partial_matches(string_ref_type search_text, emit_collection &collected_emits) const
{
size_t size = search_text.size();
emit_collection remove_emits;
for (const auto &e : collected_emits) {
if ((e.get_start() == 0 || !std::isalnum(search_text.at(e.get_start() - 1))) &&
(e.get_end() + 1 == size || !std::isalnum(search_text.at(e.get_end() + 1)))
) {
continue;
}
remove_emits.push_back(e);
}
for (auto &e : remove_emits) {
collected_emits.erase(
std::find(collected_emits.begin(), collected_emits.end(), e)
);
}
}
It is just desired to have empty string matches too, if necessary.
Another bonus would be to return index directly from insert() method, to simplify index based matching with associated data later on.
Also, added const ref to the argument of insert
This changes are proposed:
int insert(const string_type& keyword) {
// commented to allow empty string matches too
// if (keyword.empty())
// return;
..
// bugfix, proposed here #7
d_constructed_failure_states = false;
return d_num_keywords-1;
--
emit_collection collected_emits;
// added to collect empty string matches too
store_emits( pos, cur_state, collected_emits );
for( auto c : text ) {
I have a usecase, where i parse for the existence of certain keywords and add them incrementally.
However, i found that the code segfaults when i have a word that matches a previous word plus a number.
aho_corasick::trie trie;
trie.parse_text("she");
trie.insert("she");
trie.parse_text("she1");
Interestingly this only happens when the parsed word is NOT already inserted. So the following runs fine:
aho_corasick::trie trie;
trie.insert("she");
trie.parse_text("she");
trie.parse_text("she1");
The culprit seems to be
ptr next_state(CharType character, bool ignore_root_state) const {
ptr result = nullptr;
auto found = d_success.find(character);
if (found != d_success.end()) {
result = found->second.get();
} else if (!ignore_root_state && d_root != nullptr) {
result = d_root;
}
return result;
}
where d_success.find(character) segfaults when searching for the number
The source code compiles fine under C++ 2011,
But the benchmark can only use C++ 2017.
For someone who only has C++ 2011, this cmake
fails overall due to this inability compile benchmark.
I've found you got an error when adding items using iterators
it should be:
for (InputIterator it = first; it != last; ++it) {
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdadb1700 (LWP 14062)]
0x0000000000834cc4 in aho_corasick::state::next_state(char, bool) const () at /home/opt/compiler/gcc-8.2/gcc-8.2/include/c++/8.2.0/bits/stl_tree.h:2549
2549 _Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::
#0 0x0000000000834cc4 in aho_corasick::state::next_state(char, bool) const () at /home/opt/compiler/gcc-8.2/gcc-8.2/include/c++/8.2.0/bits/stl_tree.h:2549
transition_collection get_transitions() const {
transition_collection result;
for (auto it = d_success.cbegin(); it != d_success.cend(); ++it) {
result.push_back(it->first); //maybe it is a crash
}
return transition_collection(result);
}
code:result.push_back(it->first);
i found it sometimes crash
It's in the header file, but it would be ideal to have it in the LICENSE
file at the top level.
I have millions of pattern to search, so it may cost much time to build the tree.
could you save the already-built-tree in file and then load it when use?
In particular, the main parse_text
method. Is it safe for multiple threads to call parse_text
at the same time?
Hi,
currently if one parses the text, one get a vector containing the keyword itself (Plus start and end of it). In many cases however, the convenient output would be the ID of the keyword in the provided keyword list, e.g in the provided example
aho_corasick::trie trie;
trie.insert("hers");
trie.insert("his");
trie.insert("she");
trie.insert("he");
auto result = trie.parse_text("ushers");
One should be able to do
std::cout << result.at(0).get_keyword() <<"\n"
<< result.at(0).get_index() << "\n";
And get the output
she
2
As far as i understand the algorithm the trie nodes should be indexed themselves so this should be quite straightforward. Thoughts?
Hi,
in
size_t determine_median(const interval_collection& intervals) const {
size_t start = -1;
size_t end = -1;
for (const auto& i : intervals) {
size_t cur_start = i.get_start();
size_t cur_end = i.get_end();
if (start == -1 || cur_start < start) {
start = cur_start;
}
if (end == -1 || cur_end > end) {
end = cur_end;
}
}
return (start + end) / 2;
}
you compare the unsigned size_t with -1. It might be best to change the declaration of start and end to ìnt`. In the end it wont really matter as the result is the same as for the size_t case, but the compiler is happier about it.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.