Coder Social home page Coder Social logo

aho_corasick's People

Contributors

cjgdev avatar miscco avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

aho_corasick's Issues

remove_partial_matches only consider alpha, the digial may not match?

[code ]

aho_corasick::trie trie;
trie.remove_overlaps()
    .only_whole_words()
    .case_insensitive();
trie.insert("great question");
trie.insert("forty-two");
trie.insert("deep thought");
auto tokens = trie.tokenise("123great question");
std::stringstream html;
html << "<html><body><p>";
for (const auto& token : tokens) {
        if (token.is_match()) html << "<i>";
        html << token.get_fragment();
        if (token.is_match()) html << "</i>";
}
html << "</p></body></html>";
std::cout << html.str();

[output]

<html><body><p>great question</p></body></html>

that's not the whole words.

if change remove_partial_matches as below, it works ok (isalpha ---> isalnum)

void remove_partial_matches(string_ref_type search_text, emit_collection &collected_emits) const
    {
        size_t size = search_text.size();
        emit_collection remove_emits;
        for (const auto &e : collected_emits) {
            if ((e.get_start() == 0 || !std::isalnum(search_text.at(e.get_start() - 1))) &&
                (e.get_end() + 1 == size || !std::isalnum(search_text.at(e.get_end() + 1)))
               ) {
                continue;
            }
            remove_emits.push_back(e);
        }
        for (auto &e : remove_emits) {
            collected_emits.erase(
                std::find(collected_emits.begin(), collected_emits.end(), e)
            );
        }
    }

Adding empty string matches too

It is just desired to have empty string matches too, if necessary.
Another bonus would be to return index directly from insert() method, to simplify index based matching with associated data later on.
Also, added const ref to the argument of insert

This changes are proposed:

at the start of the insert method:

int insert(const string_type& keyword) {
// commented to allow empty string matches too
// if (keyword.empty())
// return;

..

in the end of the insert method:

// bugfix, proposed here #7
d_constructed_failure_states = false;
return d_num_keywords-1;

--

At emit_collection parse_text(string_type text):

emit_collection collected_emits;
// added to collect empty string matches too
store_emits( pos, cur_state, collected_emits );
for( auto c : text ) {

Segfault with incremental parsing.

I have a usecase, where i parse for the existence of certain keywords and add them incrementally.

However, i found that the code segfaults when i have a word that matches a previous word plus a number.

aho_corasick::trie trie;
trie.parse_text("she");
trie.insert("she");
trie.parse_text("she1");

Interestingly this only happens when the parsed word is NOT already inserted. So the following runs fine:

aho_corasick::trie trie;
trie.insert("she");
trie.parse_text("she");
trie.parse_text("she1");

The culprit seems to be

        ptr next_state(CharType character, bool ignore_root_state) const {
            ptr result = nullptr;
            auto found = d_success.find(character);
            if (found != d_success.end()) {
                result = found->second.get();
            } else if (!ignore_root_state && d_root != nullptr) {
                result = d_root;
            }
            return result;
        }

where d_success.find(character) segfaults when searching for the number

source is C++11; benchmark is C++17

The source code compiles fine under C++ 2011,
But the benchmark can only use C++ 2017.

For someone who only has C++ 2011, this cmake fails overall due to this inability compile benchmark.

Segmentation fault found when using gdb debug

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffdadb1700 (LWP 14062)]
0x0000000000834cc4 in aho_corasick::state::next_state(char, bool) const () at /home/opt/compiler/gcc-8.2/gcc-8.2/include/c++/8.2.0/bits/stl_tree.h:2549
2549 _Rb_tree<_Key, _Val, _KeyOfValue, _Compare, _Alloc>::

#0 0x0000000000834cc4 in aho_corasick::state::next_state(char, bool) const () at /home/opt/compiler/gcc-8.2/gcc-8.2/include/c++/8.2.0/bits/stl_tree.h:2549

get_transitions() ======//maybe it is a crash

transition_collection get_transitions() const {
transition_collection result;
for (auto it = d_success.cbegin(); it != d_success.cend(); ++it) {
result.push_back(it->first); //maybe it is a crash
}
return transition_collection(result);
}

code:result.push_back(it->first);

i found it sometimes crash

License?

It's in the header file, but it would be ideal to have it in the LICENSE file at the top level.

do the tree-building off-line

I have millions of pattern to search, so it may cost much time to build the tree.

could you save the already-built-tree in file and then load it when use?

Add key ID to emit type.

Hi,

currently if one parses the text, one get a vector containing the keyword itself (Plus start and end of it). In many cases however, the convenient output would be the ID of the keyword in the provided keyword list, e.g in the provided example

aho_corasick::trie trie;
trie.insert("hers");
trie.insert("his");
trie.insert("she");
trie.insert("he");
auto result = trie.parse_text("ushers");

One should be able to do

std::cout << result.at(0).get_keyword() <<"\n"
          << result.at(0).get_index() << "\n";

And get the output

she
2

As far as i understand the algorithm the trie nodes should be indexed themselves so this should be quite straightforward. Thoughts?

Silence compiler warning

Hi,

in

            size_t determine_median(const interval_collection& intervals) const {
                size_t start = -1;
                size_t end = -1;
                for (const auto& i : intervals) {
                    size_t cur_start = i.get_start();
                    size_t cur_end = i.get_end();
                    if (start == -1 || cur_start < start) {
                        start = cur_start;
                    }
                    if (end == -1 || cur_end > end) {
                        end = cur_end;
                    }
                }
                return (start + end) / 2;
            }

you compare the unsigned size_t with -1. It might be best to change the declaration of start and end to ìnt`. In the end it wont really matter as the result is the same as for the size_t case, but the compiler is happier about it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.