Coder Social home page Coder Social logo

zf's Introduction

This is a fork of one of the versions of zettair search engine with my experiments added.

This retains the BSD license of the zettair.

Zettair homepage: http://www.seg.rmit.edu.au/zettair/

zf's People

Contributors

ayourtch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

zf's Issues

found segfault error in summary capitalisation highlighting

Hi,

Compared with original version of zettair, this fork have solved some bugs, like when executing zet -a. And I have still found segfault in summary capitalisation highlighting.

How to reproduce this error:
-download the html file with command: wget www.al-ilmu.com/magazines/detail.php?id=76
-index the file: /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug detail.php?id=76
-and search with some term "tauhid", "pembeli" (indonesians term) using: /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise
-the following lines is the output from search command:

      aris@aris-laptop:~/Documents$ /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise
      AYXX: before reposset_set_record: rset->entries: 0. max: 1
      AYXX: many files, adding quantity: 1
      AYXX: reposset_set_record: rset->entries: 2
      AYXX: docmap load: rset->entries: 2
      > tauhid
      AYXX: doc_ord_eval
      AYXX: vocabsize[0]: 9, memsum: 9
      AYXX: fit in memory: 1, total: 1
      AYXX: searching in THRESH mode
      AYXX: searching in AND mode
      AYXX: remaining...
      AYXX: normal term: tauhid
      return from doc_ord_eval: 0, accs: 1
      ERROR: index_search (src/search.c::1365): AUX: file:///home/aris/Documents/detail.php?id=76 // just my line to show which file caused error.
      Segmentation fault

-If I'm not forgotten, there are another file that caused same error. I will post them after found.

-aris

In tag summary highlighting, " is not well html escaped

The steps to repeat the errors are (because of internet problem, I test this feature (offline) programatically, but this steps will work too):
-wget http://kaahil.wordpress.com/2009/05/16/giliran-%E2%80%9Cflu-india%E2%80%9D_-flu-kuda-equine-influenza-h7n7-dan-h3n8/
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug index.html
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=tag
-and search term "kuda"

And the output with tag summary:

Giliran &#-0;&#.(;�FLU INDIA&#-0;&#.(;&#'';_ Flu ...

but with capitalise summary:

Giliran “FLU INDIA”_ Flu Kuda (Equine influenza) ...

best regards,
-aris

error <b></b> tagging in tag summary highlighting

The procedures to repeat the error are:
-download the (just 23 html files) sources file: wget --user-agent="" -rN -R css,js,jpg,jpeg,png,bmp,ico,gif,pdf,txt,doc,odt,dbk,README,readme,lexi,texi,tex,xml,ogg,tar,tar.gz,zip,rar,mp3,swf,rtf,bz2 --limit-rate=20k --background -o www.gwan.com.wget-log -t 10 http://www.gwan.com
-index them: find www.gwan.com/* -type f | /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug
-and search "microsoft" using tag summary highlighting: /usr/local/zettair/bin/zet -n 10 -f debug --summary=tag
-and the 8'th results are:
8. file:///root/Desktop/www.gwan.com/en_bing.html (score 0.134769, docid 10)
title: TrustLeap G-WAN HTTP Server Software - MICROSOFT MSN Robots
Should I assume that these were considered as "bugs" by MICROSOFT? ... Seven months later, MICROSOFT still did not explain why it uses its robots to disrupt the Web sites of its competitiors with the kind of time-out attacks that kill Apache and IIS (yes, IIS too, thanks to a 120-second 'after accept' time-out)... While GOOGLE robots never attacked http:/ /gwan.ch, MICROSOFT robots never stopped their attacks -even after my emails... Now, what about Cyveillance? (another MICROSOFT ' stragetic partner ') ... They are used as a cover channel to send (really dangerous) timeout attacks -the same deadly attacks that MICROSOFT Bing and MSN robots are using under the cover of similar 'benign' junk traffic to crash Apache Web servers.

the bolding tag in "MICROSOFT? ..." is not closed.

best regards,
-aris

found segfault error in summary tag highlighting, but successful in capitalise highlighting

The pocedure to repeat the errror are:
-download the files: wget http://en.wikipedia.org/wiki/Che_Guevara_Mausoleum http://en.wikipedia.org/wiki/Guevarism http://en.wikipedia.org/wiki/Che_Guevara_in_popular_culture http://en.wikipedia.org/wiki/Bibliography_of_Che_Guevara http://en.wikipedia.org/wiki/Che_Guevara
-index the result: /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug Che_Guevara Che_Guevara_Mausoleum Che_Guevara_in_popular_culture Guevarism
-search "guevara" using /usr/local/zettair/bin/zet -n 10 -f debug --summary=tag, and the result:

AYXX: before reposset_set_record: rset->entries: 0. max: 1
AYXX: many files, adding quantity: 4
AYXX: reposset_set_record: rset->entries: 5
AYXX: docmap load: rset->entries: 5
> guevara
AYXX: doc_ord_eval
AYXX: vocabsize[0]: 939, memsum: 939
AYXX: fit in memory: 1, total: 1
AYXX: searching in THRESH mode
AYXX: searching in AND mode
AYXX: remaining...
AYXX: normal term: guevara
return from doc_ord_eval: 0, accs: 4
Segmentation fault

-but, search "guevara" is successful using /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise. The result are:

AYXX: before reposset_set_record: rset->entries: 0. max: 1
AYXX: many files, adding quantity: 4
AYXX: reposset_set_record: rset->entries: 5
AYXX: docmap load: rset->entries: 5
> guevara
AYXX: doc_ord_eval
AYXX: vocabsize[0]: 939, memsum: 939
AYXX: fit in memory: 1, total: 1
AYXX: searching in THRESH mode
AYXX: searching in AND mode
AYXX: remaining...
AYXX: normal term: guevara
return from doc_ord_eval: 0, accs: 4
1. file:///home/aris/Che_Guevara (score 0.084276, docid 0)
title: Che Guevara - Wikipedia, the free encyclopedia 
GUEVARA returned to ...
2. file:///home/aris/Che_Guevara_in_popular_culture (score -0.043245, docid 2)
title: Che Guevara in popular culture - Wikipedia, the free encyclopedia 
Eleven days after GUEVARA'S execution, ...
3. file:///home/aris/Che_Guevara_Mausoleum (score -0.182210, docid 1)
title: Che Guevara Mausoleum - Wikipedia, the free encyclopedia 
The Che GUEVARA Mausoleum (Mausoleo Che GUEVARA) ...
4. file:///home/aris/Guevarism (score -0.208675, docid 3)
title: Guevarism - Wikipedia, the free encyclopedia 
�� Che GUEVARA ... Guevarism is a theory of communist ...

4 results of 4 shown (took 0.051080 seconds)

best regards,
-aris

check that the tag highlight always closes

085cab4 fixed the more obvious case. However I believe there is also a corner case where we had highlighting but could not fit into the buffer - and then we roll back, and as a result we probably will wipe out the closing bold tag. this ticket is to eventually investigate this condition more.

The string in result[i].title is not ended with '\0'

I'm no writing an experiment using zettair in a C web dev environment. Searching using zettair mean that I must allocate some "struct index_result *" for searching and then free them in every connection.

The problem start when I repeat the search (reallocate "struct index_result *"). The garbage string from the previous tittle still visible when the 2'nd search title length is smaller than 1'st search title length (of course in the same result position). The problem disappear when I user calloc rather than malloc to allocate buffers of results.

To repeat the problem I modify the commandline.c source, so in every interactive mode search, the application must reallocate the results buffer. And I use the steps in #3 to produce the index. The code is:

            for (i = 0; args->list && args->list[i]; i++) { // line 1503 -------------
                if (!(search(idx, args->list[i], results, args->results,
                  args->first_result, stats.maxtermlen, args->sopts,
                  &args->sopt))) {
                    index_delete(idx);
                    free_args(args);
                    return EXIT_FAILURE;
                }
            }

            if (args->qlist != stdin || !args->list || !args->list[0]) {
                /* stream-sourced mode */
                char querybuf[QUERYBUF + 1];

                free(results); /// the code which I added -------------------                    

                while (((args->qlist != stdin) 
                    || (printf("> ") && (fflush(stdout) == 0)))
                  && fgets(querybuf, QUERYBUF, args->qlist)) {
                    querybuf[QUERYBUF] = '\0';

                    results = malloc(sizeof(*results) * args->results); /// the code which I added ----------------

                    if (!(search(idx, querybuf, results, args->results, 
                      args->first_result, stats.maxtermlen, args->sopts, 
                      &args->sopt))) {
                        index_delete(idx);
                        free_args(args);
                        return EXIT_FAILURE;
                    }

                    free(results); /// the code which I added --------------------

                }
            }

            gettimeofday(&now, NULL);

            printf("%lu microseconds querying "
              "(excluding loading/unloading)\n", 
              (unsigned long int) (now.tv_usec - then.tv_usec 
                + (now.tv_sec - then.tv_sec) * 1000000)); //line 1543  -------------

then search "ibm" and then "microsoft" using this command: /usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise and the errors will occurs.

best regards,
-aris

error summarizing html doc containing "meta" tag with very long atribute value

The error here mean that the summarizing resulting string from meta attributes value. The tags contributes to the error:
...
<meta name="robots" content="index, follow" />
<meta name="keywords" content="comma delimited string with 1791 of lengths" />
<meta name="description" content="Majalah Islam AsySyariah, Ilmiah & Mudah dipahami." />
<meta name="generator" content="Joomla! 1.5 - Open Source Content Management" />
...

The summarizing was not error if I cut the string half of it.

The steps to reproduce the error:
-wget http://www.asysyariah.com/syariah/seputar-hukum-islam.html
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug seputar-hukum-islam.html
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise
-and search term "pendidikan"

best regards,
-aris

compile zettair as a 32-bit application/library on the 64-bit machine

Hi,

I try to compile zettair as a 32-bit application/library on the 64-bit machine, and found error.

I use this option when I configured zettair (I successfully did this when building tokyocabinet and libcaptcha):

sudo ./configure CPPFLAGS=-m32 LDFLAGS=-m32 --prefix=/usr/local/zettair/

then I "make" it and the following error occurred:

.....
.....
gcc -shared  src/.libs/str.o src/.libs/index.o src/.libs/mlparse.o src/.libs/stop.o src/.libs/stop_default.o src/.libs/postings.o src/.libs/merge.o src/.libs/vec.o src/.libs/makeindex.o src/.libs/freemap.o src/.libs/bit.o src/.libs/binsearch.o src/.libs/search.o src/.libs/chash.o src/.libs/stem.o src/.libs/heap.o src/.libs/queryparse.o src/.libs/index_querybuild.o src/.libs/bucket.o src/.libs/mem.o src/.libs/fdset.o src/.libs/pyramid.o src/.libs/iobtree.o src/.libs/getmaxfsize.o src/.libs/storagep.o src/.libs/btbucket.o src/.libs/btbulk.o src/.libs/vocab.o src/.libs/getlongopt.o src/.libs/error.o src/.libs/mlparse_wrap.o src/.libs/summarise.o src/.libs/mime.o src/.libs/remerge.o src/.libs/signals.o src/.libs/stack.o src/.libs/rbtree.o src/.libs/psettings.o src/.libs/psettings_default.o src/.libs/lcrand.o src/.libs/objalloc.o src/.libs/docmap.o src/.libs/reposset.o src/.libs/poolalloc.o src/.libs/alloc.o src/.libs/staticalloc.o src/.libs/dirichlet.o src/.libs/pcosine.o src/.libs/cosine.o src/.libs/hawkapi.o src/.libs/okapi_k3.o src/.libs/impact.o src/.libs/impact_build.o src/libtextcodec/.libs/crc.o src/libtextcodec/.libs/stream.o src/libtextcodec/.libs/detectfilter.o src/libtextcodec/.libs/gunzipfilter.o  -lm  -m32 -Wl,-soname -Wl,libzet.so.0 -o .libs/libzet.so.0.0.0
/usr/bin/ld: i386:x86-64 architecture of input file `src/libtextcodec/.libs/stream.o' is incompatible with i386 output
/usr/bin/ld: i386:x86-64 architecture of input file `src/libtextcodec/.libs/detectfilter.o' is incompatible with i386 output
/usr/bin/ld: final link failed: Invalid operation
collect2: ld returned 1 exit status
make[1]: *** [libzet.la] Error 1
make[1]: Leaving directory `/home/aris/Downloads/ayourtch-zf-6252417'
make: *** [all] Error 2

Is my option when configuring zettair correct? Or I need more workaround to compile zettair as a 32-bit application/library on the 64-bit machine (eg: editing makefile)?

best regards,
-aris

In capitalise summary highlighting, " is not well formatted

The steps to repeat the errors are (because of internet problem, I test this feature (offline) programatically, but this steps will work too) :
-wget ahlulhadist.wordpress.com/2007/10/11/zaid-al-khair/?like=1&_wpnonce=8c2f8cca82
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug file_outputed_from_step_1
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise
-and search term "kuda"

And the output with capitalise summary (add space in "& #8221"):

Katanya, &#8220;Inilah rampasanku yang pertama.& #8221; Lalu dihampirinya anak ...

but with tag summary:

Katanya, &amp;#8220;Inilah rampasanku yang pertama.&amp; #8221; Lalu dihampirinya anak ...

and the original text:

Katanya, “Inilah rampasanku yang pertama.” Lalu dihampirinya anak ...

and the original page source (using view page source menu):

Katanya, &#8220;Inilah rampasanku yang pertama.&#8221; Lalu dihampirinya anak ...

best regards,
-aris

investigate why the zero length happens

the issue#1 is for the crash due to indexing with len-1 (so, index wraps around). This is to track that I need to investigate if this scenario of zero length is a legit one or a bug.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.