ayourtch / zf Goto Github PK
View Code? Open in Web Editor NEWzettair fork
zettair fork
This is a fork of one of the versions of zettair search engine with my experiments added. This retains the BSD license of the zettair. Zettair homepage: http://www.seg.rmit.edu.au/zettair/
Hi,
Compared with original version of zettair, this fork have solved some bugs, like when executing zet -a. And I have still found segfault in summary capitalisation highlighting.
How to reproduce this error:
-download the html file with command: wget www.al-ilmu.com/magazines/detail.php?id=76
-index the file: /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug detail.php?id=76
-and search with some term "tauhid", "pembeli" (indonesians term) using: /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise
-the following lines is the output from search command:
aris@aris-laptop:~/Documents$ /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise
AYXX: before reposset_set_record: rset->entries: 0. max: 1
AYXX: many files, adding quantity: 1
AYXX: reposset_set_record: rset->entries: 2
AYXX: docmap load: rset->entries: 2
> tauhid
AYXX: doc_ord_eval
AYXX: vocabsize[0]: 9, memsum: 9
AYXX: fit in memory: 1, total: 1
AYXX: searching in THRESH mode
AYXX: searching in AND mode
AYXX: remaining...
AYXX: normal term: tauhid
return from doc_ord_eval: 0, accs: 1
ERROR: index_search (src/search.c::1365): AUX: file:///home/aris/Documents/detail.php?id=76 // just my line to show which file caused error.
Segmentation fault
-If I'm not forgotten, there are another file that caused same error. I will post them after found.
-aris
Hi,
I found that text produced by summary capitalization highlighting is not displayed well in my (experimentation) web page. It caused by the html tag have not been escaped. It should to be html escaped?
best regards,
-aris
The steps to repeat the errors are (because of internet problem, I test this feature (offline) programatically, but this steps will work too):
-wget http://kaahil.wordpress.com/2009/05/16/giliran-%E2%80%9Cflu-india%E2%80%9D_-flu-kuda-equine-influenza-h7n7-dan-h3n8/
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug index.html
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=tag
-and search term "kuda"
And the output with tag summary:
Giliran &#-0;&#.(;�FLU INDIA&#-0;&#.(;&#'';_ Flu ...
but with capitalise summary:
Giliran “FLU INDIA”_ Flu Kuda (Equine influenza) ...
best regards,
-aris
The procedures to repeat the error are:
-download the (just 23 html files) sources file: wget --user-agent="" -rN -R css,js,jpg,jpeg,png,bmp,ico,gif,pdf,txt,doc,odt,dbk,README,readme,lexi,texi,tex,xml,ogg,tar,tar.gz,zip,rar,mp3,swf,rtf,bz2 --limit-rate=20k --background -o www.gwan.com.wget-log -t 10 http://www.gwan.com
-index them: find www.gwan.com/* -type f | /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug
-and search "microsoft" using tag summary highlighting: /usr/local/zettair/bin/zet -n 10 -f debug --summary=tag
-and the 8'th results are:
8. file:///root/Desktop/www.gwan.com/en_bing.html (score 0.134769, docid 10)
title: TrustLeap G-WAN HTTP Server Software - MICROSOFT MSN Robots
Should I assume that these were considered as "bugs" by MICROSOFT? ... Seven months later, MICROSOFT still did not explain why it uses its robots to disrupt the Web sites of its competitiors with the kind of time-out attacks that kill Apache and IIS (yes, IIS too, thanks to a 120-second 'after accept' time-out)... While GOOGLE robots never attacked http:/ /gwan.ch, MICROSOFT robots never stopped their attacks -even after my emails... Now, what about Cyveillance? (another MICROSOFT ' stragetic partner ') ... They are used as a cover channel to send (really dangerous) timeout attacks -the same deadly attacks that MICROSOFT Bing and MSN robots are using under the cover of similar 'benign' junk traffic to crash Apache Web servers.
the bolding tag in "MICROSOFT? ..." is not closed.
best regards,
-aris
The pocedure to repeat the errror are:
-download the files: wget http://en.wikipedia.org/wiki/Che_Guevara_Mausoleum http://en.wikipedia.org/wiki/Guevarism http://en.wikipedia.org/wiki/Che_Guevara_in_popular_culture http://en.wikipedia.org/wiki/Bibliography_of_Che_Guevara http://en.wikipedia.org/wiki/Che_Guevara
-index the result: /usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug Che_Guevara Che_Guevara_Mausoleum Che_Guevara_in_popular_culture Guevarism
-search "guevara" using /usr/local/zettair/bin/zet -n 10 -f debug --summary=tag, and the result:
AYXX: before reposset_set_record: rset->entries: 0. max: 1
AYXX: many files, adding quantity: 4
AYXX: reposset_set_record: rset->entries: 5
AYXX: docmap load: rset->entries: 5
> guevara
AYXX: doc_ord_eval
AYXX: vocabsize[0]: 939, memsum: 939
AYXX: fit in memory: 1, total: 1
AYXX: searching in THRESH mode
AYXX: searching in AND mode
AYXX: remaining...
AYXX: normal term: guevara
return from doc_ord_eval: 0, accs: 4
Segmentation fault
-but, search "guevara" is successful using /usr/local/zettair/bin/zet -n 10 -f debug --summary=capitalise. The result are:
AYXX: before reposset_set_record: rset->entries: 0. max: 1
AYXX: many files, adding quantity: 4
AYXX: reposset_set_record: rset->entries: 5
AYXX: docmap load: rset->entries: 5
> guevara
AYXX: doc_ord_eval
AYXX: vocabsize[0]: 939, memsum: 939
AYXX: fit in memory: 1, total: 1
AYXX: searching in THRESH mode
AYXX: searching in AND mode
AYXX: remaining...
AYXX: normal term: guevara
return from doc_ord_eval: 0, accs: 4
1. file:///home/aris/Che_Guevara (score 0.084276, docid 0)
title: Che Guevara - Wikipedia, the free encyclopedia
GUEVARA returned to ...
2. file:///home/aris/Che_Guevara_in_popular_culture (score -0.043245, docid 2)
title: Che Guevara in popular culture - Wikipedia, the free encyclopedia
Eleven days after GUEVARA'S execution, ...
3. file:///home/aris/Che_Guevara_Mausoleum (score -0.182210, docid 1)
title: Che Guevara Mausoleum - Wikipedia, the free encyclopedia
The Che GUEVARA Mausoleum (Mausoleo Che GUEVARA) ...
4. file:///home/aris/Guevarism (score -0.208675, docid 3)
title: Guevarism - Wikipedia, the free encyclopedia
�� Che GUEVARA ... Guevarism is a theory of communist ...
4 results of 4 shown (took 0.051080 seconds)
best regards,
-aris
085cab4 fixed the more obvious case. However I believe there is also a corner case where we had highlighting but could not fit into the buffer - and then we roll back, and as a result we probably will wipe out the closing bold tag. this ticket is to eventually investigate this condition more.
I'm no writing an experiment using zettair in a C web dev environment. Searching using zettair mean that I must allocate some "struct index_result *" for searching and then free them in every connection.
The problem start when I repeat the search (reallocate "struct index_result *"). The garbage string from the previous tittle still visible when the 2'nd search title length is smaller than 1'st search title length (of course in the same result position). The problem disappear when I user calloc rather than malloc to allocate buffers of results.
To repeat the problem I modify the commandline.c source, so in every interactive mode search, the application must reallocate the results buffer. And I use the steps in #3 to produce the index. The code is:
for (i = 0; args->list && args->list[i]; i++) { // line 1503 -------------
if (!(search(idx, args->list[i], results, args->results,
args->first_result, stats.maxtermlen, args->sopts,
&args->sopt))) {
index_delete(idx);
free_args(args);
return EXIT_FAILURE;
}
}
if (args->qlist != stdin || !args->list || !args->list[0]) {
/* stream-sourced mode */
char querybuf[QUERYBUF + 1];
free(results); /// the code which I added -------------------
while (((args->qlist != stdin)
|| (printf("> ") && (fflush(stdout) == 0)))
&& fgets(querybuf, QUERYBUF, args->qlist)) {
querybuf[QUERYBUF] = '\0';
results = malloc(sizeof(*results) * args->results); /// the code which I added ----------------
if (!(search(idx, querybuf, results, args->results,
args->first_result, stats.maxtermlen, args->sopts,
&args->sopt))) {
index_delete(idx);
free_args(args);
return EXIT_FAILURE;
}
free(results); /// the code which I added --------------------
}
}
gettimeofday(&now, NULL);
printf("%lu microseconds querying "
"(excluding loading/unloading)\n",
(unsigned long int) (now.tv_usec - then.tv_usec
+ (now.tv_sec - then.tv_sec) * 1000000)); //line 1543 -------------
then search "ibm" and then "microsoft" using this command: /usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise and the errors will occurs.
best regards,
-aris
The error here mean that the summarizing resulting string from meta attributes value. The tags contributes to the error:
...
<meta name="robots" content="index, follow" />
<meta name="keywords" content="comma delimited string with 1791 of lengths" />
<meta name="description" content="Majalah Islam AsySyariah, Ilmiah & Mudah dipahami." />
<meta name="generator" content="Joomla! 1.5 - Open Source Content Management" />
...
The summarizing was not error if I cut the string half of it.
The steps to reproduce the error:
-wget http://www.asysyariah.com/syariah/seputar-hukum-islam.html
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug seputar-hukum-islam.html
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise
-and search term "pendidikan"
best regards,
-aris
I found again a segfault error in summary capitalisation highlighting. The procedures to repeat the errors are in #1 (comment)
Hi,
I try to compile zettair as a 32-bit application/library on the 64-bit machine, and found error.
I use this option when I configured zettair (I successfully did this when building tokyocabinet and libcaptcha):
sudo ./configure CPPFLAGS=-m32 LDFLAGS=-m32 --prefix=/usr/local/zettair/
then I "make" it and the following error occurred:
.....
.....
gcc -shared src/.libs/str.o src/.libs/index.o src/.libs/mlparse.o src/.libs/stop.o src/.libs/stop_default.o src/.libs/postings.o src/.libs/merge.o src/.libs/vec.o src/.libs/makeindex.o src/.libs/freemap.o src/.libs/bit.o src/.libs/binsearch.o src/.libs/search.o src/.libs/chash.o src/.libs/stem.o src/.libs/heap.o src/.libs/queryparse.o src/.libs/index_querybuild.o src/.libs/bucket.o src/.libs/mem.o src/.libs/fdset.o src/.libs/pyramid.o src/.libs/iobtree.o src/.libs/getmaxfsize.o src/.libs/storagep.o src/.libs/btbucket.o src/.libs/btbulk.o src/.libs/vocab.o src/.libs/getlongopt.o src/.libs/error.o src/.libs/mlparse_wrap.o src/.libs/summarise.o src/.libs/mime.o src/.libs/remerge.o src/.libs/signals.o src/.libs/stack.o src/.libs/rbtree.o src/.libs/psettings.o src/.libs/psettings_default.o src/.libs/lcrand.o src/.libs/objalloc.o src/.libs/docmap.o src/.libs/reposset.o src/.libs/poolalloc.o src/.libs/alloc.o src/.libs/staticalloc.o src/.libs/dirichlet.o src/.libs/pcosine.o src/.libs/cosine.o src/.libs/hawkapi.o src/.libs/okapi_k3.o src/.libs/impact.o src/.libs/impact_build.o src/libtextcodec/.libs/crc.o src/libtextcodec/.libs/stream.o src/libtextcodec/.libs/detectfilter.o src/libtextcodec/.libs/gunzipfilter.o -lm -m32 -Wl,-soname -Wl,libzet.so.0 -o .libs/libzet.so.0.0.0
/usr/bin/ld: i386:x86-64 architecture of input file `src/libtextcodec/.libs/stream.o' is incompatible with i386 output
/usr/bin/ld: i386:x86-64 architecture of input file `src/libtextcodec/.libs/detectfilter.o' is incompatible with i386 output
/usr/bin/ld: final link failed: Invalid operation
collect2: ld returned 1 exit status
make[1]: *** [libzet.la] Error 1
make[1]: Leaving directory `/home/aris/Downloads/ayourtch-zf-6252417'
make: *** [all] Error 2
Is my option when configuring zettair correct? Or I need more workaround to compile zettair as a 32-bit application/library on the 64-bit machine (eg: editing makefile)?
best regards,
-aris
The steps to repeat the errors are (because of internet problem, I test this feature (offline) programatically, but this steps will work too) :
-wget ahlulhadist.wordpress.com/2007/10/11/zaid-al-khair/?like=1&_wpnonce=8c2f8cca82
-/usr/local/zettair/bin/zet -i -c /usr/local/zettair/share/psettings.xml -t HTML -f debug file_outputed_from_step_1
-/usr/local/zettair/bin/zet -f debug -n 10 --summary=capitalise
-and search term "kuda"
And the output with capitalise summary (add space in "& #8221"):
Katanya, “Inilah rampasanku yang pertama.& #8221; Lalu dihampirinya anak ...
but with tag summary:
Katanya, &#8220;Inilah rampasanku yang pertama.& #8221; Lalu dihampirinya anak ...
and the original text:
Katanya, “Inilah rampasanku yang pertama.” Lalu dihampirinya anak ...
and the original page source (using view page source menu):
Katanya, “Inilah rampasanku yang pertama.” Lalu dihampirinya anak ...
best regards,
-aris
the issue#1 is for the crash due to indexing with len-1 (so, index wraps around). This is to track that I need to investigate if this scenario of zero length is a legit one or a bug.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.