Hi, I've been trying to index a wiki article dump using this library

Thanks for the timely response! I just tried adding <code class="not

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Incomplete document returned about indexwikipedia HOT 7 CLOSED

lemire commented on September 20, 2024

Incomplete document returned

from indexwikipedia.

Comments (7)

lemire commented on September 20, 2024

I think that you believe that the index should include the content of the documents. You can achieve this result by setting the doc.store property to true. This will increase considerably the size of the index, however.

from indexwikipedia.

XinyuHua commented on September 20, 2024

Thanks for the timely response!

I just tried adding properties.setProperty("doc.store", "true"); after line 84, still got the same problem.

I tried to use system.out.println(doc) after it's added to the indexWriter, and it shows all those fields so I suppose these document are valid, and I can also search keywords in title to retrieve some documents so it must be somewhere in the final index. But I just couldn't retrieve those fields except for the docid.

modified code:

             while ((doc = docMaker.makeDocument()) != null) {
                                indexWriter.addDocument(doc);
                                system.out.println(doc);
                                ++count;

output:
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLYdocid:207687 indexed,tokenized,omitNormsdocname:5040412 indexed,tokenized,omitNorms<docdate:31-OCT-2017 04:57:57.000> indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=LONG,numericPrecisionStep=16docdatenum:1509440277000 indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=INT,numericPrecisionStep=8doctimesecnum:32277 indexed,tokenized,omitNorms<doctitle:Mr. Sc....

from indexwikipedia.

lemire commented on September 20, 2024

@XinyuHua

Can you go into a fresh directory and type the following, exactly...

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Let me know what it tells you...

from indexwikipedia.

XinyuHua commented on September 20, 2024

Hi Lemire,

Below is the output by executing the last command, I'm now checking if I can retrieve documents:

Starting Indexing of Wikipedia dump /home/xinyu/tutorial/lucene/IndexWikipedia/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
Indexed 1000 documents (1000 bodies) in 2129 ms
Indexed 2000 documents (2000 bodies) in 3717 ms
...
Indexed 159000 documents (158972 bodies) in 254279 ms
Indexed 160000 documents (159972 bodies) in 255809 ms
Indexed 161000 documents (160972 bodies) in 257037 ms
org.apache.lucene.benchmark.byTask.feeds.NoMoreDataException
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:196)
	at java.lang.Thread.run(Thread.java:748)
Indexing 161973 documents took 258266 ms
Total data processed: 0 bytes
Index should be located at /home/xinyu/tutorial/lucene/IndexWikipedia/Index
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:21 min
[INFO] Finished at: 2017-12-01T22:49:56-05:00
[INFO] Final Memory: 32M/2632M
[INFO] ------------------------------------------------------------------------

from indexwikipedia.

lemire commented on September 20, 2024

Ok so I expect the issue might be with your chosen data source.

I am closing this issue. Reopen if you cannot resolve your issues.

from indexwikipedia.

XinyuHua commented on September 20, 2024

Hi Lemire,

Just checked with the retrieving program, it seems the problem still persists. This is the snippet I used to query some keyword:


        String INDEX_PATH = "/home/xinyu/tutorial/lucene/IndexWikipedia/Index";                                                                                                   
        StandardAnalyzer analyzer = new StandardAnalyzer();
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("doctitle", analyzer);
        Query query = parser.parse("Query this sentence");
        TopDocs results = searcher.search(query, 5 * 10);
        ScoreDoc[] hits = results.scoreDocs;
        int numTotalHits = Math.toIntExact(results.totalHits);
        for(ScoreDoc hit : hits){
            Document doc = searcher.doc(hit.doc);
            System.out.println(hit);
            System.out.println(doc);
       }

Here is what it prints:

doc=62331 score=11.589677 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:71477>>
doc=55135 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:61916>>
doc=72442 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:76853>>
doc=145087 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:141977>>

So the problem is it seems the Document object only contains docid, but it should contain some other text field right? I'm not sure if there is any problem with my code...

from indexwikipedia.

lemire commented on September 20, 2024

The primary purpose of the index is to allow you to find the document identifiers matching a query. If you want to store additional fields, you can do so with commands such as doc.add(new Field("mykey","my value", TextField.TYPE_STORED));. Please see the Lucene documentation.

from indexwikipedia.

Incomplete document returned about indexwikipedia HOT 7 CLOSED

Comments (7)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent