Coder Social home page Coder Social logo

Comments (7)

lemire avatar lemire commented on September 20, 2024

I think that you believe that the index should include the content of the documents. You can achieve this result by setting the doc.store property to true. This will increase considerably the size of the index, however.

from indexwikipedia.

XinyuHua avatar XinyuHua commented on September 20, 2024

Thanks for the timely response!

I just tried adding properties.setProperty("doc.store", "true"); after line 84, still got the same problem.

I tried to use system.out.println(doc) after it's added to the indexWriter, and it shows all those fields so I suppose these document are valid, and I can also search keywords in title to retrieve some documents so it must be somewhere in the final index. But I just couldn't retrieve those fields except for the docid.

modified code:

             while ((doc = docMaker.makeDocument()) != null) {
                                indexWriter.addDocument(doc);
                                system.out.println(doc);
                                ++count;

output:
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLYdocid:207687 indexed,tokenized,omitNormsdocname:5040412 indexed,tokenized,omitNorms<docdate:31-OCT-2017 04:57:57.000> indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=LONG,numericPrecisionStep=16docdatenum:1509440277000 indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=INT,numericPrecisionStep=8doctimesecnum:32277 indexed,tokenized,omitNorms<doctitle:Mr. Sc....

from indexwikipedia.

lemire avatar lemire commented on September 20, 2024

@XinyuHua

Can you go into a fresh directory and type the following, exactly...

git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"

Let me know what it tells you...

from indexwikipedia.

XinyuHua avatar XinyuHua commented on September 20, 2024

Hi Lemire,

Below is the output by executing the last command, I'm now checking if I can retrieve documents:

Starting Indexing of Wikipedia dump /home/xinyu/tutorial/lucene/IndexWikipedia/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
Indexed 1000 documents (1000 bodies) in 2129 ms
Indexed 2000 documents (2000 bodies) in 3717 ms
...
Indexed 159000 documents (158972 bodies) in 254279 ms
Indexed 160000 documents (159972 bodies) in 255809 ms
Indexed 161000 documents (160972 bodies) in 257037 ms
org.apache.lucene.benchmark.byTask.feeds.NoMoreDataException
	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:196)
	at java.lang.Thread.run(Thread.java:748)
Indexing 161973 documents took 258266 ms
Total data processed: 0 bytes
Index should be located at /home/xinyu/tutorial/lucene/IndexWikipedia/Index
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:21 min
[INFO] Finished at: 2017-12-01T22:49:56-05:00
[INFO] Final Memory: 32M/2632M
[INFO] ------------------------------------------------------------------------

from indexwikipedia.

lemire avatar lemire commented on September 20, 2024

Ok so I expect the issue might be with your chosen data source.

I am closing this issue. Reopen if you cannot resolve your issues.

from indexwikipedia.

XinyuHua avatar XinyuHua commented on September 20, 2024

Hi Lemire,

Just checked with the retrieving program, it seems the problem still persists. This is the snippet I used to query some keyword:


        String INDEX_PATH = "/home/xinyu/tutorial/lucene/IndexWikipedia/Index";                                                                                                   
        StandardAnalyzer analyzer = new StandardAnalyzer();
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
        IndexSearcher searcher = new IndexSearcher(reader);
        QueryParser parser = new QueryParser("doctitle", analyzer);
        Query query = parser.parse("Query this sentence");
        TopDocs results = searcher.search(query, 5 * 10);
        ScoreDoc[] hits = results.scoreDocs;
        int numTotalHits = Math.toIntExact(results.totalHits);
        for(ScoreDoc hit : hits){
            Document doc = searcher.doc(hit.doc);
            System.out.println(hit);
            System.out.println(doc);
       }

Here is what it prints:

doc=62331 score=11.589677 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:71477>>
doc=55135 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:61916>>
doc=72442 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:76853>>
doc=145087 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:141977>>

So the problem is it seems the Document object only contains docid, but it should contain some other text field right? I'm not sure if there is any problem with my code...

from indexwikipedia.

lemire avatar lemire commented on September 20, 2024

The primary purpose of the index is to allow you to find the document identifiers matching a query. If you want to store additional fields, you can do so with commands such as doc.add(new Field("mykey","my value", TextField.TYPE_STORED));. Please see the Lucene documentation.

from indexwikipedia.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.