Comments (7)
I think that you believe that the index should include the content of the documents. You can achieve this result by setting the doc.store
property to true. This will increase considerably the size of the index, however.
from indexwikipedia.
Thanks for the timely response!
I just tried adding properties.setProperty("doc.store", "true");
after line 84, still got the same problem.
I tried to use system.out.println(doc) after it's added to the indexWriter, and it shows all those fields so I suppose these document are valid, and I can also search keywords in title to retrieve some documents so it must be somewhere in the final index. But I just couldn't retrieve those fields except for the docid.
modified code:
while ((doc = docMaker.makeDocument()) != null) {
indexWriter.addDocument(doc);
system.out.println(doc);
++count;
output:
Document<stored,indexed,omitNorms,indexOptions=DOCS_ONLYdocid:207687 indexed,tokenized,omitNormsdocname:5040412 indexed,tokenized,omitNorms<docdate:31-OCT-2017 04:57:57.000> indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=LONG,numericPrecisionStep=16docdatenum:1509440277000 indexed,tokenized,omitNorms,indexOptions=DOCS_ONLY,numericType=INT,numericPrecisionStep=8doctimesecnum:32277 indexed,tokenized,omitNorms<doctitle:Mr. Sc....
from indexwikipedia.
Can you go into a fresh directory and type the following, exactly...
git clone https://github.com/lemire/IndexWikipedia.git
cd IndexWikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
mkdir Index
mvn compile
mvn exec:java -Dexec.args="enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2 Index"
Let me know what it tells you...
from indexwikipedia.
Hi Lemire,
Below is the output by executing the last command, I'm now checking if I can retrieve documents:
Starting Indexing of Wikipedia dump /home/xinyu/tutorial/lucene/IndexWikipedia/enwiki-latest-pages-articles10.xml-p2336425p3046511.bz2
Indexed 1000 documents (1000 bodies) in 2129 ms
Indexed 2000 documents (2000 bodies) in 3717 ms
...
Indexed 159000 documents (158972 bodies) in 254279 ms
Indexed 160000 documents (159972 bodies) in 255809 ms
Indexed 161000 documents (160972 bodies) in 257037 ms
org.apache.lucene.benchmark.byTask.feeds.NoMoreDataException
at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:196)
at java.lang.Thread.run(Thread.java:748)
Indexing 161973 documents took 258266 ms
Total data processed: 0 bytes
Index should be located at /home/xinyu/tutorial/lucene/IndexWikipedia/Index
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 04:21 min
[INFO] Finished at: 2017-12-01T22:49:56-05:00
[INFO] Final Memory: 32M/2632M
[INFO] ------------------------------------------------------------------------
from indexwikipedia.
Ok so I expect the issue might be with your chosen data source.
I am closing this issue. Reopen if you cannot resolve your issues.
from indexwikipedia.
Hi Lemire,
Just checked with the retrieving program, it seems the problem still persists. This is the snippet I used to query some keyword:
String INDEX_PATH = "/home/xinyu/tutorial/lucene/IndexWikipedia/Index";
StandardAnalyzer analyzer = new StandardAnalyzer();
IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
IndexSearcher searcher = new IndexSearcher(reader);
QueryParser parser = new QueryParser("doctitle", analyzer);
Query query = parser.parse("Query this sentence");
TopDocs results = searcher.search(query, 5 * 10);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = Math.toIntExact(results.totalHits);
for(ScoreDoc hit : hits){
Document doc = searcher.doc(hit.doc);
System.out.println(hit);
System.out.println(doc);
}
Here is what it prints:
doc=62331 score=11.589677 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:71477>>
doc=55135 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:61916>>
doc=72442 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:76853>>
doc=145087 score=10.742379 shardIndex=0
Document<stored,indexed,tokenized,omitNorms,indexOptions=DOCS<docid:141977>>
So the problem is it seems the Document object only contains docid, but it should contain some other text field right? I'm not sure if there is any problem with my code...
from indexwikipedia.
The primary purpose of the index is to allow you to find the document identifiers matching a query. If you want to store additional fields, you can do so with commands such as doc.add(new Field("mykey","my value", TextField.TYPE_STORED));
. Please see the Lucene documentation.
from indexwikipedia.
Related Issues (5)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from indexwikipedia.