Coder Social home page Coder Social logo

Comments (4)

hankcs avatar hankcs commented on May 28, 2024
  1. hanlp-lucene-plugin目前支持的lucence版本为7.2.0,不支持lucene9.7。lucene9.7中不存在org.apache.lucene.analysis.util.TokenizerFactory这个类,所以你根本不可能编译通过,所以要么你跑的根本不是你所列出的代码而是别的分词器,要么你跑的不是官方版本。
  2. lucence版本7.2.0不存在搜不到的问题:https://github.com/hankcs/hanlp-lucene-plugin/blob/c6be0de363022a38436490cd19761881ebad41e8/src/test/java/com/hankcs/lucene/HanLPAnalyzerTest.java#L87
    public void testIndexAndSearch() throws Exception
    {
        Analyzer analyzer = new HanLPAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        Directory directory = new RAMDirectory();
        IndexWriter indexWriter = new IndexWriter(directory, config);

        Document document = new Document();
        document.add(new TextField("content", "**人", Field.Store.YES));
        indexWriter.addDocument(document);

        indexWriter.commit();
        indexWriter.close();

        IndexReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse("**人");
        ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
        assertEquals(1, hits.length);
        for (ScoreDoc scoreDoc : hits)
        {
            Document targetDoc = isearcher.doc(scoreDoc.doc);
            System.out.println(targetDoc.getField("content").stringValue());
        }
    }

from hanlp.

SxunS avatar SxunS commented on May 28, 2024
  1. 不好意思,你是对的。由于是maven 构建的项目,没注意实际使用的org.apache.lucene.analysis.util.TokenizerFactory这个类,确实在lucene7.2.0中。所以编译没有报错(跑的是官方版本).
  2. 对于上述测试用例,我又重新创建了一个干净的环境。maven依赖坐标如下
<dependencies>
    <dependency>
      <groupId>com.hankcs.nlp</groupId>
      <artifactId>hanlp-lucene-plugin</artifactId>
      <version>1.1.7</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13.2</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.hankcs</groupId>
      <artifactId>hanlp</artifactId>
      <version>portable-1.8.4</version>
    </dependency>
  </dependencies>

结果依然同问题描述的一样。
3. 尝试移出 portable-1.8.4依赖,结果正常检索出来,猜测可能和 com.hankcs:hanlp:portable-1.8.4有关。
4. 包含 portable-1.8.4依赖,测试结果:
image
5. 移除 portable-1.8.4依赖,测试结果:
image

from hanlp.

SxunS avatar SxunS commented on May 28, 2024

补充:

  1. protable-1.7.6 查询正常
  2. protable-1.8.4 查询有问题(图1)
  3. 使用方案2(release jar + data + properties)的方式,查询正常

protable 和 release jar 的区别是什么呢? 就是data 词典和模型不一样吗?
使用protable 也是用的自定义的 词典(下载自官方)。
properties 配置

#本配置文件中的路径的根目录,根目录+其他路径=完整路径(支持相对路径,请参考:https://github.com/hankcs/HanLP/pull/254)
#Windows用户请注意,路径分隔符统一使用/
root=E:/xx/demo/document-search/document-search/document-search-server/src/main/resources

#好了,以上为唯一需要修改的部分,以下配置项按需反注释编辑。
Normalization=true

from hanlp.

hankcs avatar hankcs commented on May 28, 2024
  • 应该是 3a99bc6 引入了一个初始化的bug
  • portable版本默认加载小模型
  • 该bug仅影响mini模型在JRE启动后第一次分词的结果
  • 如果你使用mini模型,请使用 https://github.com/hankcs/HanLP/releases/tag/v1.8.1 以前的版本。否则无论portable与否,只要你的hanlp.properties里没有加载mini模型,都不影响。

感谢反馈,已经修复,请检查上面的commit是否解决了这个问题。
如果还有问题,欢迎重开issue。

from hanlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.