phymbert / spark-search Goto Github PK
View Code? Open in Web Editor NEWSpark Search - high performance advanced search features based on Apache Lucene
License: Apache License 2.0
Spark Search - high performance advanced search features based on Apache Lucene
License: Apache License 2.0
After #64 upgrade, check on submodule disapears
When running SearchRDDJavaExamples:
Serialization stack:
- object not serializable (class: java.lang.Object, value: java.lang.Object@6528d1e1)
- field (class: org.apache.spark.search.package$SearchRecord, name: source, type: class java.lang.Object)
- object (class org.apache.spark.search.package$SearchRecord, SearchRecord(8399,1,1.7722454071044922,0,java.lang.Object@6528d1e1))
- element of array (index: 0)
- array (class [Lorg.apache.spark.search.package$SearchRecord;, size 100)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
It should be possible to continusly index and search using spark streaming.
Separate join and matches aggregation in two methods.
It will allow user to choose a better aggregation method and top k monoid algorithm.
Need to cleanup temp resources on local dir after save
When loading SearchRDD from HDFS, lucene indices are corrupted
If SearchIndexedRDD is not cached on yarn, child RDD is not run on the same workers, triggering new indexation.
The opposite, if it is cached, MatchRDD partition is not located on the same executor and the index directory is not found.
Standalone Spark when no cache:
20/06/28 21:41:31 WARN TaskSetManager: Lost task 0.0 in stage 14.0 (TID 46, 172.21.0.6, executor 2): org.apache.lucene.index.IndexNotFoundException: no segments* file found in MMapDirectory@/tmp/spark-search-rdd9-index-0 lockFactory=org.apache.lucene.store.NoLockFactory@253f0c73: files: []
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:675)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:76)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
at org.apache.spark.search.rdd.SearchPartitionReader.<init>(SearchPartitionReader.java:67)
at org.apache.spark.search.rdd.SearchRDD.reader(SearchRDD.scala:206)
at org.apache.spark.search.rdd.MatchRDD.compute(MatchRDD.scala:55)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
Revert nexus at module level to parent
Need to refactor examples and integration test to be running faster and using a single docker, docker-compose file with build args/env variable based on the current action matrix.
At the moment duplicates are lost, an acumulator might be passed optionnaly to convert duplicates.
Example:
var searchRDD: SearchRDD[R] = ???
searchRDD.searchDropDuplicates(t => "name:${t.name}", (v, v) => v ++ v, Seq())
Include also split SearchRDD into search and paris rdd operations
Reloaded partition failed to be opened by the search reader because indices are corrupted.
Solution: add an optional step to download file from hdfs locally before unzipping it
Link to:
https://oss.sonatype.org/
https://central.sonatype.org/publish/publish-maven/
https://central.sonatype.org/publish/publish-guide/#releasing-to-central
https://oss.sonatype.org/#view-repositories;releases~browsestorage~io/github/phymbert
https://issues.sonatype.org/browse/OSSRH-58231
Fix sources & javadoc should be done during package lifecycle and during deploy
and maven:release
(need to be added to parent & modules).
At the moment:
mvn clean source:jar javadoc:jar verify nexus-staging:deploy nexus-staging:deploy-staged nexus-staging:release -DskipTests -P deploy,scala-2.11
To deploy not the parent and other scala version
clean source:jar javadoc:jar verify nexus-staging:deploy nexus-staging:deploy-staged nexus-staging:release -DskipTests -pl core,sql -P deploy,scala-2.12
Upon release, your component will be published to Central: this typically occurs within 30 minutes, though updates to search can take up to four hours.
hello - im attempting to use this library v0.2 in a yarn, with my driver running on the cluster
I am encountering the following exception -
Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in MMapDirectory@/local/hadoop/disksdl/yarn/nodemanager/usercache/spotci/appcache/application_1617967855014_1171701/container_e136_1617967855014_1171701_02_000001/tmp/spark-search/application_1617967855014_1171701-sparksearch-rdd0-index-3 lockFactory=org.apache.lucene.store.NoLockFactory@4a1941a4: files: [] at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:715) at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:84) at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:64)
Im wondering if there was any info on where to start looking for why it would be empty?
thanks
Continue SQL logical & physical plan for spark search sql api.
See action #130
Same for java & scala examples:
2021-08-20T22:36:18.0615405Z Some typo in names:
2021-08-20T22:36:24.0708492Z Downloading software reviews...
2021-08-20T22:36:27.3572928Z Joined software and computer reviews by reviewer names:
2021-08-20T22:36:36.7498019Z Dropping duplicated reviewers:
2021-08-20T22:37:23.4888234Z Restoring from previous indexation:
2021-08-20T22:37:30.1019131Z 8103 positive reviews after restoration
For small amounts of data not getting this issue. But if we try to process huge data, then we get the following exception. Any guidance here could be great help.
22/09/16 08:17:34 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 32.0 in stage 47.0 (TID 1449) (offerexposure-cluster-naidu-1-w-10.c.wmt-mtech-offerexposure-stg.internal executor 1): org.apache.spark.search.SearchException: indexation failed on partition 1 and directory /tmp/spark-search/application_1663309524066_0008-sparksearch-rdd149-index-1
at org.apache.spark.search.rdd.SearchPartitionIndex.monitorIndexation(SearchPartitionIndex.java:145)
at org.apache.spark.search.rdd.SearchPartitionIndex.index(SearchPartitionIndex.java:82)
at org.apache.spark.search.rdd.SearchRDDIndexer.compute(SearchRDDIndexer.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:386)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$6(BlockManager.scala:1461)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$6$adapted(BlockManager.scala:1459)
at org.apache.spark.storage.DiskStore.put(DiskStore.scala:70)
at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1459)
at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1350)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1414)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1237)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:384)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:335)
at org.apache.spark.search.rdd.SearchRDDCartesian.compute(SearchRDDCartesian.scala:54)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.nio.file.NoSuchFileException: /tmp/spark-search/application_1663309524066_0008-sparksearch-rdd149-index-1/pending_segments_1
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
at sun.nio.fs.UnixFileSystemProvider.implDelete(UnixFileSystemProvider.java:244)
at sun.nio.fs.AbstractFileSystemProvider.delete(AbstractFileSystemProvider.java:103)
at java.nio.file.Files.delete(Files.java:1126)
at org.apache.lucene.store.FSDirectory.privateDeleteFile(FSDirectory.java:370)
at org.apache.lucene.store.FSDirectory.deleteFile(FSDirectory.java:339)
at org.apache.lucene.store.LockValidatingDirectoryWrapper.deleteFile(LockValidatingDirectoryWrapper.java:38)
at org.apache.lucene.index.IndexFileDeleter.deleteFile(IndexFileDeleter.java:705)
at org.apache.lucene.index.IndexFileDeleter.deleteFiles(IndexFileDeleter.java:699)
at org.apache.lucene.index.IndexFileDeleter.<init>(IndexFileDeleter.java:238)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1089)
at org.apache.spark.search.rdd.SearchPartitionIndex.lambda$index$1(SearchPartitionIndex.java:90)
at org.apache.spark.search.rdd.SearchPartitionIndex.monitorIndexation(SearchPartitionIndex.java:128)
... 29 more
Thanks,
Naidu
When a SearchRDDLucene#save is called, instead of throwing exception, user may want to erase exiting path.
If search operation occurs on different partition, it still trigger indexation.
To prevent this, SearchRDDLucene* should be marked as cached, option to cache locally default to yes on yarn must be set to true regardless of backend scheduler
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.