solr4-extras
Random solr4 customizations (in Scala).
Secure - search "encrypted" data collections
A custom Solr search component that allows searching against a collection that is split between MongoDB and Solr. Solr contains the searchable but unstored version of the documents, and MongoDB contains the encrypted version of the document (along with the user keys in another table). At search time, data is searched against Solr, then retrieved and decrypted from MongoDB and presented to the client.
More info on my Blog Post
Setup Instructions
- Download and install MongoDB and Solr4.
- Copy the contents of conf/secure to the Solr example/solr/collections1/conf directory.
- git clone this project, then run "sbt" from the command line. This will populate your .ivy2 cache with the necessary JAR files for this project.
- Make a directory example/solr/collections1/lib and copy the following JAR files from your .ivy cache to it. Here is the list I built by trial and error: (casbah-commons_2.9.2-2.3.0.jar, mongo-java-driver-2.8.0.jar, casbah-core_2.9.2-2.3.0.jar, scala-library.jar, casbah-query_2.9.2-2.3.0.jar, scalaj-collection_2.9.1-1.2.jar, casbah-util_2.9.2-2.3.0.jar).
- Run "sbt package" on the command line. This will create the solr4-extras JAR file. Copy this file to the lib directory above.
- Start mongod.
- Start Solr4 (java -jar start.jar)
- Download and expand the Enron dataset and update conf/secure/secure.properties to point to it.
- Run "sbt run" to run the indexer and populate the index and MongoDB tables.
- Create index on the email collection: db.emails.ensureIndex({"message_id": 1})
- You should now see results from queries to the custom /secure_select service. Example URL: http://localhost:8983/solr/collection1/secure_select?q=body:%22hedge%20fund%22&fq=from:[email protected]&[email protected]
FuncQuery - function queries to influence ranking using demographics
SolrJ code to write random score values and a title to a Solr instance so these can be used in function queries. No front end code (although I guess I could have written a JUnit test to demonstrate the function queries in action), and no configuration changes. More info on my Blog Post.
Payloads - a Solr4 port for concept maps as payloads
Payload implementation for modeling concepts and their scores as payload fields, with Similarity, QParser for Payloads. Needs following configuration:
Setup Instructions
- Build JAR using sbt package.
- Copy JAR into lib, along with scala-compiler.jar and scala-library.jar.
- Make following modifications to conf/schema.xml and conf/solrconfig.xml.
schema.xml:
<field name="cscores" type="payloads" indexed="true" stored="true"/>
<similarity
class="com.mycompany.solr4extras.payloads.MyCompanySimilarityWrapper"/>
solrconfig.xml:
<queryParser name="payloadQueryParser"
class="com.mycompany.solr4extras.payloads.PayloadQParserPlugin"/>
<requestHandler name="/cselect" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">payloadQueryParser</str>
</lst>
</requestHandler>
More info on my Blog Post.
NER - Named Entity Extraction with LingPipe
Using LingPipe to construct regex based and dictionary based Named Entity Extractors backed by Solr, used for preprocessing query.
The following fields need to be defined in schema.xml for the SolrMapDictionary object:
<field name="nercat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="nerval" type="text_general" indexed="true" stored="true"/>
More info on my Blog Post.
Concept Embedding - Mixed Concept + Text queries
Code written against Solr4 to embed concept IDs like synonyms within text. Custom TokenFilter and Analyzer to support this work, plus configuration and JUnit tests. Configuration consists of the following field definitions and the following fieldType definition:
<!-- for concept position -->
<field name="itemtitle" type="text_en" indexed="true" stored="true"/>
<field name="itemtitle_cp" type="text_cp" indexed="true" stored="true"/>
<!-- text_cp field type definition -->
<fieldType name="text_cp" class="solr.TextField">
<analyzer type="index"
class="com.mycompany.solr4extras.cpos.ConceptPositionAnalyzer"/>>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory"
protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
More info on my Blog Post.
Near-Duplicate Detection
Uses Shingles and MinHashing to implement near-duplicate detection on the Restaurant Dataset. No customization of Solr required, everything is done in client. Following new fields need to be declared to use this application:
<field name="content" type="text_general" indexed="false" stored="true"
multiValued="false"/>
<field name="md5_hash" type="string" indexed="true" stored="true"/>
<field name="num_words" type="int" indexed="true" stored="true" />
<field name="first_word" type="string" indexed="true" stored="true"/>
<field name="content_ng" type="string" indexed="true" stored="true"
multiValued="true"/>
<field name="content_sg" type="string" indexed="true" stored="true"
multiValued="true"/>
More info on my Blog Post.