Coder Social home page Coder Social logo

bcgov / bcdc-search-improvement-capstone Goto Github PK

View Code? Open in Web Editor NEW
1.0 3.0 1.0 166.99 MB

BC Data Catalogue search engine improvement proof-of-concept project. This work was completed by a student team as a UBC MDS Capstone project (June 2023).

License: Apache License 2.0

Dockerfile 0.03% Python 0.36% Shell 6.71% Batchfile 4.41% AMPL 0.02% CSS 14.66% JavaScript 72.30% XSLT 1.52%
bcdc bcstats citz data-science

bcdc-search-improvement-capstone's Introduction

LicenseLifecycle:Experimental

bcdc-search-improvement-capstone

A UBC MDS Capstone project focused on a proof-of-concept of the application of natural language processing modeling to improve the user search experience of the B.C. Data Catalogue.

Project Status

Work In Progress

Problem Statement

The BC Data Catalogue contains over 4,000 datasets with lots of useful information, available for anyone to use. Despite its wealth of information, users often faced difficulties in locating the datasets they needed, resulting in lower engagement with the platform.
To address this issue, we used SBERT to implement Semantic Search on top of Solr search engine. Semantic search is an advanced NLP technique that focuses on understanding user intent rather than relying solely on specific keywords.
The new search engine shows a significant improvement in search performance and has the ability to comprehend synonyms and phrases. It can also handle typing errors, making it more intuitive and user-friendly.

How to run?

  1. install Java 8
  2. install dependencies using conda
conda env create -f environment.yml
  1. start Solr
bin/solr.cmd start
  1. Run the streamlit app
streamlit run search_engine.py

Data Sources

B.C. Data Catalogue text data: sourced directly from the B.C. Data Catalogue available under the Open Government Licence - British Columbia.

Software

Getting Help or Reporting an Issue

To report bugs/issues/feature requests, please file an issue.

How to Contribute

If you would like to contribute, please see our CONTRIBUTING guidelines.

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

Dependencies

Solr is an open source search platform built on Apache Lucene. We used Solr 6.6.6 in this project. The license for Solr can be found at solr/LICENSE.txt.
We also used Vector Scoring Plugin for Solr to calculate the distance between the query and the documents.

License

Copyright 2023 Province of British Columbia

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and limitations under the License.

bcdc-search-improvement-capstone's People

Contributors

behrooz-q avatar repo-mountie[bot] avatar stephhazlitt avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

Forkers

jonjunduan

bcdc-search-improvement-capstone's Issues

Add missing topics

TL;DR

Topics greatly improve the discoverability of repos; please add the short code from the table below to the topics of your repo so that ministries can use GitHub's search to find out what repos belong to them and other visitors can find useful content (and reuse it!).

Why Topic

In short order we'll add our 800th repo. This large number clearly demonstrates the success of using GitHub and our Open Source initiative. This huge success means it's critical that we work to make our content as discoverable as possible. Through discoverability, we promote code reuse across a large decentralized organization like the Government of British Columbia as well as allow ministries to find the repos they own.

What to do

Below is a table of abbreviation a.k.a short codes for each ministry; they're the ones used in all @gov.bc.ca email addresses. Please add the short codes of the ministry or organization that "owns" this repo as a topic.

add a topic

That's it, you're done!!!

How to use

Once topics are added, you can use them in GitHub's search. For example, enter something like org:bcgov topic:citz to find all the repos that belong to Citizens' Services. You can refine this search by adding key words specific to a subject you're interested in. To learn more about searching through repos check out GitHub's doc on searching.

Pro Tip ๐Ÿค“

  • If your org is not in the list below, or the table contains errors, please create an issue here.

  • While you're doing this, add additional topics that would help someone searching for "something". These can be the language used javascript or R; something like opendata or data for data only repos; or any other key words that are useful.

  • Add a meaningful description to your repo. This is hugely valuable to people looking through our repositories.

  • If your application is live, add the production URL.

Ministry Short Codes

Short Code Organization Name
AEST Advanced Education, Skills & Training
AGRI Agriculture
ALC Agriculture Land Commission
AG Attorney General
MCF Children & Family Development
CITZ Citizens' Services
DBC Destination BC
EMBC Emergency Management BC
EAO Environmental Assessment Office
EDUC Education
EMPR Energy, Mines & Petroleum Resources
ENV Environment & Climate Change Strategy
FIN Finance
FLNR Forests, Lands, Natural Resource Operations & Rural Development
HLTH Health
IRR Indigenous Relations & Reconciliation
JEDC Jobs, Economic Development & Competitiveness
LBR Labour Policy & Legislation
LDB BC Liquor Distribution Branch
MMHA Mental Health & Addictions
MAH Municipal Affairs & Housing
BCPC Pension Corporation
PSA Public Service Agency
PSSG Public Safety and Solicitor General
SDPR Social Development & Poverty Reduction
TCA Tourism, Arts & Culture
TRAN Transportation & Infrastructure
WLRS Water, Land and Resource Stewardship

NOTE See an error or omission? Please create an issue here to get it remedied.

Lets use common phrasing

TL;DR ๐ŸŽ๏ธ

Teams are encouraged to favour modern inclusive phrasing both in their communication as well as in any source checked into their repositories. You'll find a table at the end of this text with preferred phrasing to socialize with your team.

Words Matter

We're aligning our development community to favour inclusive phrasing for common technical expressions. There is a table below that outlines the phrases that are being retired along with the preferred alternatives.

During your team scrum, technical meetings, documentation, the code you write, etc. use the inclusive phrasing from the table below. That's it - it really is that easy.

For the curious mind, the Public Service Agency (PSA) has published a guide describing how Words Matter in our daily communication. Its an insightful read and a good reminder to be curious and open minded.

What about the master branch?

The word "master" is not inherently bad or non-inclusive. For example people get a masters degree; become a master of their craft; or master a skill. It's generally when the word "master" is used along side the word "slave" that it becomes non-inclusive.

Some teams choose to use the word main for the default branch of a repo as opposed to the more commonly used master branch. While it's not required or recommended, your team is empowered to do what works for them. If you do rename the master branch consider using main so that we have consistency among the repos within our organization.

Preferred Phrasing

Non-Inclusive Inclusive
Whitelist => Allowlist
Blacklist => Denylist
Master / Slave => Leader / Follower; Primary / Standby; etc
Grandfathered => Legacy status
Sanity check => Quick check; Confidence check; etc
Dummy value => Placeholder value; Sample value; etc

Pro Tip ๐Ÿค“

This list is not comprehensive. If you're aware of other outdated nomenclature please create an issue (PR preferred) with your suggestion.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.