Coder Social home page Coder Social logo

dbahadori / document-similarity Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 77.52 MB

web based plagiarism detection system for word documents using information retrieval based methods

License: Apache License 2.0

C# 15.28% CSS 7.22% JavaScript 64.70% ASP.NET 0.01% HTML 12.79%
asp-net-mvc feature-vector lucene-net doc layered-architecture javascript csharp lemmatization tokenization normalization

document-similarity's Introduction

DocumentSimilarity

DocumentSimilarity project is a web based plagiarism detection system for word documents using information retrieval based methods. I used the document similarity analysis and tf-idf technique as a term weighting method which is the main model in our system. This system is implemented using Lucene.Net and ASP.Net MCV.

Solution

There is two pre-processing steps in order to prepare documents for indexing by document repository.

  1. in the first step system pre-processes the document once it is imported to the system in order to extract different sections of it. The extracted information are temporary and keep in memory. in order to save the extracted information permanently on the disk, an XML template is used.

  2. After extracting all sections of the imported document, system pre-processes the content of each section by doing normalization, lemmatization and tokenization operations. To perform lemmatizating, first the sentences are separated in each section and after that the tokens are separated in each sentence. Then for each token lemmatization action is performed. Finally, the tokens are merged in the same order as they were separated to create the initial sentences.

After the pre-processing steps, a document is created in the document database, and each section is added as a field to that, moreover the document is indexed by the lucene.net search engine.

When we need to search in the document database, the pre-processing is also done on the query created in the system, similar to the step of adding a new document to the database.

system Functionality

  • Add a new document : By using this feature, the user will be able to upload new documents in the system. In order to display, search and participate in the similarity calculation, documents must be loaded in the system in advance.

  • Index documents : The system automatically performs indexing on the documents after uploading. Indexing is done in order to perform future searches and obtain similar documents.

  • Display documents: By using this feature, the user will be able to view the documents indexed in the document repository.

  • Delete the document: By using this feature, the user will be able to delete any of the displayed documents that are indexed in the document repository. Documents that are deleted from the database will also be physically deleted from the memory where they are stored.

  • Search functionality (advanced): By using this feature, the user will be able to search for the desired documents based on specific filters among the documents indexed in the document repository. The search function includes simple search and advanced search. In the advanced search, the user will be able to limit the scope of his search with various filters.

  • Create a new user account: By using this feature, the user will be able to create a new user account in the system. In order to use other features of the system, the end user must have a valid user account.

  • Log into the system: Using this feature, the user will be able to log into the system using the username and password that he obtained when creating a new account. In order to use other features of the system, it is necessary for the end user to log into the system.

  • Log out: Using this feature, the user will be able to log out from the system after logging in.

  • Change password: Using this feature, the user will be able to change the password he got when creating a new account after logging into the system.

  • Calculate the degree of similarity between new and existing documents: By using this feature, the user will be able to obtain the degree of similarity between his document and the documents in the document repository.

System Architecture

this system is implemented using layered architectural style. The layers of system and components of each layer are demonstrated below.

document-similarity's People

Contributors

dbahadori avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.