Coder Social home page Coder Social logo

patzacher / wfo_fuzzy_join Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 11.3 MB

A repository for R code that uses a fuzzy join operation to find scientific plant name matches in abstracts of scientific articles with the option of using parallel processing.

doparallel dplyr fuzzy-matching fuzzyjoin iterators parallel-processing r rprogramming text-classification text-preprocessing

wfo_fuzzy_join's Introduction

WorldFlora Online (WFO) Fuzzy Join Example

This project focuses on utilizing fuzzy string matching techniques to join biological scientific names using the World Flora Online (WFO) dataset. The code showcases several methods to perform fuzzy joins for matching and retrieving data across datasets. The aim is to facilitate data integration and comparison by handling slight discrepancies or variations in the recorded scientific names.

Overview

R code that uses a fuzzy join operation to find scientific plant name matches in abstracts of scientific articles. The scientific names are those contained in the WorldFlora Online Plant List, a comprehensive and authoritative list of vascular plants.

Installation

To use this code, ensure you have the required R packages installed. The essential packages include:

dplyr

tidytext

tm

fuzzyjoin

WorldFlora

data.table

stringr

parallel

doParallel

foreach

iterators

The packages can be installed in R using the install.packages("package_name") command.

Usage

Loading Data

Load the wfo_species_example and example_data CSV files. The WFO.download() function retrieves the World Flora Online data, which needs to be run once. Data frames are subset for troubleshooting purposes and all scientific names are converted to lowercase.

Cleaning Abstracts

Text preprocessing is completed before completing the matching operations. Preprocessing aids the matching process by reducing the size of the data set, resulting in fewer operations, and removing text that is not likely to contain words of interest. Specifically, we use the tm package to remove punctuation and numbers, and convert all text to lowercase. We then remove stop words (e.g., "is", "are", "the"). Lastly, the abstracts are tokenized into n-grams (e.g., word chunks) of a length specified by the user (e.g., 4-word chunks) using the unnest_tokens function from the tidytext package.

Fuzzy Join Options

Three different approaches for fuzzy joins are demonstrated:

Option 1: Fuzzy join using WFO.match.fuzzyjoin without parallel processing

  • Tokenized n-grams are processed for fuzzy joins with the WFO dataset.

Option 2: Fuzzy join using WFO.match.fuzzyjoin with parallel processing

  • Utilizes parallel processing to perform fuzzy joins with the WFO dataset.

Option 2.1: Fuzzy join using WFO.match.fuzzyjoin with parallel processing and chunked data frames

  • Splits the data frame into chunks for parallel processing to enhance performance.

Option 3: Fuzzy join using fuzzyjoin

  • Performs fuzzy join based on approximate string matching using the stringdist_left_join function.

Each section provides code snippets and detailed explanations on how the fuzzy joins are executed and the rationale behind each method.

Contributing

Contributions to this project are welcome. If you'd like to contribute, please follow these steps:

Fork the repository.

Create a new branch for your feature.

Make your changes and submit a pull request.

wfo_fuzzy_join's People

Contributors

patzacher avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.