Coder Social home page Coder Social logo

tonanhngo / discover-archetype Goto Github PK

View Code? Open in Web Editor NEW

This project forked from ibm/discover-archetype

0.0 0.0 1.0 2.48 MB

Discover archetypes in your text corpus using Watson Natural Language Understanding.

License: Apache License 2.0

Python 67.87% Jupyter Notebook 32.13%

discover-archetype's Introduction

Discover the archetypes in your system of records

System of records are ubiquitous in the world around us, ranging from music playlist, job listing, medical record, customer service calls, github issues, etc. Archetypes are formally defined as a pattern or a model, of which all things of the same type are copied. More informally, we can think of archetypes as categories, classes, topics, etc.

When we read through a set of these records, our mind naturally groups the records into some collection of archetypes. For example, we may sort a song collection into easy listening, classical, rock, etc. This manual process is practical for a small number of records, for examples a few dozen. Large system can have millions of records, so we need an automated way to process them. In addition, without prior knowledge of these records, we may not know beforehand the archetypes that exist in the records, so we also need a way to discover meaningful archetypes that can be adopted. Since records are often in the form of unstructured text, such automated processing needs to be able to understand natural language. Watson Natural Language Understanding, coupled with statiscal techniques, can help you to (1) discover meaningful archetypes in your records and then (2) classify new record against this set of archetypes.

In this example, we will use a medical dictation data set to illustrate the process. The data is provided by ezDI and includes 249 actual medical dictation that have been anonymized.

When the reader has completed this code pattern, they will understand how to:

  • Work with the Watson Natural Language Understanding service (NLU) through API calls.
  • Work with the IBM Cloud Object Store service (COS) through the SDK to hold data and result.
  • Perform statistical analysis on the result from Watson Natural Language Understanding.
  • Explore the archetypes through graphical interpretation of the data in a Jupyter Notebook or a web interface.

architecture

Flow

  1. The user downloads the custom medical dictation data set from ezDI and prepares the text data for processing.
  2. The user interacts with the Watson Natural Language Understanding service via the provided application UI or the Jupyter Notebook.
  3. The user runs a series of statistical analysis on the result from Watson Natural Language Understanding.
  4. The user uses the graphical display to explore the archetypes that the analysis discovers.
  5. The user classifies a new dictation by providing it as input and see which archetype it is mapped to.

Included components

Featured technologies

Watch the Video

Steps

  1. Clone the repo
  2. Create IBM Cloud services
  3. Download and prepare the data
  4. Run the Jupyter notebook
  5. Run the Web UI

1. Clone the repo

git clone https://github.com/IBM/discover-archetype

2. Create IBM Cloud services

You will use 3 IBM Cloud services.

a. Watson Natural Language Understanding

On your IBM Cloud dashboard, navigate to Watson -> Watson Services -> Browse Services -> Natural Language Understanding. Select the Lite plan and click Create. When the service becomes available, copy the endpoint and credential for use later.

b. IBM Cloud Object Store

On your IBM Cloud dashboard, navigate to Classic Infrastructure -> Storage -> Object Storage. Select the Lite plan and click Create. When the service becomes available, click on Create bucket and create two buckets: one for the medical dictation and one for the NLU result. Copy the bucket instance CRN, endpoints and credentials for use later.

c. Watson Studio

On your IBM Cloud dashboard, navigate to Watson -> Watson Services -> Browse Services -> Watson Studio. Select the Lite plan and click Create. Click on New project and create an empty project. Navigate into your new empty project and click on New notebook. Select From file and upload the Jupyter notebook from the local git repo:

discover-archetype/notebook/WATSON_Document_Archetypes_Analysis_Showcase.ipynb

3. Download and prepare the data

Go to the ezDI web site and download both the medical dictation text files. The downloaded files will be contained in zip files.

Create a Documents subdirectory and then extract the downloaded zip files into their respective locations.

The dictation files stored in the Documents directory will be in rtf format, and need to be converted to plain text. Use the following bash script to convert them all to txt files.

Note: Run the following script with Python 2.7.

pip install pyth
for name in `ls Documents/*.rtf`;
do
  python python/convert_rtf.py $name
done

Upload the dictation files in text format to the IBM Cloud Object Store bucket for dictation.

4. Run the Jupyter notebook

a. Configure credentials and endpoints

In the Jupyter console, the second cell contains a number of parameters that need to be filled out with the necessary credentials, endpoints, and resource id for the 3 IBM Cloud services. Use the values obtained from step 2. Then use the console to execute each cell in the notebook.

5. Run the Web UI

Follow the instruction in the README

discover-archetype's People

Contributors

dnordfors avatar pvaneck avatar stevemar avatar tonanhngo avatar

Forkers

rkeb

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.