Coder Social home page Coder Social logo

chriskahrs / project-gutenberg-analysis Goto Github PK

View Code? Open in Web Editor NEW

This project forked from brightonkahrs/project-gutenberg-analysis

0.0 0.0 0.0 8.43 MB

This project explores how Microsoft Fabric and Azure OpenAI can analyze a document repository of text-based data. It serves as a submission for the Global Microsoft Fabric AI Hackathon.

Jupyter Notebook 100.00%

project-gutenberg-analysis's Introduction

Objective

This project explores how Microsoft Fabric and Azure OpenAI can analyze a document repository of text-based data. Microsoft Fabric offers a suite of data analytics tooling. It also offers the OneLake, an automatically provisioned enterprise data lake that can store any type of file, including unstructured text data. The intelligence of Azure OpenAI, combined with strategic prompting, extracts valuable information from text with remarkable efficiency. By combining Microsoft Fabric and Azure Open AI you will be able to analyze yout text-based data like never before, particularly with Spark Notebooks and Power BI! Our document store for this walkthrough will be public domain eBooks.

This project explores the following capabilities of Azure OpenAI, using Microsoft Fabric.

  1. Entity Extraction
  2. Text Summarization
  3. Text Classification
  4. Text Embeddings and Semantic Similarity

Here is a link to the video submission - this project should take about 1 hour to complete, follow the project steps below to get started

Note: This project leverages data from Project Gutenberg, the first free provider of public domain eBooks. Please consider donating at https://www.gutenberg.org/donate/

Real World Value

Many organizations have a treasure trove of text and pdf documents. These document stores can be massive though, and searching through them manually would be way too time-intensive. Tools used in this project make understanding your data and how it connects extremely efficient. Entity extraction eliminates the need for manually sifting through documents for metadata, text summarization allows you to understand a document much quicker, and text classification groups your data into meaningful categories. For example, a lawyer can use such tooling to find previous court cases that are similar to the one they are working on, without having to manually search through countless documents!

Prerequisites

  1. An Azure subscription
  2. Contributor access to a Microsoft Fabric workspace
  3. Access to a Microsoft Fabric F64 capacity or higher (The Fabric trial FT1 sku will not work for direct Azure OpenAI integration). This capacity should be connected to your workspace.

Note: an alternative approach would be to leverage any size Microsoft Fabric sku and a provisioned Azure OpenAI Service

Architecture

Project Architecture

Steps

Project Steps

  1. Build Lakehouse

    Go to (or create) the Fabric workspace that will be used for this project, select the 'Data Science' or 'Data Engineering' experience, and create your Lakehouse. Feel free to name it whatever you like!

    Build Lakehouse

    Then go back to the workspace. The result should look something like this:

    Step 1 End Result

  2. Ingest eBook Data onto OneLake

    Download or clone this repo to access the Jupyter Notebooks in the scripts folder, then import the 01_data_ingestion_and_prep notebook into your Fabric workspace

    Import Notebook

    If you do not see the Import Notebook option, make sure you are on the Data Science or Data Engineering experience

    Open notebook item then click Add Lakehouse. Select 'Existing Lakehouse' then choose the Lakehouse you just created. This will make your Lakehouse the default for this notebook

    Set Default Lakehouse

    Run each cell in the notebook and follow along with the markdown. Notice how quickly the spark pool starts! You are given some options to change some parameters but the recommend parameters are already set. This notebook will create the necessary folders, ingest the data from Project Gutenberg, and then prepare the data for use with Azure OpenAI by using Semantic Kernel, more specifically text chunker

    After running the script, if you go back to the workspace and open up your Lakehouse, it should look like the following (if it doesnt try hitting the refresh in the top left)

    Step 2 End Result

    You can explore the data using the Lakehouse explorer

    Lakehouse Explorer

  3. Enrich eBook Data using Azure OpenAI

    Import the 02_enrich_data_with_AzureOpenAI notebook using the same process as before (including setting the default Lakehouse)

    This notebook will be accessing Azure OpenAI resource from directly within Microsoft Fabric. When using an F64 sku or higher notice how you do not need an API key or a provisioned service in Azure! The use of Azure OpenAI is charged against the capacity units on your F64 capacity. AMAZING! With this lightweight, yet extremely powerful, use of Azure OpenAI we will perform the following:

    1. Entity Extraction
    2. Text Summarization
    3. Text Classification
    4. Generate Embeddings

    Run each cell in the notebook examining how each function is using Azure OpenAI prompting

    The entire text would be too large to fit into the token window for these models. That is why we are using the chunks we created before, and will use a text reduction technique. We summarize each of the smaller chunks, then take all of our summaries to make a summary of the entire document.

    We then use the summary as the input for our classification prompt using the predefined categories in the script

    We use a similar technique to the summarization as we do with embeddings. Embeddings are numeric representations of the semantic meaning of text. Here we get the embedding of each chunk, then take the average of all the embeddings. So more simply we are aiming for the average meaning of the entire text. There are more advanced techniques to weight chunks, but here we use an evenly distributed weight amongst chunks

    All enriched data is saved back to JSON for future use. Data is also saved as a Lakehouse table to be analyzed with notebooks, SQL, and Power BI! Your Lakehouse should now look something like:

    Step 3 End Result

  4. Analyze Enriched Data using Notebooks and Power BI

    Import the 03_TSNE_data_analysis notebook using the same process as step 2 (including setting the default Lakehouse)

    Run the notebook. This notebook will use the embeddings we generated to find 'how' semantically similar each book is based on their cosine similarity. OpenAI ada-002 embeddings have 1536 dimensions which is far too many for humans to visualize. Thus, TSNE gives us a good and human-friendly estimate of 'how' similar these embeddings are. Here is an example from my most recent run:

    TSNE Visual

    Already we can see some clumpings of data from these books! This notebook now saves the x and y axis from this visual, then updates the books table with it. This way we can further our analysis with Power BI, which is our next step!

    Go to the Lakehouse, then open up the SQL analytics endpoint

    SQL Analytics Endpoint

    Here you can do a variety of analytics. You can write SQL queries, generate visual queries, manage the default semantic data model, and create a new report. The report we will create comes from the default semantic data model of the Lakehouse. It leverages Direct Lake mode which means there are no extra steps we need to do to start using Power BI on top of data lake!

    Click new report

    Create New Report

    This will open up a report connected to the books data we just created

    New Power BI Report

    Here is the report I created but feel free to get creative and make your own! To understand how many books were assigned to each category I made a clustered column chart with 'category' on the x-axis and 'Count of book_id' on the y-axis (by clicking the arrow next to book_id in the y-axis you can change the aggregation metric). On the right I made a scatter chart with 'book_id' in the values, 'Sum of x_axis' in the x-axis and 'Sum of y_axis' in the y-axis and 'category' in the legend (I also added zoom sliders from the formatting pane). Then on the bottom I provided the book details in a table visual.

    Finished Power BI Report

    Click 'File' in the top left and save the Power BI report to your workspace, name it whatever you like. Now its time to start using it! Power BI is very interactive, by clicking on any visual it will cross filter others.

    Cross Filter

    In the scatter chart you can box highlight any of the groupings to see what OpenAI thought were similar texts.

    Scatter Chart Box Highlight

    In a single Power BI report we can see all of the work we did with Azure OpenAI. Entity extraction gives us valuable metadata such as book title, author, and more. Text summarization shows us the summary in the book details. Text classification is shown in our category bar chart. Semantic similarity is shown through the TSNE visualization. Through this we have turned unstructured text data into meaningful insights!

    This concludes the project, thank you for your time. I hope you are now as excited about Microsoft Fabric and Azure OpenAI as I am!

    developer: Brighton Kahrs | [email protected]

project-gutenberg-analysis's People

Contributors

brightonkahrs avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.