- β‘ Fun fact: I use Arch btw
- I converted my notes from my academic days into interactive notebooks for future undergrads and people who're curious to learn, under project Wellspring
nlp-issue-labeller's Introduction
nlp-issue-labeller's People
nlp-issue-labeller's Issues
Dockerize our application
Check for issues with multiple tags in dataset
If we find issues that belongs to 2 different tag, do we keep both or drop both?
I'd vote for drop both since it's quite a pain to account for them down the line
Scraped data analysis
Analysis summary (using jupyter notebook)
Overall
-------
Total number of tensorflow issues:19207
Total number of rust issues:18022
Total number of kubernetes issues:18955
Tensorflow issue analysis
-------------------------
Number of feature issues:2461
Number of bug issues:15098
Number of doc issues:1648
Rust issue analysis
-------------------
Number of feature issues:10163
Number of bug issues:7049
Number of doc issues:766
Kubernetes issue analysis
-------------------------
Number of feature issues:4302
Number of bug issues:11695
Number of doc issues:968
Total issue analysis
-------------------------
Number of feature issues:16926
Number of bug issues:33842
Number of doc issues:3382
German repo issue analysis
--------------------------
Total number of corona widget issues:192
Total number of openWB issues:145
French repo issue analysis
--------------------------
Total number of DVF-app issues:104
Total number of Grafikart issues:313
Total number of azure docs issues:247
Total number of bcdlibre issues:36
Readme line 27 needs to be more specific
Dockerize application
Sequence classifier returns empty array
Just testing stuff
Function `has_code_block` does not return expected output
Function `has_code_block` does not return expected output
this is a nonsense issue
i don't know what this issue is
Class standardisation policy
We will classify labels of all scraped issues into our standard set of classes.
Namely, {feature, bug, doc}
Tensorflow:
Class | Labels |
---|---|
feature | type:feature |
bug | type:bug, type:build/install, type:performance, type:support |
doc | type:docs-feature, type:docs-bug |
Rust:
Class | Labels |
---|---|
feature | C-feature-request, C-feature-accepted, C-enhancement |
bug | C-bug |
doc | T-doc |
Kubernetes:
Class | Labels |
---|---|
feature | kind/feature, kind/api-change |
bug | kind/bug, kind/failing-test |
doc | kind/documentation |
Refactor remove_url to replace_url
Update Report
Implement Autolabeller bot
what the heck is this issue?
Finish report
Documentation for codebase
Dockerize our application
Tokenisation
Ideas to tokenise both issue title & description text
text | regex |
---|---|
title | white space |
description | white space, \r , \n , \t |
For description text, we might also want to skip code blocks and reference links. They can be vectorised separately for information extraction.
Consultation slides & Poster
Consultation Slides:
https://docs.google.com/presentation/d/1rleH5cI1pHe5NWnodWVVc_4bLxFnH8FQ9KvQdbApgAw/edit?usp=sharing (updated)
Poster: (preview only)
https://create.piktochart.com/output/53480512-nlp-poster
Update dependencies for library pip 4.2.2
We urgently need to upgrade, else our application will be broken.
Deadline: Next Friday!!
Update on pipeline and feature issues
Tasks:
- HTTP links (maybe image link as well) are not yet replaced from βtextβ, which might affect word embedding.
- Indicative word count (bug, feature, doc) can be added into feature vector. here
- Do some selections on the feature vector, remove irrelevant ones that adversely impact accuracy.
-
dataframe_generator
might want to incorporatexxx_links_extracted.json
- Update
code_count
,link_count
andimage_count
method inhandcrafted_feature_extractor.py
(links and code blocks can get directly from JSON data, consider remove image feature)
Model-wise, maybe SGDClassifier
can be tried as well.
Feature Engineering
Language-dependent:
- count number of indicative word occurrences
- feature:
feature, features
- doc:
doc, docs, documentation, documentations
- bug:
bug, bugs, fix, fixes
- feature:
- text-based features:
- title length
- description length
- distinct word count
- uppercase word count
- stopword count
- Use word embedding to detect word similarity
- compare Gensim word2vec model
- CBOW:
gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5)
- Skip-gram:
gensim.models.Word2Vec(data, min_count = 1, size = 100, window = 5, sg = 1)
- CBOW:
- basic implementation: https://towardsdatascience.com/a-beginners-guide-to-word-embedding-with-gensim-word2vec-model-5970fa56cc92
- compare Gensim word2vec model
Language-independent: (markdown)
- common markdown element count of occurrence
- highlight:
bold, italics, section name
- link:
image, reference
- code:
code line (between ``), code block (between ``` ```)
- special:
list, task
- possible APIs:
- highlight:
- element length (e.g.
bold word count, code block word count, list item count
)- implementation: rely on parser output from above
- further information extraction on meaningful entities (e.g. reference URL, code block content)
- implementation: TODO
Language-independent: (GitHub)
@github_username
count of occurrence- implementation: custom tokeniser
- issue \ pr reference e.g.
#12
count of occurrence- implementation: custom tokeniser
Dockerize our application
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.