The maple from zelliott

maple's Issues

Nenkova Meeting 10/5

Improve statistical analysis:
- Plot tokens vs types within a single topic as the number of tokens increases
- Use 100 medline papers instead of just 1
- Determine shared vocabulary
- Any other analysis we think might be useful
Train 3 language models:
- (1) science/medical section in the NYT
- (2) nltk's Brown corpus
- (3) medline articles
- Make sure that the number of tokens are equal for each corpus.
Design a test to manually evaluate difficulty:
- Build a test so that people can evaluate the difficulty of texts
- Each of us evaluate 200 texts
- Design some kind of rubric, metric to judge difficulty
Take a look at PLOS ONE for comparing abstract difficulty

Breaking these up into tasks:

Pull 100 medline files with of abstract text from nlpgrid
Perform statistical analysis on this text
Build generic language model generator where we can feed in a corpus and it spits out a a language model
Pull the NYT corpus, Brown corpus, (we already have 100 file medline corpus)
Design a test to evaluate difficulty

I kicked off an nlpgrid job this afternoon that is running through 100 medline files, and pulling all abstracts with the mesh keyword "Obesity". Once that's finished, we can do the stat analysis on the output.

The output for a single topic will be of the form:

<FilesAbstracts>
  <Topic>[topic]</Topic>
  <File>
    <Abstract>[abstract text]</Abstract>
    <Abstract>[abstract text]</Abstract>
    ...
  </File>
  <File>
    <Abstract>[abstract text]</Abstract>
    <Abstract>[abstract text]</Abstract>
    ...
  </File>
  ...
</FilesAbstracts>

Given it's going to take some time for that script to finish, the best move is probably to create a test xml file of the same format, and just test your script on it while we wait for the actual data. In the end, we need to produce a couple of numbers/plots for each topic.

Total number of abstracts
Total number of types
Total number of tokens
A plot of types vs. tokens (essentially a line chart, with the x-axis being number of tokens processed, and the y-axis being number of types identified)
An identical plot as the one mentioned above, but this time ignoring any token/type with <= 5 occurrences
Maybe a histogram of counts of tokens (i.e. take the top 100-1000 tokens, and make a histogram of their counts)
Anything else?

Let me know if you need help with any of this stuff.

Nenkova Meeting 9/30

Get some specific data from the research papers to support underlying motivation
Read the CMU paper
Unigram language model (probabilities of certain words appearing in text). Trigram language model (probability of 3 words appearing in sequence). Perhaps stick to unigram for baseline system.
Train model on some publicly read corpus (NYT, Wikipedia). Don't train on med_line corpus.
SRILM (standard tool used for language modeling), CMU also has their own.
Determine paper readability by summing log probs. Have to make sure the corpus sizes of each corpus are identical (or normalized by the square of the the number of words n).
Evaluation: Each person gets 20 articles from each domain (mutually exclusive), and judges difficulty. Then, compare with analysis. Do this evaluation before training/testing.
Evaluation: Also can compare to the difficulty of each of the individual topics. That is, (obesity has a difficulty of 0.6), difficulty of articles tagged with obesity is 0.6.

Nenkova 10/14

Cloze test & readability test
100 abstracts each- 75 unique, 5 repeats, 25 shared
For each abstract, randomly remove 5 words, perform cloze test, then re-add words and perform readability test
Pull NYT articles from nlpgrid
Plot that shows what % are seen not often
Plot that shows as you read more text, counts that have been seen more than five times.
Compare Cell Line Tumor & Obesity texts with Spencer's lang model

Outstanding tasks given by Nenkova

Get nlpgrid accounts from Nenkova
Choose 5 topics/journals by some criterion
Determine the size of each topic (number of tokens)
Determine the vocabulary of each topic (number of unique words, called types)
Plot types vs tokens for each topic
Discard any word that has appeared less than 5 times

Nenkova 11/4

Tasks:

Get the LDC datasets from the library and store/organize the transcripts
Finish the cloze + readability web test

Presentation structure:

What we are building (mockups of what we're building)
- Maybe a timeline of where we're at now and where we're eventually headed (timeline of mockups or something)
What we've built so far
Could demo the cloze test, interactive with the class
Very, very clearly explain the language model and its underlying assumptions
- Bring in some of the analysis (plots) we've generated
Can demo which domains are determined to be the most difficult based on language models based on different corpora.

Meeting 1/20

Zhi:

Shuffle questions within each test.
Add group ids to tests

Zack:

Work with TAs to email links out to students.
Change readability to difficulty
Domain difficulty thing

Omar:

Bunch of language model stuff (looks like you took notes)
Figure out how to deal with unknown tokens

Spencer:

General:

Need to determine whether or not domain determines readability, or if readability varies greatly within a single domain
Keep track of results better - weekly conclusions

Meeting 12/2

Omar: Graphs for coverage of lang models
Spencer: Create a lang model that is made up from random words chosen from the abstracts
Spencer: Compute lang model stuff at some fixed length (150 words)
Spencer: Fraction of unknown tokens for each lang model
Zack: Figure out which abstracts were shared and do some analysis, avg abstract length per topic, stat conf interval
Zhi/Zack: Prepare cloze test for potentially more people

Meeting 1/13

TODO

Zack - analysis on repeated questions
have the 7 tests finished by next week friday. so we can test everything (released to students week of 23rd)
easiest and hardest topics plus our abstract data, cloze test scores for them

Notes:

Test Generation Structure
-Test size of 7 on prof nenkova's linguistic class, want to record test time.
-Each question is given to three (minimum to break ties, pilot number) students.
-Maybe eventually just one question for MTurk
-Insert new blanks for really simple words removed (in "the"), or
-canned phrases for each abstract at end.
-maybe include some NYT or gossip stuff (fairly readable, common stuff), and test whether accuracy increase
MTurk -
give link to our website and have them take the test there, after they finishes,
generate code (according to their work on questions (blanks)) they can paste
into MTurk to mark them as finished (for payment and identification)
Language Model -
-use other metrics i.e. use vocab from nyt
-skipping over unseen words,

zelliott / maple Goto Github PK

maple's People

Contributors

Watchers

maple's Issues

Nenkova Meeting 10/5

Omar README

Nenkova Meeting 9/30

Nenkova 10/14

Outstanding tasks given by Nenkova

Nenkova 11/4

Meeting 1/20

Spencer:

Meeting 12/2

Meeting 1/13

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent