Coder Social home page Coder Social logo

maple's People

Contributors

imzzheng avatar omarpaladines avatar zelliott avatar

Watchers

 avatar  avatar  avatar  avatar

maple's Issues

Nenkova Meeting 10/5

  • Improve statistical analysis:
    • Plot tokens vs types within a single topic as the number of tokens increases
    • Use 100 medline papers instead of just 1
    • Determine shared vocabulary
    • Any other analysis we think might be useful
  • Train 3 language models:
    • (1) science/medical section in the NYT
    • (2) nltk's Brown corpus
    • (3) medline articles
    • Make sure that the number of tokens are equal for each corpus.
  • Design a test to manually evaluate difficulty:
    • Build a test so that people can evaluate the difficulty of texts
    • Each of us evaluate 200 texts
    • Design some kind of rubric, metric to judge difficulty
  • Take a look at PLOS ONE for comparing abstract difficulty

Breaking these up into tasks:

  1. Pull 100 medline files with of abstract text from nlpgrid
  2. Perform statistical analysis on this text
  3. Build generic language model generator where we can feed in a corpus and it spits out a a language model
  4. Pull the NYT corpus, Brown corpus, (we already have 100 file medline corpus)
  5. Design a test to evaluate difficulty

Omar README

I kicked off an nlpgrid job this afternoon that is running through 100 medline files, and pulling all abstracts with the mesh keyword "Obesity". Once that's finished, we can do the stat analysis on the output.

The output for a single topic will be of the form:

<FilesAbstracts>
  <Topic>[topic]</Topic>
  <File>
    <Abstract>[abstract text]</Abstract>
    <Abstract>[abstract text]</Abstract>
    ...
  </File>
  <File>
    <Abstract>[abstract text]</Abstract>
    <Abstract>[abstract text]</Abstract>
    ...
  </File>
  ...
</FilesAbstracts>

Given it's going to take some time for that script to finish, the best move is probably to create a test xml file of the same format, and just test your script on it while we wait for the actual data. In the end, we need to produce a couple of numbers/plots for each topic.

  • Total number of abstracts
  • Total number of types
  • Total number of tokens
  • A plot of types vs. tokens (essentially a line chart, with the x-axis being number of tokens processed, and the y-axis being number of types identified)
  • An identical plot as the one mentioned above, but this time ignoring any token/type with <= 5 occurrences
  • Maybe a histogram of counts of tokens (i.e. take the top 100-1000 tokens, and make a histogram of their counts)
  • Anything else?

Let me know if you need help with any of this stuff.

Nenkova Meeting 9/30

  • Get some specific data from the research papers to support underlying motivation
  • Read the CMU paper
  • Unigram language model (probabilities of certain words appearing in text). Trigram language model (probability of 3 words appearing in sequence). Perhaps stick to unigram for baseline system.
  • Train model on some publicly read corpus (NYT, Wikipedia). Don't train on med_line corpus.
  • SRILM (standard tool used for language modeling), CMU also has their own.
  • Determine paper readability by summing log probs. Have to make sure the corpus sizes of each corpus are identical (or normalized by the square of the the number of words n).
  • Evaluation: Each person gets 20 articles from each domain (mutually exclusive), and judges difficulty. Then, compare with analysis. Do this evaluation before training/testing.
  • Evaluation: Also can compare to the difficulty of each of the individual topics. That is, (obesity has a difficulty of 0.6), difficulty of articles tagged with obesity is 0.6.

Nenkova 10/14

  • Cloze test & readability test
  • 100 abstracts each- 75 unique, 5 repeats, 25 shared
  • For each abstract, randomly remove 5 words, perform cloze test, then re-add words and perform readability test
  • Pull NYT articles from nlpgrid
  • Plot that shows what % are seen not often
  • Plot that shows as you read more text, counts that have been seen more than five times.
  • Compare Cell Line Tumor & Obesity texts with Spencer's lang model

Outstanding tasks given by Nenkova

  • Get nlpgrid accounts from Nenkova
  • Choose 5 topics/journals by some criterion
  • Determine the size of each topic (number of tokens)
  • Determine the vocabulary of each topic (number of unique words, called types)
  • Plot types vs tokens for each topic
  • Discard any word that has appeared less than 5 times

Nenkova 11/4

Tasks:

  • Get the LDC datasets from the library and store/organize the transcripts
  • Finish the cloze + readability web test

Presentation structure:

  • What we are building (mockups of what we're building)
    • Maybe a timeline of where we're at now and where we're eventually headed (timeline of mockups or something)
  • What we've built so far
  • Could demo the cloze test, interactive with the class
  • Very, very clearly explain the language model and its underlying assumptions
    • Bring in some of the analysis (plots) we've generated
  • Can demo which domains are determined to be the most difficult based on language models based on different corpora.

Meeting 1/20

Zhi:

  • Shuffle questions within each test.
  • Add group ids to tests

Zack:

  • Work with TAs to email links out to students.
  • Change readability to difficulty
  • Domain difficulty thing

Omar:

  • Bunch of language model stuff (looks like you took notes)
  • Figure out how to deal with unknown tokens

Spencer:

General:

  • Need to determine whether or not domain determines readability, or if readability varies greatly within a single domain
  • Keep track of results better - weekly conclusions

Meeting 12/2

  • Omar: Graphs for coverage of lang models
  • Spencer: Create a lang model that is made up from random words chosen from the abstracts
  • Spencer: Compute lang model stuff at some fixed length (150 words)
  • Spencer: Fraction of unknown tokens for each lang model
  • Zack: Figure out which abstracts were shared and do some analysis, avg abstract length per topic, stat conf interval
  • Zhi/Zack: Prepare cloze test for potentially more people

Meeting 1/13

TODO

  • Zack - analysis on repeated questions
  • have the 7 tests finished by next week friday. so we can test everything (released to students week of 23rd)
  • easiest and hardest topics plus our abstract data, cloze test scores for them

Notes:

  1. Test Generation Structure
    -Test size of 7 on prof nenkova's linguistic class, want to record test time.
    -Each question is given to three (minimum to break ties, pilot number) students.
    -Maybe eventually just one question for MTurk
    -Insert new blanks for really simple words removed (in "the"), or
    -canned phrases for each abstract at end.
    -maybe include some NYT or gossip stuff (fairly readable, common stuff), and test whether accuracy increase

  2. MTurk -
    give link to our website and have them take the test there, after they finishes,
    generate code (according to their work on questions (blanks)) they can paste
    into MTurk to mark them as finished (for payment and identification)

  3. Language Model -
    -use other metrics i.e. use vocab from nyt
    -skipping over unseen words,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.