Coder Social home page Coder Social logo

asilev / nlp_finalproject Goto Github PK

View Code? Open in Web Editor NEW
0.0 0.0 0.0 87.14 MB

The goal for this project is to research and present a method for hebrew short messages stylometry. The system that should be implemented should be able to identify an author of a message based on previously learned authoring style. The goal is to study the features required for such identification, when working on short messages in Hebrew language.

Java 80.84% Python 19.16%
nlp nlp-machine-learning hebrew stylometry

nlp_finalproject's Introduction

Hebrew Short Message Stylometry

The goal for this project is to research and present a method for hebrew short messages stylometry. The system that should be implemented should be able to identify an author of a message based on previously learned authoring style. The goal is to study the features required for such identification, when working on short messages in Hebrew language.

the code of this project is arrange according to the architecture shown in https://github.com/asilev/NLP_FinalProject/blob/master/doc/architecture/NLP%20Author%20Classification%20Architecture.png.

The files in TwitterReader are used to retrieve data sets from Twitter, using the twitter4j framework. This folder is meant to be compiled and run as a standalone executable with its own main method (Twitter4j is provided in a pre-compiled JAR library, twitter4j-core-4.0.4.jar, imported into the project). Running it requires one parameter, which is the output path (where to put the tweets from tweeter, in two formats - raw and normalized). After reading the raw data, our code "normalizes" (or re-fromats) the data received in order to prepare it for processing by the YAP library. In this process, two sets of files are generated - for each user, the raw data file named "ID_username.txt" (ID is a serial number starting with 00) and a normalized file named the same, with the suffix "-normalized".

The files with the "-normalized" suffix are then fed, one by one, manually, to the YAP tool. YAP (and GO) were downloaded from GitHub (and https://golang.org/), respectively. After built into an executable tool, we used YAP with two commands: "yap hebma -raw input.raw -out lattices.conll": This command runs YAP as a morphological analyzer. The ouptut file "lattices.conll" is then fed to the second command - "yap md -in lattices.conll -om output.conll": This command runs YAP disambiguation function and outputs the file "output.conll" which provides, for each tweet, its morphological analysis, allowing us to continue with the simulations.

The files in Feature model are helper classes that help arrange the output data from YAP according to different types of feature sets: AllFeaturesModel creates a vector with all the features from other models. BagOfWordsModel has several static methods that create different types of vectors, where each feature in the vector is a word or word-based feature (bi-gram, word with POS, stemmed word, etc.). FeatureMarker has a static method that creates a vector with style marker features. FeatureSelector has a static method that creates a vector with selected features from other classes. This can be used to create several bag of words models combined, a style marker model with any sub-set of style features, or a combination of models.

These classes and static methods can be read from a learning model.

LogisticRegression uses Logistic regression to train a model by externalizing the model's parameters as files. It then evaluates the model using the test corpus, and prints the accuracy. This class is meant to be run as a stand-alone executable. It receives two parameters: The input path (that should contain outputs from YAP) and encoding type for these files(e.g. UTF8). This class was not used intensively, some parameters are hard coded and can easily be changed in source code. The features to be tested are put inside a model object by commenting and uncommenting any feature to be removed / added respectively. The number of authors to train and evaluate is a variable named numOfAuthors.

SupportVectorMachine is used in order to prepare the inputs for the svmlib tool. This is also meant to be run as a stand-alone executable that receives parameters in a key-value format. The keys and their default values are described hereafter, and where no default is mentioned it means that this is not an optional key:

  • [-p input_path]
  • [-e encoding=UTF8]
  • [-t training_filename=training]
  • [-v evalutation_filename=evaluation]
  • [-f feature_mapper_filename=featureMapper]
  • [-m model_filename]
  • [-n number_of_authors=10]

The model_filename is a simple text file, that contains names of features that this model should be trained on. Each line contains a name of a feature-set, and lines preceded with "%" are considered comments and their feature is not used by the model. The class creates the required input files for the svmlib tool, that should be run also as a stand-alone utility in order to evaluate the model. We used the windows executable version of svmlib, in the following manner:

  • Updating the model file according to the required feature sets.
  • Running the SupportVectorMachine utility to produce inputs.
  • If scaling is tested, running svm-scale.exe -l 0 [name of training input file] > [name of desired scaled training input file]
  • Then, if scaling is tested, running svm-scale.exe -l 0 [name of evaluation input file] > [name of desired scaled evaluation input file]
  • running svm-train.exe -t 1 -d 1 -r 0 -g [gamma parameter] -c [c parameter] [name of training / scaled training input file]
    • The c and gamma parameters were changed in runs in order to find optimal values.
  • running svm-predict.txt [name of evaluation / scaled evaluation input file] [name of training / scaked training model file] eval
  • The output of the svm-predict was saved as the accuracy of the model. Examples of python scripts that we used for this process can be found in resources folder.

The last class we've used is the "ConfusionMatrixBuilder", that takes a tagged evaluation file and an actual evaluation from libSVM, and creates the confusion matrix for these parameters. This currently assumes 10 authors exactly.

The rosources folder is organized as follows:

  • RawDataFromTwitter is the output of the TwitterReader, and its normalized form.
  • OutputFromYap is the output from YAP, after given the normalized form of the twitter reader as input. It is further divided to 1K_20A for ~1000 tweets from 20 authors, 10K_20A for ~4,000 tweets from 20 authors, and 10K_10A for ~4,000 tweets from 10 authors
  • Results shows results for our runs. Each result folder may contain:
    • How we ran the SVM model, using a python script call SvmRunner. (Logistic Regression was ran without scripts). This also includes the hyper parameters in use.
    • model.txt file, that lists the features that were used (features listed in this file without the % symbol).
    • The training file, which is the output of the SvmPrepare (on gold corpus) and input for svm-train.
    • The evaluation file, which is also the outptu of the SvmPrepare (on test corpus) and input for svm-predict.
    • The training model, which is the output of svm-train.
    • FeatureMapper, a file that maps the number of features in other files to their names.
    • resultFile, the output of the running script, with all the accuract results from svm-predict.
    • The folders are arranged as follows:
      • BOBW: A test for Bag of Bigram Words.
      • BOPW: A test for Bag of POS Words.
      • BOSW: A test for Bag of Stemmed Words.
      • BOW: A test for simple Bag of Words.
      • ConfusionMatrix: A test to provide input to confusion matrix analysis (includes the actual prediction inside as eval file).
      • FeatureContribution: A test to see which feature from style marker mostly improves bag of words. This folder includes a file for each style marker feature.
      • SVMn: For each n, an SVM run with all style marker feature for n authors.
      • Scale-FM: A run of SVM with all style marker features with feature scaling.
      • Scale-BOW: A run of SVM with Bag of Words features with feature scaling.

nlp_finalproject's People

Contributors

asilev avatar ronenj avatar

nlp_finalproject's Issues

Implement Feature Marker

Feature marker model is equivalent to bag of words model (both represent features models).
This model should get as input Map<String, List<List>> userTweets (as read from structured data reader); This represents a map from author (string) to a list of tweets, where each tweet is a list of morphemes. A morepheme is a struct of its attributes (gender, stem, num, etc.).
And should return List<TaggedFeatureVector>, which represents:

  1. A tag, which is the author
  2. A feature vector, which is a hashmap from "feature" (of type K) and its value.
    Note: when creating a feature vector, you must provide the name of the zero index, from type K.

Segmentation of Data

The system should be able to deal with segmented data, and non segmented data.
The segmentation should take place over all data, and create an alternative data source that is segmented.

Data is segmented in YAP

Data Retrieval

Retrieve data for training and validating.
Data sources: Twitter, SMS, Whatsapp, Telegram, Facebook.
Data should be short, up to 144 characters per message.
Data should be tagged with the author id.
Data should contain many messages per author. assumed ~500 messages per author.
At least 10 different authors should be provided.

  • Find data sources
  • Extract the data
  • Encode (transliterate) data and normalize.

All done; transliterate is now performed as part of YAP

Paper Research

Search for relevant papers, and summarize important and relevant aspects of them.
summaries should be shared between project members, with links to the original paper.

Add corpus matching features

Download available Hebrew corpora, and for each corpus create in preprocess a list of unique words from it.
Add a feature that checks for each tweet, how many words are from a particular list (corpus).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.