Coder Social home page Coder Social logo

Comments (43)

gabrielrueda avatar gabrielrueda commented on September 2, 2024 2

Update

Here is an update: I researched how to detect gender based on name. First, I realized that there were a few APIs to detect gender, but those had a limits which will not be practical for us. Then I looked into three articles labeled as Methods 1-3 in the PDF. In the PDF I provide some rough notes that I got from each of the articles. Methods 1&3 seem to have a similar approach while Method 2 is different. 

For preprocessing Method 1 and 3 would split each character and assign a number to each possible character. Method 2 would use a count vectorizer. This means it would get substrings of a size specified (e.g. 2-4 chars) and you would get all the substrings possible in your dataset. Then for each name you would count the repetitions of each sub-string. 

For the model, Method 1 and 3 use a bidirectional LSTM. Method 2 would use logical regression. From what I read, a logical regression would be more lightweight, but less accurate. However, I am not too familiar with these and will probably do more research on this. 

Also, there are possible differences in patterns for names from different countries. We could possibly train different models for a few different countries, but I am not sure how practical this would be. 

We can also account for names that are used interchangeably between two genders (e.g. Alex) and ignore those. 

Let me know if you have any suggestions.


Notes

Method 1:

Input Modifications:
  • lowercase
  • split each character
  • pad empty spaces to make all names same length
  • encode characters to numbers
    • space = 0, a = 1, b = 2, ...
  • Encode gender (F to 0 and M to 1)
NLP Model:

Embedding Layer

  • to embed each input character's encoded number into a dense 256 dimension vector.
  • embedding is a method used to represent discrete variables as continuous vectors

Bidirrectional LSTM Layer

  • read the seq of character embeddings from the previous step and output a single vector representing that sequence
  • The values for units and dropouts are hyperparameters as well

Dense Layer

  • outputs single value close to 0 as 'F'
    • close to 1 for 'M'
  • Not sure if we should also have threshold
    • options for kind of male or kind of female

Method 2:

Generate Features
  • Count Vectorizer: a way to build vocabulary and features from a corpus automatically
    • frequency of substrings in a certain string
  • Example (2-4) char count vectorizer for "Chris"
    • Ch
    • hr
    • ri
    • is
    • chr
    • hri
    • his
    • chri
    • hris

For These few names:
2-4 grams

Logistic Regression
  • lightweight model
    • if runtime is more important than accuracy -> this is a good option
  • other options: Decision Trees, Neural Networks, SVMs
  • input: the frequency of the repetition of the sub-strings defined above
    • Example:
    • Tina
Improvements Suggested:
  • 87% accuarcy -> ways to improve this could be
    • training one model per country -> femine/masculine may diff in differnt countries
    • vocabulary size could be too large for count vectorizer

Method 3:

Data Preprocess
  • removes accents from letters
LSTM
  • also uses LSTM algorithm -> character level LSTM
  • used bi-dirrectional LSTM

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 2

@hosseinfani @Hamedloghmani
It seems the second option (https://genderize.io/) is good based on it being more affordable than the other website, while having data from various countries.

I also had an idea to reduce number of requests we would need to make:

What if we kept our own record of name and gender every time we make a request. This would mean if there were duplicate names in our datasets, than it wouldn't have to make the request again.
This could possibly reduce the 11,215,383 amount to a number below 10 000 000.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 2

Hi Gabriel, hope you are doing well.
Last time we spoke in the issue page, I mentioned 2 API's regarding gender retrieval for names.
Before entering the final phase and buying one, we are willing to do an experiment. I broke down the steps as follows:1) Create an Excel or csv file with 100 random full names ( Firstname and Lastname). Please try to include some diversity in these samples regarding gender and country of origin.2) Using the free version of each API, get the results for each of these samples. gender-api takes lastname too, but genderize.io does not.3) We have to compare the results and decide between them based on the output that we get from this experiment. The final output will be 4 columns, first two are name and last name,  third and forth are gender results from each of those APIs.Please note that genderize.io also has a link to a python library for usage as well as their API details, it might be helpful.You can log your process in the issue page and define tasks in Trello as well.Let me know what do you think about it.

from adila.

hosseinfani avatar hosseinfani commented on September 2, 2024 2

@gabrielrueda
@Hamedloghmani
I assume the dataset is dblp, right?
Also, keep track of what experts has which (firstname-lastname). Because, later we want to double check with the actual persons in the dataset. Something like this:

[40], Hossein, Fani, 0, 0
[42], Ali, Fani, 0, 1
...
0 being male, 1 being female,
e.g., Hossein Fani is the 40th author in dblp

make sense?

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 2

@Hamedloghmani

Here are observations that I mentioned about earlier. These observations are some of the inconsistencies that occur when names are represented in the dataset.

Case 1) The dataset shows "DuarteCesar" but on dblp the actual name is "Cesar Duarte"
Case 2) The dataset shows "A AbramovSergei" but on dblp the actual name is "Sergei A. Abramov" 
Case 3) The dataset show "A A Aoude" but there's multiple results on dblp (no first name given it seems) 
Case 4) The dataset shows "M. Turunen" but M is just the first letter of the first name

I looked into a few names like case 1 and case 2 and there seems to be a pattern, that when there is no space between the names, the order for first name, last name is reversed. I was wondering if I could implement something in the code to detect that and assume case 1 and case 2. As for case 3 and 4, I was thinking I could just discard those. 


My idea to deal with these cases:

Pass Through 1:

  1. Create new json file(e.g. dblp_correctNames.json)
  2. If the name is successful in finding first name using regular method (FIRSTNAME SPACE LASTNAME), that json row is copied to the dblp_correctNames.json file.
  3. If the name follows case (1) or case (2), then the name will be modified and also copied to the dblp_correctNames.json file.
  4. Otherwise, if the name follows case (3) or case (4), those will be copied to another json file (dblp_failed_to_parse_name.json), where they could possibly be used for future if needed.

Pass Through 2:

  1. Will go through dblp_correctNames.json and record all unique names as originally intended. They will be recorded in an indexed pandas dataframe with empty columns of "Gender" and "Probability"

I will let you know when I implement this idea in code and I'll share the results.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

Hi Gabriel,

Thank you so much for the update.

Great explanation. I'll take a deeper look at the methods that you kindly suggested and send you an update then we'll choose between these methods.

Thanks a lot.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

Hello @hosseinfani , @gabrielrueda
I read and thought about the methods. Based on the pros and cons and their evaluation results, I think the third method is more accurate and practical.
However, there are some significant concerns about using machine learning to predict gender based on first names:

• We need a massive dataset of names from around the world with very little or no bias in the distribution of names. This is because our target dataset (for example, DBLP) is from all around the world and is fitted on a specific region or country will considerably decrease the model’s performance in the prediction phase.

• The best method offered around 89% accuracy with potential signs of overfitting. Even if we consider that 90%, it is still not sufficient at all in our case. In a small scale, we will have almost 10,000 misses out of 100,000. This is critical because based on the results from this model, we will conduct research on “fairness and bias” and make assumptions about that. I think this will be a loose end if we continue based on noisy or faulty predictions.

To put it in a nutshell, I think paid query-based APIs will be our best option at the moment. As Gabriel kindly mentioned, there are a few, and I looked into 2 popular options:

  1. https://gender-api.com/en/
  2. https://genderize.io/

The pricing seems reasonable, especially the second one.

I would be happy to hear your thoughts about my opinion.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

@hosseinfani

  • DPLP v12 has 4894081 records

  • IMDB title.basics.tsv.gz has 6321302 records

  • They sum up to 11,215,383 records. I think we have to get the 99$/month plan.

I don't think this method will be effective on Github though. Because I examined some data and too many people use nick names or stuff like that as their name. But if we count that, we might have around 1,000,000 there.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

@hosseinfani
I'll check for educational discount asap.
We'll be done in almost 1 month by 10.000.000 requests/month which is 99$/month

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

@gabrielrueda
I think it's a brilliant idea. Will have a bit of time complexity for us but I think the number of requests will be significantly decreased.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

@Hamedloghmani
I just pushed my code.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

@gabrielrueda
Thank you so much for your progress report.

As we discussed, your idea was great based on the observations that you had.
Thank you, looking forward to seeing the implementation and results 😄

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

Hello @Hamedloghmani,

After running the filter on the intitial dataset, and then finding all the unique first names, the program found 275 859 unique first names in the filtered dataset (Removing Cases 3 & 4 as well modifying cases 1 & 2).

I have made a pull request (#59) to include the code. Also I have included the results as a uniquenames_filtered.pkl and uniquenames_filtered.csv files (uniquenames_filtered.pkl is there to preserve the index I setup on the pandas dataframe).

I also have the json files for dblp_correctNames.json and dblp_failed_to_parse.json. I can send these privately since the files are too large to be pushed to git.

Let me know if you have any questions.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

Thank you so much for the update @gabrielrueda
I'll go over your code and results today. We'll start labeling after I merge this request and refactor it.
I'll keep you posted.
You can upload the large results in Teams -> Adila -> DBLP Labeling Files
Thanks.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

Hello @Hamedloghmani,
I filled in the table for genders of each unique name, The table will be stored as both a .pkl and .csv, but use the .pkl to import data into the program just like the uncommented lines in the main function.

I basically completed two steps in order to obtain data:

  1. Made http requests to get the data from genderize and outputted the whole thing to a text file. (I have the text files on my computer if your interested) -> makeParallelAPIReqs()
  2. Read from those text files and updated the values in the dataframe -> addGenderResultsFromFile()

I left some of things commented at the bottom, since for a chunk of the data (between records 90k and 160k), I obtained the data in a different way, however for future use, the functions in class should be used.

Changes are in pull request #61

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

@gabrielrueda
Yes. I think that's great since we might need to use inferred genders with specific levels of confidence in some cases.

Thanks.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

Hello @gabrielrueda
Thank you so much for the update. I'll go over your pull request tonight.
Please upload them in MS Teams -> Adila -> DBLP Labeling Files
You can create another folder inside this directory if you want. I can reformat it later, no worries.

Let me know if you run into any issues.
Thanks

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

Hi @Hamedloghmani ,
I finished labelling the dataset, but I filtered some of the members from the name.basics.tsv file. Since I filtered out the names, should I update the title.basics.tsv and title.principals.tsv files to remove the entries that include the names that I filtered out in the labelling process?

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

Hello @Hamedloghmani,
It shouldn't be time consuming to filter it out in the other files, so I'll do that and let you know when I complete it. As for the next task, yes it would be great to schedule a meeting for that.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024 1

Hello @Hamedloghmani,

Here is the approach I will take to map the indexes from OpeNTF to the raw dataset:

Part 1:

In the indexes.pkl there is the 'i2c' and 'c2i' dictionaries:

Example:

'i2c': {
    9 : '54977.0_Reginald_Barker'
}
c2i: {
	'54977.0_Reginald_Barker' : 9
}

The string '54977.0_Reginald_Barker' will include the member id from the raw dataset before the '.''
In this case it would be 54977

Part 2:

However, in the name.basics.tsv file, the member id would be listed as 'nm0054977'.

The member id has the layout of "nm" + 7 digits.

e.g. 54977 would be 'nm0054977'

(add 0's in unassigned digits)

Part 3: The mapping:

  1. Loop through 'c2i' dictionary and create new dictionary with layout as described in part 2. {memberID: opeNTF_output_index}
  2. Create a pandas dataframe: (make sure to set opeNTF_output_index as the index)
opeNTF_output_index gender probability
(integer) (true/false/null) (double from 0.0 to 1.0 or null)

true = male, false = female

  1. Loop through name.basics_labelled.tsv. For each member id, find the index using the dictionary created in step 1. Add the opeNTF_output_index and the gender/probability from the file to the new dataframe.
  2. Then, loop through the list of keys from 'i2c'. If the opeNTF_output_index is not in the pandas dataframe, the add that opeNTF_output_index with gender as null and probabilty as null.
  3. Export the dataframe as .csv and .pkl for future use.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024 1

Hi @gabrielrueda
Thank you so much for the update. Make it to the main branch please.

Thanks

from adila.

hosseinfani avatar hosseinfani commented on September 2, 2024

The second one worked pretty well on some samples. also it accept country

Honestly, we can try both and make a vote.

How much will they cost us for dblp, imdb, gith?

for uspt, we have it already I believe.

from adila.

hosseinfani avatar hosseinfani commented on September 2, 2024

@Hamedloghmani
how many months we need?
is there any educational discount?

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

@Hamedloghmani
Sounds good, I can begin to write a program to keep record duplicate names and record the gender to each respective name. Should I add the gender parameter to each person's record or make a new record of just person and gender? Also should I start with the toy dataset from DBLP?

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

@hosseinfani
Exactly, we are starting with dblp.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hello @Hamedloghmani,
I have made some observations. I took 100 names from the DBLP trying to include diversity (although I ended up with a majority of male names (77 vs 23)).

Observations:

  • 96/100 names had the same results for both APIs
  • Gender-API had no NULL results
  • Genderize has three 3 entries with a NULL result
  • Thus, only 1 of the names had a conflicting result (male vs female)

Probabilities:
Since both of the APIs had a accuracy/probability of their result, I graphed the accuracy of each results in order to observe the accurarcies. For the most part, both APIs have around the same value. However, at times one API would get a higher percentage than the other, especially in the names which were harder to predict.

Here is the graph for the first 10 entries:

accuracies_0-10

I have the output.csv, the remaining bar graphs for the accuracies, and the python scripts to formulate this information. Where in the repository should I share this information, or should I share it privately?

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hello @gabrielrueda
Thank you so much for your update and informative representation.
Based on the plot it seems like usually Genderize is outperfoming Gender API.

Please push your code as a .py file in fair_team_formation/src/util directory.
I'll examine the full results and we will shorty start with labeling the whole DBLP dataset.

Thanks

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

@gabrielrueda
Thanks a lot. I'll review the code and the results asap.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hi @gabrielrueda

Thank you so much for the implementation. I merged your pull request.
I believe we can proceed to the final phase and label the whole dataset since we have the gender for all the unique names.
What do you think ?

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

@Hamedloghmani
I think that we are ready to label the whole dataset. Also, some of the entries will result in NULL for gender/probability. Should we just filter out those entries when creating our new dataset?

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

@gabrielrueda
Great !
Yes, I believe we can filter them out.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

@Hamedloghmani
Also, would a structure like this: "gender": {"value": true, "probability": 0.97} for each author be good to represent the values in the dataset?

Example:
{"id":1,"authors":[{"name":"Hinton","gender": {"value": true, "probability": 0.97},"org":"Shinshu University","id":1},{"name":"LeCun","gender": {"value": false, "probability": 0.87},"org":"Shinshu University","id":3}],"fos":[{"name":"Machine Learning","w":0.45139},{"name":"Image Captioning", "w":0.3241}],"title":"Preliminary Design of a Network Protocol Learning Tool Based on the Comprehension of High School Students: Design by an Empirical Study Using a Simple Mind Map","year":2000,"n_citation":1,"page_start":"89","page_end":"93","doc_type":"Conference","publisher":"Springer, Berlin, Heidelberg","volume":"","issue":"","doi":"10.1007/978-3-642-39476-8_19","references":[2005687710,2018037215],"indexed_abstract":{"IndexLength":58,"InvertedIndex":{"tool.":[42],"study":[4],"aim":[37],"purpose":[1],"scientific":[17],"for":[11],"aspects":[18],"students":[14,46],"focus":[27],"hands-on":[47],"learning":[9,41],"experience":[48],"our":[40],"we":[26],"network":[33,56],"The":[0],"More":[24],"high":[12],"protocols.":[57],"school":[13],"and":[21],"of":[2,19,32,55],"communication":[22],"protocols":[34],"gives":[45],"on":[28],"a":[8],"studying":[15],"specifically,":[25],"this":[3],"understand":[51],"is":[5],"develop":[7,39],"Our":[43],"tool":[10,44],"the":[16,29,36,52],"help":[50],"as":[35],"principles":[31,54],"information":[20],"networks.":[23],"to":[6,38,49],"basic":[30,53]}},"venue":{"raw":"International Conference on Human-Computer Interaction","id":1127419992,"type":"C"}}

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hi @Hamedloghmani,
I just wanted to let you know that I labelled the dataset, and kept all the successful entries (entries whose name could successfully a gender). I will create a pull request for the code. Where should I upload the labelled dataset (since it's too big to upload to GitHub)?

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

@Hamedloghmani
The dataset finished uploading. It should be in MS Teams -> Adila -> DBLP Labeling Files -> FinalGenderLabelledDataset. Let me know if you are unable to see/access it.

Thanks

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

@gabrielrueda
Thank you Gabriel. I just checked and I got access to it.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hello @gabrielrueda
Thank you so much for the update. You can consider doing that if it is not too time consuming.
In the next steps we would discuss mapping the names from the OpeNTF outputs to the raw dataset and also a policy for missing names.
I will schedule a meeting to talk about that soon if you want.

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hi @Hamedloghmani,
I finished filtering the entries in title.basics.tsv and title.principals.tsv in order to match what was filtered in the names.basics.tsv file. These 3 files are uploaded to the "IMDB Labelling Files" folder. I'll make a pull request tomorrow for the code, I just have to add comments and rename some of the functions.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hello @gabrielrueda
Thanks a lot for the update.
Please make the pull request on the main branch this time. Including your previous implementations for dblp.

Thank you

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hi @Hamedloghmani,
I just created the pull request for the code.

Thanks, Gabriel

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hello @gabrielrueda
I merged your pull request. Thanks a lot.

from adila.

Hamedloghmani avatar Hamedloghmani commented on September 2, 2024

Hi @gabrielrueda
Thank you so much for your report. The approach makes sense as we discussed. You can proceed with the implementation.

Thanks

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hi @Hamedloghmani,
I finished the implementation. Should I make a pull request for my implementation to the dev or main branch?

Thanks, Gabriel

from adila.

gabrielrueda avatar gabrielrueda commented on September 2, 2024

Hello @Hamedloghmani,

I just wanted to let you know that I changed the true/false in the dataset to M/F, as well as updated the results in mapping table for IMDB. I also created a mapping table for DBLP.

In the pull request I updated the mappingGender.py and labelDataset.py for the changes needed. I have also included the file called changeDataset.py, which I only used as a temporary way to change the values to true/false to M/F.

Finally, the updated datasets and mapping tables will be uploaded to the Teams Adila channel in folders of DBLP Labelling Files/Gender Mappings and IMDB Labelling Files/Gender Mappings (I'll let you know when they finish uploading).

from adila.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.