A generative model that intergrates embedded topic model and node2vec.
The detailed description and its application on UK-Biobank could be found here
(a) GETM training. GETM is a variational autoencoder (VAE) model. The neural network encoder takes individuals' condition and medication information as input and produces the variational mean μ and variance σ2> for the patient topic mixtures θ. The decoder is linear and consists of two tri-factorizations. One learns medication-defined topic embedding α(med) and medication embedding ρ(med). The other learns condition-specific topic embedding α(cond) and the condition embedding ρ(cond). We separately pre-train (b) the embedding of medications ρ(med) and (c) the embedding of conditions ρ(cond) using node2vec based on their structural meta-information. This is done learning the node embedding that maximizes the likelihood of the tree-structured relational graphs of conditions and medications.
The requirements.txt is located in scripts/requirements.txt
pip install -r requirements.txt
- The getm takes a bag-of-words individual-by-med+cond numpy matrix, a medication embedding matrix and a condition embedding matrix.
- node2vec requires a text file with format as: node1 node2.
- Get node embedding with node2vec
import networkx as nx
from node2vec import Node2Vec
# Get groph using text file
G = nx.read_edgelist(graph_file, nodetype=int, create_using=nx.DiGraph())
for edge in G.edges():
G[edge[0]][edge[1]]['weight'] = 1
G = G.to_undirected()
# Run node2vec
node2vec = Node2Vec(G, dimensions=dimensions, walk_length=walk_length, \
num_walks=num_walks, workers=workers)
-
Commands to run getm
- Run getm without masking test information
python main_multi_etm_sep.py --epochs=10 --lr=0.01 --batch_size=100 --save_path="acute2chronic_results/results_m802c443_topic128"\ --vocab_size1=802 --vocab_size2=443 --data_path="data/drug802_cond443" --num_topics=128 --rho_size=128 --emb_size=128 --t_hidden_size=128 --enc_drop=0.0 \ --train_embeddings1=0 --embedding1="drug_emb.npy" --train_embeddings2=0 --embedding2="code_emb.npy" --rho_fixed1=1 --rho_fixed2=1
data-path
: path for loading input data in form of bag-of-words for each feature
-vocab_size1
: number of unique medication
-vocab_size2
: number of unique condition
-train_embedding1
: whether to initialize medication embedding randomly
-train_embedding2
: whether to initialize medication embedding randomly
-embedding1
: path for pretrained medication embedding
-embedding2
: path for pretrained condition embedding
-rho_fixed1
: whether to fix medication embedding during training
-rho_fixed2
: whether to fix condition embedding during training- Run getm with partial test information masked
python main_multi_etm_rec.py ...
- subjob
- run_metm.sh: bash script to run job
- scripts
- multi_etm_sep.py: GETM model script
- main_multi_etm_sep.py: The script to instantialize a GETM model, train and evaluate it
- main_multi_etm_rec.py: The script to instantialize a GETM model, train and evaluate it with test medication totally masked
- etm.py: ETM model script
- main.py: The script to instantialize a ETM model, train and evaluate it
[1] Aditya Grover and Jure Leskovec. node2vec: Scalable feature learning for networks. CoRR, abs/1607.00653, 2016. [2]a> Dieng, Adji B and Ruiz, Francisco J R and Blei, David M. Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907, 2019