Skip to content

Commit c35f05b

Browse files
committed
[enh] Updating versions of all packages + corresponding modifications. Further hydra config parametrization.
1 parent 9ff340d commit c35f05b

14 files changed

+115
-129
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,6 @@
33
.idea
44
*~
55
*.w2v
6-
*.json*.txt
6+
*.json*.txt
7+
*.log
8+
.hydra

README.md

Lines changed: 8 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -2,54 +2,30 @@
22

33
Yet another PyTorch implementation of the model described in the paper [**An Unsupervised Neural Attention Model for Aspect Extraction**](https://aclweb.org/anthology/papers/P/P17/P17-1036/) by He, Ruidan and Lee, Wee Sun and Ng, Hwee Tou and Dahlmeier, Daniel, **ACL2017**.
44

5+
**NOTA BENE**: now `gensim>=4.0.0` and `hydra` are required.
6+
57
## Example
68

79
**For a working example of a whole pipeline please refer to `example_run.sh`**
810

911
Let's get some data:
1012

1113
```
12-
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Electronics_5.json.gz
13-
gunzip reviews_Electronics_5.json.gz
14-
python3 custom_format_converter.py reviews_Electronics_5.json
14+
wget http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Cell_Phones_and_Accessories_5.json.gz
15+
gunzip reviews_Cell_Phones_and_Accessories_5.json.gz
16+
python3 custom_format_converter.py reviews_Cell_Phones_and_Accessories_5.json
1517
```
1618

1719
Then we need to train the word vectors:
1820

1921
```
20-
python3 word2vec.py reviews_Electronics_5.json.txt
22+
python3 word2vec.py reviews_Cell_Phones_and_Accessories_5.json.txt
2123
```
2224
And run
2325

26+
**TODO**: running with hydra params example is in progress
2427
```
25-
usage: main.py [-h] [--word-vectors-path <str>] [--batch-size BATCH_SIZE]
26-
[--aspects-number ASPECTS_NUMBER] [--ortho-reg ORTHO_REG]
27-
[--epochs EPOCHS] [--optimizer {adam,adagrad,sgd}]
28-
[--negative-samples NEG_SAMPLES] [--dataset-path DATASET_PATH]
29-
[--maxlen MAXLEN]
30-
31-
optional arguments:
32-
-h, --help show this help message and exit
33-
--word-vectors-path <str>, -wv <str>
34-
path to word vectors file
35-
--batch-size BATCH_SIZE, -b BATCH_SIZE
36-
Batch size for training
37-
--aspects-number ASPECTS_NUMBER, -as ASPECTS_NUMBER
38-
A total number of aspects
39-
--ortho-reg ORTHO_REG, -orth ORTHO_REG
40-
Ortho-regularization impact coefficient
41-
--epochs EPOCHS, -e EPOCHS
42-
Epochs count
43-
--optimizer {adam,adagrad,sgd}, -opt {adam,adagrad,sgd}
44-
Optimizer
45-
--negative-samples NEG_SAMPLES, -ns NEG_SAMPLES
46-
Negative samples per positive one
47-
--dataset-path DATASET_PATH, -d DATASET_PATH
48-
Path to a training texts file. One sentence per line,
49-
tokens separated wiht spaces.
50-
--maxlen MAXLEN, -l MAXLEN
51-
Max length of the considered sentence; the rest is
52-
clipped if longer
28+
usage: main.py ...
5329
5430
```
5531

configs/config.yaml

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,18 @@
11
defaults:
22
- embeddings: word2vec-custom
3-
- optimizers: adam
3+
- optimizer: adam
44

55
data:
6-
path: "reviews_Electronics_5.json.txt"
6+
path: "reviews_Cell_Phones_and_Accessories_5.json.txt"
77

88
model:
99
batch_size: 50
10-
ortho_reg: 0.1
11-
aspects_number: 40
10+
ortho_reg: 0.2
11+
aspects_number: 25
1212
epochs: 1
1313
negative_samples: 5
14-
max_len: 201
14+
max_len: 201
15+
16+
hydra:
17+
run:
18+
dir: . #results/sessions_${now:%Y-%m-%d}_${now:%H-%M-%S}
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
name: word2vec-custom
2-
path: "word_vectors/reviews_Cell_Phones_and_Accessories_5.json.txt.w2v"
2+
path: word_vectors/reviews_Cell_Phones_and_Accessories_5.json.txt.w2v
File renamed without changes.

configs/optimizer/asgd.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
name: asgd
2+
learning_rate: 0.05

configs/optimizer/sgd.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
name: sgd
2+
learning_rate: 0.05

custom_format_converter.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,6 @@ def read_amazon_format(path: str, sentence=True):
5353
if len(sys.argv) > 1:
5454
path = sys.argv[1]
5555
else:
56-
path = "reviews_Electronics_5.json"
56+
path = "reviews_Cell_Phones_and_Accessories_5.json"
5757

5858
read_amazon_format(path, sentence=True)

example_run.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19,4 +19,5 @@ if [ ! -f ./word_vectors/$DATA_NAME.json.txt.w2v ]; then
1919
fi
2020

2121
echo "Training ABAE..."
22-
python main.py -as 30 -d $DATA_NAME.json.txt -wv word_vectors/$DATA_NAME.json.txt.w2v
22+
echo "A working example is in progress... Please see 'main.py' code."
23+
#python main.py -as 30 -d $DATA_NAME.json.txt -wv word_vectors/$DATA_NAME.json.txt.w2v

main.py

Lines changed: 63 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,76 @@
11
# -*- coding: utf-8 -*-
2+
import logging
3+
4+
import hydra
25
import numpy as np
36
import torch
4-
import hydra
7+
58
from model import ABAE
69
from reader import get_centroids, get_w2v, read_data_tensors
710

11+
logger = logging.getLogger(__name__)
12+
813

914
@hydra.main("configs", "config")
1015
def main(cfg):
11-
1216
w2v_model = get_w2v(cfg.embeddings.path)
13-
print(cfg)
14-
print(w2v_model)
15-
# wv_dim = w2v_model.vector_size
16-
# y = torch.zeros(args.batch_size, 1)
17-
#
18-
# model = ABAE(wv_dim=wv_dim,
19-
# asp_count=args.aspects_number,
20-
# init_aspects_matrix=get_centroids(w2v_model, aspects_count=args.aspects_number))
21-
# print(model)
22-
#
23-
# criterion = torch.nn.MSELoss(reduction="sum")
24-
#
25-
# optimizer = None
26-
# scheduler = None
27-
#
28-
# if args.optimizer == "adam":
29-
# optimizer = torch.optim.Adam(model.parameters())
30-
# elif args.optimizer == "sgd":
31-
# optimizer = torch.optim.SGD(model.parameters(), lr=0.05)
32-
# elif args.optimizer == "adagrad":
33-
# optimizer = torch.optim.Adagrad(model.parameters())
34-
# elif args.optimizer == "asgd":
35-
# optimizer = torch.optim.ASGD(model.parameters(), lr=0.05)
36-
# else:
37-
# raise Exception("Optimizer '%s' is not supported" % args.optimizer)
38-
#
39-
# for t in range(args.epochs):
40-
#
41-
# print("Epoch %d/%d" % (t + 1, args.epochs))
42-
#
43-
# data_iterator = read_data_tensors(args.dataset_path, args.wv_path,
44-
# batch_size=args.batch_size, maxlen=args.maxlen)
45-
#
46-
# for item_number, (x, texts) in enumerate(data_iterator):
47-
# if x.shape[0] < args.batch_size: # pad with 0 if smaller than batch size
48-
# x = np.pad(x, ((0, args.batch_size - x.shape[0]), (0, 0), (0, 0)))
49-
#
50-
# x = torch.from_numpy(x)
51-
#
52-
# # extracting bad samples from the very same batch; not sure if this is OK, so todo
53-
# negative_samples = torch.stack(
54-
# tuple([x[torch.randperm(x.shape[0])[:args.neg_samples]] for _ in range(args.batch_size)]))
55-
#
56-
# # prediction
57-
# y_pred = model(x, negative_samples)
58-
#
59-
# # error computation
60-
# loss = criterion(y_pred, y)
61-
# optimizer.zero_grad()
62-
# loss.backward()
63-
# optimizer.step()
64-
#
65-
# if item_number % 1000 == 0:
66-
#
67-
# print(item_number, "batches, and LR:", optimizer.param_groups[0]['lr'])
68-
#
69-
# for i, aspect in enumerate(model.get_aspect_words(w2v_model)):
70-
# print(i + 1, " ".join([a for a in aspect]))
71-
#
72-
# print("Loss:", loss.item())
73-
# print()
17+
wv_dim = w2v_model.vector_size
18+
y = torch.zeros((cfg.model.batch_size, 1))
19+
20+
model = ABAE(wv_dim=wv_dim,
21+
asp_count=cfg.model.aspects_number,
22+
init_aspects_matrix=get_centroids(w2v_model, aspects_count=cfg.model.aspects_number))
23+
logger.debug(str(model))
24+
25+
criterion = torch.nn.MSELoss(reduction="sum")
26+
27+
if cfg.optimizer.name == "adam":
28+
optimizer = torch.optim.Adam(model.parameters())
29+
elif cfg.optimizer.name == "sgd":
30+
optimizer = torch.optim.SGD(model.parameters(), lr=cfg.optimizer.learning_rate)
31+
elif cfg.optimizer.name == "adagrad":
32+
optimizer = torch.optim.Adagrad(model.parameters())
33+
elif cfg.optimizer.name == "asgd":
34+
optimizer = torch.optim.ASGD(model.parameters(), lr=cfg.optimizer.learning_rate)
35+
else:
36+
raise Exception("Optimizer '%s' is not supported" % cfg.optimizer.name)
37+
38+
for t in range(cfg.model.epochs):
39+
40+
logger.debug("Epoch %d/%d" % (t + 1, cfg.model.epochs))
41+
42+
data_iterator = read_data_tensors(cfg.data.path, cfg.embeddings.path,
43+
batch_size=cfg.model.batch_size, maxlen=cfg.model.max_len)
44+
45+
for item_number, (x, texts) in enumerate(data_iterator):
46+
if x.shape[0] < cfg.model.batch_size: # pad with 0 if smaller than batch size
47+
x = np.pad(x, ((0, cfg.model.batch_size - x.shape[0]), (0, 0), (0, 0)))
48+
49+
x = torch.from_numpy(x)
50+
51+
# extracting bad samples from the very same batch; not sure if this is OK, so todo
52+
negative_samples = torch.stack(
53+
tuple([x[torch.randperm(x.shape[0])[:cfg.model.negative_samples]]
54+
for _ in range(cfg.model.batch_size)]))
55+
56+
# prediction
57+
y_pred = model(x, negative_samples)
58+
59+
# error computation
60+
loss = criterion(y_pred, y)
61+
optimizer.zero_grad()
62+
loss.backward()
63+
optimizer.step()
64+
65+
if item_number % 1000 == 0:
66+
67+
logger.info("%d batches, and LR: %.5f" % (item_number, optimizer.param_groups[0]['lr']))
68+
69+
for i, aspect in enumerate(model.get_aspect_words(w2v_model, logger)):
70+
logger.info("[%d] %s" % (i + 1, " ".join([a for a in aspect])))
71+
72+
logger.info("Loss: %.4f" % loss.item())
7473

7574

7675
if __name__ == "__main__":
77-
main()
76+
main()

model.py

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@ def __init__(self, wv_dim: int, maxlen: int):
1313
# max sentence length -- batch 2nd dim size
1414
self.maxlen = maxlen
1515
self.M = Parameter(torch.empty(size=(wv_dim, wv_dim)))
16-
init.kaiming_uniform(self.M.data)
16+
init.kaiming_uniform_(self.M.data)
1717

1818
# softmax for attending to wod vectors
19-
self.attention_softmax = torch.nn.Softmax()
19+
self.attention_softmax = torch.nn.Softmax(dim=-1)
2020

2121
def forward(self, input_embeddings):
2222
# (b, wv, 1)
@@ -63,7 +63,7 @@ def __init__(self, wv_dim: int = 200, asp_count: int = 30,
6363

6464
self.attention = SelfAttention(wv_dim, maxlen)
6565
self.linear_transform = torch.nn.Linear(self.wv_dim, self.asp_count)
66-
self.softmax_aspects = torch.nn.Softmax()
66+
self.softmax_aspects = torch.nn.Softmax(dim=-1)
6767
self.aspects_embeddings = Parameter(torch.empty(size=(wv_dim, asp_count)))
6868

6969
if init_aspects_matrix is None:
@@ -108,7 +108,9 @@ def forward(self, text_embeddings, negative_samples_texts):
108108
reconstruction_triplet_loss = ABAE._reconstruction_loss(weighted_text_emb,
109109
recovered_emb,
110110
averaged_negative_samples)
111-
max_margin = torch.max(reconstruction_triplet_loss, torch.zeros_like(reconstruction_triplet_loss))
111+
max_margin = torch \
112+
.max(reconstruction_triplet_loss, torch.zeros_like(reconstruction_triplet_loss)) \
113+
.unsqueeze(dim=-1)
112114

113115
return self.ortho * self._ortho_regularizer() + max_margin
114116

@@ -126,20 +128,20 @@ def _ortho_regularizer(self):
126128
torch.matmul(self.aspects_embeddings.t(), self.aspects_embeddings) \
127129
- torch.eye(self.asp_count))
128130

129-
def get_aspect_words(self, w2v_model, topn=15):
131+
def get_aspect_words(self, w2v_model, logger, topn=15):
130132
words = []
131133

132134
# getting aspects embeddings
133135
aspects = self.aspects_embeddings.detach().numpy()
134136

135137
# getting scalar products of word embeddings and aspect embeddings;
136138
# to obtain the ``probabilities'', one should also apply softmax
137-
words_scores = w2v_model.wv.syn0.dot(aspects)
139+
# words_scores = w2v_model.wv.syn0.dot(aspects)
140+
words_scores = w2v_model.wv.vectors.dot(aspects)
138141

139142
for row in range(aspects.shape[1]):
140143
argmax_scalar_products = np.argsort(- words_scores[:, row])[:topn]
141-
# print([w2v_model.wv.index2word[i] for i in argmax_scalar_products])
142-
# print([w for w, dist in w2v_model.similar_by_vector(aspects.T[row])[:topn]])
143-
words.append([w2v_model.wv.index2word[i] for i in argmax_scalar_products])
144+
# print([w for w, dist in w2v_model.wv.similar_by_vector(aspects.T[row])[:topn]])
145+
words.append([w2v_model.wv.index_to_key[i] for i in argmax_scalar_products])
144146

145147
return words

reader.py

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from sklearn.cluster import MiniBatchKMeans
55

66

7-
def read_data_batches(path, batch_size=50, minlength=5):
7+
def read_data_batches(path: str, batch_size: int=50, minlength: int=5):
88
"""
99
Reading batched texts of given min. length
1010
:param path: path to the text file ``one line -- one normalized sentence''
@@ -26,7 +26,7 @@ def read_data_batches(path, batch_size=50, minlength=5):
2626
yield batch
2727

2828

29-
def text2vectors(text, w2v_model, maxlen, vocabulary):
29+
def text2vectors(text: list, w2v_model, maxlen: int, vocabulary):
3030
"""
3131
Token sequence -- to a list of word vectors;
3232
if token not in vocabulary, it is skipped; the rest of
@@ -40,7 +40,7 @@ def text2vectors(text, w2v_model, maxlen, vocabulary):
4040
acc_vecs = []
4141

4242
for word in text:
43-
if word in w2v_model and (vocabulary is None or word in vocabulary):
43+
if word in w2v_model.wv and (vocabulary is None or word in vocabulary):
4444
acc_vecs.append(w2v_model.wv[word])
4545

4646
# padding for consistent length with ZERO vectors
@@ -94,11 +94,10 @@ def get_centroids(w2v_model, aspects_count):
9494
km = MiniBatchKMeans(n_clusters=aspects_count, verbose=0, n_init=100)
9595
m = []
9696

97-
for k in w2v_model.wv.vocab:
97+
for k in w2v_model.wv.key_to_index:
9898
m.append(w2v_model.wv[k])
9999

100100
m = np.matrix(m)
101-
102101
km.fit(m)
103102
clusters = km.cluster_centers_
104103

requirements.txt

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,7 @@
1-
nltk>=3.5
2-
gensim>=3.8.3
3-
torch>=1.5.0
4-
torchvision>=0.6.0
5-
tqdm>=4.45.0
6-
scikit-learn>=0.22.2.post1
7-
numpy>=1.18.4
8-
hydra>=2.5
1+
nltk>=3.6.2
2+
gensim>=4.1.0
3+
torch>=1.9.0
4+
torchvision>=0.10.0
5+
scikit-learn>=0.24.2
6+
numpy>=1.21.2
7+
hydra-core>=1.1.1

0 commit comments

Comments
 (0)