File tree 1 file changed +53
-0
lines changed
1 file changed +53
-0
lines changed Original file line number Diff line number Diff line change
1
+ # Data configuration
2
+
3
+ ** Install [ torchtext] ( https://github.com/pytorch/text ) for data processing**
4
+
5
+ The datasets module currently contains:
6
+
7
+ - Sentiment analysis: SST and IMDb
8
+ - Question classification: TREC
9
+ - Entailment: SNLI
10
+ - Language modeling: WikiText-2
11
+ - Machine translation: Multi30k, IWSLT, WMT14
12
+
13
+ Others are planned or a work in progress:
14
+
15
+ - Question answering: SQuAD
16
+
17
+ The current need to configure the data collection
18
+
19
+ ### Glove
20
+
21
+ Download to the project's root directory under the folder vector_cache
22
+
23
+ - [ 42B] ( http://nlp.stanford.edu/data/glove.42B.300d.zip )
24
+ - [ 840B] ( http://nlp.stanford.edu/data/glove.840B.300d.zip )
25
+ - [ twitter.27B] ( http://nlp.stanford.edu/data/glove.twitter.27B.zip )
26
+ - [ 6B] ( http://nlp.stanford.edu/data/glove.6B.zip )
27
+
28
+ ### Classification Datasets
29
+
30
+ - Download [ IMDB] ( http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz ) dataset to .data/imdb
31
+ - Download [ SST] ( http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip ) dataset to .data/sst
32
+ - Download TREC [ Question Classification ] ( http://cogcomp.org/Data/QA/QC/train_5500.label ) [ 2] ( http://cogcomp.org/Data/QA/QC/TREC_10.label ) dataset to .data/imdb
33
+
34
+ ### File Structure
35
+
36
+ - TextClassificationBenchmark
37
+ - .data
38
+ - imdb
39
+ - aclImdb_v1.tar.gz
40
+ - sst
41
+ - trainDevTestTrees_PTB.zip
42
+ - trec
43
+ - train_5500.label
44
+ - TREC_10.label
45
+ - .vector_cache
46
+ - glove.42B.300d.zip
47
+ - glove.840B.300d.zip
48
+ - glove.twitter.27B.zip
49
+ - glove.6B.zip
50
+
51
+
52
+
53
+ ## More datasets and updates coming soon, please wait for us to update further
You can’t perform that action at this time.
0 commit comments