Skip to content

Commit f21d06e

Browse files
authored
Create data_config_en.md
1 parent 8c904a7 commit f21d06e

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed

docs/data_config_en.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
# Data configuration
2+
3+
**Install [torchtext](https://github.com/pytorch/text) for data processing**
4+
5+
The datasets module currently contains:
6+
7+
- Sentiment analysis: SST and IMDb
8+
- Question classification: TREC
9+
- Entailment: SNLI
10+
- Language modeling: WikiText-2
11+
- Machine translation: Multi30k, IWSLT, WMT14
12+
13+
Others are planned or a work in progress:
14+
15+
- Question answering: SQuAD
16+
17+
The current need to configure the data collection
18+
19+
### Glove
20+
21+
Download to the project's root directory under the folder vector_cache
22+
23+
- [42B](http://nlp.stanford.edu/data/glove.42B.300d.zip)
24+
- [840B](http://nlp.stanford.edu/data/glove.840B.300d.zip)
25+
- [twitter.27B](http://nlp.stanford.edu/data/glove.twitter.27B.zip)
26+
- [6B](http://nlp.stanford.edu/data/glove.6B.zip)
27+
28+
### Classification Datasets
29+
30+
- Download [IMDB](http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz) dataset to .data/imdb
31+
- Download [SST](http://nlp.stanford.edu/sentiment/trainDevTestTrees_PTB.zip) dataset to .data/sst
32+
- Download TREC [Question Classification ](http://cogcomp.org/Data/QA/QC/train_5500.label) [2](http://cogcomp.org/Data/QA/QC/TREC_10.label) dataset to .data/imdb
33+
34+
### File Structure
35+
36+
- TextClassificationBenchmark
37+
- .data
38+
- imdb
39+
- aclImdb_v1.tar.gz
40+
- sst
41+
- trainDevTestTrees_PTB.zip
42+
- trec
43+
- train_5500.label
44+
- TREC_10.label
45+
- .vector_cache
46+
- glove.42B.300d.zip
47+
- glove.840B.300d.zip
48+
- glove.twitter.27B.zip
49+
- glove.6B.zip
50+
51+
52+
53+
## More datasets and updates coming soon, please wait for us to update further

0 commit comments

Comments
 (0)