Skip to content

Commit a0faa80

Browse files
Add files via upload
0 parents  commit a0faa80

20 files changed

+2627
-0
lines changed

Photos/1.png

1.08 MB
Loading

Photos/2.png

896 KB
Loading

Photos/3.png

988 KB
Loading

Photos/4.png

1.08 MB
Loading

Photos/5.png

1.32 MB
Loading

Photos/6.png

645 KB
Loading

Photos/new-1.png

1.17 MB
Loading

Photos/new-2.png

665 KB
Loading

Photos/new-3.png

890 KB
Loading

Photos/new-4.png

1.07 MB
Loading

Photos/new-5.png

1.06 MB
Loading

Photos/new-6.png

1.64 MB
Loading

README.md

Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
# Dynamic Webs crawlering in Python
2+
3+
## Tutorial for reptiling the NSTL Words (~620,000 words)
4+
5+
**Official Website**: https://www.nstl.gov.cn/
6+
7+
**Words Website**: https://www.nstl.gov.cn/stkos.html?t=Concept&q=
8+
9+
---
10+
11+
## Table of Contents
12+
13+
<ul>
14+
<li><a href="#Usage-Demo">Usage Demo</a></li>
15+
<li><a href="#Other-Useful-Resources">Other Useful Resources</a></li>
16+
<li><a href="#Prevent-Anti-reptile">Prevent Anti-reptile</a></li>
17+
<li><a href="#Links-of-the-Contents">Links of the Contents</a></li>
18+
<li><a href="#ID-Order">ID Order</a></li>
19+
<li><a href="#Announcement">Announcement</a></li>
20+
<li><a href="#License">License</a></li>
21+
</ul>
22+
23+
---
24+
25+
## Usage Demo
26+
27+
1. First of all, get all the words' IDs from the main page
28+
29+
<p align="center">
30+
<img src='Photos/new-1.png'>
31+
<br />
32+
<br />
33+
34+
<img src='Photos/new-2.png'>
35+
<br />
36+
<br />
37+
38+
<img src='Photos/new-3.png'>
39+
<br />
40+
<br />
41+
42+
<img src='Photos/new-4.png'>
43+
<br />
44+
<br />
45+
46+
<img src='Photos/new-5.png'>
47+
<br />
48+
<br />
49+
50+
<img src='Photos/new-6.png'>
51+
<br />
52+
<br />
53+
</p>
54+
55+
For example: https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/list/pl&query=%7B%22c%22%3A10%2C%22st%22%3A%220%22%2C%22f%22%3A%5B%5D%2C%22p%22%3A%22%22%2C%22q%22%3A%5B%7B%22k%22%3A%22%22%2C%22v%22%3A%22%22%2C%22e%22%3A1%2C%22es%22%3A%7B%7D%2C%22o%22%3A%22AND%22%2C%22a%22%3A0%7D%5D%2C%22op%22%3A%22AND%22%2C%22s%22%3A%5B%22yea%3Adesc%22%5D%2C%22t%22%3A%5B%22Concept%22%5D%7D&sl=1&pageSize=10&pageNumber=1
56+
57+
You can request this website instead to get the IDs from the page. You can change "&pageNumber=1" to "&pageNumber=100" to get the contents from Page 100. Please follow the codes "get-website-IDs.ipynb".
58+
59+
2. After get the IDs of words, use "fast-crawler-NSTL-data.ipynb" to download and capture all the contents from the websites.
60+
61+
<p align="center">
62+
<img src='Photos/1.png'>
63+
<br />
64+
<br />
65+
66+
<img src='Photos/2.png'>
67+
<br />
68+
<br />
69+
70+
<img src='Photos/3.png'>
71+
<br />
72+
<br />
73+
74+
<img src='Photos/4.png'>
75+
<br />
76+
<br />
77+
78+
<img src='Photos/5.png'>
79+
<br />
80+
<br />
81+
82+
<img src='Photos/6.png'>
83+
<br />
84+
<br />
85+
</p>
86+
87+
The words will be saved as JSON format.
88+
89+
Good luck to your future reptile!
90+
91+
---
92+
93+
## Other Useful Resources
94+
95+
1. Python **selenium** package *(5 - 20 seconds for a website)*
96+
97+
1). **Don't recommend this method** if there are tons of websites
98+
99+
2). You need to download *Chrome Driver* to capture the websites.
100+
101+
3). If you have (a) few dynamic websites, you can use "*selenium-reptile-script.py*" script.
102+
This file can be used as a Reference. After a little corrections, you can use this file.
103+
104+
2. Google Developer Tools -> Network -> Headers -> Request URL *(less than one second for a website)*
105+
106+
1). Suggest to use this method.
107+
108+
2). Use "*fast-reptile-script-YEAR2018.py*" script or follow "fast-crawler-NSTL-data.ipynb" script.
109+
110+
---
111+
112+
## Prevent Anti-reptile
113+
114+
1. Use a fake device instead of visiting directly
115+
```python
116+
# A fake device to avoid the Anti reptile
117+
USER_AGENTS = [
118+
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
119+
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
120+
"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
121+
"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
122+
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
123+
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
124+
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
125+
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
126+
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
127+
"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",
128+
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",
129+
"Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5",
130+
"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6",
131+
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
132+
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20",
133+
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52",
134+
]
135+
136+
random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)]
137+
headers = {
138+
'User-Agent': random_agent,
139+
}
140+
```
141+
142+
2. Add a try-except-sleep to avoid "Refuse connection" and use timeout=(5, 10) to avoid Python "No response" Bug
143+
144+
```python
145+
for j in range(10):
146+
try:
147+
res = requests.get(url, headers=headers, verify=False, timeout=(5, 10))
148+
contents = res.text
149+
except Exception as e:
150+
if j >= 9:
151+
print('The exception has happened', '-' * 100)
152+
else:
153+
time.sleep(0.5)
154+
else:
155+
time.sleep(0.5)
156+
break
157+
```
158+
159+
3. Use SSL package and "verify=False" to disable Network Certificate
160+
161+
```python
162+
import ssl
163+
164+
# Avoid SSL Certificate to access the HTTP website
165+
ssl._create_default_https_context = ssl._create_unverified_context
166+
```
167+
168+
or
169+
170+
```python
171+
res = requests.get(url, headers=headers, verify=False, timeout=(5, 10))
172+
contents = res.text
173+
```
174+
175+
4. Use urllib3 package to disable warnings
176+
177+
```python
178+
import urllib3
179+
180+
# Disable all kinds of warnings
181+
urllib3.disable_warnings()
182+
```
183+
184+
5. Block current IP address
185+
186+
1). Use a VPN software to change the IP address.
187+
188+
2). Reptile the websites at home.
189+
190+
3). Wait for 2 days to reptile if it's not urgent.
191+
192+
---
193+
194+
## Links of the Contents
195+
196+
I use an Example here, ID is **C018781660**:
197+
198+
To get ***English Term + Chinese Term + Synonyms***: [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/detail&id=C018781660)
199+
200+
To get ***Fields***: [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=stkos/pc/detail/ztree&id=C018781660)
201+
202+
To get **IDs** in a single page (~ 10 IDs in one page): [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/list/pl&query=%7B%22c%22%3A10%2C%22st%22%3A%220%22%2C%22f%22%3A%5B%5D%2C%22p%22%3A%22%22%2C%22q%22%3A%5B%7B%22k%22%3A%22%22%2C%22v%22%3A%22%22%2C%22e%22%3A1%2C%22es%22%3A%7B%7D%2C%22o%22%3A%22AND%22%2C%22a%22%3A0%7D%5D%2C%22op%22%3A%22AND%22%2C%22s%22%3A%5B%22yea%3Adesc%22%5D%2C%22t%22%3A%5B%22Concept%22%5D%7D&sl=1&pageSize=10&pageNumber=1)
203+
204+
---
205+
206+
## ID Order
207+
208+
In fact, there is an order for NSTL words w.r.t. different years, e.g., 2018, 2019, 2020,
209+
but I **don't recommend** you to use the order method because you might miss some words in this way.
210+
211+
In contrast, I think it is superior to **capture all the words' IDs first**,
212+
and **then capture the contents w.r.t. these word IDs**.
213+
214+
Orders for NSTL words:
215+
216+
**YEAR 2020**: C0200 + {29329 - 55000} --> 19,892 Words
217+
218+
**YEAR 2019**: C019 + {000000 - 500000} --> 395,090 Words
219+
220+
**YEAR 2018**: C018 + {781660 - 999999} --> 200,068 Words
221+
222+
FYI, there are lots of blank pages in these IDs' website. You can look through and run the codes, and then you will find out.
223+
224+
Crawlering **614,888** words.
225+
226+
Until **August 18th, 2020**, there are **614,959** Words available.
227+
228+
---
229+
230+
## Announcement
231+
232+
These scripts and methods were mainly used to capture words contents from the NSTL word websites.
233+
However, I believe they can be transfered for other dynamic web crawlering besides the NSTL words.
234+
So, enjoy your python crawlering and have a wonderful journey!
235+
236+
-- Shuyue Jia
237+
238+
---
239+
240+
## License
241+
242+
MIT License

0 commit comments

Comments
 (0)