|
| 1 | +# Dynamic Webs crawlering in Python |
| 2 | + |
| 3 | +## Tutorial for reptiling the NSTL Words (~620,000 words) |
| 4 | + |
| 5 | +**Official Website**: https://www.nstl.gov.cn/ |
| 6 | + |
| 7 | +**Words Website**: https://www.nstl.gov.cn/stkos.html?t=Concept&q= |
| 8 | + |
| 9 | +--- |
| 10 | + |
| 11 | +## Table of Contents |
| 12 | + |
| 13 | +<ul> |
| 14 | +<li><a href="#Usage-Demo">Usage Demo</a></li> |
| 15 | +<li><a href="#Other-Useful-Resources">Other Useful Resources</a></li> |
| 16 | +<li><a href="#Prevent-Anti-reptile">Prevent Anti-reptile</a></li> |
| 17 | +<li><a href="#Links-of-the-Contents">Links of the Contents</a></li> |
| 18 | +<li><a href="#ID-Order">ID Order</a></li> |
| 19 | +<li><a href="#Announcement">Announcement</a></li> |
| 20 | +<li><a href="#License">License</a></li> |
| 21 | +</ul> |
| 22 | + |
| 23 | +--- |
| 24 | + |
| 25 | +## Usage Demo |
| 26 | + |
| 27 | +1. First of all, get all the words' IDs from the main page |
| 28 | + |
| 29 | + <p align="center"> |
| 30 | + <img src='Photos/new-1.png'> |
| 31 | + <br /> |
| 32 | + <br /> |
| 33 | + |
| 34 | + <img src='Photos/new-2.png'> |
| 35 | + <br /> |
| 36 | + <br /> |
| 37 | + |
| 38 | + <img src='Photos/new-3.png'> |
| 39 | + <br /> |
| 40 | + <br /> |
| 41 | + |
| 42 | + <img src='Photos/new-4.png'> |
| 43 | + <br /> |
| 44 | + <br /> |
| 45 | + |
| 46 | + <img src='Photos/new-5.png'> |
| 47 | + <br /> |
| 48 | + <br /> |
| 49 | + |
| 50 | + <img src='Photos/new-6.png'> |
| 51 | + <br /> |
| 52 | + <br /> |
| 53 | + </p> |
| 54 | + |
| 55 | + For example: https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/list/pl&query=%7B%22c%22%3A10%2C%22st%22%3A%220%22%2C%22f%22%3A%5B%5D%2C%22p%22%3A%22%22%2C%22q%22%3A%5B%7B%22k%22%3A%22%22%2C%22v%22%3A%22%22%2C%22e%22%3A1%2C%22es%22%3A%7B%7D%2C%22o%22%3A%22AND%22%2C%22a%22%3A0%7D%5D%2C%22op%22%3A%22AND%22%2C%22s%22%3A%5B%22yea%3Adesc%22%5D%2C%22t%22%3A%5B%22Concept%22%5D%7D&sl=1&pageSize=10&pageNumber=1 |
| 56 | + |
| 57 | + You can request this website instead to get the IDs from the page. You can change "&pageNumber=1" to "&pageNumber=100" to get the contents from Page 100. Please follow the codes "get-website-IDs.ipynb". |
| 58 | + |
| 59 | +2. After get the IDs of words, use "fast-crawler-NSTL-data.ipynb" to download and capture all the contents from the websites. |
| 60 | + |
| 61 | + <p align="center"> |
| 62 | + <img src='Photos/1.png'> |
| 63 | + <br /> |
| 64 | + <br /> |
| 65 | + |
| 66 | + <img src='Photos/2.png'> |
| 67 | + <br /> |
| 68 | + <br /> |
| 69 | + |
| 70 | + <img src='Photos/3.png'> |
| 71 | + <br /> |
| 72 | + <br /> |
| 73 | + |
| 74 | + <img src='Photos/4.png'> |
| 75 | + <br /> |
| 76 | + <br /> |
| 77 | + |
| 78 | + <img src='Photos/5.png'> |
| 79 | + <br /> |
| 80 | + <br /> |
| 81 | + |
| 82 | + <img src='Photos/6.png'> |
| 83 | + <br /> |
| 84 | + <br /> |
| 85 | + </p> |
| 86 | + |
| 87 | + The words will be saved as JSON format. |
| 88 | + |
| 89 | + Good luck to your future reptile! |
| 90 | + |
| 91 | +--- |
| 92 | + |
| 93 | +## Other Useful Resources |
| 94 | + |
| 95 | +1. Python **selenium** package *(5 - 20 seconds for a website)* |
| 96 | + |
| 97 | + 1). **Don't recommend this method** if there are tons of websites |
| 98 | + |
| 99 | + 2). You need to download *Chrome Driver* to capture the websites. |
| 100 | + |
| 101 | + 3). If you have (a) few dynamic websites, you can use "*selenium-reptile-script.py*" script. |
| 102 | + This file can be used as a Reference. After a little corrections, you can use this file. |
| 103 | + |
| 104 | +2. Google Developer Tools -> Network -> Headers -> Request URL *(less than one second for a website)* |
| 105 | + |
| 106 | + 1). Suggest to use this method. |
| 107 | + |
| 108 | + 2). Use "*fast-reptile-script-YEAR2018.py*" script or follow "fast-crawler-NSTL-data.ipynb" script. |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## Prevent Anti-reptile |
| 113 | + |
| 114 | +1. Use a fake device instead of visiting directly |
| 115 | + ```python |
| 116 | + # A fake device to avoid the Anti reptile |
| 117 | + USER_AGENTS = [ |
| 118 | + "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", |
| 119 | + "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", |
| 120 | + "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", |
| 121 | + "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", |
| 122 | + "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", |
| 123 | + "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", |
| 124 | + "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", |
| 125 | + "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", |
| 126 | + "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", |
| 127 | + "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", |
| 128 | + "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", |
| 129 | + "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", |
| 130 | + "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", |
| 131 | + "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", |
| 132 | + "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", |
| 133 | + "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", |
| 134 | + ] |
| 135 | + |
| 136 | + random_agent = USER_AGENTS[randint(0, len(USER_AGENTS) - 1)] |
| 137 | + headers = { |
| 138 | + 'User-Agent': random_agent, |
| 139 | + } |
| 140 | + ``` |
| 141 | + |
| 142 | +2. Add a try-except-sleep to avoid "Refuse connection" and use timeout=(5, 10) to avoid Python "No response" Bug |
| 143 | + |
| 144 | + ```python |
| 145 | + for j in range(10): |
| 146 | + try: |
| 147 | + res = requests.get(url, headers=headers, verify=False, timeout=(5, 10)) |
| 148 | + contents = res.text |
| 149 | + except Exception as e: |
| 150 | + if j >= 9: |
| 151 | + print('The exception has happened', '-' * 100) |
| 152 | + else: |
| 153 | + time.sleep(0.5) |
| 154 | + else: |
| 155 | + time.sleep(0.5) |
| 156 | + break |
| 157 | + ``` |
| 158 | + |
| 159 | +3. Use SSL package and "verify=False" to disable Network Certificate |
| 160 | + |
| 161 | + ```python |
| 162 | + import ssl |
| 163 | + |
| 164 | + # Avoid SSL Certificate to access the HTTP website |
| 165 | + ssl._create_default_https_context = ssl._create_unverified_context |
| 166 | + ``` |
| 167 | + |
| 168 | + or |
| 169 | + |
| 170 | + ```python |
| 171 | + res = requests.get(url, headers=headers, verify=False, timeout=(5, 10)) |
| 172 | + contents = res.text |
| 173 | + ``` |
| 174 | + |
| 175 | +4. Use urllib3 package to disable warnings |
| 176 | + |
| 177 | + ```python |
| 178 | + import urllib3 |
| 179 | + |
| 180 | + # Disable all kinds of warnings |
| 181 | + urllib3.disable_warnings() |
| 182 | + ``` |
| 183 | + |
| 184 | +5. Block current IP address |
| 185 | + |
| 186 | + 1). Use a VPN software to change the IP address. |
| 187 | + |
| 188 | + 2). Reptile the websites at home. |
| 189 | + |
| 190 | + 3). Wait for 2 days to reptile if it's not urgent. |
| 191 | + |
| 192 | +--- |
| 193 | + |
| 194 | +## Links of the Contents |
| 195 | + |
| 196 | +I use an Example here, ID is **C018781660**: |
| 197 | + |
| 198 | +To get ***English Term + Chinese Term + Synonyms***: [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/detail&id=C018781660) |
| 199 | + |
| 200 | +To get ***Fields***: [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=stkos/pc/detail/ztree&id=C018781660) |
| 201 | + |
| 202 | +To get **IDs** in a single page (~ 10 IDs in one page): [**Link**](https://www.nstl.gov.cn/execute?target=nstl4.search4&function=paper/pc/list/pl&query=%7B%22c%22%3A10%2C%22st%22%3A%220%22%2C%22f%22%3A%5B%5D%2C%22p%22%3A%22%22%2C%22q%22%3A%5B%7B%22k%22%3A%22%22%2C%22v%22%3A%22%22%2C%22e%22%3A1%2C%22es%22%3A%7B%7D%2C%22o%22%3A%22AND%22%2C%22a%22%3A0%7D%5D%2C%22op%22%3A%22AND%22%2C%22s%22%3A%5B%22yea%3Adesc%22%5D%2C%22t%22%3A%5B%22Concept%22%5D%7D&sl=1&pageSize=10&pageNumber=1) |
| 203 | + |
| 204 | +--- |
| 205 | + |
| 206 | +## ID Order |
| 207 | + |
| 208 | +In fact, there is an order for NSTL words w.r.t. different years, e.g., 2018, 2019, 2020, |
| 209 | +but I **don't recommend** you to use the order method because you might miss some words in this way. |
| 210 | + |
| 211 | +In contrast, I think it is superior to **capture all the words' IDs first**, |
| 212 | +and **then capture the contents w.r.t. these word IDs**. |
| 213 | + |
| 214 | +Orders for NSTL words: |
| 215 | + |
| 216 | +**YEAR 2020**: C0200 + {29329 - 55000} --> 19,892 Words |
| 217 | + |
| 218 | +**YEAR 2019**: C019 + {000000 - 500000} --> 395,090 Words |
| 219 | + |
| 220 | +**YEAR 2018**: C018 + {781660 - 999999} --> 200,068 Words |
| 221 | + |
| 222 | +FYI, there are lots of blank pages in these IDs' website. You can look through and run the codes, and then you will find out. |
| 223 | + |
| 224 | +Crawlering **614,888** words. |
| 225 | + |
| 226 | +Until **August 18th, 2020**, there are **614,959** Words available. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Announcement |
| 231 | + |
| 232 | +These scripts and methods were mainly used to capture words contents from the NSTL word websites. |
| 233 | +However, I believe they can be transfered for other dynamic web crawlering besides the NSTL words. |
| 234 | +So, enjoy your python crawlering and have a wonderful journey! |
| 235 | + |
| 236 | +-- Shuyue Jia |
| 237 | + |
| 238 | +--- |
| 239 | + |
| 240 | +## License |
| 241 | + |
| 242 | +MIT License |
0 commit comments