Level4: 檔機器人類型
這一關的網頁看起來類似 Level 2: 分頁處理
但是當你把Level 2 的程式重新執行時會發現每次都是跑第一頁,就算把每次送出request 時的PHPSESSID 固定住的話還是一樣,這時候我們便需要考慮網頁的另一個特性: header。
所有經由HTTP protocol走在網路上的資料都如下
1 2 3 4 5 6 7 8 9
| HTTP/1.1 200 OK Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Encoding:gzip, deflate, sdch Accept-Language:zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4 Connection:keep-alive Host:axe-level-4.herokuapp.com Referer:http://axe-level-4.herokuapp.com/lv4/?page=1 Upgrade-Insecure-Requests:1 User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
|
1 2 3 4 5 6 7 8 9
| HTTP/1.1 200 OK Connection:keep-alive Content-Length:2513 Content-Type:text/html Date:Thu, 29 Dec 2016 06:37:27 GMT Server:Apache/2.2.25 (Unix) PHP/5.3.27 Via:1.1 vegur X-Powered-By:PHP/5.3.27 {HTML CONTENT}
|
所以我們的爬蟲程式要手動設定header,以下為設定的步驟
1 2 3 4 5 6 7 8 9 10 11
| headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" "Accept-Encoding": "gzip, deflate, sdch" "Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4" "Connection": "keep-alive" "Host": "axe-level-4.herokuapp.com" "Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1" "Upgrade-Insecure-Requests": "1" "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" } result = requests.get(url, headers=headers)
|
最後我們整理後的程式如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| import requests from lxml import etree, html
if __name__ == "__main__": headers = { "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8" "Accept-Encoding": "gzip, deflate, sdch" "Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4" "Connection": "keep-alive" "Host": "axe-level-4.herokuapp.com" "Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1" "Upgrade-Insecure-Requests": "1" "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36" } jsonData = '[' for i in range(1,25): result = requests.get("http://axe-level-4.herokuapp.com/lv4/?page=%d" % i, headers=headers) result.encoding='utf8' root = etree.fromstring(result.text, etree.HTMLParser()) for row in root.xpath("//table[@class='table']/tr[position()>1]"): column = row.xpath("./td/text()") tople = '{"town": "%s", "village": "%s", "name": "%s"},' % (column[0], column[1], column[2]) jsonData += tople print(jsonData[0:-1] + ']')
|
以上就是我們這一章節的教學,下一節會介紹如何把thread 應用在爬蟲內。