爬蟲實戰 - Using Python3 Part5

Level4: 檔機器人類型

這一關的網頁看起來類似 Level 2: 分頁處理

但是當你把Level 2 的程式重新執行時會發現每次都是跑第一頁,就算把每次送出request 時的PHPSESSID 固定住的話還是一樣,這時候我們便需要考慮網頁的另一個特性: header。

所有經由HTTP protocol走在網路上的資料都如下

1
2
3
4
5
6
7
8
9
HTTP/1.1 200 OK
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Host:axe-level-4.herokuapp.com
Referer:http://axe-level-4.herokuapp.com/lv4/?page=1
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36
1
2
3
4
5
6
7
8
9
HTTP/1.1 200 OK
Connection:keep-alive
Content-Length:2513
Content-Type:text/html
Date:Thu, 29 Dec 2016 06:37:27 GMT
Server:Apache/2.2.25 (Unix) PHP/5.3.27
Via:1.1 vegur
X-Powered-By:PHP/5.3.27
{HTML CONTENT}

所以我們的爬蟲程式要手動設定header,以下為設定的步驟

1
2
3
4
5
6
7
8
9
10
11
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
"Accept-Encoding": "gzip, deflate, sdch"
"Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4"
"Connection": "keep-alive"
"Host": "axe-level-4.herokuapp.com"
"Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1"
"Upgrade-Insecure-Requests": "1"
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
result = requests.get(url, headers=headers)

最後我們整理後的程式如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# -*- coding:utf8 -*-
import requests
from lxml import etree, html

if __name__ == "__main__":
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
"Accept-Encoding": "gzip, deflate, sdch"
"Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4"
"Connection": "keep-alive"
"Host": "axe-level-4.herokuapp.com"
"Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1"
"Upgrade-Insecure-Requests": "1"
"User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
jsonData = '['
for i in range(1,25):
result = requests.get("http://axe-level-4.herokuapp.com/lv4/?page=%d" % i, headers=headers)
result.encoding='utf8'
root = etree.fromstring(result.text, etree.HTMLParser())
for row in root.xpath("//table[@class='table']/tr[position()>1]"):
column = row.xpath("./td/text()")
tople = '{"town": "%s", "village": "%s", "name": "%s"},' % (column[0], column[1], column[2])
jsonData += tople
print(jsonData[0:-1] + ']')

以上就是我們這一章節的教學,下一節會介紹如何把thread 應用在爬蟲內。