爬蟲實戰 - Using Python3 Part5 | 阿狗的程式雜記

Level4: 檔機器人類型

這一關的網頁看起來類似 Level 2: 分頁處理

但是當你把Level 2 的程式重新執行時會發現每次都是跑第一頁，就算把每次送出request 時的PHPSESSID 固定住的話還是一樣，這時候我們便需要考慮網頁的另一個特性: header。

所有經由HTTP protocol走在網路上的資料都如下

HTTP/1.1 200 OK
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip, deflate, sdch
Accept-Language:zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4
Connection:keep-alive
Host:axe-level-4.herokuapp.com
Referer:http://axe-level-4.herokuapp.com/lv4/?page=1
Upgrade-Insecure-Requests:1
User-Agent:Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36

HTTP/1.1 200 OK
Connection:keep-alive
Content-Length:2513
Content-Type:text/html
Date:Thu, 29 Dec 2016 06:37:27 GMT
Server:Apache/2.2.25 (Unix) PHP/5.3.27
Via:1.1 vegur
X-Powered-By:PHP/5.3.27
{HTML CONTENT}

所以我們的爬蟲程式要手動設定header，以下為設定的步驟

headers = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
    "Accept-Encoding": "gzip, deflate, sdch"
    "Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4"
    "Connection": "keep-alive"
    "Host": "axe-level-4.herokuapp.com"
    "Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1"
    "Upgrade-Insecure-Requests": "1"
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
}
result = requests.get(url, headers=headers)

最後我們整理後的程式如下

# -*- coding:utf8 -*-
import requests
from lxml import etree, html

if __name__ == "__main__":
    headers = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
        "Accept-Encoding": "gzip, deflate, sdch"
        "Accept-Language": "zh-TW,zh;q=0.8,en-US;q=0.6,en;q=0.4"
        "Connection": "keep-alive"
        "Host": "axe-level-4.herokuapp.com"
        "Referer": "http://axe-level-4.herokuapp.com/lv4/?page=1"
        "Upgrade-Insecure-Requests": "1"
        "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36"
    }
    jsonData = '['
    for i in range(1,25):
        result = requests.get("http://axe-level-4.herokuapp.com/lv4/?page=%d" % i, headers=headers)
        result.encoding='utf8'
        root = etree.fromstring(result.text, etree.HTMLParser())
        for row in root.xpath("//table[@class='table']/tr[position()>1]"):
            column = row.xpath("./td/text()")
            tople = '{"town": "%s", "village": "%s", "name": "%s"},' % (column[0], column[1], column[2])
            jsonData += tople
    print(jsonData[0:-1] + ']')

以上就是我們這一章節的教學，下一節會介紹如何把thread 應用在爬蟲內。