Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Tips / Scrape Crawler

Scrape Crawler

Requests, BeautifulSoup, Scrapy, Selenium, CAPTCHA

HTTP Methods: GET POST PATCH PUT DELETE

Python 爬蟲: 從 0 到 1 19.5h

pycurl 1. celery 2. 分 crawler, parser 兩段實作。每隻爬蟲要 bind 在不同 IP 上,每 IP 配合每網站一分鐘可以充許一次存取,用 task queue 管理。爬蟲不做 parse,爬回的頁面丟進 kafka 之類的 queue,在其他機器做 parse 後存進資料庫。

captcha

Selenium

Controlling the Web with Python Selenium

Selenium for Chrome Extension XPath

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free

bloomberg