Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Tips / Scrape Crawler

Scrape Crawler

Requests, BeautifulSoup, Scrapy, Selenium, CAPTCHA

HTTP Methods: GET POST PATCH PUT DELETE

Reddit Scraper Scrapy Reddit New York Transportation Data

Python 爬蟲: 從 0 到 1 19.5h

pycurl 1. celery 2. 分 crawler, parser 兩段實作。每隻爬蟲要 bind 在不同 IP 上,每 IP 配合每網站一分鐘可以充許一次存取,用 task queue 管理。爬蟲不做 parse,爬回的頁面丟進 kafka 之類的 queue,在其他機器做 parse 後存進資料庫。

DataSet with API

IMDB Metacritic Movie Rating

requests 可能因 SSL 版本造成錯誤

captcha

BeautifulSoup

在 Windows 使用 open('file.txt', encoding='utf8') 讀取文字檔案,才可能避免 UnicodeDecodeError: 'cp950' 問題。

img src replace

Selenium

Controlling the Web with Python Selenium

Selenium for Chrome Extension XPath

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free

bloomberg

Scrapy

scrapyrt