Skip to content. | Skip to navigation

Personal tools

Navigation

You are here: Home / Tips / Scrape Crawler

Scrape Crawler

Requests, BeautifulSoup, Scrapy, Selenium, CAPTCHA https://medium.com/@bmorelli25/a-guide-to-automating-scraping-the-web-with-javascript-chrome-puppeteer-node-js-b18efb9e9921

HTTP Methods: GET POST PATCH PUT DELETE

BeautifulSoup 編輯網址範例 Reddit Scraper Scrapy Reddit New York Transportation Data

Python 爬蟲: 從 0 到 1 19.5h

pycurl 1. celery 2. 分 crawler, parser 兩段實作。每隻爬蟲要 bind 在不同 IP 上,每 IP 配合每網站一分鐘可以充許一次存取,用 task queue 管理。爬蟲不做 parse,爬回的頁面丟進 kafka 之類的 queue,在其他機器做 parse 後存進資料庫。

DataSet with API

IMDB Metacritic Movie Rating

requests 可能因 SSL 版本造成錯誤

captcha

BeautifulSoup

在 Windows 使用 open('file.txt', encoding='utf8') 讀取文字檔案,才可能避免 UnicodeDecodeError: 'cp950' 問題。

img src replace

Selenium

Controlling the Web with Python Selenium

Selenium for Chrome Extension XPath

By replacing BeautifulSoup with selectolax, you can get a 5-30x speedup almost for free

bloomberg

Scrapy

scrapyrt

urllib

urllib.urlencode vs urllib.quote https://medium.com/@ankitjain28may/how-to-perform-web-scraping-using-node-js-5a96203cb7cb