scraper

Apache Nutch™

Nutch is a highly extensible, highly scalable, matured, production-ready Web crawler which enables fine grained configuration and accomodates a wide variety of data acquisition tasks.

ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more…