Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Related tags

page-fetch
Overview

page-fetch

page-fetch is a tool for researchers that lets you:

  • Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files
  • Run arbitrary JavaScript on many web pages and see the returned values

Installation

page-fetch is written with Go and can be installed with go get:

▶ go get github.com/detectify/page-fetch

Or you can clone the respository and build it manually:

▶ git clone https://github.com/detectify/page-fetch.git
▶ cd page-fetch
▶ go install

Dependencies

page-fetch uses chromedp, which requires that a Chrome or Chromium browser be installed. It uses the following list of executable names in attempting to execute a browser:

  • headless_shell
  • headless-shell
  • chromium
  • chromium-browser
  • google-chrome
  • google-chrome-stable
  • google-chrome-beta
  • google-chrome-unstable
  • /usr/bin/google-chrome

Basic Usage

page-fetch takes a list of URLs as its input on stdin. You can provide the input list using IO redirection:

▶ page-fetch < urls.txt

Or using the output of another command:

▶ grep admin urls.txt | page-fetch

By default, responses are stored in a directory called 'out', which is created if it does not exist:

▶ echo https://detectify.com | page-fetch
GET https://detectify.com/ 200 text/html; charset=utf-8
GET https://detectify.com/site/themes/detectify/css/detectify.css?v=1621498751 200 text/css
GET https://detectify.com/site/themes/detectify/img/detectify_logo_black.svg 200 image/svg+xml
GET https://fonts.googleapis.com/css?family=Merriweather:300i 200 text/css; charset=utf-8
...
▶ tree out
out
├── detectify.com
│   ├── index
│   ├── index.meta
│   └── site
│       └── themes
│           └── detectify
│               ├── css
│               │   ├── detectify.css
│               │   └── detectify.css.meta
...

The directory structure used in the output directory mirrors the directory structure used on the target websites. A ".meta" file is stored for each request that contains the originally requested URL, including the query string), the request and response headers etc.

Options

You can get the page-fetch help output by running page-fetch -h:

▶ page-fetch -h
Request URLs using headless Chrome, storing the results

Usage:
  page-fetch [options] < urls.txt

Options:
  -c, --concurrency    Concurrency Level (default 2)
  -e, --exclude     Do not save responses matching the provided string (can be specified multiple times)
  -i, --include     Only save requests matching the provided string (can be specified multiple times)
  -j, --javascript  JavaScript to run on each page
  -o, --output      Output directory name (default 'out')
  -w, --overwrite           Overwrite output files when they already exist
      --no-third-party      Do not save responses to requests on third-party domains
      --third-party         Only save responses to requests on third-party domains

Concurrency

You can change how many headless Chrome processes are used with the -c / --concurrency option. The default value is 2.

Excluding responses based on content-type

You can choose to not save responses that match particular content types with the -e / --exclude option. Any response with a content-type that partially matches the provided value will not be stored; so you can, for example, avoid storing image files by specifying:

▶ page-fetch --exclude image/

The option can be specified multiple times to exclude multiple different content-types.

Including responses based on content-type

Rather than excluding specific content-types, you can opt to only save certain content-types with the -i / --include option:

▶ page-fetch --include text/html

The option can be specified multiple times to include multiple different content-types.

Running JavaScript on each page

You can run arbitrary JavaScript on each page with the -j / --javascript option. The return value of the JavaScript is converted to a string and printed on a line prefixed with "JS":

▶ echo https://example.com | page-fetch --javascript document.domain
GET https://example.com/ 200 text/html; charset=utf-8
JS (https://example.com): example.com

This option can be used for a very wide variety of purposes. As an example, you could extract the href attribute from all links on a webpage:

n.href)' | grep ^JS JS (https://example.com): [https://www.iana.org/domains/example] ">
▶ echo https://example.com | page-fetch --javascript '[...document.querySelectorAll("a")].map(n => n.href)' | grep ^JS
JS (https://example.com): [https://www.iana.org/domains/example]

Setting the output directory name

By default, files are stored in a directory called out. This can be changed with the -o / --output option:

▶ echo https://example.com | page-fetch --output example
GET https://example.com/ 200 text/html; charset=utf-8
▶ find example/ -type f
example/example.com/index
example/example.com/index.meta

The directory is created if it does not already exist.

Overwriting files

By default, when a file already exists, a new file is created with a numeric suffix, e.g. if index already exists, index.1 will be created. This behaviour can be overridden with the -w / --overwrite option. When the option is used matching files will be overwritten instead.

Excluding third-party responses

You may sometimes wish to exclude responses from third-party domains. This can be done with the --no-third-party option. Any responses to requests for domains that do not match the input URL, or one of its subdomains, will not be saved.

Including only third-party responses

On rare occasions you may wish to only store responses to third party domains. This can be done with the --third-party option.

Issues
  • Adding `ignore-certificate-errors` Chrome option

    Adding `ignore-certificate-errors` Chrome option

    page-fetch currently validates TLS certificates, which is generally an undesired feature for security tools. This PR adds the ignore-certificate-errors option to Chrome to disable certificate validation.

    I have not extensively tested this, but it seems to work in line with Chrome against badssl.com

    opened by ajxchapman 1
Owner
Detectify
Detectify analyzes the level of security of your website. Get secure on detectify.com or catch up with us on our via blog.detectify.com.
Detectify
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 May 30, 2021
Interact with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies

WhiteChocolateMacademiaNut Description Interacts with Chromium-based browsers' debug port to view open tabs, installed extensions, and cookies. Tested

Justin Bui 54 Jun 9, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.5k Jun 14, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14k Jun 12, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 7.9k Jun 15, 2021
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.6k Jun 9, 2021
Fast website link checker in Go

Muffet Muffet is a website link checker which scrapes and inspects all pages in a website recursively. Features Massive speed Colored outputs Differen

Yota Toyama 1.8k Jun 11, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.2k Jun 15, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.8k Jun 11, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 208 Jun 7, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.7k Jun 8, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 763 May 7, 2021
Cirno-go A tool for downloading books from hbooker in Go.

Cirno-go A tool for downloading books from hbooker in Go. Features Login your own account Search books by book name Download books as txt and epub fil

沚水 19 May 5, 2021
Journalist. An RSS aggregator.

Journalist. An RSS aggregator. Download the latest version for macOS, Linux, FreeBSD, NetBSD, OpenBSD & Plan9 here. WARNING: journalist is highly expe

マリウス 89 Apr 22, 2020