ant (alpha) is a web crawler for Go.

Overview



ant (alpha) is a web crawler for Go.








Declarative

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

You can also use a jQuery-like API that allows you to scrape complex HTML pages if needed.

var data struct { Title string `css:"title"` }
page, _ := ant.Fetch(ctx, "https://apple.com")
page.Scan(&data)
data.Title // => Apple

Headless

By default the crawler uses http.Client, however if you're crawling SPAs youc an use the antcdp.Client implementation which allows you to use chrome headless browser to crawl pages.

eng, err := ant.Engine(ant.EngineConfig{
  Fetcher: &ant.Fetcher{
    Client: antcdp.Client{},
  },
})

Polite

The crawler automatically fetches and caches robots.txt, making sure that it never causes issues to small website owners. Of-course you can disable this behavior.

eng, err := ant.NewEngine(ant.EngineConfig{
  Impolite: true,
})
eng.Run(ctx)

Concurrent

The crawler maintains a configurable amount of "worker" goroutines that read URLs off the queue, and spawn a goroutine for each URL.

Depending on your configuration, you may want to increase the number of workers to speed up URL reads, of-course if you don't have enough resources you can reduce the number of workers too.

eng, err := ant.NewEngine(ant.EngineConfig{
  // Spawn 5 worker goroutines that dequeue
  // URLs and spawn a new goroutine for each URL.
  Workers: 5,
})
eng.Run(ctx)

Rate limits

The package includes a powerful ant.Limiter interface that allows you to define rate limits per URL. There are some built-in limiters as well.

ant.Limit(1) // 1 rps on all URLs.
ant.LimitHostname(5, "amazon.com") // 5 rps on amazon.com hostname.
ant.LimitPattern(5, "amazon.com.*") // 5 rps on URLs starting with `amazon.co.`.
ant.LimitRegexp(5, "^apple.com\/iphone\/*") // 5 rps on URLs that match the regex.

Note that LimitPattern and LimitRegexp only match on the host and path of the URL.


Matchers

Another powerful interface is ant.Matcher which allows you to define URL matchers, the matchers are called before URLs are queued.

ant.MatchHostname("amazon.com") // scrape amazon.com URLs only.
ant.MatchPattern("amazon.com/help/*")
ant.MatchRegexp("amazon\.com\/help/.+")

Robust

The crawl engine automatically retries any errors that implement Temporary() error that returns true.

Becuase the standard library returns errors that implement that interface the engine will retry most temporary network and HTTP errors.

eng, err := ant.NewEngine(ant.EngineConfig{
  Scraper: myscraper{},
  MaxAttempts: 5,
})

// Blocks until one of the following is true:
//
// 1. No more URLs to crawl (the scraper stops returning URLs)
// 2. A non-temporary error occured.
// 3. MaxAttempts was reached.
//
err = eng.Run(ctx)

Built-in Scrapers

The whole point of scraping is to extract data from websites into a machine readable format such as CSV or JSON, ant comes with built-in scrapers to make this ridiculously easy, here's a full cralwer that extracts quotes into stdout.

func main() {
	var url = "http://quotes.toscrape.com"
	var ctx = context.Background()
	var start = time.Now()

	type quote struct {
		Text string   `css:".text"   json:"text"`
		By   string   `css:".author" json:"by"`
		Tags []string `css:".tag"    json:"tags"`
	}

	type page struct {
		Quotes []quote `css:".quote" json:"quotes"`
	}

	eng, err := ant.NewEngine(ant.EngineConfig{
		Scraper: ant.JSON(os.Stdout, page{}, `li.next > a`),
		Matcher: ant.MatchHostname("quotes.toscrape.com"),
	})
	if err != nil {
		log.Fatalf("new engine: %s", err)
	}

	if err := eng.Run(ctx, url); err != nil {
		log.Fatal(err)
	}

	log.Printf("scraped in %s :)", time.Since(start))
}

Testing

anttest package makes it easy to test your scraper implementation it fetches a page by a URL, caches it in the OS's temporary directory and re-uses it.

The func depends on the file's modtime, the file expires daily, you can adjust the TTL by setting antttest.FetchTTL.

// Fetch calls `t.Fatal` on errors.
page := anttest.Fetch(t, "https://apple.com")
_, err := myscraper.Scrape(ctx, page)
assert.NoError(err)


Issues
  • build(deps): bump github.com/andybalholm/cascadia from 1.1.0 to 1.2.0

    build(deps): bump github.com/andybalholm/cascadia from 1.1.0 to 1.2.0

    Bumps github.com/andybalholm/cascadia from 1.1.0 to 1.2.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Note: This repo was added to Dependabot recently, so you'll receive a maximum of 5 PRs for your first few update runs. Once an update run creates fewer than 5 PRs we'll remove that limit.

    You can always request more updates by clicking Bump now in your Dependabot dashboard.

    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 2
  • build(deps): bump github.com/mafredri/cdp from 0.29.2 to 0.30.0

    build(deps): bump github.com/mafredri/cdp from 0.29.2 to 0.30.0

    ⚠️ Dependabot is rebasing this PR ⚠️

    If you make any changes to it yourself then they will take precedence over the rebase.


    Bumps github.com/mafredri/cdp from 0.29.2 to 0.30.0.

    Release notes

    Sourced from github.com/mafredri/cdp's releases.

    0.30.0

    • Refactor errors and add support for Go 1.13 errors (#124) 0ea84f3f55ce89090f9ecb993590f6920ebdea38
      • all: Implement Is, As and Unwrap for errors
      • Deprecate cdp.ErrorCause in favor of Is, As, Unwrap
      • Use errors.Is and errors.As where applicable
      • travis: Drop Go 1.12, add Go 1.15
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • MatchRegexp incorrect behavior

    MatchRegexp incorrect behavior

    MatchRegexp function tries to match r.Host+r.Path, but the r.Path is normalized to not contain the leading slash. So if the url := "https://google.com/search/car" and one tries to match with exp := regexp.QuoteMeta("google.com/search/car") it returns false because MatchRegexp is comparing "google.comsearch/car" with the exp.

    opened by rschio 1
  • Errors in the example for Built-in Scrapers

    Errors in the example for Built-in Scrapers

    The example under https://github.com/yields/ant#built-in-scrapers has a few minor errors in it.

    A corrected version would be:

    package main
    
    import (
    	"context"
    	"os"
    
    	"github.com/yields/ant"
    )
    
    func main() {
    	// Describe how a quote should be extracted.
    	type Quote struct {
    		Text string   `css:".text"`
    		By   string   `css:".author"`
    		Tags []string `css:".tag"`
    	}
    
    	// A page may have many quotes.
    	type Page struct {
    		Quotes []Quote `css:".quote"`
    	}
    
    	// Where we want to fetch quotes from.
    	const host = "quotes.toscrape.com"
    
    	// Initialize the engine with a built-in scraper
    	// that receives a type and extract data into an io.Writer.
    	eng, err := ant.NewEngine(ant.EngineConfig{
    		Scraper: ant.JSON(os.Stdout, Page{}),
    		Matcher: ant.MatchHostname(host),
    	})
    	if err != nil {
    		panic(err)
    	}
    
    	// Block until there are no more URLs to scrape.
    	if err := eng.Run(context.Background(), "http://"+host); err != nil {
    		panic(err)
    	}
    }
    

    One suggestion I have is to include examples in your README using something like https://github.com/campoy/embedmd which would make it easier to spot when examples are out of date for some reason.

    opened by peterhellberg 1
  • Fix import in example for antcdp

    Fix import in example for antcdp

    The import in the example for antcdp was incorrect (probably outdated), and gave the following error:

    go get: module github.com/yields/[email protected] found (v0.0.0-20210325225620-7fef9fdab5a8), but does not contain package github.com/yields/ant/exp/antcdp
    
    opened by faheel 1
  • build(deps): bump github.com/stretchr/testify from 1.6.1 to 1.7.0

    build(deps): bump github.com/stretchr/testify from 1.6.1 to 1.7.0

    Bumps github.com/stretchr/testify from 1.6.1 to 1.7.0.

    Release notes

    Sourced from github.com/stretchr/testify's releases.

    Minor improvements and bug fixes

    Minor feature improvements and bug fixes

    Commits
    • acba37e Only use repeatability if no repeatability left
    • eb8c41e Add more tests to mock package
    • a5830c5 Extract method to evaluate closest match
    • 1962448 Use Repeatability as tie-breaker for closest match
    • 92707c0 Fixed the link to not point to assert only
    • 05dd0b2 Updated the readme to point to pkg.dev
    • c26b7f3 Update assertions.go
    • 8fb4b24 [Fix] The most recent changes to golang/protobuf breaks the spew Circular dat...
    • dc8af72 add generated code for positive/negative assertion
    • 1544508 add assert positive/negative
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • build(deps): bump github.com/hashicorp/go-multierror from 1.1.0 to 1.1.1

    build(deps): bump github.com/hashicorp/go-multierror from 1.1.0 to 1.1.1

    Bumps github.com/hashicorp/go-multierror from 1.1.0 to 1.1.1.

    Commits
    • 9974e9e Merge pull request #50 from sethvargo/sethvargo/npe
    • 0023bb0 Check if multierror is nil in WrappedErrors
    • ab6846a we require go 1.13
    • cb869d9 Merge pull request #49 from hashicorp/fix-ci-config
    • af59c66 Upgrade gotestsum version
    • eabd672 Fix CI config
    • 7795f06 Merge pull request #47 from hashicorp/mwhooker-patch-1
    • e27d231 update badges to use circle
    • 78708db Carry over some changes from #26
    • c7dd669 Merge pull request #27 from mdeggies/add-circleci
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language
    • @dependabot badge me will comment on this PR with code to add a "Dependabot enabled" badge to your readme

    Additionally, you can set the following in your Dependabot dashboard:

    • Update frequency (including time of day and day of week)
    • Pull request limits (per update run and/or open at any time)
    • Out-of-range updates (receive only lockfile updates, if desired)
    • Security updates (receive only security updates, if desired)
    dependencies 
    opened by dependabot-preview[bot] 1
  • add antcdp

    add antcdp

    • antcdp: move out of exp/
    • antcdp: new target per request
    • antcdp: refactor, optimize
    opened by yields 0
  • Add compression to *antcache.Diskstore

    Add compression to *antcache.Diskstore

    Adds snappy compression to reduce disk usage when crawling larger websites.

    opened by yields 0
  • build(deps): bump github.com/mafredri/cdp from 0.31.0 to 0.32.0

    build(deps): bump github.com/mafredri/cdp from 0.31.0 to 0.32.0

    Bumps github.com/mafredri/cdp from 0.31.0 to 0.32.0.

    Release notes

    Sourced from github.com/mafredri/cdp's releases.

    v0.32.0

    • Update to latest protocol definitions (39a25b4)
    • cmd/cdpgen: Add initialisms (a69e549)
    • cmd/cdpgen: Handle project outside GOPATH (a7ff101)

    v0.31.1

    • devtool: Close http client request body (#126) (fd8f409)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • build(deps): bump github.com/golang/snappy from 0.0.3 to 0.0.4

    build(deps): bump github.com/golang/snappy from 0.0.3 to 0.0.4

    Bumps github.com/golang/snappy from 0.0.3 to 0.0.4.

    Commits
    • 544b418 Update AUTHORS and CONTRIBUTORS
    • 0eaccd4 Fix dangling golden_test filename link
    • 3ff355f Merge pull request #51 from topos-ai/bytereader
    • b9440b4 Merge pull request #40 from EdwardBetts/spelling
    • ef34881 Merge pull request #60 from alexlegg/master
    • 33fc3d5 Merge pull request #61 from cuonglm/cuonglm/fix-wrong-arm64-scaled-register-f...
    • b46926b Fix wrong arm64 scaled register format
    • e149cdd Use a more inclusive text for golden input.
    • 0a27eb7 Add ReadByte method, satisfies the io.ByteReader interface
    • da2bb33 correct spelling mistake
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
Owner
Amir Abushareb
Amir Abushareb
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 64 Sep 2, 2021
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.2k Sep 8, 2021
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 9 Jul 31, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 209 Aug 22, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.7k Sep 10, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.1k Sep 6, 2021
用Go实现抓取Boss直聘职位数据。IP代理,模拟浏览器,高效快速。

crawler-boss 用Go实现抓取Boss直聘职位数据。有几个特点 1.代理防IP被封 2.模拟浏览器,反识别爬虫。 3.控制爬取频率。 4.多协程爬取。 不足之处 1.爬取失败,没有进行重试以及更换IP处理。 2.错误处理 3.代码结构方面进行优化。 交流 && 疑问 如果有任何错误或不懂的

Ray 19 Sep 9, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 766 Sep 15, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.8k Sep 14, 2021
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHenHQ 749 Sep 7, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 986 Sep 12, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.6k Sep 4, 2021
DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

dorkscout dokrscout is a tool to automate the finding of vulnerable applications or secret files around the internet throught google searches, dorksco

R4yan 93 Sep 5, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Sep 3, 2021
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 7 Sep 12, 2021
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 1 Sep 3, 2021
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.7k Sep 9, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.6k Sep 12, 2021
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 380 Sep 8, 2021