crawlergo is a browser crawler that uses chrome headless mode for URL collection.

Overview

crawlergo

chromedp BlackHat EU Arsenal

A powerful browser crawler for web vulnerability scanners

English Document | 中文文档

crawlergo is a browser crawler that uses chrome headless mode for URL collection. It hooks key positions of the whole web page with DOM rendering stage, automatically fills and submits forms, with intelligent JS event triggering, and collects as many entries exposed by the website as possible. The built-in URL de-duplication module filters out a large number of pseudo-static URLs, still maintains a fast parsing and crawling speed for large websites, and finally gets a high-quality collection of request results.

crawlergo currently supports the following features:

  • chrome browser environment rendering
  • Intelligent form filling, automated submission
  • Full DOM event collection with automated triggering
  • Smart URL de-duplication to remove most duplicate requests
  • Intelligent analysis of web pages and collection of URLs, including javascript file content, page comments, robots.txt files and automatic Fuzz of common paths
  • Support Host binding, automatically fix and add Referer
  • Support browser request proxy
  • Support pushing the results to passive web vulnerability scanners

Screenshot

Installation

Please read and confirm disclaimer carefully before installing and using。

Build

cd crawlergo/cmd/crawlergo
go build crawlergo_cmd.go
  1. crawlergo relies only on the chrome environment to run, go to download for the new version of chromium, or just click to download Linux version 79.
  2. Go to download page for the latest version of crawlergo and extract it to any directory. If you are on linux or macOS, please give crawlergo executable permissions (+x).
  3. Or you can modify the code and build it yourself.

If you are using a linux system and chrome prompts you with missing dependencies, please see TroubleShooting below

Quick Start

Go!

Assuming your chromium installation directory is /tmp/chromium/, set up 10 tabs open at the same time and crawl the testphp.vulnweb.com:

./crawlergo -c /tmp/chromium/chrome -t 10 http://testphp.vulnweb.com/

Using Proxy

./crawlergo -c /tmp/chromium/chrome -t 10 --request-proxy socks5://127.0.0.1:7891 http://testphp.vulnweb.com/

Calling crawlergo with python

By default, crawlergo prints the results directly on the screen. We next set the output mode to json, and the sample code for calling it using python is as follows:

#!/usr/bin/python3
# coding: utf-8

import simplejson
import subprocess


def main():
    target = "http://testphp.vulnweb.com/"
    cmd = ["./crawlergo", "-c", "/tmp/chromium/chrome", "-o", "json", target]
    rsp = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    output, error = rsp.communicate()
	#  "--[Mission Complete]--"  is the end-of-task separator string
    result = simplejson.loads(output.decode().split("--[Mission Complete]--")[1])
    req_list = result["req_list"]
    print(req_list[0])


if __name__ == '__main__':
    main()

Crawl Results

When the output mode is set to json, the returned result, after JSON deserialization, contains four parts:

  • all_req_list: All requests found during this crawl task, containing any resource type from other domains.
  • req_list:Returns the current domain results of this crawl task, pseudo-statically de-duplicated, without static resource links. It is a subset of all_req_list .
  • all_domain_list:List of all domains found.
  • sub_domain_list:List of subdomains found.

Examples

crawlergo returns the full request and URL, which can be used in a variety of ways:

  • Used in conjunction with other passive web vulnerability scanners

    First, start a passive scanner and set the listening address to: http://127.0.0.1:1234/

    Next, assuming crawlergo is on the same machine as the scanner, start crawlergo and set the parameters:

    --push-to-proxy http://127.0.0.1:1234/

  • Host binding (not available for high version chrome) (example)

  • Custom Cookies (example)

  • Regularly clean up zombie processes generated by crawlergo (example) , contributed by @ring04h

Bypass headless detect

crawlergo can bypass headless mode detection by default.

https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html

TroubleShooting

  • 'Fetch.enable' wasn't found

    Fetch is a feature supported by the new version of chrome, if this error occurs, it means your version is too low, please upgrade the chrome version.

  • chrome runs with missing dependencies such as xxx.so

    // Ubuntu
    apt-get install -yq --no-install-recommends \
         libasound2 libatk1.0-0 libc6 libcairo2 libcups2 libdbus-1-3 \
         libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2.0-0 libglib2.0-0 libgtk-3-0 libnspr4 \
         libpango-1.0-0 libpangocairo-1.0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 \
         libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3
         
    // CentOS 7
    sudo yum install pango.x86_64 libXcomposite.x86_64 libXcursor.x86_64 libXdamage.x86_64 libXext.x86_64 libXi.x86_64 \
         libXtst.x86_64 cups-libs.x86_64 libXScrnSaver.x86_64 libXrandr.x86_64 GConf2.x86_64 alsa-lib.x86_64 atk.x86_64 gtk3.x86_64 \
         ipa-gothic-fonts xorg-x11-fonts-100dpi xorg-x11-fonts-75dpi xorg-x11-utils xorg-x11-fonts-cyrillic xorg-x11-fonts-Type1 xorg-x11-fonts-misc -y
    
    sudo yum update nss -y
  • Run prompt Navigation timeout / browser not found / don't know correct browser executable path

    Make sure the browser executable path is configured correctly, type: chrome://version in the address bar, and find the executable file path:

Parameters

Required parameters

  • --chromium-path Path, -c Path The path to the chrome executable. (Required)

Basic parameters

  • --custom-headers Headers Customize the HTTP header. Please pass in the data after JSON serialization, this is globally defined and will be used for all requests. (Default: null)
  • --post-data PostData, -d PostData POST data. (Default: null)
  • --max-crawled-count Number, -m Number The maximum number of tasks for crawlers to avoid long crawling time due to pseudo-static. (Default: 200)
  • --filter-mode Mode, -f Mode Filtering mode, simple: only static resources and duplicate requests are filtered. smart: with the ability to filter pseudo-static. strict: stricter pseudo-static filtering rules. (Default: smart)
  • --output-mode value, -o value Result output mode, console: print the glorified results directly to the screen. json: print the json serialized string of all results. none: don't print the output. (Default: console)
  • --output-json filepath Write the result to the specified file after JSON serializing it. (Default: null)
  • --request-proxy proxyAddress socks5 proxy address, all network requests from crawlergo and chrome browser are sent through the proxy. (Default: null)

Expand input URL

  • --fuzz-path Use the built-in dictionary for path fuzzing. (Default: false)
  • --fuzz-path-dict Customize the Fuzz path by passing in a dictionary file path, e.g. /home/user/fuzz_dir.txt, each line of the file represents a path to be fuzzed. (Default: null)
  • --robots-path Resolve the path from the /robots.txt file. (Default: false)

Form auto-fill

  • --ignore-url-keywords, -iuk URL keyword that you don't want to visit, generally used to exclude logout links when customizing cookies. Usage: -iuk logout -iuk exit. (default: "logout", "quit", "exit")
  • --form-values, -fv Customize the value of the form fill, set by text type. Support definition types: default, mail, code, phone, username, password, qq, id_card, url, date and number. Text types are identified by the four attribute value keywords id, name, class, type of the input box label. For example, define the mailbox input box to be automatically filled with A and the password input box to be automatically filled with B, -fv mail=A -fv password=B.Where default represents the fill value when the text type is not recognized, as "Cralwergo". (Default: Cralwergo)
  • --form-keyword-values, -fkv Customize the value of the form fill, set by keyword fuzzy match. The keyword matches the four attribute values of id, name, class, type of the input box label. For example, fuzzy match the pass keyword to fill 123456 and the user keyword to fill admin, -fkv user=admin -fkv pass=123456. (Default: Cralwergo)

Advanced settings for the crawling process

  • --incognito-context, -i Browser start incognito mode. (Default: true)
  • --max-tab-count Number, -t Number The maximum number of tabs the crawler can open at the same time. (Default: 8)
  • --tab-run-timeout Timeout Maximum runtime for a single tab page. (Default: 20s)
  • --wait-dom-content-loaded-timeout Timeout The maximum timeout to wait for the page to finish loading. (Default: 5s)
  • --event-trigger-interval Interval The interval when the event is triggered automatically, generally used in the case of slow target network and DOM update conflicts that lead to URL miss capture. (Default: 100ms)
  • --event-trigger-mode Value DOM event auto-triggered mode, with async and sync, for URL miss-catching caused by DOM update conflicts. (Default: async)
  • --before-exit-delay Delay exit to close chrome at the end of a single tab task. Used to wait for partial DOM updates and XHR requests to be captured. (Default: 1s)

Other

  • --push-to-proxy The listener address of the crawler result to be received, usually the listener address of the passive scanner. (Default: null)
  • --push-pool-max The maximum number of concurrency when sending crawler results to the listening address. (Default: 10)
  • --log-level Logging levels, debug, info, warn, error and fatal. (Default: info)
  • --no-headless Turn off chrome headless mode to visualize the crawling process. (Default: false)

Follow me

Weibo:@9ian1i Twitter: @9ian1i

Related articles:A browser crawler practice for web vulnerability scanning

Issues
  • push to proxy not working

    push to proxy not working

    i tried pushing the request to the proxy but its not working

    https://user-images.githubusercontent.com/87262382/173264330-89be5cff-aa97-4f17-a30d-30d5d5b8437d.mp4

    opened by melmel27 8
  • crawlergo don't work with some sites like twitter

    crawlergo don't work with some sites like twitter

    Hey Qianlitp, thank you for this really awesome tool.

    here is the whole problem in one picture:

    image

    Also when I tried to close the running crawlergo with Ctrl+C, It didn't let me get out of the process.

    opened by amammad 6
  • 直接保存请求结果到文件

    直接保存请求结果到文件

    Hi,很好的浏览器爬虫! 能否将爬虫请求的结果直接保存到文件? 例如要将爬虫请求的URL结果直接保存到result_url.txt。

    尝试了: 1:-o json,发现只是在控制台打印出req_list的json形式,不方便数据处理。 2:--output-json result.json,只是输出尝试1的结果到文件。

    如果能直接将结果保存到文件,就便于数据处理。 如--output-req-url result_url.txt: ./crawlergo -c /bin/chrome -t 5 http://testphp.vulnweb.com/ --output-req-url result_url.txt

    类似的readme提到的这四种结果,能各自保存在文件就方便多了: all_req_list: 本次爬取任务过程中发现的所有请求,包含其他域名的任何资源类型。 req_list:本次爬取任务的同域名结果,经过伪静态去重,不包含静态资源链接。理论上是 all_req_list 的子集 all_domain_list:发现的所有域名列表。 sub_domain_list:发现的任务目标的子域名列表。

    好像利用python调用能处理,但是不够方便。

    //参考rad可以直接保存请求的url或者完整请求到文件。

    feature 
    opened by theLSA 5
  • windows下运行,所有站点都会报错timeout

    windows下运行,所有站点都会报错timeout

    $ crawlergo.exe -c .\GoogleChromePortable64\GoogleChromePortable.exe http://www.baidu.com Crawling GET https://www.baidu.com/ Crawling GET http://www.baidu.com/ time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n" time="2019-12-31T10:56:43+08:00" level=error msg="https://www.baidu.com/" time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done." time="2019-12-31T10:56:43+08:00" level=error msg="navigate timeout chrome failed to start:\n" time="2019-12-31T10:56:43+08:00" level=error msg="http://www.baidu.com/" time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err" time="2019-12-31T10:56:43+08:00" level=debug msg="all navigation tasks done." time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target" time="2019-12-31T10:56:43+08:00" level=debug msg="get comment nodes err" time="2019-12-31T10:56:43+08:00" level=debug msg="invalid target" --[Mission Complete]-- GET http://www.baidu.com/ HTTP/1.1 User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36 Spider-Name: crawlergo-0KeeTeam

    GET https://www.baidu.com/ HTTP/1.1 Spider-Name: crawlergo-0KeeTeam User-Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.0 Safari/537.36

    opened by littleheary 5
  • 目前我的方法是拼接, 比如 http://www.A.com, 已知了两个路径: /path_a,/path_b

    目前我的方法是拼接, 比如 http://www.A.com, 已知了两个路径: /path_a,/path_b

    目前我的方法是拼接, 比如 http://www.A.com, 已知了两个路径: /path_a,/path_b 那么命令为: crawlergo -c chrome http://www.A.com/ http://www.A.com/path_a http://www.A.com/path_b

    有两个问题:

    1. 如果已知路径比较多, 手工拼接比较麻烦
    2. 这种拼接传参的方法和分开一个个执行得到的结果是一样? 还是说有差别,没有进行验证.

    当然后期能有参数支持多路径作为入口最好不过.

    Originally posted by @djerrystyle in https://github.com/0Kee-Team/crawlergo/issues/31#issuecomment-591703807

    opened by asdfasadfasfa 4
  • 带端口的url瞬间返回结果

    带端口的url瞬间返回结果

    ./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com:80/ 执行后: 瞬间返回如下:

    --[Mission Complete]--
    {"req_list":null,"all_domain_list":[xxxxx],"all_req_list":[xxxxx]}
    

    但是: ./crawlergo_linux -c chrome-linux/chrome -output-mode json http://A.B.com/

    Crawling GET http://A.B.com/
    DEBU[0000] 
    DEBU[0006] context deadline exceeded
    --[Mission Complete]--
    {"req_list":[xxxxx],"all_domain_list":[xxxxx],"sub_domain_list":[xxxxx]}
    
    bug 
    opened by djerryz 4
  • 可否支持设置默认的parameter value

    可否支持设置默认的parameter value

    从目前的测试来看,如果我设置postdata 是'username=admin&password=password', 那只尝试一次, 而且忽略其他在页面里面一同出现的paramter, 后续的username password 继续使用默认的KeeTeam 等等。 能否支持设定username=admin 以后, 所有在username 出现的地方都使用admin 而不用KeeTeam? password 类似。

    feature 
    opened by ufo009e 3
  • 某网站爬取时间过长

    某网站爬取时间过长

    目标站点为:https://www.che168.com/ 爬取了两天了,还未结束, 所以希望作者能帮忙看一下是什么原因. 因为crawlergo是串联在自己写的一个程序中的,程序一直在爬,导致无法结束. 后续应该如何约束最大爬取时间,或深度?

    部分爬取URL如下:

    http://www.che168.com/suihua/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
    https://www.che168.com/china/baoma/baoma5xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
    http://www.che168.com/jiangsu/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/
    http://www.che168.com/nanjing/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
    https://www.che168.com/china/aodi/aodia6l/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
    http://www.che168.com/xuzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
    http://www.che168.com/wuxi/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
    https://www.che168.com/china/baoma/baoma3xi/0_5/a3_8msdgscncgpi1ltocspexx0a1/#pvareaid=108403%23seriesZong
    http://www.che168.com/changzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=100943
    http://www.che168.com/suzhou/suva0-suva-suvb-suvc-suvd/0_8/a0_0msdgscncgpi1ltocsp1exa16/#pvareaid=1009
    
    opened by djerryz 3
  • crawlergo 直接退出了

    crawlergo 直接退出了

    0.12的版本下载后只有5.1M,不知道是精简过多了。执行后就直接退出了 ➜ crawlergo mv ~/Downloads/crawlergo ./ ➜ crawlergo chmod +x crawlergo ➜ crawlergo ./crawlergo [1] 9838 killed ./crawlergo ➜ crawlergo ./crawlergo -h [1] 9845 killed ./crawlergo -h ➜ crawlergo ./crawlergo [1] 9852 killed ./crawlergo ➜ crawlergo

    bug 
    opened by 0xa-saline 3
  • navigate timeout context deadline exceeded

    navigate timeout context deadline exceeded

    执行 ./crawlergo -c /usr/bin/google-chrome-stable -t 20 http://testphp.vulnweb.com/

    image

    传参的url只爬到一个 GET http://testphp.vulnweb.com/search.php?test=query release

    Distributor ID:	CentOS
    Description:	CentOS Linux release 7.6.1810 (Core) 
    Release:	7.6.1810
    Codename:	Core
    
    opened by Stu2014 3
  • 直接报错

    直接报错

    panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x38 pc=0x16b9e8f]

    goroutine 975 [running]: ioscan-ng/src/tasks/crawlergo/engine.(*Tab).InterceptRequest(0xc00062c1c0, 0xc0005e5d80) D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/intercept_request.go:42 +0x25f created by ioscan-ng/src/tasks/crawlergo/engine.NewTab.func1 D:/go_projects/ioscan-ng/src/tasks/crawlergo/engine/tab.go:90 +0x2e8

    bug 
    opened by 0xa-saline 3
  • [code review] 阻断静态资源代码块

    [code review] 阻断静态资源代码块

    问题描述

    对静态资源的判断,不知道是设计原因还是 bug,所以想请教一下作者

    疑惑代码块

    https://github.com/Qianlitp/crawlergo/blob/551acb2b75403985493b56414d797ce5a1da480f/pkg/engine/intercept_request.go#L52-L53

    疑问:在这段代码中不仅会匹配到 /path/to/{filename}.{extensive} , 还会匹配到 /path/to/modifyPng 这类 path ,对于第二种有可能是某个网页,过滤掉会导致漏爬,这样的匹配规则是否会过于严格。

    引用的 config.StaticSuffix 如下

    https://github.com/Qianlitp/crawlergo/blob/551acb2b75403985493b56414d797ce5a1da480f/pkg/config/config.go#L71-L79

    opened by PIGfaces 1
  • [Bug] 不断请求阻塞请求导致导航页超时

    [Bug] 不断请求阻塞请求导致导航页超时

    问题描述

    当我在爬取 目标站点: https://19.offcn.com/ 的时候有些链接爬不全,关闭无头模式打开 chrome 开发者工具时发现有访问失败的请求再不断地重复提交导致其他请求被阻塞,导致整个页面渲染超时而导致漏爬

    正常打开时的控制台

    image

    crawlergo 无头模式爬取的控制台

    crawlergo

    复现步骤

    版本

    • Commit Version: 551acb2b75403985493b56414d797ce5a1da480f
    • Browser: 1.39.122 Chromium: 102.0.5005.115 (正式版本) (arm64)

    执行的命令

     ./crawlergo -m 2 -c **** --no-headless https://19.offcn.com/
    

    期望表现

    • 网页能加载完成

    实际表现

    • 网页无法加载完全导致以下的 <DIV> 没有渲染而漏抓
    						<div class="zg_personal already_login" style="display: none">
    							<p class="zg_personalP"><strong><img src=""/></strong><i></i></p>
    							<div class="zg_person_list" style="display: none;">
    							<em>&nbsp;</em>
    							<a href="/mycourse/index/">我的课程</a>
    							<a href="/svipcourse/">学员专享</a>
    							<a href="/orders/myorders/">我的订单</a>
    							<a href="/mycoupon/index/">我的优惠券</a>
    							<a href="/user/index/">账号设置</a>
    							<a href="/foreuser/outlogin/">退出登录</a>
    							</div>
    						</div>
    

    正常打开可以快速加载完成,使用 crawlergo 加载时间太长,这是个 bug 吗,已关闭了所有代理

    bug 
    opened by PIGfaces 1
  • http://dvwa.bihuo.cn/login.php没有爬到POST login.php

    http://dvwa.bihuo.cn/login.php没有爬到POST login.php

    网站: http://dvwa.bihuo.cn/login.php 里面包含POST login.php 抓取不到 image

    ./crawlergo --chromium-path=/opt/google/chrome/chrome http://dvwa.bihuo.cn/login.php --no-headless --log-level=debug image

    opened by chushuai 2
  • [code review] 计算 url path 的唯一值的策略不解的代码段

    [code review] 计算 url path 的唯一值的策略不解的代码段

    非常感谢作者提供了这么棒的工具

    描述

    在我阅读源码到 smart_filter.go#L576 时,对代码段产生了一些疑问

    /**
    计算标记后的唯一请求ID
    */
    func (s *SmartFilter) getMarkedUniqueID(req *model.Request) string {
            ......
    	if req.URL.Fragment != "" && strings.HasPrefix(req.URL.Fragment, "/") {
    		uniqueStr += req.URL.Fragment
    	}
            ......
    }
    
    1. 在最后计算 url 的唯一值时加上 fragment 的原因是为了什么呢,但是在标记时没有为 frangment 打上标记,这是个 bug 吗类似 #70
    2. 是否可以去掉这段代码或者添加 fragment 的标记?这个代码存在的原因是什么呢。

    再次感谢作者提供这么好的工具并在百忙之中抽出时间解答,希望能得到作者的指点

    bug 
    opened by PIGfaces 1
  • Output Request Response to

    Output Request Response to ".txt" File

    First of all this tool is fantastic!

    The primary reason I would like to switch to this tool is the DOM Rendering. Would you be able to incorporate a feature that outputs http.response to ".txt" files after the dom has been rendered? A lot of tools like httpx lack this feature and it would be super nice for indexing.

    opened by gromhacks 1
Releases(v0.4.2)
Owner
9ian1i
one thousand pears
9ian1i
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 448 Jun 24, 2022
Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Go-based search engine URL collector , support Google, Bing, can be based on Google syntax batch collection URL

Re 50 Jun 20, 2022
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Alexey Khan 1 Nov 3, 2021
Go-site-crawler - a simple application written in go that can fetch contentfrom a url endpoint

Go Site Crawler Go Site Crawler is a simple application written in go that can f

Shane Grech 1 Feb 5, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.9k Jun 30, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 16.8k Jun 26, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.1k Jun 24, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 769 Jun 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 263 Jun 12, 2022
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 9 Feb 13, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Jun 17, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 77 Jun 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 11 May 9, 2022
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 10 Jun 28, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 11 May 9, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

null 4 May 23, 2022
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

null 29 May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

QuaKe 1 Nov 9, 2021