Fast, highly configurable, cloud native dark web crawler.

Overview

Bathyscaphe dark web crawler

CI

Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler.

How to start the crawler

To start the crawler, one just need to execute the following command:

$ ./scripts/docker/start.sh

and wait for all containers to start.

Notes

  • You can start the crawler in detached mode by passing --detach to start.sh.
  • Ensure you have at least 3 GB of memory as the Elasticsearch stack docker will require 2 GB.

How to initiate crawling

One can use the RabbitMQ dashboard available at localhost:15003, and publish a new JSON object in the crawlingQueue .

The object should look like this:

{
  "url": "https://facebookcorewwwi.onion"
}

How to speed up crawling

If one want to speed up the crawling, he can scale the instance of crawling component in order to increase performances. This may be done by issuing the following command after the crawler is started:

$ ./scripts/docker/start.sh -d --scale crawler=5

this will set the number of crawler instance to 5.

How to view results

You can use the Kibana dashboard available at http://localhost:15004. You will need to create an index pattern named ' resources', and when it asks for the time field, choose 'time'.

How to hack the crawler

If you've made a change to one of the crawler component and wish to use the updated version when running start.sh you just need to issue the following command:

$ goreleaser --snapshot --skip-publish --rm-dist

this will rebuild all images using local changes. After that just run start.sh again to have the updated version running.

Architecture

The architecture details are available here.

Comments
  • Elasticsearch Crashing with Code 127

    Elasticsearch Crashing with Code 127

    I'm running this on a Google Cloud Platform compute instance with 8GB RAM and 2 cores.

    When I open the Kibana dashboard and create a canvas with a data table of the crawled content from resources *, it appears to lag for a brief moment and later give me 401 unauth errors.

    In the console, I see that docker_elasticsearch_1 exited with code 127

    Memory usage at the time of crash doesn't seem to be high either, with around 2/8GB RAM being used.

    scheduler_1      | time="2020-09-07T03:53:12Z" level=info msg="Successfully initialized tdsh-scheduler. Waiting for URLs"
    torproxy_1       | WARNING: no logs are available with the 'none' log driver
    // After opening kibana dashboard and waiting about 20 seconds
    docker_elasticsearch_1 exited with code 127
    
    bug 
    opened by GaganBhat 10
  • Kibana server is not ready yet

    Kibana server is not ready yet

    I tried the new project but can't get past "Kibana server is not ready yet". I used the packaged build and start scripts. Are there additional steps or an installation guide somewhere?

    Edit: Everything appeared to start OK, here's my output: Starting deployments_nats_1 ... done Starting deployments_elasticsearch_1 ... done Starting deployments_torproxy_1 ... done Starting deployments_scheduler_1 ... done Starting deployments_crawler_1 ... done Starting deployments_kibana_1 ... done Starting deployments_api_1 ... done Starting deployments_persister_1 ... done Attaching to deployments_torproxy_1, deployments_nats_1, deployments_elasticsearch_1, deployments_scheduler_1, deployments_crawler_1, deployments_api_1, deployments_kibana_1, deployments_persister_1 torproxy_1 | WARNING: no logs are available with the 'none' log driver nats_1 | WARNING: no logs are available with the 'none' log driver elasticsearch_1 | WARNING: no logs are available with the 'none' log driver scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Starting trandoshan-scheduler v0.0.1" scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using NATS server at: nats" scheduler_1 | time="2020-08-05T21:39:31Z" level=debug msg="Using API server at: http://api:8080" scheduler_1 | time="2020-08-05T21:39:31Z" level=info msg="Successfully initialized trandoshan-scheduler. Waiting for URLs" crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Starting trandoshan-crawler v0.0.1" crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using NATS server at: nats" crawler_1 | time="2020-08-05T21:39:32Z" level=debug msg="Using TOR proxy at: torproxy:9050" crawler_1 | time="2020-08-05T21:39:32Z" level=info msg="Successfully initialized trandoshan-crawler. Waiting for URLs" api_1 | {"time":"2020-08-05T21:39:33.269084605Z","level":"INFO","prefix":"echo","file":"api.go","line":"73","message":"Starting trandoshan-api v0.0.1"} api_1 | {"time":"2020-08-05T21:39:33.269182929Z","level":"DEBUG","prefix":"echo","file":"api.go","line":"75","message":"Using elasticsearch server at: http://elasticsearch:9200"} api_1 | {"time":"2020-08-05T21:39:33.295324468Z","level":"INFO","prefix":"echo","file":"api.go","line":"88","message":"Successfully initialized trandoshan-api. Waiting for requests"} api_1 | ⇨ http server started on [::]:8080 kibana_1 | WARNING: no logs are available with the 'none' log driver persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Starting trandoshan-persister v0.0.1" persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using NATS server at: nats" persister_1 | time="2020-08-05T21:39:34Z" level=debug msg="Using API server at: http://api:8080" persister_1 | time="2020-08-05T21:39:34Z" level=info msg="Successfully initialized trandoshan-persister. Waiting for resources" deployments_elasticsearch_1 exited with code 1

    documentation 
    opened by FFrozTT 10
  • Problems crawling torch results pages

    Problems crawling torch results pages

    For some reason the crawler is not parsing Torch results pages correctly because none of the links end up being scheduled or crawled.

    e.g.

    http://xmh57jrzrnw6insl.onion/4a1f6b371c/search.cgi?s=DRP&q=irc&cmd=Search%21

    Should return plenty of results.

    I have no idea why, the only thing remotely interesting is that there is an iframe at the start of the page. I think I've seen this with some other pages too but I can't remember what they were.

    wontfix 
    opened by FFrozTT 9
  • Duplicate URLs in ElasticSearch DB

    Duplicate URLs in ElasticSearch DB

    Hi there,

    I've been playing with this Tor crawler for some time and generally it works pretty well. However, I've got a problem of duplicate urls. It has been running for 4 days and has achieved over 4000 hits, but the count of unique urls is just around 1000.

    I noticed that there is a query method in the scheduler that asks the ElasticSearch DB whether the found url already exists.

    b64URI := base64.URLEncoding.EncodeToString([]byte(normalizedURL.String()))
    apiURL := fmt.Sprintf("%s/v1/resources?url=%s", apiURI, b64URI)
    
    var urls []proto.ResourceDto
    r, err := httpClient.JSONGet(apiURL, &urls)
    ...
    if len(urls) == 0 {
    ...
    

    I've copied this method to the crawler and persister as well to check todo urls and resource urls. However, it still only gets around 1000 unique urls out of over 4000 hits.

    Does anyone have any idea of how to fix this problem? Any hint would be greatly appreciated.

    bug 
    opened by ht-weng 6
  • Add switches to docker-compose allowing detaching

    Add switches to docker-compose allowing detaching

    Add -t -i switches to docker command allowing detaching. Right now I have to restart the whole project if I want to attach/detach.

    Alternatively, you could assign unique detach keys:

    --detach-keys "ctrl-a,a"

    Sorry, I was trying to add an "Enhancement label" but I don't think I can

    question 
    opened by FFrozTT 5
  • Updated extractor to fix doubling up on same URL

    Updated extractor to fix doubling up on same URL

    It seems that the extractor saves the URL without the protocol, but the scheduler searches for the URL with the protocol attached.

    extractor.go:

    resDto := api.ResourceDto{
    		URL:   protocolRegex.ReplaceAllLiteralString(msg.URL, ""), // here it sanitizes it..
    		Title: extractTitle(msg.Body),
    		Body:  msg.Body,
    		Time:  time.Now(),
    	}
    

    scheduler.go: _, count, err := apiClient.SearchResources(u.String(), "", time.Time{}, endDate, 1, 1)

    bug 
    opened by smit1759 4
  • Elasticsearch: max virtual memory areas vm.max_map_count [xxxx] is too low, increase to at least [xxxx]

    Elasticsearch: max virtual memory areas vm.max_map_count [xxxx] is too low, increase to at least [xxxx]

    I'm seeing a lot more results than before but also API errors on URL's I know to be working:

    time="2020-08-10T20:40:42Z" level=error msg="Error getting response from ES: dial tcp: lookup elasticsearch on 127.0.0.11:53: no such host"

    bug 
    opened by FFrozTT 4
  • Build Error invalid argument

    Build Error invalid argument "creekorful/"

    Hi, docker version is 19.03.12, we tried to build trandoshan using the scripts build.sh. However it gives the following errors:

    invalid argument "creekorful/" for "-t, --tag" flag: invalid reference format See 'docker build --help'.

    Where can we find the documentation on this project? Thanks

    opened by blueteamzone 3
  • Request to Elasticsearch failed: {

    Request to Elasticsearch failed: {"error":{}}

    So I've had all containers running overnight without exiting and there is certainly a lot of activity but something doesn't seem quite right between Kibana and Elasticsearch. Kibana is only showing me 8 entry's and giving this error:

    Request to Elasticsearch failed: {"error":{}}
    
    Error: Request to Elasticsearch failed: {"error":{}}
        at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4900279
        at Function._module.service.Promise.try (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2504083)
        at http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503457
        at Array.map (<anonymous>)
        at Function._module.service.Promise.map (http://x.x.x.x:15004/bundles/commons.bundle.js:3:2503414)
        at callResponseHandlers (http://x.x.x.x:15004/bundles/commons.bundle.js:3:4898793)
        at http://x.x.x.x:15004/bundles/commons.bundle.js:3:4881154
        at processQueue (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:204190)
        at http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:205154
        at Scope.$digest (http://x.x.x.x:15004/built_assets/dlls/vendors.bundle.dll.js:435:215159)
    
    bug 
    opened by FFrozTT 3
  • %!s(<nil>) response

    %!s() response

    I'm having that previous issueagain with the new build:

    scheduler_1 | time="2020-08-08T17:17:49Z" level=debug msg="Processing URL: https://www.facebookcorewwwi.onion" api_1 | time="2020-08-08T17:17:49Z" level=debug msg="Successfully published URL: https://www.facebookcorewwwi.onion" api_1 | time="2020-08-08T17:17:49Z" level=error msg="Error getting response: %!s()" scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Error while searching URL: %!s()" scheduler_1 | time="2020-08-08T17:17:49Z" level=error msg="Received status code: 500"

    bug 
    opened by FFrozTT 3
  • Investigate redis memory optimization

    Investigate redis memory optimization

    On my test instance:

    • 54,199,819 keys
    • avg 200 bytes / key

    => 10,839,963,800 byte => 10Gb


    127.0.0.1:6379> MEMORY USAGE url:[scrubbed]
    (integer) 200
    
    optimization 
    opened by creekorful 2
Releases(v1.0.0)
Owner
Darkspot
Building products to analyse the dark places of the internet.
Darkspot
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 15 Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 15 Aug 21, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 9.1k Sep 22, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 265 Sep 28, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Sep 23, 2022
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 14 Sep 27, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Tony Afula 0 Jan 29, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 17.8k Sep 25, 2022
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 7.2k Sep 21, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 770 Sep 6, 2022
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 11 Aug 1, 2022
High-performance crawler framework based on fasthttp

predator / 掠食者 基于 fasthttp 开发的高性能爬虫框架 使用 下面是一个示例,基本包含了当前已完成的所有功能,使用方法可以参考注释。

null 15 May 2, 2022
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

9ian1i 2.1k Sep 22, 2022
New World Auction House Crawler In Golang

New-World-Auction-House-Crawler Goal of this library is to have a process which grabs New World auction house data in the background while playing the

null 3 Sep 7, 2022
Simple content crawler for joyreactor.cc

Reactor Crawler Simple CLI content crawler for Joyreactor. He'll find all media content on the page you've provided and save it. If there will be any

null 29 May 5, 2022
A PCPartPicker crawler for Golang.

gopartpicker A scraper for pcpartpicker.com for Go. It is implemented using Colly. Features Extract data from part list URLs Search for parts Extract

QuaKe 1 Nov 9, 2021
Multiplexer: HTTP-Server & URL Crawler

Multiplexer: HTTP-Server & URL Crawler Приложение представляет собой http-сервер с одним хендлером. Хендлер на вход получает POST-запрос со списком ur

Alexey Khan 1 Nov 3, 2021
A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

A simple crawler sending Telegram notification when Refurbished Macbook Air / Pro in stock.

SouthWolf 8 Jan 30, 2022