DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

Overview

DataHen Till is a standalone tool that runs alongside your web scraper, and instantly makes your existing web scraper scalable, maintainable and unblockable. It integrates with your existing web scraper without requiring any code changes on your scraper code.

Till was architected to follow best practices that DataHen has accumulated over the years of scraping at a massive scale.

How it works

Problems with Web Scraping

Web scraping is usually easy to get started, especially on a small scale. However, as you try to scale it up, it gets exponentially difficult. Scraping 10,000 records can easily be done with simple web scraper scripts in any programming language, but as you try to scrape millions of pages, you would need to architect and build features on your web scraping script that allows you to scale, maintain and unblock your scrapers.

DataHen Till solves the following problems:

Scaling your scraper

Scraping to millions or even billions of records requires much more pre-planning. It's not simply running your existing web scraper script in a bigger CPU/Ram machine. More thoughts are needed, such as:

  • How to log massive amounts of HTTP requests.
  • How to troubleshoot HTTP requests, when it fails at scale.
  • How to minimize bandwidth usage.
  • How to rotate proxy IPs.
  • How to handle anti-scrapers.
  • What happens when a scraper fails.
  • How to resume scrapers after they are fixed.
  • etc.

Till provides a plug-and-play method of making your web scrapers scalable, and maintainable following best practices at DataHen that makes web scraping a pleasant experience.

Blocked scraper

As you try to scale up the number of requests, quite often, the target websites will detect your scraper and try to block your requests using Captcha, or throttling, or denying your request completely.

Till helps you circumvent detected as a web scraper by identifying your scraper as a real web browser. It does this by generating random user-agent headers and randomizing proxy IPs (that you supply) on every HTTP request.

Till also makes it easy for you to troubleshoot on why the target website block your scraper.

Scraper Maintenance

Maintaining high-scale scrapers is challenging due to the massive volume of requests and interactions between your scrapers and the target websites. In order for a smooth operation, you need to think through how to maintain your scrapers regularly.

You need to know how to raise and triage errors as they occur on your scrapers, not all errors on web scraping should be treated equally. some are ignorable, and some are urgent. So, you will need to know what will be the details of your "development-deployment-maintenance" process will be.

Till solves this by logging all your HTTP requests and categorizing them whether it was successful (2XX statuses) or failures(non 2XX statuses). Till also provides a Web UI to analyze the request history and make sense of what happened during your scraping process.

Till makes it even easier for scraper maintenance by assigning each request with a unique Global ID (GID) that is derived from the request's URL, method, body, etc. You can then use this GID to troubleshoot your scrapers on where it went wrong.

Postmortem analysis & reproducability

The biggest difficulty facing any web scraper developer is when there are scraping failures. Your scraper fails when fetching or parsing certain URLs, but when you look at the target website and URLs, everything looks fine. How do you troubleshoot what already happened in the scenario?. How do you reproduce that failed scrape so that you can fix the issue?

Till stores all HTTP requests and the responses (including the response body/content) into a local cache. If at anytime your scraper encounters an error, you can then use the request's GID (Till assigns a Global ID, also called GID, on every request) to find the request and the actual response and content from the cache. In this way, you can analyze what went wrong with that particular request.

Starting over from scratch when it fails mid-way

Websites change all the time and without notice. Imagine running your web scraper for a week and then suddenly, somewhere along the way, it fails. It is frustrating that once you've fixed the scraper, there is a high chance that you'd need to start over from scratch again. And, on top of this, there are additional consequences, such as time delay, and further charges related to proxy usage, bandwidth, storage, VM costs, etc.

Till solves this by allowing you to replay your scrapers without actually needing to resend the HTTP requests to the target server. Till does this by assigning each HTTP request its own unique Global ID (GID) that is generated from the request's URL, method, headers, etc. It then stores all HTTP responses in the Cache based on their GID.

When you restart your scraper, the scraping process can go blazingly fast because Till now serves the cached version of the HTTP responses. All of this without any code changes on your existing web scraper.

Features

User-Agent randomizer

Till automatically generates random user-agent on every request. Choose to identify your scraper as a desktop browser, or a mobile browser, or you can even override it with your custom user-agent.

Proxy IP address rotation

Supply a list of proxy IPs, and Till will randomly use them on every request. Saves you time in needing to set up a separate proxy rotation service.

Sticky Sessions

Your scraper can selectively reuse the same user-agent, proxy IP, and cookie jar for multiple requests. This allows you to easily group your requests based on certain workflow, and allow you to avoid detection from anti-scraping systems.

Managing Cookies

No need to build your cookie management logic in your scraper codes. Till can store the cookies for you so that you can easily reuse them on subsequent requests.

Request Logging

Till will log your requests based on successful request (2XX status code) or failed request (non 2XX status code). This will allow you to easily troubleshoot your scraper later.

The Till UI allows you to make sense of HTTP request history, and troubleshoot what happens during a scraping session.

HTTP Caching

Till caches all of your HTTP responses (and their contents), so that as needed, your web scraper will reuse the cache without needing to do another HTTP request to the target server.

You can selectively choose whether to use a particular cached content or not by specifying how fresh you want Till to serve the cache. For example: If Till holds an existing cached content that is 1 week old, but your web scraper only wants 1-day old content, Till will then only serve cached contents that are 1 day old.

HTTP Caching Flowchart

Global ID (GID)

Till uses DataHen Platform's convention of marking every unique request with a signature (we call this the Global ID or GID for short). Think of it like a Checksum of the actual request.

Anytime your scraper sends a request through Till, it will return a response with the header X-DH-GID that contains the GID. This GID allows you to easily troubleshoot requests when you need to look up specific requests in the log, or contents in the cache.

How DataHen Till works

Till works as a Man In The Middle (MITM) proxy that listens to incoming HTTP(S) requests and forwards those requests to the target server as needed. While it does so, it enhances each request to avoid being detected by anti-scrapers. It also logs and caches the responses to make your scraper maintainable and scalable.

Connect your scraper to Till via the proxy protocol that is typically common in any programming language.

Your scraper will then continue to run as-is and it will get instantly become more unblockable, scalable, and maintainable.

How it works

Installation

Step 1: Download Till

The recommended way to install DataHen Till is by downloading one of the standalone binaries according to your OS.

Step 2: Get your auth Token

You need to get your auth token to run Till.

Get your token for FREE by signing up for an account at till.datahen.com.

Step 3: Start Till

start the Till server with the following command:

$ till serve -t <your token here> 

The above will start a proxy port on http://localhost:2933 and the Till UI on http://localhost:2980.

Request Log UI

Step 4 Connect to Till

You can connect your scraper to Till without many code changes.

If you want to connect to Till using curl, this is how:

$ curl -k --proxy http://localhost:2933 https://fetchtest.datahen.com/echo/request

Certificate Authority (CA) Certificates

Till decrypts and encrypts HTTPS traffic on the fly between your scraper and the target websites. In order to do so, your scraper (or browser) must be able to trust the built-in Certificate Authority (CA). This means the CA certificate that Till generates for you, needs to be installed on the computer where the scraper is running.

Note: If you do not wish to install the CA certificate, you can still have your scraper connect to the Till server by disabling/ignoring security checks in your scraper. Please refer to the programming language/framework/tool that your scraper uses.

Installing the generated CA certificates onto your computer

The first time Till runs as a server, Till generates the CA certificates in the following directory:

Linux or MacOS:

~/.config/datahen/till/

Windows:

C:\Users\\.config\datahen\till\

Then, please follow the following instructions to install the CA certificates:

MacOS

Add certificates to a keychain using Keychain Access on Mac

Ubuntu/Debian

How do I install a root certificate

Mozilla Firefox

how to import the Mozilla Root Certificate into your Firefox web browser

Chrome

Getting Chrome to accept self-signed localhost certificate

Windows

Use certutil with the following command:

certutil -addstore root 

Read more about certutil

Comments
  • Insecure Browser Detected!

    Insecure Browser Detected!

    At the site below, I am using puppeteer through till. Till is started with the command;

    till serve --proxy-file c:\temp\till\proxylist.txt --token --force-user-agent --ua-type desktop

    https://secure.utah.gov/llv/search/index.html

    Insecure Browser Detected! We noticed that your browser is REALLY OLD.

    Let me know if there is additional information I can provide to help debug.

    Info: Cache MISS RID 01FDWJ3NMJK05AZM77M50DGTHS GID secure.utah.gov-08228751e1bab3b452a3e0d6830a1e50 SID Timestamp 2021-08-24 13:07:45

    Config: ForceUA true UaType desktop UseProxy true StickyCookies true StickyUA true IgnoreInterceptors [] IgnoreAllInterceptors false CacheFreshness now CacheServeFailures false

    Request: Method POST URL https://secure.utah.gov/llv/search/index.html Header Accept text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9 Accept-Encoding gzip, deflate, br Accept-Language en-US Cache-Control no-cache Connection keep-alive Content-Length 573 Content-Type application/x-www-form-urlencoded Cookie JSESSIONID=997DA0A8AD40B2C1AB70B5B556A1E845; TS01bdb7d2=0143bf51700840319b0a08eeb8fbe8681009410c51139c8d6e58c60496aabe65498495e0a8af296ef0edd92ba99bd801c0a1857f32b92dcf6a887da12f8c350179f47b9356; TS01959f26=0143bf5170adbfef85245e0b4e92afb875bb8f3992139c8d6e58c60496aabe65498495e0a8edfd3d3b396fd57ee8cec40e757bb731; __utma=128287630.704573828.1629824827.1629824827.1629824827.1; __utmc=128287630; __utmz=128287630.1629824827.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; fontsize=90%25; _ga=GA1.2.704573828.1629824827; _gid=GA1.2.1327052485.1629824827; _gat_UA-103830962-11=1; __utmb=128287630.2.9.1629824865296 Origin https://secure.utah.gov Pragma no-cache Referer https://secure.utah.gov/llv/search/index.html Sec-Fetch-Dest document Sec-Fetch-Mode navigate Sec-Fetch-Site same-origin Sec-Fetch-User ?1 Upgrade-Insecure-Requests 1 User-Agent Mozilla/5.0 (Windows NT 6.1; Trident/7.0; rv:11.0) like Gecko ContentLength 573 Body g-recaptcha-response=03AGdBq25ElbcQolMLQOZ3PnpAWF7Tvqhe55PBHrAp1oWqRWEQyBdlZquRQa5rXaucm4CiAxyZR7roFzAQsWJjjzesc_br5Sywr21DVfYyDNCjHEbGYorI3fsPOgxmijW0p9TJZGLtiaZmBVE9J4MeOZxGrUD9NQSR517qnmphipvyOqODKOPESQwcKoYOLDIaNsA4PWOMsbj2EnKZZc4j-79W5FBPT5hDmnyfgSNLcW7VFSQ73O5muE_jybf0LvgyWKKtaKexuIK_lLwjt44qXAzz7xoG2ruafB6N7xo2vrUoplhQ394iSeO7chDN47QlbI_4x1SIq0f0g5KRxVN9GYuCgPsCRCc547f6HkigwK-BZueHD8eSVG2YglwxL7vvUHj_eXHml2_gdHUVwp_Gk9QJdeFnM6vfPKDxpEqMgcjdH8IBzqxJmzo&licenseNumberCore=339473&licenseNumberFourDigit=5518&type=by_number&_csrf=e92e3abe-4aab-4843-87d4-bbbd87d4b879

    Response: Status 200 OK Proto HTTP/1.1 Header Accept-Ranges bytes Connection Keep-Alive Content-Length 17488 Content-Type text/html; charset=UTF-8 Date Tue, 24 Aug 2021 17:07:46 GMT Keep-Alive timeout=5, max=100 Server Apache Strict-Transport-Security max-age=16070400; includeSubDomains ContentLength 17488

    enhancement 
    opened by danpmohn 5
  • Doesn't work with puppeteer

    Doesn't work with puppeteer

    When I try to use till with curl, it works fine after install the cert, but when I specify the proxy with puppeteer:

    (async () => {
      const browser = await puppeteer.launch({
         headless: false,
          args: [
            '--proxy-server=http://localhost:2933',
            '--ignore-certificate-errors',
            '--ignore-certificate-errors-spki-list '
         ],
        });
    

    I get a err_ssl_version_or_cipher_mismatch error message on Chrome, using Windows 10.

    opened by deoxykev 4
  • Invalid memory address or nil pointer dereference

    Invalid memory address or nil pointer dereference

    A user reported the following issue:

    gotten error: Put "https://till.datahen.com/api/v1/instances/default/stats": read tcp 192.168.1.100:54493->104.21.62.154:443: read: connection reset by peer
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x1028d8244]
    
    goroutine 76 [running]:
    github.com/DataHenHQ/till/server.startRecurringStatUpdate()
    	/__w/till/till/server/stats.go:51 +0x194
    created by github.com/DataHenHQ/till/server.Serve
    	/__w/till/till/server/server.go:108 +0x3a8
    
    opened by paramaw 1
  • Doesn't work with node

    Doesn't work with node

    Works with curl but doesn't work with node:

    const got = require('got');
    const {HttpsProxyAgent} = require('hpagent');
    
    (async function main() {
      const response = await got.post('https://example.com/', {
        agent: {
          https: new HttpsProxyAgent({
            proxy: 'http://localhost:2933',
          }),
        },
    
        https: {
          rejectUnauthorized: false,
        },
    
        json: { query },
      });
    
      console.log({ response });
    })();
    

    I tried with http.request and request.post as well; no joy; I get ECONNRESET every time.

    opened by beaugunderson 1
  • Fixes #8 Add custom user-agent config file support

    Fixes #8 Add custom user-agent config file support

    This will allow users to provide a custom user-agent config file with only those user-agents that are useful for their needs by using the --ua-config-file /path/to/custom-ua-config.json flag or setting ua-config-file: /path/to/custom-ua-config.json on their config.yaml file.

    For example, on issue #8, the user can download our default user-agent config file, then remove Internet Explorer entry to prevent the target website "Insecure Browser Detected!" error message.

    The git patch to create that custom user-agent config file without Internet Explorer would be

    diff --git a/default-ua-config.json b/custom-ua-config.json
    index f7257c1..9a56a83 100644
    --- a/default-ua-config.json
    +++ b/custom-ua-config.json
    @@ -12,7 +12,6 @@
                             "signatures": ["Windows NT 6.1; Win64; x64"],
                             "probability": 0.1856,
                             "browser_ids": [
    -                            "ie",
                                 "edge",
                                 "chrome",
                                 "firefox",
    @@ -26,7 +25,6 @@
                             "signatures": ["Windows NT 6.2; Win64; x64"],
                             "probability": 0.0106,
                             "browser_ids": [
    -                            "ie",
                                 "edge",
                                 "chrome",
                                 "firefox",
    @@ -40,7 +38,6 @@
                             "signatures": ["Windows NT 6.3; Win64; x64"],
                             "probability": 0.0416,
                             "browser_ids": [
    -                            "ie",
                                 "edge",
                                 "chrome",
                                 "firefox",
    @@ -54,7 +51,6 @@
                             "signatures": ["Windows NT 10.0; Win64; x64"],
                             "probability": 0.7400,
                             "browser_ids": [
    -                            "ie",
                                 "edge",
                                 "chrome",
                                 "firefox",
    @@ -220,18 +216,6 @@
                 }
             ],
             "browsers": {
    -            "ie": {
    -                "id": "ie",
    -                "probability": 0.0119,
    -                "ua_format": "Mozilla/5.0 (<os:kernel>; Trident/7.0; rv:11.0) like Gecko",
    -                "variants": [
    -                    {
    -                        "id": "ie",
    -                        "probability": 1,
    -                        "data": {}
    -                    }
    -                ]
    -            },
                 "edge": {
                     "id": "edge",
                     "probability": 0.0261,
    
    
    enhancement 
    opened by colorfulsing 0
  • POST with body does not work

    POST with body does not work

    When doing POST with a request body, it drops the connection:

    curl -X POST 'https://postman-echo.com/post' -H 'X-DH-Cache-Freshness: now' -H "Content-Type: application/json" --data '{"hello":"world"}' -kv --proxy http://localhost:2933                      
    Note: Unnecessary use of -X or --request, POST is already inferred.
    ...
    > POST /post HTTP/1.1
    > Host: postman-echo.com
    > User-Agent: curl/7.58.0
    > Accept: */*
    > X-DH-Cache-Freshness: now
    > Content-Type: application/json
    > Content-Length: 17
    >
    * upload completely sent off: 17 out of 17 bytes
    * Connection #0 to host localhost left intact
    
    opened by paramaw 0
  • Add a Gitter chat badge to README.md

    Add a Gitter chat badge to README.md

    DataHenHQ/till now has a Chat Room on Gitter

    @paramaw has just created a chat room. You can visit it here: https://gitter.im/DataHenHQ/till.

    This pull-request adds this badge to your README.md:

    Gitter

    If my aim is a little off, please let me know.

    Happy chatting.

    PS: Click here if you would prefer not to receive automatic pull-requests from Gitter in future.

    opened by gitter-badger 0
  • version `GLIBC_2.28' not found, when running Till 0.8.0 on Ubuntu 18.04

    version `GLIBC_2.28' not found, when running Till 0.8.0 on Ubuntu 18.04

    Hello, when I try to run Till (till_0.8.0_Linux_x86_64.tar.gz) on Ubuntu 18.04, I got this error: ./till: /lib/x86_64-linux-gnu/libc.so.6: version 'GLIBC_2.28' not found (required by ./till)

    I think its because of Till not compatible with GLIBC that comes with Ubuntu 18.04 or lower, This is the version of GLIBC that comes with Ubuntu 18.04 or lower:

    Ubuntu 16.04 -> GLIBC 2.23 Ubuntu 18.04 -> GLIBC 2.27

    Compliling Till binaries with lower version of GLIBC might fix this issue.

    This is my system details: image

    opened by zokovi 0
Releases(v0.10.2)
Owner
DataHenHQ
DataHen provides services and platform for scalable web scraping, data processing & ETL
DataHenHQ
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.9k Nov 28, 2022
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Stephen Schmidt 3 Oct 5, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 18.3k Nov 28, 2022
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 15 Aug 21, 2022
A crawler/scraper based on golang + colly, configurable via JSON

Super-Simple Scraper This a very thin layer on top of Colly which allows configuration from a JSON file. The output is JSONL which is ready to be impo

Go Tripod 15 Aug 21, 2022
Simple price scraper with HTTP server/exporter for use with Prometheus

priceserver v0.3 Simple price scraper with HTTP server/exporter for use with Prometheus Currently working with Bitrue.com exchange but easily adaptabl

Jamie Prince 0 Nov 16, 2021
Scraper to download school attendance data from the DfE's statistics website

?? Simple to use. Scrape attendance data with a single command! ?? Super fast. A

Luke Carr 0 Mar 31, 2022
A cli scraper of gocomics.com made in go

goComic goComic is a cli tool written in go that scrapes your favorite childhood favorite comic from gocomics.com. It will give you a single days comi

null 0 Dec 24, 2021
Best Room Price Scraper from Booking.com

Best Room Price Scraper from Booking.com This repo is a tutorial of Large Scale

Kevser Sırça 4 Nov 11, 2022
A simple scraper to export data from buildkite to honeycomb using opentelemetry SDK

A quick scraper program that let you export builds on BuildKite as OpenTelemetry data and then send them to honeycomb.io for slice-n-dice high cardinality analysis.

Son Luong Ngoc 3 Jul 7, 2022
Warhammer40K faction scraper written in Golang, powered by colly.

Wascra Description Wascra is a tool written in Golang, which lets you extract all relevant Datasheet info from a Warhammer40K (9th edition) faction fr

null 0 Feb 8, 2022
This is a small tool designed to scrape one or more URLs given as command arguments.

HTTP-FETCH This is a small tool designed to scrape one or more URLs given as command arguments. Usage http-fetch [--metadata] ...URLs The output files

Daniel Sullivan 0 Nov 23, 2021
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Nov 23, 2022
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 9.4k Nov 25, 2022
Crawls web pages and prints any link it can find.

crawley Crawls web pages and prints any link it can find. Scan depth (by default - 0) can be configured. features fast SAX-parser (powered by golang.o

Alexei Shevchenko 108 Nov 17, 2022
skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Michael Eder 46 Nov 23, 2022
Youtube tutorial about web scraping using golang and Gocolly

This is an example project I wrote for a youtube tutorial about webscraping using golang and gocolly It extracts data from a tracking differences webs

null 1 Mar 26, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 264 Nov 9, 2022