Extract structured data from web sites. Web sites scraping.

Overview

Dataflow kit

alt tag

Build Status GoDoc Go Report Card codecov

Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors.

You can use it in many ways for data mining, data processing or archiving.

The Web Scraping Pipeline

Web-scraping pipeline consists of 3 general components:

  • Downloading an HTML web-page. (Fetch Service)
  • Parsing an HTML page and retrieving data we're interested in (Parse Service)
  • Encoding parsed data to CSV, MS Excel, JSON, JSON Lines or XML format.

Fetch service

fetch.d server is intended for html web pages content download. Depending on Fetcher type, web page content is downloaded using either Base Fetcher or Chrome fetcher.

Base fetcher uses standard golang http client to fetch pages as is. It works faster than Chrome fetcher. But Base fetcher cannot render dynamic javascript driven web pages.

Chrome fetcher is intended for rendering dynamic javascript based content. It sends requests to Chrome running in headless mode.

A fetched web page is passed to parse.d service.

Parse service

parse.d is the service that extracts data from downloaded web page following the rules listed in configuration JSON file. Extracted data is returned in CSV, MS Excel, JSON or XML format.

Note: Sometimes Parse service cannot extract data from some pages retrieved by default Base fetcher. Empty results may be returned while parsing Java Script generated pages. Parse service then attempts to force Chrome fetcher to render the same dynamic javascript driven content automatically. Have a look at https://scrape.dataflowkit.com/persons/page-0 which is a sample of JavaScript driven web page.

Dataflow kit benefits:

  • Scraping of JavaScript generated pages;

  • Data extraction from paginated websites;

  • Processing infinite scrolled pages.

  • Sсraping of websites behind login form;

  • Cookies and sessions handling;

  • Following links and detailed pages processing;

  • Managing delays between requests per domain;

  • Following robots.txt directives;

  • Saving intermediate data in Diskv or Mongodb. Storage interface is flexible enough to add more storage types easily;

  • Encode results to CSV, MS Excel, JSON(Lines), XML formats;

  • Dataflow kit is fast. It takes about 4-6 seconds to fetch and then parse 50 pages.

  • Dataflow kit is suitable to process quite large volumes of data. Our tests show the time needed to parse appr. 4 millions of pages is about 7 hours. 

Installation

go get -u github.com/slotix/dataflowkit

Usage

Docker

  1. Install Docker and Docker Compose

  2. Start services.

cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose up

This command fetches docker images automatically and starts services.

  1. Launch parsing in the second terminal window by sending POST request to parse daemon. Some json configuration files for testing are available in /examples folder.
curl -XPOST  127.0.0.1:8001/parse --data-binary "@$GOPATH/src/github.com/slotix/dataflowkit/examples/books.toscrape.com.json"

Here is the sample json configuration file:

{
	"name":"collection",
	"request":{
	   "url":"https://example.com"
	},
	"fields":[
	   {
		  "name":"Title",
		  "selector":".product-container a",
		  "extractor":{
			 "types":["text", "href"],
			 "filters":[
				"trim",
				"lowerCase"
			 ],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   },
	   {
		  "name":"Image",
		  "selector":"#product-container img",
		  "extractor":{
			 "types":["alt","src","width","height"],
			 "filters":[
				"trim",
				"upperCase"
			 ]
		  }
	   },
	   {
		  "name":"Buyinfo",
		  "selector":".buy-info",
		  "extractor":{
			 "types":["text"],
			 "params":{
				"includeIfEmpty":false
			 }
		  }
	   }
	],
	"paginator":{
	   "selector":".next",
	   "attr":"href",
	   "maxPages":3
	},
	"format":"json",
	"fetcherType":"chrome",
	"paginateResults":false
}

Read more information about scraper configuration JSON files at our GoDoc reference

Extractors and filters are described at https://godoc.org/github.com/slotix/dataflowkit/extract

  1. To stop services just press Ctrl+C and run
cd $GOPATH/src/github.com/slotix/dataflowkit && docker-compose down --remove-orphans --volumes

IMAFGE ALT CLI Dataflow kit web scraping framework

Click on image to see CLI in action.

Manual way

  1. Start Chrome docker container
docker run --init -it --rm -d --name chrome --shm-size=1024m -p=127.0.0.1:9222:9222 --cap-add=SYS_ADMIN \
  yukinying/chrome-headless-browser

Headless Chrome is used for fetching web pages to feed a Dataflow kit parser.

  1. Build and run fetch.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/fetch.d && go build && ./fetch.d
  1. In new terminal window build and run parse.d service
cd $GOPATH/src/github.com/slotix/dataflowkit/cmd/parse.d && go build && ./parse.d
  1. Launch parsing. See step 3. from the previous section.

Run tests

  • docker-compose -f test-docker-compose.yml up -d
  • ./test.sh
  • To stop services just run docker-compose -f test-docker-compose.yml down

Front-End

Try https://dataflowkit.com/dfk Front-end with Point-and-click interface to Dataflow kit services. It generates JSON config file and sends POST request to DFK Parser

IMAGE ALT Dataflow kit web scraping framework

Click on image to see Dataflow kit in action.

License

This is Free Software, released under the BSD 3-Clause License.

Contributing

You are welcome to contribute to our project.

alt tag

Issues
  • UI included?

    UI included?

    Will the UI available for toscrape data be published as a general service? Without that UI the whole service is... well, useful only to some extent...

    opened by uded 7
  • Adding option for bypassing render.

    Adding option for bypassing render.

    Most of websites return same result whether fetch via browser or direct download. Can you add option for bypass?

    opened by lutfuahmet 4
  • Content behind login

    Content behind login

    I would like to scrape a website behind a login form (e.g. http://quotes.toscrape.com/login). Is Dataflowkit able to send forms and keep session information during scrapping? If yes, then how?

    opened by ghost 4
  • How to Use Proxy with DataFlowKit?

    How to Use Proxy with DataFlowKit?

    How to use proxy to prevent blocking IPs? Plz give some example/guide tutorial with docker or standalone

    question 
    opened by NiNJAD3vel0per 4
  • JSON Lines Newline Delimited JSON (.jsonl) format support.

    JSON Lines Newline Delimited JSON (.jsonl) format support.

    Is your feature request related to a problem? Please describe. Here are some use cases of using JSON lines:

    • Store multiple JSON records in a file. So any kind of (uniform) structured data can be stored, such as a list of users, products or log entries.
    • JSON Lines can be streamed easily.
    • quick insertions.
    • query the last or last (n) items quickly.

    Describe the solution you'd like

    Add new parameter here

    type JSONEncoder struct {
    	JSONLines bool
    }
    

    Implement encoding to JSON Lines in the function

    func (e JSONEncoder) encode(ctx context.Context, w *bufio.Writer, payloadMD5 string, keys *map[int][]int) error {}

    opened by slotix 2
  • Multiple robots.txt files on the server. How to process them correctly?

    Multiple robots.txt files on the server. How to process them correctly?

    Let's consider the following case: Domain: http://example.com . Obviously robots.txt file is located at http://example.com/robots.txt . This robots.txt has no access restrictions.

    Let's assume that we have a link like http://adv.example.com/click?item=1 to be scraped. It redirects one to http://example.com/item1 . For security reasons the second http://adv.example.com/robots.txt file

    User-agent: *
    Disallow: /
    

    forbids everyone from accessing the page http://adv.example.com/click?item=1. But redirected page http://example.com/item1 is opened for crawling according to http://example.com/robots.txt .

    To respect robots.txt we have to parse it BEFORE downloading its corresponding page. But following the rules listed in http://adv.example.com/robots.txt restricts us from accessing final redirected page http://example.com/item1 . It stops fetching and returns the error "Forbidden by robots.txt"

    So... the only solution that comes to my mind is to download a page, generate robots.txt link from final redirected page response and check if its processing is allowed by robots.txt .

    Please have a look at robotstxt.mw.go func (mw robotstxtMiddleware) Fetch(req interface{}) (output interface{}, err error) {}

    Please share your ideas about the most elegant solution.

    question 
    opened by slotix 1
  • gofmt -s -w .

    gofmt -s -w .

    100% in gofmt at https://goreportcard.com/report/github.com/slotix/dataflowkit.

    opened by cassiobotaro 1
  • Example of a along for doctors

    Example of a along for doctors

    I read it was used for this. Is the script public. I want to get an idea of a production example and any issues that come up. Great toolkit and really useful in golang.

    opened by ghost 1
  • Add excel encoder

    Add excel encoder

    opened by slotix 1
  • Add stat information to Task

    Add stat information to Task

    Currently there is not enough information about Parse Task returned except output file path. It needs to add some extra information like Requests count divided by type (Initial, paginator, details), Response count, Error count, time elapsed, etc.

    enhancement 
    opened by slotix 1
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.6k Sep 8, 2021
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

null 1.5k Sep 7, 2021
Decode / encode XML to/from map[string]interface{} (or JSON); extract values with dot-notation paths and wildcards. Replaces x2j and j2x packages.

mxj - to/from maps, XML and JSON Decode/encode XML to/from map[string]interface{} (or JSON) values, and extract/modify values from maps by key or key-

Charles Banning 482 Sep 3, 2021
:zap: Transfer files over wifi from your computer to your mobile device by scanning a QR code without leaving the terminal.

$ qrcp Transfer files over Wi-Fi from your computer to a mobile device by scanning a QR code without leaving the terminal. You can support development

Claudio d'Angelis 8k Sep 6, 2021
A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.

goldmark A Markdown parser written in Go. Easy to extend, standards-compliant, well-structured. goldmark is compliant with CommonMark 0.29. Motivation

Yusuke Inuzuka 1.7k Sep 12, 2021
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Andrew Stuart 200 Sep 12, 2021
pdfcpu is a PDF processor written in Go.

pdfcpu is a PDF processing library written in Go supporting encryption. It provides both an API and a CLI. Supported are all versions up to PDF 1.7 (ISO-32000).

pdfcpu 2.6k Sep 15, 2021
Extract urls from text

xurls Extract urls from text using regular expressions. Requires Go 1.13 or later. import "mvdan.cc/xurls/v2" func main() { rxRelaxed := xurls.Relax

Daniel Martí 822 Sep 10, 2021
PipeIt is a text transformation, conversion, cleansing and extraction tool.

PipeIt PipeIt is a text transformation, conversion, cleansing and extraction tool. Features Split - split text to text array by given separator. Regex

Allen Dang 64 Sep 15, 2021
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2k Sep 5, 2021
Convert scanned image PDF file to text annotated PDF file

Jisui (自炊) This tool is PoC (Proof of Concept). Jisui is a helper tool to create e-book. Ordinary the scanned book have not text information, so you c

Takumasa Sakao 24 Aug 4, 2021
🌭 The hotdog web browser and browser engine 🌭

This is the hotdog web browser project. It's a web browser with its own layout and rendering engine, parsers, and UI toolkit! It's made from scratch e

Danilo Fragoso 913 Sep 7, 2021
Blackfriday: a markdown processor for Go

Blackfriday Blackfriday is a Markdown processor implemented in Go. It is paranoid about its input (so you can safely feed it user-supplied data), it i

Russ Ross 4.8k Sep 9, 2021
Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Eric Myhre 4 Sep 2, 2021
A general purpose syntax highlighter in pure Go

Chroma — A general purpose syntax highlighter in pure Go NOTE: As Chroma has just been released, its API is still in flux. That said, the high-level i

Alec Thomas 2.9k Sep 15, 2021
xmlquery is Golang XPath package for XML query.

xmlquery Overview xmlquery is an XPath query package for XML documents, allowing you to extract data or evaluate from XML documents with an XPath expr

null 243 Sep 10, 2021
GoVarnam is a cross-platform transliteration library.

Varnam is an Indian language transliteration library. GoVarnam is a Go port of libvarnam with some core architectural changes. Not every part of libvarnam is ported.

Varnamproject 16 Sep 8, 2021
Small and fast FTS (full text search)

Microfts A small full text indexing and search tool focusing on speed and space. Initial tests seem to indicate that the database takes about twice as

Bill Burdick 24 Sep 2, 2021
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

null 398 Sep 12, 2021