Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Overview

Pagser

go-doc-img travis-img go-report-card-img Coverage Status

Pagser inspired by page parser

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler.

Contents

Install

go get -u github.com/foolin/pagser

Or get the specified version:

go get github.com/foolin/[email protected]{version}

The {version} release list: https://github.com/foolin/pagser/releases

Features

  • Simple - Use golang struct tag syntax.
  • Easy - Easy use for your spider/crawler/colly application.
  • Extensible - Support for extension functions.
  • Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
  • Nested Structure - Support Nested Structure for node.
  • Configurable - Support configuration.
  • Implicit type conversion - Automatic implicit type conversion, Output result string convert to int, int64, float64...
  • GoQuery/Colly - Support all goquery project, such as go-colly.

Docs

See Pagser

Usage

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
)

const rawPageHtml = `
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Pagser Title</title>
	<meta name="keywords" content="golang,pagser,goquery,html,page,parser,colly">
</head>

<body>
	<h1>H1 Pagser Example</h1>
	<div class="navlink">
		<div class="container">
			<ul class="clearfix">
				<li id=''><a href="/">Index</a></li>
				<li id='2'><a href="/list/web" title="web site">Web page</a></li>
				<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
				<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
			</ul>
		</div>
	</div>
</body>
</html>
`

type PageData struct {
	Title    string   `pagser:"title"`
	Keywords []string `pagser:"meta[name='keywords']->attrSplit(content)"`
	H1       string   `pagser:"h1"`
	Navs     []struct {
		ID   int    `pagser:"->attrEmpty(id, -1)"`
		Name string `pagser:"a->text()"`
		Url  string `pagser:"a->attr(href)"`
	} `pagser:".navlink li"`
}

func main() {
	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err := p.Parse(&data, rawPageHtml)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


Page data json: 
-------------
{
	"Title": "Pagser Title",
	"Keywords": [
		"golang",
		"pagser",
		"goquery",
		"html",
		"page",
		"parser",
		"colly"
	],
	"H1": "H1 Pagser Example",
	"Navs": [
		{
			"ID": -1,
			"Name": "Index",
			"Url": "/"
		},
		{
			"ID": 2,
			"Name": "Web page",
			"Url": "/list/web"
		},
		{
			"ID": 3,
			"Name": "Pc Page",
			"Url": "/list/pc"
		},
		{
			"ID": 4,
			"Name": "Mobile Page",
			"Url": "/list/mobile"
		}
	]
}
-------------

Configuration

type Config struct {
	TagName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Struct Tag Grammar

[goquery selector]->[function]

Example:

type ExamData struct {
	Herf string `pagser:".navLink li a->attr(href)"`
}

1.Struct tag name: pagser
2.goquery selector: .navLink li a
3.Function symbol: ->
4.Function name: attr
5.Function arguments: href

grammar

Functions

Builtin functions

  • text() get element text, return string, this is default function, if not define function in struct tag.
  • eachText() get each element text, return []string.
  • html() get element inner html, return string.
  • eachHtml() get each element inner html, return []string.
  • outerHtml() get element outer html, return string.
  • eachOutHtml() get each element outer html, return []string.
  • attr(name) get element attribute value, return string.
  • eachAttr() get each element attribute value, return []string.
  • attrSplit(name, sep) get attribute value and split by separator to array string.
  • attr('value') get element attribute value by name is value, return string, eg: will return "xxx".
  • textSplit(sep) get element text and split by separator to array string, return []string.
  • eachTextJoin(sep) get each element text and join to string, return string.
  • eq(index) reduces the set of matched elements to the one at the specified index, return Selection for nested struct.
  • ...

More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser?tab=doc#BuiltinFunctions

Extension functions

  • Markdown() //convert html to markdown format.
  • UgcHtml() //sanitize html

Extensions function need register, like:

import "github.com/foolin/pagser/extensions/markdown"

p := pagser.New()

//Register Markdown
markdown.Register(p)

Custom function

Function interface

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define global function

//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Global-" + node.Text(), nil
}

type PageData struct{
  MyGlobalValue string    `pagser:"->MyGlob()"`
}

func main(){

    p := pagser.New()

    //Register global function `MyGlob`
    p.RegisterFunc("MyGlob", MyGlobalFunc)

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Define struct function

type PageData struct{
  MyFuncValue int    `pagser:"->MyFunc()"`
}

// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Struct-" + node.Text(), nil
}


func main(){

    p := pagser.New()

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Call Syntax

Note: all function arguments are string, single quotes are optional.

  1. Function call with no arguments

->fn()

  1. Function calls with one argument, and single quotes are optional

->fn(one)

->fn('one')

  1. Function calls with many arguments

->fn(one, two, three, ...)

->fn('one', 'two', 'three', ...)

  1. Function calls with single quotes and escape character

->fn('it\'s ok', 'two,xxx', 'three', ...)

Priority Order

Lookup function priority order:

struct method -> parent method -> ... -> global

More Examples

See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance

Implicit type conversion

Automatic implicit type conversion, Output result string convert to int, int64, float64...

Support type:

  • bool
  • float32
  • float64
  • int
  • int32
  • int64
  • string
  • []bool
  • []float32
  • []float64
  • []int
  • []int32
  • []int64
  • []string

Examples

Crawl page example

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
	"net/http"
)

type PageData struct {
	Title    string `pagser:"title"`
	RepoList []struct {
		Names       []string `pagser:"h1->textSplit('/', true)"`
		Description string   `pagser:"h1 + p"`
		Stars       string   `pagser:"a.muted-link->eqAndText(0)"`
		Repo        string   `pagser:"h1 a->attrConcat('href', 'https://github.com', $value, '?from=pagser')"`
	} `pagser:"article.Box-row"`
}

func main() {
	resp, err := http.Get("https://github.com/trending")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err = p.ParseReader(&data, resp.Body)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


2020/04/25 12:26:04 Page data json: 
-------------
{
	"Title": "Trending  repositories on GitHub today · GitHub",
	"RepoList": [
		{
			"Names": [
				"pcottle",
				"learnGitBranching"
			],
			"Description": "An interactive git visualization to challenge and educate!",
			"Stars": "16,010",
			"Repo": "https://github.com/pcottle/learnGitBranching?from=pagser"
		},
		{
			"Names": [
				"jackfrued",
				"Python-100-Days"
			],
			"Description": "Python - 100天从新手到大师",
			"Stars": "83,484",
			"Repo": "https://github.com/jackfrued/Python-100-Days?from=pagser"
		},
		{
			"Names": [
				"brave",
				"brave-browser"
			],
			"Description": "Next generation Brave browser for macOS, Windows, Linux, Android.",
			"Stars": "5,963",
			"Repo": "https://github.com/brave/brave-browser?from=pagser"
		},
		{
			"Names": [
				"MicrosoftDocs",
				"azure-docs"
			],
			"Description": "Open source documentation of Microsoft Azure",
			"Stars": "3,798",
			"Repo": "https://github.com/MicrosoftDocs/azure-docs?from=pagser"
		},
		{
			"Names": [
				"ahmetb",
				"kubectx"
			],
			"Description": "Faster way to switch between clusters and namespaces in kubectl",
			"Stars": "6,979",
			"Repo": "https://github.com/ahmetb/kubectx?from=pagser"
		},

        //...        

		{
			"Names": [
				"serverless",
				"serverless"
			],
			"Description": "Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \u0026 more! –",
			"Stars": "35,502",
			"Repo": "https://github.com/serverless/serverless?from=pagser"
		},
		{
			"Names": [
				"vuejs",
				"vite"
			],
			"Description": "Experimental no-bundle dev server for Vue SFCs",
			"Stars": "1,573",
			"Repo": "https://github.com/vuejs/vite?from=pagser"
		}
	]
}
-------------

Colly Example

Work with colly:

p := pagser.New()


// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
	//data parser model
	var data PageData
	//parse html data
	err := p.ParseSelection(&data, e.Dom)

})

Dependencies

  • github.com/PuerkitoBio/goquery

  • github.com/spf13/cast

Extensions:

  • github.com/mattn/godown

  • github.com/microcosm-cc/bluemonday

Issues
  • How to select an html element based on his text content?

    How to select an html element based on his text content?

    What should be the selector for this kind of html code:

    <h2>Firstname</h2>
    <p>John</p>
    <!--- some random html code, with random h2 tags -->
    <h2>Lastname</h2>
    <p>Doe</p>
    

    I would like to fill this struct:

    type Person struct {
        Firstname string `pagser:"h2[text=Firstname]+p"`
        Lastname  string `pagser:"h2[text=Lastname]+p"`
    }
    
    opened by sebbbastien 1
  • `->eq()` should return a Selection if it's followed by something else?

    `->eq()` should return a Selection if it's followed by something else?

    type Item struct {
           Title       string `pagser:"td->eq(0)"`
           Image       string `pagser:"td a img->attr(src)"`
           Quote       string `pagser:"td->eq(3)"`
           Description string `pagser:"td->eq(4)"`
    }
    

    If I put td->eq(2) in the tag for Image, I get the text() for the full tr. But without the ->eq(2), it's possible I may end up with a different image. Ideally don't want to make a sub-struct just for this one field if possible.

    (Happy to have a go at implementing this if it's actually possible.)

    opened by rjp 1
Releases(v0.1.5)
Owner
foolin
foolin
A declarative struct-tag-based HTML unmarshaling or scraping package for Go built on top of the goquery library

goq Example import ( "log" "net/http" "astuart.co/goq" ) // Structured representation for github file name table type example struct { Title str

Andrew Stuart 200 Sep 12, 2021
Parse RSS, Atom and JSON feeds in Go

gofeed The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser

null 1.7k Sep 5, 2021
Handlebars for golang with the same features as handlebars.js 3.0

Handlebars for golang with the same features as handlebars.js 3.0. Hard fork of Raymond to modularize and keep up with handlebars development.

Andy Walker 4 Aug 9, 2021
⚙️ Convert HTML to Markdown. Even works with entire websites and can be extended through rules.

html-to-markdown Convert HTML into Markdown with Go. It is using an HTML Parser to avoid the use of regexp as much as possible. That should prevent so

Johannes Kaufmann 250 Sep 8, 2021
Go minifiers for web formats

Minify Online demo if you need to minify files now. Command line tool that minifies concurrently and watches file changes. Releases of CLI for various

Taco de Wolff 2.8k Sep 3, 2021
Handlebars for golang

raymond Handlebars for golang with the same features as handlebars.js 3.0. The full API documentation is available here: http://godoc.org/github.com/a

Aymerick 441 Sep 13, 2021
Geziyor, a fast web crawling & scraping framework for Go. Supports JS rendering.

Geziyor Geziyor is a blazing fast web crawling and web scraping framework. It can be used to crawl websites and extract structured data from them. Gez

null 1.5k Sep 7, 2021
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2k Sep 5, 2021
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。 完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

null 198 Aug 30, 2021
A simple CSS parser and inliner in Go

douceur A simple CSS parser and inliner in Golang. Parser is vaguely inspired by CSS Syntax Module Level 3 and corresponding JS parser. Inliner only p

Aymerick 197 Jul 21, 2021
Extract structured data from web sites. Web sites scraping.

Dataflow kit Dataflow kit ("DFK") is a Web Scraping framework for Gophers. It extracts data from web pages, following the specified CSS Selectors. You

Dmitry Narizhnykh 480 Sep 4, 2021
xmlquery is Golang XPath package for XML query.

xmlquery Overview xmlquery is an XPath query package for XML documents, allowing you to extract data or evaluate from XML documents with an XPath expr

null 243 Sep 10, 2021
A markdown parser written in Go. Easy to extend, standard(CommonMark) compliant, well structured.

goldmark A Markdown parser written in Go. Easy to extend, standards-compliant, well-structured. goldmark is compliant with CommonMark 0.29. Motivation

Yusuke Inuzuka 1.7k Sep 12, 2021
User agent string parser in golang

User agent parsing useragent is a library written in golang to parse user agent strings. Usage First install the library with: go get xojoc.pw/userage

Alexandru Cojocaru 71 Aug 2, 2021
export stripTags from html/template as strip.StripTags

HTML StripTags for Go This is a Go package containing an extracted version of the unexported stripTags function in html/template/html.go. ⚠️ This pack

John Wang 96 Aug 25, 2021
Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Rewan_ 21 Jul 29, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.7k Sep 6, 2021
iTunes and RSS 2.0 Podcast Generator in Golang

podcast Package podcast generates a fully compliant iTunes and RSS 2.0 podcast feed for GoLang using a simple API. Full documentation with detailed ex

Eric Duncan 95 Aug 28, 2021
ByNom is a Go package for parsing byte sequences, suitable for parsing text and binary data

ByNom is a Go package for parsing byte sequences. Its goal is to provide tools to build safe byte parsers without compromising the speed or memo

Andrew Bashkatov 4 May 5, 2021