Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler

Overview

Pagser

go-doc-img travis-img go-report-card-img Coverage Status

Pagser inspired by page parser

Pagser is a simple, extensible, configurable parse and deserialize html page to struct based on goquery and struct tags for golang crawler.

Contents

Install

go get -u github.com/foolin/pagser

Or get the specified version:

go get github.com/foolin/[email protected]{version}

The {version} release list: https://github.com/foolin/pagser/releases

Features

  • Simple - Use golang struct tag syntax.
  • Easy - Easy use for your spider/crawler/colly application.
  • Extensible - Support for extension functions.
  • Struct tag grammar - Grammar is simple, like `pagser:"a->attr(href)"`.
  • Nested Structure - Support Nested Structure for node.
  • Configurable - Support configuration.
  • Implicit type conversion - Automatic implicit type conversion, Output result string convert to int, int64, float64...
  • GoQuery/Colly - Support all goquery project, such as go-colly.

Docs

See Pagser

Usage

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
)

const rawPageHtml = `
<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <title>Pagser Title</title>
	<meta name="keywords" content="golang,pagser,goquery,html,page,parser,colly">
</head>

<body>
	<h1>H1 Pagser Example</h1>
	<div class="navlink">
		<div class="container">
			<ul class="clearfix">
				<li id=''><a href="/">Index</a></li>
				<li id='2'><a href="/list/web" title="web site">Web page</a></li>
				<li id='3'><a href="/list/pc" title="pc page">Pc Page</a></li>
				<li id='4'><a href="/list/mobile" title="mobile page">Mobile Page</a></li>
			</ul>
		</div>
	</div>
</body>
</html>
`

type PageData struct {
	Title    string   `pagser:"title"`
	Keywords []string `pagser:"meta[name='keywords']->attrSplit(content)"`
	H1       string   `pagser:"h1"`
	Navs     []struct {
		ID   int    `pagser:"->attrEmpty(id, -1)"`
		Name string `pagser:"a->text()"`
		Url  string `pagser:"a->attr(href)"`
	} `pagser:".navlink li"`
}

func main() {
	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err := p.Parse(&data, rawPageHtml)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


Page data json: 
-------------
{
	"Title": "Pagser Title",
	"Keywords": [
		"golang",
		"pagser",
		"goquery",
		"html",
		"page",
		"parser",
		"colly"
	],
	"H1": "H1 Pagser Example",
	"Navs": [
		{
			"ID": -1,
			"Name": "Index",
			"Url": "/"
		},
		{
			"ID": 2,
			"Name": "Web page",
			"Url": "/list/web"
		},
		{
			"ID": 3,
			"Name": "Pc Page",
			"Url": "/list/pc"
		},
		{
			"ID": 4,
			"Name": "Mobile Page",
			"Url": "/list/mobile"
		}
	]
}
-------------

Configuration

type Config struct {
	TagName    string //struct tag name, default is `pagser`
	FuncSymbol   string //Function symbol, default is `->`
	Debug        bool   //Debug mode, debug will print some log, default is `false`
}

Struct Tag Grammar

[goquery selector]->[function]

Example:

type ExamData struct {
	Herf string `pagser:".navLink li a->attr(href)"`
}

1.Struct tag name: pagser
2.goquery selector: .navLink li a
3.Function symbol: ->
4.Function name: attr
5.Function arguments: href

grammar

Functions

Builtin functions

  • text() get element text, return string, this is default function, if not define function in struct tag.
  • eachText() get each element text, return []string.
  • html() get element inner html, return string.
  • eachHtml() get each element inner html, return []string.
  • outerHtml() get element outer html, return string.
  • eachOutHtml() get each element outer html, return []string.
  • attr(name) get element attribute value, return string.
  • eachAttr() get each element attribute value, return []string.
  • attrSplit(name, sep) get attribute value and split by separator to array string.
  • attr('value') get element attribute value by name is value, return string, eg: will return "xxx".
  • textSplit(sep) get element text and split by separator to array string, return []string.
  • eachTextJoin(sep) get each element text and join to string, return string.
  • eq(index) reduces the set of matched elements to the one at the specified index, return Selection for nested struct.
  • ...

More builtin functions see docs: https://pkg.go.dev/github.com/foolin/pagser?tab=doc#BuiltinFunctions

Extension functions

  • Markdown() //convert html to markdown format.
  • UgcHtml() //sanitize html

Extensions function need register, like:

import "github.com/foolin/pagser/extensions/markdown"

p := pagser.New()

//Register Markdown
markdown.Register(p)

Custom function

Function interface

type CallFunc func(node *goquery.Selection, args ...string) (out interface{}, err error)

Define global function

//global function need call pagser.RegisterFunc("MyGlob", MyGlobalFunc) before use it.
// this global method must call pagser.RegisterFunc("MyGlob", MyGlobalFunc).
func MyGlobalFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Global-" + node.Text(), nil
}

type PageData struct{
  MyGlobalValue string    `pagser:"->MyGlob()"`
}

func main(){

    p := pagser.New()

    //Register global function `MyGlob`
    p.RegisterFunc("MyGlob", MyGlobalFunc)

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Define struct function

type PageData struct{
  MyFuncValue int    `pagser:"->MyFunc()"`
}

// this method will auto call, not need register.
func (d PageData) MyFunc(node *goquery.Selection, args ...string) (out interface{}, err error) {
	return "Struct-" + node.Text(), nil
}


func main(){

    p := pagser.New()

    //Todo

    //data parser model
    var data PageData
    //parse html data
    err := p.Parse(&data, rawPageHtml)

    //...
}

Call Syntax

Note: all function arguments are string, single quotes are optional.

  1. Function call with no arguments

->fn()

  1. Function calls with one argument, and single quotes are optional

->fn(one)

->fn('one')

  1. Function calls with many arguments

->fn(one, two, three, ...)

->fn('one', 'two', 'three', ...)

  1. Function calls with single quotes and escape character

->fn('it\'s ok', 'two,xxx', 'three', ...)

Priority Order

Lookup function priority order:

struct method -> parent method -> ... -> global

More Examples

See advance example: https://github.com/foolin/pagser/tree/master/_examples/advance

Implicit type conversion

Automatic implicit type conversion, Output result string convert to int, int64, float64...

Support type:

  • bool
  • float32
  • float64
  • int
  • int32
  • int64
  • string
  • []bool
  • []float32
  • []float64
  • []int
  • []int32
  • []int64
  • []string

Examples

Crawl page example

package main

import (
	"encoding/json"
	"github.com/foolin/pagser"
	"log"
	"net/http"
)

type PageData struct {
	Title    string `pagser:"title"`
	RepoList []struct {
		Names       []string `pagser:"h1->textSplit('/', true)"`
		Description string   `pagser:"h1 + p"`
		Stars       string   `pagser:"a.muted-link->eqAndText(0)"`
		Repo        string   `pagser:"h1 a->attrConcat('href', 'https://github.com', $value, '?from=pagser')"`
	} `pagser:"article.Box-row"`
}

func main() {
	resp, err := http.Get("https://github.com/trending")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	//New default config
	p := pagser.New()

	//data parser model
	var data PageData
	//parse html data
	err = p.ParseReader(&data, resp.Body)
	//check error
	if err != nil {
		log.Fatal(err)
	}

	//print data
	log.Printf("Page data json: \n-------------\n%v\n-------------\n", toJson(data))
}

func toJson(v interface{}) string {
	data, _ := json.MarshalIndent(v, "", "\t")
	return string(data)
}

Run output:


2020/04/25 12:26:04 Page data json: 
-------------
{
	"Title": "Trending  repositories on GitHub today · GitHub",
	"RepoList": [
		{
			"Names": [
				"pcottle",
				"learnGitBranching"
			],
			"Description": "An interactive git visualization to challenge and educate!",
			"Stars": "16,010",
			"Repo": "https://github.com/pcottle/learnGitBranching?from=pagser"
		},
		{
			"Names": [
				"jackfrued",
				"Python-100-Days"
			],
			"Description": "Python - 100天从新手到大师",
			"Stars": "83,484",
			"Repo": "https://github.com/jackfrued/Python-100-Days?from=pagser"
		},
		{
			"Names": [
				"brave",
				"brave-browser"
			],
			"Description": "Next generation Brave browser for macOS, Windows, Linux, Android.",
			"Stars": "5,963",
			"Repo": "https://github.com/brave/brave-browser?from=pagser"
		},
		{
			"Names": [
				"MicrosoftDocs",
				"azure-docs"
			],
			"Description": "Open source documentation of Microsoft Azure",
			"Stars": "3,798",
			"Repo": "https://github.com/MicrosoftDocs/azure-docs?from=pagser"
		},
		{
			"Names": [
				"ahmetb",
				"kubectx"
			],
			"Description": "Faster way to switch between clusters and namespaces in kubectl",
			"Stars": "6,979",
			"Repo": "https://github.com/ahmetb/kubectx?from=pagser"
		},

        //...        

		{
			"Names": [
				"serverless",
				"serverless"
			],
			"Description": "Serverless Framework – Build web, mobile and IoT applications with serverless architectures using AWS Lambda, Azure Functions, Google CloudFunctions \u0026 more! –",
			"Stars": "35,502",
			"Repo": "https://github.com/serverless/serverless?from=pagser"
		},
		{
			"Names": [
				"vuejs",
				"vite"
			],
			"Description": "Experimental no-bundle dev server for Vue SFCs",
			"Stars": "1,573",
			"Repo": "https://github.com/vuejs/vite?from=pagser"
		}
	]
}
-------------

Colly Example

Work with colly:

p := pagser.New()


// On every a element which has href attribute call callback
collector.OnHTML("body", func(e *colly.HTMLElement) {
	//data parser model
	var data PageData
	//parse html data
	err := p.ParseSelection(&data, e.Dom)

})

Dependencies

  • github.com/PuerkitoBio/goquery

  • github.com/spf13/cast

Extensions:

  • github.com/mattn/godown

  • github.com/microcosm-cc/bluemonday

Issues
  • Bump github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16

    Bump github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16

    Bumps github.com/microcosm-cc/bluemonday from 1.0.2 to 1.0.16.

    Release notes

    Sourced from github.com/microcosm-cc/bluemonday's releases.

    Prevent a HTML sanitization vulnerability

    CVE-2021-42576

    A vulnerability was discovered by https://github.com/TomAnthony https://www.tomanthony.co.uk/ which allowed the contents of a style tag to be leaked unsanitized by bluemonday into the HTML output. Further it was demonstrated that if the form elements select and option were allowed by the policy that this could result in a successful XSS.

    You would only be vulnerable to if if you allowed style, select and option in your HTML sanitization policy:

    p := bluemonday.NewPolicy()
    p.AllowElements("style","select")
    html := p.Sanitize(`<select><option><style><script>alert(1)</script>`)
    fmt.Println(html)
    

    bluemonday very strongly recommends not allowing the style element in a policy. It is fundamentally unsafe as we do not have a CSS sanitizer and the content is passed through unmodified.

    bluemonday has been updated to explicitly suppress style and script elements by default even if you do allow them by policy as these are considered unsafe. If you have a use-case for using bluemonday whilst trusting the input then you can assert this via p.AllowUnsafe(true) which will let style and script through if the policy also allows them.

    Note: the policies shipped with bluemonday are not vulnerable to this.

    Fix XSS vulnerability in HTML attribute parsing

    A well crafted HTML attribute had the potential to evade sanitization due to incorrect escaping of the attribute whilst serializing it.

    This version resolves that issue. In doing so it will also correctly use &amp; to separate query string values in URLs within HTML attributes (href, src, ...).

    Add SanitizeReaderToWriter(r io.Reader, w io.Writer)

    No release notes provided.

    Policies that accept regexps for matching are now additive

    Thanks to @​KN4CK3R for the contribution of a PR that results in multiple Matching() policies on the same attr and element no longer clobber the previous regexps.

    Improve data-uri base64 handling, and improve docs structure

    No release notes provided.

    Improve support for links on all elements

    Originally I had only concentrated the link validation on the elements that were safe to link. However people do want to allow some unsafe elements and yet still have the benefits of link validation and sanitization, i.e. allow iframe but still have the src safely validated... these changes allow that.

    Additionally I have added tests showing how AllowSchemesWithCustomPolicy can be used to globally allow only links to certain domains, and a test that shows how to apply the AllowAttributes().Matching().OnElements to only allow a given domain on specific elements (i.e. only allow an iframe if is is a YouTube embed).

    AllowComments

    Adds a new func to allow HTML comments to be allowed. But does not allow CDATA comments which will be treated as plain HTML comments.

    Also updates the readme, and the versions of the dependencies that have also updated.

    Update x/net to latest version

    As per https://nvd.nist.gov/vuln/detail/CVE-2020-28852

    Restore support for go < 1.10

    No release notes provided.

    ... (truncated)

    Commits
    • c788a2a Prevent a HTML sanitization vulnerability
    • 13d1799 go fmt with go 1.17
    • cada0f0 Merge pull request #128 from 6543-forks/ci-enforce-code-format
    • 7e9370a CI.restart()
    • be04ac9 enforce "lf" line ending
    • 9926455 Add "Check Code Formation" step to CI
    • 1b5510c add "fmt-check" make target
    • 1a86fcd go mod tidy && go fmt
    • f0057e2 Fix escaping of HTML attributes
    • c0a6f20 Spelling mistakes and whitespace are OK
    • Additional commits viewable in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Why there are

    Why there are "each" versions of functions?

    I think it could be simplified so that one could do:

    type PageData struct {
    	Link    string   `pagser:"a->attr(href)"`
    	Links   []string `pagser:"a->attr(href)"`
    }
    

    The package could inspect the type associated with the tag and if it is slice, automatically assume "each" behavior. In general I think all functions should be returning slices, which then optionally are converted to an individual value (by default the first one, but you could set using ->Eq some other index).

    opened by mitar 0
  • How to select an html element based on his text content?

    How to select an html element based on his text content?

    What should be the selector for this kind of html code:

    <h2>Firstname</h2>
    <p>John</p>
    <!--- some random html code, with random h2 tags -->
    <h2>Lastname</h2>
    <p>Doe</p>
    

    I would like to fill this struct:

    type Person struct {
        Firstname string `pagser:"h2[text=Firstname]+p"`
        Lastname  string `pagser:"h2[text=Lastname]+p"`
    }
    
    opened by sebbbastien 1
  • `->eq()` should return a Selection if it's followed by something else?

    `->eq()` should return a Selection if it's followed by something else?

    type Item struct {
           Title       string `pagser:"td->eq(0)"`
           Image       string `pagser:"td a img->attr(src)"`
           Quote       string `pagser:"td->eq(3)"`
           Description string `pagser:"td->eq(4)"`
    }
    

    If I put td->eq(2) in the tag for Image, I get the text() for the full tr. But without the ->eq(2), it's possible I may end up with a different image. Ideally don't want to make a sub-struct just for this one field if possible.

    (Happy to have a go at implementing this if it's actually possible.)

    opened by rjp 1
Releases(v0.1.5)
Owner
foolin
foolin
Produces a set of tags from given source. Source can be either an HTML page, Markdown document or a plain text. Supports English, Russian, Chinese, Hindi, Spanish, Arabic, Japanese, German, Hebrew, French and Korean languages.

Tagify Gets STDIN, file or HTTP address as an input and returns a list of most popular words ordered by popularity as an output. More info about what

ZoomIO 21 Jun 16, 2022
[Crawler/Scraper for Golang]🕷A lightweight distributed friendly Golang crawler framework.一个轻量的分布式友好的 Golang 爬虫框架。

Goribot 一个分布式友好的轻量的 Golang 爬虫框架。 完整文档 | Document !! Warning !! Goribot 已经被迁移到 Gospider|github.com/zhshch2002/gospider。修复了一些调度问题并分离了网络请求部分到另一个仓库。此仓库会继续

null 207 Jun 15, 2022
Match regex group into go struct using struct tags and automatic parsing

regroup Simple library to match regex expression named groups into go struct using struct tags and automatic parsing Installing go get github.com/oris

Ori Seri 111 Jun 24, 2022
bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS

bluemonday bluemonday is a HTML sanitizer implemented in Go. It is fast and highly configurable. bluemonday takes untrusted user generated content as

Microcosm 2.3k Jun 24, 2022
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 16.9k Jun 30, 2022
Watches container registries for new and changed tags and creates an RSS feed for detected changes.

Tagwatch Watches container registries for new and changed tags and creates an RSS feed for detected changes. Configuration Tagwatch is configured thro

Wolfgang Popp 1 Jan 7, 2022
yview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application.

wview wview is a lightweight, minimalist and idiomatic template library based on golang html/template for building Go web application. Contents Instal

null 0 Dec 5, 2021
Simple Markdown to Html converter in Go.

Markdown To Html Converter Simple Example package main import ( "github.com/gopherzz/MTDGo/pkg/lexer" "github.com/gopherzz/MTDGo/pkg/parser" "fm

Nikita Kazeka 2 Jan 29, 2022
Parse data and test fixtures from markdown files, and patch them programmatically, too.

go-testmark Do you need test fixtures and example data for your project, in a language agnostic way? Do you want it to be easy to combine with documen

Eric Myhre 17 Jun 17, 2022
Parse placeholder and wildcard text commands

allot allot is a small Golang library to match and parse commands with pre-defined strings. For example use allot to define a list of commands your CL

Sebastian Müller 55 Apr 13, 2022
A Go library to parse and format vCard

go-vcard A Go library to parse and format vCard. Usage f, err := os.Open("cards.vcf") if err != nil { log.Fatal(err) } defer f.Close() dec := vcard.

Simon Ser 72 Jun 29, 2022
Parse RSS, Atom and JSON feeds in Go

gofeed The gofeed library is a robust feed parser that supports parsing both RSS, Atom and JSON feeds. The library provides a universal gofeed.Parser

null 1.9k Jun 28, 2022
Go library to parse and render Remarkable lines files

go-remarkable2pdf Go library to parse and render Remarkable lines files as PDF.

Jay Goel 32 Jan 10, 2022
parse and generate XML easily in go

etree The etree package is a lightweight, pure go package that expresses XML in the form of an element tree. Its design was inspired by the Python Ele

Brett Vickers 965 Jun 25, 2022
Parse line as shell words

go-shellwords Parse line as shell words. Usage args, err := shellwords.Parse("./foo --bar=baz") // args should be ["./foo", "--bar=baz"] envs, args, e

mattn 415 May 29, 2022
htmlquery is golang XPath package for HTML query.

htmlquery Overview htmlquery is an XPath query package for HTML, lets you extract data or evaluate from HTML documents by an XPath expression. htmlque

null 491 Jun 24, 2022
Golang HTML to plaintext conversion library

html2text Converts HTML into text of the markdown-flavored variety Introduction Ensure your emails are readable by all! Turns HTML into raw text, usef

J. Elliot Taylor 408 Jun 18, 2022
Frongo is a Golang package to create HTML/CSS components using only the Go language.

Frongo Frongo is a Go tool to make HTML/CSS document out of Golang code. It was designed with readability and usability in mind, so HTML objects are c

Rewan_ 21 Jul 29, 2021
golang program that simpily converts html into markdown

Simpily converts html to markdown Just a simple project I wrote in golang to convert html to markdown, surprisingly works decent for a lot of websites

null 1 Oct 23, 2021