Web Scraper in Go, similar to BeautifulSoup

Overview

soup

Build Status GoDoc Go Report Card

Web Scraper in Go, similar to BeautifulSoup

soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSoup.

Exported variables and functions implemented till now :

var Headers map[string]string // Set headers as a map of key-value pairs, an alternative to calling Header() individually
var Cookies map[string]string // Set cookies as a map of key-value  pairs, an alternative to calling Cookie() individually
func Get(string) (string,error) {} // Takes the url as an argument, returns HTML string
func GetWithClient(string, *http.Client) {} // Takes the url and a custom HTTP client as arguments, returns HTML string
func Post(string, string, interface{}) (string, error) {} // Takes the url, bodyType, and payload as an argument, returns HTML string
func PostForm(string, url.Values) {} // Takes the url and body. bodyType is set to "application/x-www-form-urlencoded"
func Header(string, string) {} // Takes key,value pair to set as headers for the HTTP request made in Get()
func Cookie(string, string) {} // Takes key, value pair to set as cookies to be sent with the HTTP request in Get()
func HTMLParse(string) Root {} // Takes the HTML string as an argument, returns a pointer to the DOM constructed
func Find([]string) Root {} // Element tag,(attribute key-value pair) as argument, pointer to first occurence returned
func FindAll([]string) []Root {} // Same as Find(), but pointers to all occurrences returned
func FindStrict([]string) Root {} //  Element tag,(attribute key-value pair) as argument, pointer to first occurence returned with exact matching values
func FindAllStrict([]string) []Root {} // Same as FindStrict(), but pointers to all occurrences returned
func FindNextSibling() Root {} // Pointer to the next sibling of the Element in the DOM returned
func FindNextElementSibling() Root {} // Pointer to the next element sibling of the Element in the DOM returned
func FindPrevSibling() Root {} // Pointer to the previous sibling of the Element in the DOM returned
func FindPrevElementSibling() Root {} // Pointer to the previous element sibling of the Element in the DOM returned
func Children() []Root {} // Find all direct children of this DOM element
func Attrs() map[string]string {} // Map returned with all the attributes of the Element as lookup to their respective values
func Text() string {} // Full text inside a non-nested tag returned, first half returned in a nested one
func FullText() string {} // Full text inside a nested/non-nested tag returned
func SetDebug(bool) {} // Sets the debug mode to true or false; false by default
func HTML() {} // HTML returns the HTML code for the specific element

Root is a struct, containing three fields :

  • Pointer containing the pointer to the current html node
  • NodeValue containing the current html node's value, i.e. the tag name for an ElementNode, or the text in case of a TextNode
  • Error containing an error in a struct if one occurrs, else nil is returned. A detailed text explaination of the error can be accessed using the Error() function. A field Type in this struct of type ErrorType will denote the kind of error that took place, which will consist of either of the following
    • ErrUnableToParse
    • ErrElementNotFound
    • ErrNoNextSibling
    • ErrNoPreviousSibling
    • ErrNoNextElementSibling
    • ErrNoPreviousElementSibling
    • ErrCreatingGetRequest
    • ErrInGetRequest
    • ErrReadingResponse

Installation

Install the package using the command

go get github.com/anaskhan96/soup

Example

An example code is given below to scrape the "Comics I Enjoy" part (text and its links) from xkcd.

More Examples

package main

import (
	"fmt"
	"github.com/anaskhan96/soup"
	"os"
)

func main() {
	resp, err := soup.Get("https://xkcd.com")
	if err != nil {
		os.Exit(1)
	}
	doc := soup.HTMLParse(resp)
	links := doc.Find("div", "id", "comicLinks").FindAll("a")
	for _, link := range links {
		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
	}
}

Contributions

This package was developed in my free time. However, contributions from everybody in the community are welcome, to make it a better web scraper. If you think there should be a particular feature or function included in the package, feel free to open up a new issue or pull request.

Issues
  • Find by single class

    Find by single class

    Currently Find("a", "class", "message") would only work if it was <a class="message"></a> but would not work on <a class="message input-message"></a> even though they are both of class message.

    Could this be added?

    opened by ghandic 16
  •   causes no text to be returned

      causes no text to be returned

    An odd issue I'm having while trying to use soup to parse Fmylife's site for FMLs is when I get an FML that has the (&)nbsp; tag

    <p class="block">
    <a href="/article/today-on-the-bus-i-saw-my-ex-girlfriend-get-on-despite-several-seats-being-open-she-specifically_190836.html">
    <span class="icon-piment"></span>&nbsp;
    [Insert FML text here] FML
    </a>
    </p>
    

    when I try to call the text, it returns blank text and nothing else.

    I usually call it using .Find("p", "class", "block").Find("a").Text() and if it doesn't have the whitespace tag, it returns fine.

    bug help wanted stale 
    opened by FM1337 10
  • Proposal: Add an

    Proposal: Add an "Empty" func to Root that would make it easier to tell when a query didn't return results

    Right now I suppose you would do this by checking if error was non-nil and then check the error to see if it contained "not found", which you would only know about if you read the source code of this project 😄

    I think what I am proposing is to add something that does that check for you in the library. Maybe something like:

    func (r Root) Empty() bool {
        if r.Error == nil {
             return false
        }
        return strings.Contains(string(r.Error), "not found")
    }
    

    Is this something other people would see as valuable? I would use it sorta like this:

    main := doc.Find("section", "class", "gramb")
    if main.Empty() {
      return errors.New("No results for this query")
    }
    defs := main.FindAll("span", "class", "ind")
    // Other processing here
    

    Right now I'm just checking if main.Error is non nil and returning no results. Would just be nice (I think) to have a cleaner interface around it.

    If you think this is worth doing I'd love to take a crack at it!

    Thanks for this library, it's immensely helpful to my side project 😄

    enhancement 
    opened by ghostlandr 7
  • [BUG]: Search classes with spaces fails every time (even in the weather example you provided)

    [BUG]: Search classes with spaces fails every time (even in the weather example you provided)

    Hi, I tried your weather example and it always trows an "invalid memory address". I tried to reproduce the same bug with another website and it can actually search only those classes without any spaces inside of them. I don't know why but your parser stopped understanding spaces. I added a fmt.Println() function in order to print the only class search with spaces (grid), that's the code:

    package main
    
    import (
    	"bufio"
    	"fmt"
    	"log"
    	"os"
    	"strings"
    
    	"github.com/anaskhan96/soup"
    )
    
    func main() {
    	fmt.Printf("Enter the name of the city : ")
    	city, _ := bufio.NewReader(os.Stdin).ReadString('\n')
    	city = city[:len(city)-1]
    	cityInURL := strings.Join(strings.Split(city, " "), "+")
    	url := "https://www.bing.com/search?q=weather+" + cityInURL
    	resp, err := soup.Get(url)
    	if err != nil {
    		log.Fatal(err)
    	}
    	doc := soup.HTMLParse(resp)
    	grid := doc.Find("div", "class", "b_antiTopBleed b_antiSideBleed b_antiBottomBleed")
    	fmt.Println("Print grid:", grid)
    	heading := grid.Find("div", "class", "wtr_titleCtrn").Find("div").Text()
    	conditions := grid.Find("div", "class", "wtr_condition")
    	primaryCondition := conditions.Find("div")
    	secondaryCondition := primaryCondition.FindNextElementSibling()
    	temp := primaryCondition.Find("div", "class", "wtr_condiTemp").Find("div").Text()
    	others := primaryCondition.Find("div", "class", "wtr_condiAttribs").FindAll("div")
    	caption := secondaryCondition.Find("div").Text()
    	fmt.Println("City Name : " + heading)
    	fmt.Println("Temperature : " + temp + "˚C")
    	for _, i := range others {
    		fmt.Println(i.Text())
    	}
    	fmt.Println(caption)
    }
    

    And that's the output:

    Enter the name of the city : New York
    Print grid: {<nil>  element `div` with attributes `class b_antiTopBleed b_antiSideBleed b_antiBottomBleed` not found}
    panic: runtime error: invalid memory address or nil pointer dereference
    [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x61d1f5]
    
    goroutine 1 [running]:
    github.com/anaskhan96/soup.findOnce(0x0, 0xc42005be68, 0x3, 0x3, 0xc420050000, 0x4aa247, 0xc420261e00)
    	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:304 +0x315
    github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x6e1e60, 0xc420242070, 0xc42005be68, 0x3, 0x3, 0x0, 0x0, ...)
    	/home/fef0/go/src/github.com/anaskhan96/soup/soup.go:120 +0x8d
    main.main()
    	/home/fef0/Code/Go/Test/Test.go:26 +0x4e3
    exit status 2
    

    If you notice in the second line it was impossible to found the grid, but in facts it happens only because there are spaces in the class name. I hope you can fix that as soon as possible, bye for now!

    opened by Fef0 6
  • Add defined errors to the package

    Add defined errors to the package

    Hello! I have finally come back for #29 and solved it basically the way you suggested. I'm looking to try to get involved with more Go OSS and this is my first PR so far.

    My biggest concern is that it's theoretically a breaking change, as if someone was depending on having error details in Error previously, those are now gone.

    The reason this is so many changes is because as part of adding the ErrorDetails to Root I also changed the usages of Root to use labelled properties rather than specifying everything in each initialization.

    Let me know if I should add an example to either the Readme or to the examples folder of this in use?

    opened by ghostlandr 5
  • Check if element exists without triggering warnings in console?

    Check if element exists without triggering warnings in console?

    I'm curious if there's a way to check if an element exists and have a Boolean returned if it does or does not exist rather than having the console just output something like

    2017/06/06 11:21:52 Error occurred in Find() : Element `div` with attributes `class title` not found
    
    opened by FM1337 5
  • Debug mode, check if element is found and correct comments

    Debug mode, check if element is found and correct comments

    According to #7, with this merge you will be able to:

    1. Check if the element is found with the Error field in the Root struct;
    2. Toggle debug mode with the SetDebug() function. Default is false, if set to true will show the various panic().

    Example to check if the node is found (no panic will appear in the terminal):

    source := soup.HTMLParse(resp)
    articles := source.Find("section", "class", "loop").FindAll("article")
    for _, article := range articles {
    	link := article.Find("h2").Find("a")
    	if link.Error == nil { // link is an instance of Root
    		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
    	}
    }
    

    Example to check if the node is found with debug mode (panic will appear in terminal):

    soup.SetDebug(true)
    
    source := soup.HTMLParse(resp)
    articles := source.Find("section", "class", "loop").FindAll("article")
    for _, article := range articles {
    	link := article.Find("h2").Find("a")
    	if link.Error == nil { // link is an instance of Root
    		fmt.Println(link.Text(), "| Link :", link.Attrs()["href"])
    	}
    }
    

    Notes:

    • I added correct comments to each function, interface and struct. Node, Root, FindNextSibling and FindPrevSibling needs edit on their comments.
    • The example codes should be updated.
    opened by danilopolani 4
  • FindStrict and FindAllStrict

    FindStrict and FindAllStrict

    Implementation of FindStrict and FindAllStrict functions based on #15 discussion. New test scenarios added for this functions and for the old Find and FindAll functions.

    Also I've implemented a little different algorithm for searching a value occurrence in a tag's attribute values. In the previous implementation you were searching only for the first occurrence in attribute values:

    strings.Fields(n.Attr[i].Val)[0]  == args[2]
    

    I've changed this behavior to search in all attribute values:

    func attributeContainsValue(attr html.Attribute, attribute, value string) bool {
    	if attr.Key == attribute {
    		for _, attrVal := range strings.Fields(attr.Val) {
    			if attrVal == value {
    				return true
    			}
    		}
    	}
    	return false
    }
    

    I think that in this question the order of the value doesn't matter, so there are no difference between <div class="first second"> and <div class="second first"> elements.

    opened by Salmondx 4
  • Crashed with SIGSEGV

    Crashed with SIGSEGV

    Trying to run the test weather.go in my machine and got this.

    Enter the name of the city : Brisbane panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x665715]

    goroutine 1 [running]: github.com/anaskhan96/soup.findOnce(0x0, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x70207e, 0x13) /home/stevek/go/src/github.com/anaskhan96/soup/soup.go:345 +0x315 github.com/anaskhan96/soup.Root.Find(0x0, 0x0, 0x0, 0x75c820, 0xc000364040, 0xc0000bdea8, 0x3, 0x3, 0x0, 0x0, ...) /home/stevek/go/src/github.com/anaskhan96/soup/soup.go:121 +0x82 main.main() /home/stevek/tmp/go-lang/src/weather.go:24 +0x49d exit status 2

    opened by sunshine69 4
  • soup.HTMLParse() returning nil

    soup.HTMLParse() returning nil

    This method was previously working but for some reason, it returns nil every single time now

    //example
    t, _ := soup.Get("https://google.com")
    fmt.Println(soup.HTMLParse(t)) //prints {address <nil>}
    
    opened by amirgamil 0
  • findOnce break after the first child node.

    findOnce break after the first child node.

    If the element is not found in the first child node, the value is returned, and the loop has no effect.

    I think this should be if q {

    for c := n.FirstChild; c != nil; c = c.NextSibling {
    	p, q := findOnce(c, args, true, strict)
    	if !q {
    		return p, q
    	}
    }
    

    https://github.com/anaskhan96/soup/blob/cb47551b378185a4504cf253c2abfbeea361cabf/soup.go#L504

    opened by knuppe 0
  • fatal error: concurrent map iteration and map write

    fatal error: concurrent map iteration and map write

    fatal error: concurrent map iteration and map write
    
    goroutine 833 [running]:
    runtime.throw(0x7161cb, 0x26)
            /root/.gvm/gos/go1.15.5/src/runtime/panic.go:1116 +0x72 fp=0xc000069938 sp=0xc000069908 pc=0x437312
    runtime.mapiternext(0xc000069a10)
            /root/.gvm/gos/go1.15.5/src/runtime/map.go:853 +0x554 fp=0xc0000699b8 sp=0xc000069938 pc=0x412574 
    github.com/anaskhan96/soup.setHeadersAndCookies(0xc0002e4600)
            /root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:145 +0x87 fp=0xc000069b28 sp=0xc0000699b8 pc=0x6831a7
    github.com/anaskhan96/soup.GetWithClient(0xc000288660, 0x24, 0xc0004b3da0, 0x0, 0x0, 0x0, 0x0)
            /root/go/pkg/mod/github.com/anaskhan96/[email protected]/soup.go:117 +0x18b fp=0xc000069be0 sp=0xc000069b28 pc=0x682cab
    
    opened by weaming 0
  • InnerHTML

    InnerHTML

    Is there any way to get the equivalent of .HTML(), but excluding the element's own markup (just like JS' .innerHMTL), without having to resort to regex?

    An example:

    element.HTML() yields <p><a href="square-cover-art.jpeg">My <em>wacky</em> label with <strong>bold</strong> and <code>code</code> and stuff “hmmm”</a></p>

    I want to get <a href="square-cover-art.jpeg">My <em>wacky</em> label with <strong>bold</strong> and <code>code</code> and stuff “hmmm”</a>

    I guess I could iterate over element.Children() and concatenate each child's .HTML(), but I think having a .InnerHTML() would make things nicer (and a tad better when it comes to performance I guess)

    I'm willing to make a PR :)

    opened by ewen-lbh 0
  • How to get http status code?

    How to get http status code?

    I want to get the status code from the Http request to make sure the response from the website returns 200 Ok and continue the process. Sometimes website returns 404

    Any way to get the HTTP status code?

    opened by chuongtrh 1
  • Navigating to Parent

    Navigating to Parent

    In order for proper selection it would be awesome to be able to navigate to the current elements Parent, then keep going through siblings and all.

    Right now it is quite hard to properly find what I am looking for from a strict top-down view.

    enhancement 
    opened by joekinley 1
  • Anything akin to BeautifulSoup's Comment?

    Anything akin to BeautifulSoup's Comment?

    Just curious if soup has anything similar to how BeautifulSoup lets your parse HTML comments in Python?

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#comments-and-other-special-strings

    Trying to parse some HTML where some data is commented, and able to do the following in Python:

    from bs4 import BeautifulSoup, Comment
    comments = soup.find_all(text=lambda text: isinstance(text, Comment))
    comments_soup = BeautifulSoup(comment, 'lxml')
    

    Is there anything close to that here? Or any chance or adding something like it?

    enhancement 
    opened by mtslzr 1
  • Find Or

    Find Or

    I have case where I want element have div or p I dont know how to make it probably its not possible with existing lib and we will need something FindOr

    enhancement 
    opened by bajro17 2
  • How to use selectors?

    How to use selectors?

    bs allows you to use select for using CSS selectors. Any such thing in this library?

    enhancement 
    opened by kadnan 1
  • FindAll Regex

    FindAll Regex

    Hi, I'm just starting out with go, so this question mgiht be dumb.

    Is there a way, with this library to findall regular expression ?

    If it is not implemented, will it be? or am I looking at the wrong package?

    thanks

    enhancement 
    opened by guanicoe 1
Releases(v1.2.4)
Owner
Anas Khan
Trying to make a meaningful contribution to open source since 2016
Anas Khan
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.9k Sep 24, 2021
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 Aug 27, 2021
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 1 Sep 7, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.6k Sep 17, 2021
DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

dorkscout dokrscout is a tool to automate the finding of vulnerable applications or secret files around the internet throught google searches, dorksco

R4yan 95 Sep 6, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 996 Sep 22, 2021
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.2k Sep 22, 2021
Getting new films from baskino.me

Filmparser Description Every time I want to watch a film, I have to go to baskino.me to check if it is available on torrent trackers to download it th

Artyom 4 Jul 19, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.6k Sep 23, 2021
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

9ian1i 1.4k Sep 25, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.1k Sep 16, 2021
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 66 Sep 9, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Sep 18, 2021
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 384 Sep 17, 2021
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 8 Sep 15, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 766 Sep 15, 2021