Declarative web scraping

Overview

Ferret

Go Report Status Build Status Discord Chat Ferret release Apache-2.0 License

ferret

Try it! Docs CLI Test runner Web worker

What is it?

ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.
ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.

Read the introductory blog post about Ferret here!

Features

  • Declarative language
  • Support of both static and dynamic web pages
  • Embeddable
  • Extensible

Documentation is available at our website.

Issues
  • Website

    Website

    Create a website with nice and clean design. It should contain:

    • [ ] Introduction
    • [ ] Quick start
    • [ ] Guideline
    • [ ] API documentation
    • [ ] FAQ (?)
    • [ ] Contact information

    The website repo is here.

    type/enhancement help wanted 
    opened by ziflex 19
  • New to Go, Not working on Ubuntu 18

    New to Go, Not working on Ubuntu 18

    I'm new to this go stuff. I tried installing this on Ubuntu 18. First installing go, and then trying to make ferret... Would it be possible to post a completely newbie guide with all the command line steps? Would be much appreciated, thanks!

    opened by alteredorange 17
  • How to get rid of converted characters in URLs

    How to get rid of converted characters in URLs

    I am developing a crawler and so far, so very good: thank you for this outstanding crawler.

    The only issue is that, in the returned URLs, there is a & character which gets converted into \u0026, thus: "https://thedomain/alphabet=M\u0026borough=Bronx"

    So I tried to replace it, either by using SUBSTITUTE: RETURN SUBSTITUTE(prfx + letter.attributes.href, "\u0026", "&")

    or REGEX_REPLACE.

    In both cases, the \u0026 string is NOT replaced and remains embedded into the resulting URLs. However, when I try SUBSTITUTEsay on a -> z it works fine.

    Is it a limitation of JSON, which I use as an output format? How can I get rid of the converted string as it prevents me from crawling at the lower levels of the website.

    type/bug good first issue area/stdlib 
    opened by asitemade4u 15
  • Create a Docker image with stripped down version of Chromium

    Create a Docker image with stripped down version of Chromium

    Chrome is awesome and all, but for scraping tasks it's too heavy. We need to investigate how we can create a custom build with stripped features that are not relevant to web scraping and publish this Docker image.

    Some links:

    • https://github.com/gcarq/inox-patchset
    • https://github.com/Eloston/ungoogled-chromium
    type/enhancement help wanted good first issue area/drivers/cdp hacktoberfest 
    opened by ziflex 13
  • Add object functions

    Add object functions

    • [x] KEYS(object, sort) → strArray
    • [x] HAS(object, keyName) → isPresent ~~LENGTH(object) → count~~ (Implemented here)
    • [x] MERGE(object1, object2, ... objectN) → newMergedObject
    • [x] MERGE_RECURSIVE(object1, object2, ... objectN) → newMergedObject
    • [x] VALUES(document, removeInternal) → anyArray
    • [x] ZIP(keys, values) → newObj
    • [x] KEEP(object, key1, key2, ... key) → newObj
    type/enhancement good first issue area/stdlib 
    opened by ziflex 13
  • Many changes

    Many changes

    1. Added options for http driver: WithTimeout, WithBodyLimit.
    2. Changed default level from Debug to Trace when setting http headers.
    3. Added ferret function to_binary for convert data to array bytes (it's necessary for requests with body).
    4. Added mutex to map with params because in some cases, emerging race condition.
    5. Big changes - it's unification using function document without DOM tree prepare. Current realization (IO::NET::HTTP) not using proxy and not retuned headers and cookies and duplicates code in http.Driver. So I rewrote it with HTTPResponse as returned object from ferret - realization and tests
    opened by bundleman 12
  • STYLE_GET seems broken

    STYLE_GET seems broken

    Describe the bug Not sure because it's my first test on it but STYLE_GET seems broken

    To Reproduce

    LET doc = DOCUMENT('https://news.ycombinator.com/', {
        driver: 'cdp',
        viewport: {
            width: 1920,
            height: 1080
        }
    })
    
    
    WAIT_ELEMENT(doc, '.storylink', 5000)
    
    FOR el IN ELEMENTS(doc, '.title')
        RETURN STYLE_GET(el, 'font-family')
    

    will return [{},{},{}...]

    Expected behavior

    font-family: Verdana, Geneva, sans-serif

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • Version: 0.11.1
    type/bug area/drivers/cdp area/drivers status/review-needed 
    opened by PierreBrisorgueil 12
  • Fix arithmetic operators

    Fix arithmetic operators

    The arithmetic operators must accept operands of any type. Passing non-numeric values to an arithmetic operator must cast the operands to numbers:

    • NONE will be converted to 0
    • false will be converted to 0, true will be converted to 1
    • a valid numeric value remains unchanged, but NaN and Infinity will be converted to 0
    • string values are converted to a number if they contain a valid string representation of a number. Any whitespace at the start or the end of the string is ignored. Strings with any other contents are converted to the number 0
    • an empty array is converted to 0, an array with one member is converted to the numeric representation of its sole member. Arrays with more members are converted to the number 0.
    • objects, binary and custom types are converted to the number 0.

    Here are a few examples:

    Upd:

    1 + "a"                 // "1a"
    1 + "99"                // "199"
    1 + NONE                // 1
    NONE + 1                // 1
    3 + [ ]                 // 3
    24 + [ 2 ]              // 26
    24 + [ 2, 4 ]           // 30
    25 - NONE               // 25
    17 - true               // 16
    23 * { }                // 0
    5 * [ 7 ]               // 35
    5 * [ 7, 2 ]               // 45
    24 / "12"               // 2
    1 / 0                   // panic
    
    type/bug area/runtime 
    opened by ziflex 12
  • Google example does not work with version 0.10

    Google example does not work with version 0.10

    Describe the bug The google example does not work anymore with the version 0.10

    To Reproduce Steps to reproduce the behavior:

    1. Run with version 0.9 the script https://github.com/MontFerret/ferret/blob/master/examples/google-search.fql
    2. The script works as expected
    3. Run with version 0.10 the script https://github.com/MontFerret/ferret/blob/master/examples/google-search.fql
    4. the script prints the error:
    Failed to execute the query
    cdp.DOM: GetContentQuads: rpc error: Could not compute content quads. (code = -32000): CLICK(google,'input[name="btnK"]') at 8:0
    

    Expected behavior It is expected to have the same behavior.

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • OS: OSX 10.14.5
    • Browser
    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
    Google Chrome 80.0.3987.132
    
    • Version [e.g. 22]

    Additional context Chrome is launched with the command line:

    /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222
    
    type/bug area/drivers/cdp area/drivers status/review-needed 
    opened by gsempe 11
  • What should be the origin when NAVIGATE

    What should be the origin when NAVIGATE

    This is a question rather than issue. I am trying to get URLs of files inside the FOR loop (or ideally download the PDFs). The flow is Login page -> Saved content page -> Refcard page -> Download button. I struggle to understand what the origin should be and how many DOCUMENTs do i need? Any help would be appreciated.

    // Login works fine.
    LET base_url = DOCUMENT("https://dzone.com/", true)
    LET login_doc = DOCUMENT("https://dzone.com/users/login.html", true)
    LET login_btn = ELEMENT(login_doc, "button[type=submit]")
    INPUT(login_doc, "form[role=form] input[name=j_username]", "[email protected]", 5)
    INPUT(login_doc, "form[role=form] input[name=j_password]", "XXXXXX", 5)
    CLICK(login_btn)
    WAIT_NAVIGATION(login_doc, 25000)
    
    // Loop in Refcardz on the 'Saved' content page of the user to get the links. Also working fine.
    LET origin_doc = DOCUMENT("https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved", true)
    LET origin_url = "https://dzone.com/users/3590306/dzone-refcardz.html?sort=saved"
    NAVIGATE(login_doc, origin_url, 25000)
    WAIT_ELEMENT(origin_doc, 'p[class=comment-title]', 50000)
    LET titles = ELEMENTS(origin_doc, 'div[class="col-md-11 comment-description"] p[class="comment-title"]')
    LET links = (
      FOR el IN titles
        LET refcard_name = ELEMENT(el, "a")
        LET refcard_url = "https://dzone.com" + refcard_name.attributes.href
        RETURN refcard_url
    )
    
    // On each Refcard page, click on the 'Download' button, get the URL and then go back. Does not work.
    FOR link_url IN links
      LET link_origin_doc = DOCUMENT(origin_url, true)
      NAVIGATE(link_origin_doc, link_url, 50000)
      WAIT_ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]', 5000)
      LET download_btn = ELEMENT(link_origin_doc, 'button[class="btn download btn-lg"]')
      CLICK(download_btn)
      RETURN(link_origin_doc.url)
      NAVIGATE_BACK(link_origin_doc)
    
    type/question 
    opened by luckylittle 11
  • Added fuzzer

    Added fuzzer

    This PR adds a fuzzer that targets Compile()

    I managed to get the fuzzer running on oss-fuzz's platform as well, and I propose setting ferret up on in oss-fuzz. This will allow oss-fuzz to run this fuzzer as well as future fuzzers continuously. If a bug is discovered, oss-fuzz sends a bug report via email to all maintainers on their contact list. It doesn't cost anything to be a part of oss-fuzz, but the bugs do need to get fixed.

    If you would like to setup ferret on oss-fuzz, I need a primary contact email address and the email addresses of all maintainers to add to the email list. This can be updated at anytime.

    Google might ask about the userbase of ferret in order to accept the project, so if you have any information about any companies or other packages using ferret, it would help with getting ferret accepted.

    opened by AdamKorcz 9
  • Bump github.com/stretchr/testify from 1.7.0 to 1.7.5

    Bump github.com/stretchr/testify from 1.7.0 to 1.7.5

    Bumps github.com/stretchr/testify from 1.7.0 to 1.7.5.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Bump github.com/rs/zerolog from 1.26.1 to 1.27.0

    Bump github.com/rs/zerolog from 1.26.1 to 1.27.0

    Bumps github.com/rs/zerolog from 1.26.1 to 1.27.0.

    Commits
    • e9344a8 docs: add an example for Lshortfile-like implementation of CallerMarshalFunc ...
    • 263b0bd #411 Add FieldsExclude parameter to console writer (#418)
    • 588a61c ctx: Modify WithContext to use a non-pointer receiver (#409)
    • 361cdf6 Remove extra space in console when there is no message (#413)
    • fc26014 MsgFunc function added to Event (#406)
    • 025f9f1 journald: don't call Enabled before each write (#407)
    • 3efdd82 call done function when logger is disabled (#393)
    • c0c2e11 Consistent casing, redundancy, and spelling/grammar (#391)
    • 665519c Fix ConsoleWriter color on Windows (#390)
    • 0c8d3c0 move the lint command to its own package (#389)
    • See full diff in compare view

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Bump github.com/mafredri/cdp from 0.32.0 to 0.33.0

    Bump github.com/mafredri/cdp from 0.32.0 to 0.33.0

    Bumps github.com/mafredri/cdp from 0.32.0 to 0.33.0.

    Release notes

    Sourced from github.com/mafredri/cdp's releases.

    v0.33.0

    • Switch to GitHub Actions (#129) (159a656)
    • Update to latest protocol definitions (#130) (d0c159a)
    • rpcc: Fix potential deadlock in stream Sync (#134) (91594d7)
    • Format project with gofumpt (#135) (3c5eab7)
    • Add example using alternative websocket implementation (69932b8)
    • Update go.mod (9d6eeb0)
    • Update to latest protocol definitions (#136) (b079440)

    https://github.com/mafredri/cdp/compare/v0.32.0...v0.33.0

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Bump github.com/corpix/uarand from 0.1.1 to 0.2.0

    Bump github.com/corpix/uarand from 0.1.1 to 0.2.0

    Bumps github.com/corpix/uarand from 0.1.1 to 0.2.0.

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 0
  • Installation documentation is incorrect

    Installation documentation is incorrect

    Describe the bug Install docs don't work. image

    Running any of these commands does not produce a successful outcome.

    To Reproduce Steps to reproduce the behavior:

    1. Run any of the commands in the screenshot above

    Expected behavior Expected Ferret to install

    Screenshots image

    Desktop (please complete the following information):

    • OS: Mac OS 12
    • Version: Latest

    Additional context This works go install github.com/MontFerret/cli/[email protected]

    type/enhancement help wanted 
    opened by subimage 2
Releases(v0.16.6)
Owner
MontFerret
Modern web scraping system
MontFerret
Implementing WEB Scraping with Go

WEB Scraping with Go In this project I implement a WEB scraper that create a CSV file with quotes and authors from the Pensador programing Web Page. R

Jailton Silva 1 Dec 10, 2021
🦙 acao(阿草), the tool man for data scraping of https://asoul.video/.

?? acao acao(阿草), the tool man for data scraping of https://asoul.video/. Deploy to Aliyun serverless function with Raika update_member Update A-SOUL

A-SOUL Video 9 Apr 7, 2022
A neat wrapper around the 4chan API for content scraping.

moonarch A neat wrapper around the 4chan API for content scraping. How-To First, get the repository: go get github.com/lazdotdigital/fourscrape. fours

Laz 0 Nov 27, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.9k Jun 30, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 263 Jun 12, 2022
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 448 Jun 24, 2022
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.8k Jun 20, 2022
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 1.5k Jun 22, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Jul 3, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHenHQ 782 Jun 4, 2022
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 77 Jun 21, 2022
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 10 Jun 28, 2022
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

Stephen Schmidt 2 Feb 11, 2022
Crawls web pages and prints any link it can find.

crawley Crawls web pages and prints any link it can find. Scan depth (by default - 0) can be configured. features fast SAX-parser (powered by golang.o

Alexei Shevchenko 61 Jun 20, 2022
skweez spiders web pages and extracts words for wordlist generation.

skweez skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation. It is basically an attempt to make a more oper

Michael Eder 46 Mar 10, 2022
WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

WolvesLeader 1 Nov 28, 2021
Examples for chromedp for web scrapping

About chromedp examples This folder contains a variety of code examples for working with chromedp. The godoc page contains a number of simple examples

egnimos 0 Nov 30, 2021
Dumbass-news - A web service to report dumbass news

Dumbass News - a web service to report dumbass news Copyright (C) 2022 Mike Tayl

Mike Taylor 0 Jan 18, 2022
A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Tony Afula 0 Jan 29, 2022