Declarative web scraping

Overview

Ferret

Go Report Status Build Status Discord Chat Ferret release Apache-2.0 License

ferret

Try it! Docs CLI Test runner Web worker

What is it?

ferret is a web scraping system. It aims to simplify data extraction from the web for UI testing, machine learning, analytics and more.
ferret allows users to focus on the data. It abstracts away the technical details and complexity of underlying technologies using its own declarative language. It is extremely portable, extensible, and fast.

Read the introductory blog post about Ferret here!

Features

  • Declarative language
  • Support of both static and dynamic web pages
  • Embeddable
  • Extensible

Documentation is available at our website.

Issues
  • Website

    Website

    Create a website with nice and clean design. It should contain:

    • [ ] Introduction
    • [ ] Quick start
    • [ ] Guideline
    • [ ] API documentation
    • [ ] FAQ (?)
    • [ ] Contact information

    The website repo is here.

    help wanted type/enhancement 
    opened by ziflex 19
  • New to Go, Not working on Ubuntu 18

    New to Go, Not working on Ubuntu 18

    I'm new to this go stuff. I tried installing this on Ubuntu 18. First installing go, and then trying to make ferret... Would it be possible to post a completely newbie guide with all the command line steps? Would be much appreciated, thanks!

    opened by alteredorange 17
  • Feature/#10 values

    Feature/#10 values

    There is bug into PR. In 1 case out of 10 the When complex type attributes (object) test falls: screenshot 2018-10-16 at 01 11 00 Need help to fix it before merge.

    opened by 3timeslazy 17
  • How to get rid of converted characters in URLs

    How to get rid of converted characters in URLs

    I am developing a crawler and so far, so very good: thank you for this outstanding crawler.

    The only issue is that, in the returned URLs, there is a & character which gets converted into \u0026, thus: "https://thedomain/alphabet=M\u0026borough=Bronx"

    So I tried to replace it, either by using SUBSTITUTE: RETURN SUBSTITUTE(prfx + letter.attributes.href, "\u0026", "&")

    or REGEX_REPLACE.

    In both cases, the \u0026 string is NOT replaced and remains embedded into the resulting URLs. However, when I try SUBSTITUTEsay on a -> z it works fine.

    Is it a limitation of JSON, which I use as an output format? How can I get rid of the converted string as it prevents me from crawling at the lower levels of the website.

    area/stdlib good first issue type/bug 
    opened by asitemade4u 15
  • Create a Docker image with stripped down version of Chromium

    Create a Docker image with stripped down version of Chromium

    Chrome is awesome and all, but for scraping tasks it's too heavy. We need to investigate how we can create a custom build with stripped features that are not relevant to web scraping and publish this Docker image.

    Some links:

    • https://github.com/gcarq/inox-patchset
    • https://github.com/Eloston/ungoogled-chromium
    area/drivers/cdp good first issue hacktoberfest help wanted type/enhancement 
    opened by ziflex 13
  • Add object functions

    Add object functions

    • [x] KEYS(object, sort) → strArray
    • [x] HAS(object, keyName) → isPresent ~~LENGTH(object) → count~~ (Implemented here)
    • [x] MERGE(object1, object2, ... objectN) → newMergedObject
    • [x] MERGE_RECURSIVE(object1, object2, ... objectN) → newMergedObject
    • [x] VALUES(document, removeInternal) → anyArray
    • [x] ZIP(keys, values) → newObj
    • [x] KEEP(object, key1, key2, ... key) → newObj
    area/stdlib good first issue type/enhancement 
    opened by ziflex 13
  • STYLE_GET seems broken

    STYLE_GET seems broken

    Describe the bug Not sure because it's my first test on it but STYLE_GET seems broken

    To Reproduce

    LET doc = DOCUMENT('https://news.ycombinator.com/', {
        driver: 'cdp',
        viewport: {
            width: 1920,
            height: 1080
        }
    })
    
    
    WAIT_ELEMENT(doc, '.storylink', 5000)
    
    FOR el IN ELEMENTS(doc, '.title')
        RETURN STYLE_GET(el, 'font-family')
    

    will return [{},{},{}...]

    Expected behavior

    font-family: Verdana, Geneva, sans-serif

    Screenshots If applicable, add screenshots to help explain your problem.

    Desktop (please complete the following information):

    • Version: 0.11.1
    area/drivers area/drivers/cdp status/review-needed type/bug 
    opened by PierreBrisorgueil 12
  • screenshot ?

    screenshot ?

    tell me how to take a screenshot, I do not see examples

    area/drivers area/stdlib type/question 
    opened by heiheshang 12
  • Fixed headers

    Fixed headers

    area/drivers area/drivers/cdp area/drivers/http type/bug 
    opened by ziflex 12
  • Fix arithmetic operators

    Fix arithmetic operators

    The arithmetic operators must accept operands of any type. Passing non-numeric values to an arithmetic operator must cast the operands to numbers:

    • NONE will be converted to 0
    • false will be converted to 0, true will be converted to 1
    • a valid numeric value remains unchanged, but NaN and Infinity will be converted to 0
    • string values are converted to a number if they contain a valid string representation of a number. Any whitespace at the start or the end of the string is ignored. Strings with any other contents are converted to the number 0
    • an empty array is converted to 0, an array with one member is converted to the numeric representation of its sole member. Arrays with more members are converted to the number 0.
    • objects, binary and custom types are converted to the number 0.

    Here are a few examples:

    Upd:

    1 + "a"                 // "1a"
    1 + "99"                // "199"
    1 + NONE                // 1
    NONE + 1                // 1
    3 + [ ]                 // 3
    24 + [ 2 ]              // 26
    24 + [ 2, 4 ]           // 30
    25 - NONE               // 25
    17 - true               // 16
    23 * { }                // 0
    5 * [ 7 ]               // 35
    5 * [ 7, 2 ]               // 45
    24 / "12"               // 2
    1 / 0                   // panic
    
    area/runtime type/bug 
    opened by ziflex 12
  • Bump github.com/mafredri/cdp from 0.31.0 to 0.32.0

    Bump github.com/mafredri/cdp from 0.31.0 to 0.32.0

    Bumps github.com/mafredri/cdp from 0.31.0 to 0.32.0.

    Release notes

    Sourced from github.com/mafredri/cdp's releases.

    v0.32.0

    • Update to latest protocol definitions (39a25b4)
    • cmd/cdpgen: Add initialisms (a69e549)
    • cmd/cdpgen: Handle project outside GOPATH (a7ff101)

    v0.31.1

    • devtool: Close http client request body (#126) (fd8f409)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    dependencies 
    opened by dependabot[bot] 1
  • R&D pester behavior

    R&D pester behavior

    Now "pester" is doing 3 requests. Need to find out the reason.

    area/drivers area/drivers/http area/stdlib good first issue type/bug 
    opened by bundleman 0
  • Tests range

    Tests range

    Added tests for operators/range Issue: #34

    opened by ConaGo 1
  • Event triggering syntax

    Event triggering syntax

    Changes

    This proposal introduces:

    • New keyword
    • New runtime interface
    • New runtime components

    Background

    Since FQL is intended to help users interact with web pages by writing declarative query language, it makes sense to come up with more consistent and built-in way to express this interaction.

    One of the most common interactions are:

    • Typing text into a textbox
    • Selecting a value from a dropdown
    • Pressing a button

    At this moment, all these actions have their own functions in standard library. That means that every time you write a new functionality for communication with a web page (or any external system) you always have to provide a function that acts as a bridge between the runtime and your domain objects (like HTLMElement). This makes writing modules and extensions more time consuming, difficult and cumbersome. We can do it better.

    Proposal

    This proposal offers an optional alternative approach - provide a uniform approach of sending events/actions to remove systems.

    Instead of doing:

    CLICK(element)
    

    we could do the following:

    DISPATCH click IN element
    

    Instead of doing:

    INPUT(element, "My text")
    

    we could do the following:

    DISPATCH input IN element WITH "My text"
    

    Instead of doing:

    SELECT(element, "my-option")
    

    we could do the following:

    DISPATCH select IN element WITH ["option1", "option2"]
    

    Instead of doing:

    INPUT(page, "#search", "Ferret")
    

    we could do the following:

    DISPATCH input IN page WITH "Ferret" OPTIONS { selector: "#search", delay: 50 }
    

    And etc.

    So, the general syntax would be:

    DISPATCH event IN target [WITH payload] [OPTIONS opts]
    

    All runtime values that could be used the syntax with, must implement the following interface:

    
    type Dispatcher interface {
        Dispatch(ctx context.Context, event values.String, payload core.Value, options core.Value) (core.Value, error) 
    }
    
    

    I'm not fully sold on the DISPATCH keyword, so it's open for discussion (Could be TRIGGER, CREATE EVENT, SEND, DO, ACTION). Also, the interface name can be changed too. But the idea is this: generic syntax and the runtime contract.

    P.s. since some of the actions return a value, the dispatcher can return an optional result. Thus, such syntax is totally valid too:

    LET values = DISPATCH select IN element WITH ["option1", "option2"]
    
    area/compiler area/runtime area/syntax status/proposal type/enhancement 
    opened by ziflex 0
  • Update e2e tests

    Update e2e tests

    It's been awhile since I updated e2e tests and there are some of them that are filing (most of them are related to examples).

    Also, we need to add e2e tests that cover headers and cookies for both drivers.

    area/drivers good first issue help wanted type/maintenance 
    opened by ziflex 0
  • Documentation: add examples of queries made using `XPATH`

    Documentation: add examples of queries made using `XPATH`

    Unless I missed something, the documentation doesn't explain how to query document metadata (searching "site:montferret.dev metadata" through Google returned nothing, neither did grepping the source code).

    As an example, I tried to query the og:url metadata. I tried variations of //meta[property='og:url']::attr(content), with or without the leading //, and with or without the attr(content), but was unsuccessful.

    Cheers

    good first issue help wanted type/documentation 
    opened by ngirard 2
  • What is the right way to send HttpHeader in FQL ?

    What is the right way to send HttpHeader in FQL ?

    Hi,

    I am trying to send Proxy-Authorization header as below.

    LET proxy_header = {"Proxy-Authorization": ["Basic e40b7d5eff464a4fb51efed2d1a19a24"]}
    
    LET doc = DOCUMENT("https://example.com", { driver: "cdp", headers: proxy_header})
    

    But Ferret is crashing with the following error. What could be the issue?

    fatal error: fault
    [signal 0xc0000005 code=0x0 addr=0x47ed50 pc=0x8412aa]
    
    goroutine 1 [running]:
    runtime.throw(0xf230c1, 0x5)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/panic.go:1116 +0x79 fp=0xc000214e78 sp=0xc000214e48 pc=0x868e19
    runtime.sigpanic()
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/signal_windows.go:249 +0x24f fp=0xc000214ea8 sp=0xc000214e78 pc=0x87c14f
    runtime.evacuated(...)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/map.go:203
    runtime.mapiternext(0xc000214fc8)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/map.go:876 +0x4aa fp=0xc000214f28 sp=0xc000214ea8 pc=0x8412aa
    runtime.mapiterinit(0xec87a0, 0xc000215318, 0xc000214fc8)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/map.go:843 +0x1da fp=0xc000214f48 sp=0xc000214f28 pc=0x840d1a
    github.com/MontFerret/ferret/pkg/drivers.HTTPHeaders.MarshalJSON(0xc000215318, 0xec87a0, 0xc000215318, 0x209b6cbc090, 0xc000215318, 0x30002)
            /home/runner/work/ferret/ferret/pkg/drivers/headers.go:111 +0x9d fp=0xc000215038 sp=0xc000214f48 pc=0xa7aa7d
    github.com/wI2L/jettison.encodeJSONMarshaler(0xec87a0, 0xc000215318, 0xc000749000, 0x0, 0x1000, 0xff9220, 0xc0000a40b8, 0xf3d23d, 0x23, 0x5, ...)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:692 +0x79 fp=0xc0002150f0 sp=0xc000215038 pc=0xa489f9
    github.com/wI2L/jettison.encodeMarshaler(0xc000215318, 0xc000749000, 0x0, 0x1000, 0xff9220, 0xc0000a40b8, 0xf3d23d, 0x23, 0x5, 0x80, ...)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/encode.go:668 +0x182 fp=0xc0002151c8 sp=0xc0002150f0 pc=0xa48262
    github.com/wI2L/jettison.newJSONMarshalerInstr.func1(0xc000215318, 0xc000749000, 0x0, 0x1000, 0xff9220, 0xc0000a40b8, 0xf3d23d, 0x23, 0x5, 0x80, ...)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:241 +0xc8 fp=0xc000215280 sp=0xc0002151c8 pc=0xa53468
    github.com/wI2L/jettison.wrapInlineInstr.func1(0xc000936ab0, 0xc000749003, 0x0, 0x1000, 0xff9220, 0xc0000a40b8, 0xf3d23d, 0x23, 0x5, 0x80, ...)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/instruction.go:406 +0xa6 fp=0xc000215318 sp=0xc000215280 pc=0xa53e06
    github.com/wI2L/jettison.marshalJSON(0xec87a0, 0xc000936ab0, 0xff9220, 0xc0000a40b8, 0xf3d23d, 0x23, 0x5, 0x80, 0x0, 0x0, ...)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:167 +0x103 fp=0xc0002153f8 sp=0xc000215318 pc=0xa4dac3
    github.com/wI2L/jettison.MarshalOpts(0xec87a0, 0xc000936ab0, 0xc000215608, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0)
            /home/runner/go/pkg/mod/github.com/w!i2!l/[email protected]/json.go:142 +0x20a fp=0xc000215598 sp=0xc0002153f8 pc=0xa4d78a
    github.com/MontFerret/ferret/pkg/drivers/cdp/network.(*Manager).SetHeaders(0xc00003e1c0, 0xff9260, 0xc0005b5740, 0xc000936ab0, 0x0, 0x0)
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/network/manager.go:185 +0x10a fp=0xc000215630 sp=0xc000215598 pc=0xcc56ea
    github.com/MontFerret/ferret/pkg/drivers/cdp.LoadHTMLPage(0xff9260, 0xc0005b5740, 0xc0000a03c0, 0xc00002e231, 0x3f, 0x0, 0x0, 0x0, 0x0, 0xc000936ab0, ...)
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/page.go:83 +0x2fe fp=0xc000215768 sp=0xc000215630 pc=0xcc943e
    github.com/MontFerret/ferret/pkg/drivers/cdp.(*Driver).Open(0xc0002c9e80, 0xff9260, 0xc0005b5740, 0xc00002e231, 0x3f, 0x0, 0x0, 0x0, 0x0, 0xc000936ab0, ...)
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/driver.go:65 +0x2d8 fp=0xc000215860 sp=0xc000215768 pc=0xcc75b8
    github.com/MontFerret/ferret/pkg/stdlib/html.Open(0xff9260, 0xc0005b5740, 0xc00092cd00, 0x2, 0x2, 0x0, 0x0, 0x0, 0x0)
            /home/runner/work/ferret/ferret/pkg/stdlib/html/document.go:75 +0x399 fp=0xc000215a20 sp=0xc000215860 pc=0xcd2099
    github.com/MontFerret/ferret/pkg/runtime/expressions.(*FunctionCallExpression).Exec(0xc0002c9800, 0xff92a0, 0xc0009369c0, 0xc00092cce0, 0xc00069a1e0, 0x0, 0x0, 0x0)
            /home/runner/work/ferret/ferret/pkg/runtime/expressions/func_call.go:58 +0x2a2 fp=0xc000215aa0 sp=0xc000215a20 pc=0xd11262
    github.com/MontFerret/ferret/pkg/runtime/expressions.(*VariableDeclarationExpression).Exec(0xc00092c600, 0xff92a0, 0xc0009369c0, 0xc00092cce0, 0x1000280, 0x14d4220, 0x0, 0x0)
            /home/runner/work/ferret/ferret/pkg/runtime/expressions/variable.go:48 +0x5d fp=0xc000215af0 sp=0xc000215aa0 pc=0xd11f5d
    github.com/MontFerret/ferret/pkg/runtime/expressions.(*BodyExpression).Exec(0xc000936030, 0xff92a0, 0xc0009369c0, 0xc00092cce0, 0xc0005b56e0, 0x0, 0x94e439, 0x14a1d20)
            /home/runner/work/ferret/ferret/pkg/runtime/expressions/body.go:41 +0xb6 fp=0xc000215b58 sp=0xc000215af0 pc=0xd0ee36
    github.com/MontFerret/ferret/pkg/runtime.(*Program).Run(0xc0009367e0, 0xff91e0, 0xc0002c9f00, 0xc000431d18, 0x3, 0x3, 0x0, 0x0, 0x0, 0x0, ...)
            /home/runner/work/ferret/ferret/pkg/runtime/program.go:95 +0x202 fp=0xc000215c08 sp=0xc000215b58 pc=0xa61c62
    github.com/MontFerret/ferret/cli.Exec(0xc000216000, 0x4cc, 0x0, 0x0, 0xc000155800, 0x0, 0x0, 0x0, 0x0, 0x0)
            /home/runner/work/ferret/ferret/cli/exec.go:62 +0x451 fp=0xc000215d60 sp=0xc000215c08 pc=0xd80891
    github.com/MontFerret/ferret/cli.ExecFile(0xc0000a4080, 0xa, 0x0, 0x0, 0xc000155800, 0x0, 0x0, 0x0, 0x0, 0x0)
            /home/runner/work/ferret/ferret/cli/exec.go:24 +0x128 fp=0xc000215dd0 sp=0xc000215d60 pc=0xd80428
    main.main()
            /home/runner/work/ferret/ferret/main.go:191 +0x332 fp=0xc000215f88 sp=0xc000215dd0 pc=0xd88b72
    runtime.main()
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/proc.go:204 +0x209 fp=0xc000215fe0 sp=0xc000215f88 pc=0x86b5c9
    runtime.goexit()
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/asm_amd64.s:1374 +0x1 fp=0xc000215fe8 sp=0xc000215fe0 pc=0x89b181
    
    goroutine 6 [syscall]:
    os/signal.signal_recv(0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/sigqueue.go:147 +0xa9
    os/signal.loop()
            /opt/hostedtoolcache/go/1.15.5/x64/src/os/signal/signal_unix.go:23 +0x29
    created by os/signal.Notify.func1.1
            /opt/hostedtoolcache/go/1.15.5/x64/src/os/signal/signal.go:150 +0x4b
    
    goroutine 7 [chan receive]:
    github.com/MontFerret/ferret/cli.Exec.func1(0xc0005b5680, 0xc00005e810, 0xc0005b5560)
            /home/runner/work/ferret/ferret/cli/exec.go:49 +0x3b
    created by github.com/MontFerret/ferret/cli.Exec
            /home/runner/work/ferret/ferret/cli/exec.go:47 +0x35d
    
    goroutine 10 [select]:
    net/http.(*persistConn).readLoop(0xc0004370e0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:2161 +0x9cc
    created by net/http.(*Transport).dialConn
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:1708 +0xcd7
    
    goroutine 11 [select]:
    net/http.(*persistConn).writeLoop(0xc0004370e0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:2340 +0x134
    created by net/http.(*Transport).dialConn
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:1709 +0xcfc
    
    goroutine 38 [IO wait]:
    internal/poll.runtime_pollWait(0x209b6d86440, 0x72, 0xff0fa0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/netpoll.go:222 +0x65
    internal/poll.(*pollDesc).wait(0xc00019e1b8, 0x72, 0x13ab300, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_poll_runtime.go:87 +0x4c
    internal/poll.execIO(0xc00019e018, 0xf82dc0, 0x908001, 0x909acd, 0x60473a5b)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_windows.go:175 +0x105
    internal/poll.(*FD).Read(0xc00019e000, 0xc0004d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_windows.go:441 +0x2ff
    net.(*netFD).Read(0xc00019e000, 0xc0004d2000, 0x1000, 0x1000, 0x86ba65, 0xc0004dbb58, 0x893080)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/fd_posix.go:55 +0x56
    net.(*conn).Read(0xc000006058, 0xc0004d2000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/net.go:182 +0x95
    net/http.(*persistConn).Read(0xc000478000, 0xc0004d2000, 0x1000, 0x1000, 0xc0004cc060, 0xc0004dbc58, 0x835b95)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:1887 +0x7c
    bufio.(*Reader).fill(0xc0004d0000)
            /opt/hostedtoolcache/go/1.15.5/x64/src/bufio/bufio.go:101 +0x10a
    bufio.(*Reader).Peek(0xc0004d0000, 0x1, 0x0, 0x0, 0x1, 0x0, 0xc000944300)
            /opt/hostedtoolcache/go/1.15.5/x64/src/bufio/bufio.go:139 +0x56
    net/http.(*persistConn).readLoop(0xc000478000)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:2040 +0x1ba
    created by net/http.(*Transport).dialConn
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:1708 +0xcd7
    
    goroutine 39 [select]:
    net/http.(*persistConn).writeLoop(0xc000478000)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:2340 +0x134
    created by net/http.(*Transport).dialConn
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/http/transport.go:1709 +0xcfc
    
    goroutine 23 [IO wait]:
    internal/poll.runtime_pollWait(0x209b6d86358, 0x72, 0xff0fa0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/runtime/netpoll.go:222 +0x65
    internal/poll.(*pollDesc).wait(0xc00019e438, 0x72, 0x13ab300, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_poll_runtime.go:87 +0x4c
    internal/poll.execIO(0xc00019e298, 0xf82dc0, 0xf04201, 0x198, 0xebd280)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_windows.go:175 +0x105
    internal/poll.(*FD).Read(0xc00019e280, 0xc00078a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/internal/poll/fd_windows.go:441 +0x2ff
    net.(*netFD).Read(0xc00019e280, 0xc00078a000, 0x1000, 0x1000, 0xc00056fb10, 0x92bf96, 0xc0005b8868)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/fd_posix.go:55 +0x56
    net.(*conn).Read(0xc000006048, 0xc00078a000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/net/net.go:182 +0x95
    bufio.(*Reader).fill(0xc0004d0060)
            /opt/hostedtoolcache/go/1.15.5/x64/src/bufio/bufio.go:101 +0x10a
    bufio.(*Reader).Peek(0xc0004d0060, 0x2, 0xc0005cd005, 0x4, 0x1000, 0xc00056fc68, 0x83a685)
            /opt/hostedtoolcache/go/1.15.5/x64/src/bufio/bufio.go:139 +0x56
    github.com/gorilla/websocket.(*Conn).read(0xc0005b8580, 0x2, 0x0, 0x0, 0xee6440, 0x4, 0xfef6a0)
            /home/runner/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:370 +0x47
    github.com/gorilla/websocket.(*Conn).advanceFrame(0xc0005b8580, 0xfef920, 0xc000088080, 0x0)
            /home/runner/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:798 +0x6b
    github.com/gorilla/websocket.(*Conn).NextReader(0xc0005b8580, 0xc000088030, 0xc000088030, 0x201, 0x0, 0xfef920)
            /home/runner/go/pkg/mod/github.com/gorilla/[email protected]/conn.go:980 +0x9b
    github.com/mafredri/cdp/rpcc.(*wsReadWriteCloser).Read(0xc000558280, 0xc000958000, 0x200, 0x200, 0xc00019c090, 0x83, 0x90)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/socket.go:37 +0x4f
    encoding/json.(*Decoder).refill(0xc0005b8840, 0x0, 0x0)
            /opt/hostedtoolcache/go/1.15.5/x64/src/encoding/json/stream.go:165 +0xf8
    encoding/json.(*Decoder).readValue(0xc0005b8840, 0x0, 0x0, 0x90)
            /opt/hostedtoolcache/go/1.15.5/x64/src/encoding/json/stream.go:140 +0x1ff
    encoding/json.(*Decoder).Decode(0xc0005b8840, 0xe3bee0, 0xc000930000, 0x83, 0x90)
            /opt/hostedtoolcache/go/1.15.5/x64/src/encoding/json/stream.go:63 +0x93
    github.com/mafredri/cdp/rpcc.(*jsonCodec).ReadResponse(0xc00062c1f0, 0xc000930000, 0xc00019c090, 0x83)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/conn.go:245 +0x4c
    github.com/mafredri/cdp/rpcc.(*Conn).recv(0xc0000a0280, 0xc00062c210, 0xc00062c200)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/conn.go:333 +0xde
    created by github.com/mafredri/cdp/rpcc.DialContext
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/conn.go:197 +0x37b
    
    goroutine 24 [select]:
    github.com/mafredri/cdp/rpcc.(*streamClient).watch(0xc000334180)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:171 +0x115
    created by github.com/mafredri/cdp/rpcc.newStreamClient
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:142 +0x1bf
    
    goroutine 25 [select]:
    github.com/mafredri/cdp/rpcc.(*streamClient).watch(0xc000334200)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:171 +0x115
    created by github.com/mafredri/cdp/rpcc.newStreamClient
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:142 +0x1bf
    
    goroutine 26 [select]:
    github.com/mafredri/cdp/session.(*Manager).watch(0xc00024a440, 0xc0005582a0, 0xc0005542a0, 0xc0004d01e0, 0xc0004d00c0)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/session/manager.go:103 +0x265
    created by github.com/mafredri/cdp/session.NewManager
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/session/manager.go:192 +0x1a5
    
    goroutine 30 [select]:
    github.com/mafredri/cdp/session.(*session).ReadResponse(0xc00024a2c0, 0xc000930050, 0x7, 0xc000720340)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/session/session.go:43 +0x10a
    github.com/mafredri/cdp/rpcc.(*Conn).recv(0xc0000a03c0, 0xc00005e220, 0xc00005e210)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/conn.go:333 +0xde
    created by github.com/mafredri/cdp/rpcc.DialContext
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/conn.go:197 +0x37b
    
    goroutine 76 [select]:
    github.com/mafredri/cdp/rpcc.(*streamClient).watch(0xc000334300)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:171 +0x115
    created by github.com/mafredri/cdp/rpcc.newStreamClient
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:142 +0x1bf
    
    goroutine 75 [select]:
    github.com/mafredri/cdp/rpcc.(*streamClient).watch(0xc000334100)
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:171 +0x115
    created by github.com/mafredri/cdp/rpcc.newStreamClient
            /home/runner/go/pkg/mod/github.com/mafredri/[email protected]/rpcc/stream.go:142 +0x1bf
    
    goroutine 77 [runnable]:
    github.com/MontFerret/ferret/pkg/drivers/cdp/events.(*SourceCollection).Get(0xc0004c8210, 0x1, 0x3, 0x2, 0x0, 0x0)
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/events/sources.go:54 +0x145
    github.com/MontFerret/ferret/pkg/drivers/cdp/events.(*Loop).run(0xc0004c8150, 0xff91e0, 0xc00003e180)
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/events/loop.go:87 +0x2f2
    created by github.com/MontFerret/ferret/pkg/drivers/cdp/events.(*Loop).Run
            /home/runner/work/ferret/ferret/pkg/drivers/cdp/events/loop.go:24 +0x5a 
    
    area/runtime type/bug 
    opened by aravindajju 3
  • visualize For Declarative language

    visualize For Declarative language

    FQL (Ferret Query Language) is a strong way to make process for web scraping.

    For reducing learning costs, Is there any possible to provide a web UI that uses page motion capture to generate FQL?

    By the way, thanks for everyone who build this great project !

    help wanted status/available type/enhancement type/question 
    opened by hyyc554 1
  • Reuse a browser tab between multiple scripts (stateful execution)

    Reuse a browser tab between multiple scripts (stateful execution)

    Add possibility to reuse a browser tab between multiple scripts so that they all could work with an existing session.

    Possible solution: Allow driver to create a new tab and return a handle that could be passed in to a program to a user.

    area/drivers area/drivers/cdp type/enhancement 
    opened by ziflex 0
  • Is there a way to always get absolute URLs?

    Is there a way to always get absolute URLs?

    I wanted to know if there's a way to make Ferret always return absolute URLs when they are relative in the source code, like web browsers do.

    I'm crawling a site by getting a bunch of href attribute values from different anchors into an array and then iterating that array to load and return the content I need from each of the URLs.

    The problem is that some of the URLs are absolute (https://example.com/whatever) and others are relative (/whichever), so when I try to get a DOCUMENT from one of the relative URLs, I get the following error:

    Failed to execute the query
    failed to retrieve a document /whichever: Get /whichever: unsupported protocol scheme "": DOCUMENT(url) at 11:16: FORurlinurlsLETpropDoc=DOCUMENT(url)RETURN{...} at 10:1
    

    I'd ideally want to run the entire process in a single FQL script, but I couldn't find a way to convert the relative URLs or make them work, so it seems my only option is to first return them to a Go program to be fixed and then run an additional data-gathering query on each of them.

    status/review-needed type/enhancement type/question 
    opened by gonssal 4
Releases(v0.15.0)
Owner
MontFerret
Modern web scraping system
MontFerret
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.4k Jul 24, 2021
A little like that j-thing, only in Go.

goquery - a little like that j-thing, only in Go goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go

null 10.4k Jul 19, 2021
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 Jul 20, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 209 Jul 16, 2021
Getting new films from baskino.me

Filmparser Description Every time I want to watch a film, I have to go to baskino.me to check if it is available on torrent trackers to download it th

Artyom 4 Jul 19, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 925 Jul 22, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8k Jul 26, 2021
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.6k Jul 20, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Jul 24, 2021
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 358 Jul 23, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 763 May 7, 2021