skweez spiders web pages and extracts words for wordlist generation.

Overview

skweez

skweez (pronounced like "squeeze") spiders web pages and extracts words for wordlist generation.

It is basically an attempt to make a more operator-friendly version of CeWL. It is written in Golang, making it far more portable and performant than Ruby.

Build / Install

Binary Releases

You may use binary releases from the release section for your favorite operating system.

Build from source

Assuming you have Go 1.16+ installed and working, just clone the repo and do a go build or use go get github.com/edermi/skweez.

Usage

./skweez -h
skweez is a fast and easy to use tool that allows you to (recursively)
crawl websites to generate word lists.

Usage:
  skweez domain1 domain2 domain3 [flags]

Flags:
      --debug                 Enable Debug output
  -d, --depth int             Depth to spider. 0 = unlimited, 1 = Only provided site, 2... = specific depth (default 2)
  -h, --help                  help for skweez
      --json                  Write words + counts in a json file. Requires --output/-o
  -n, --max-word-length int   Maximum word length (default 24)
  -m, --min-word-length int   Minimum word length (default 3)
      --no-filter             Do not filter out strings that don't match the regex to check if it looks like a valid word (starts and ends with alphanumeric letter, anything else in between). Also ignores --min-word-length and --max-word-length
  -o, --output string         When set, write an output file to 
   
    .txt (
    
     .json when --json is specified). Empty writes no output to disk
      --scope strings         Additional site scope, for example subdomains. If not set, only the provided site's domains are in scope

    
   

skweez takes an arbitrary number of links and crawls them, extracting the words. skweez will only crawl sites under the link's domain, so if you submit www.somesite.com, it will not visit for example blog.somesite.com even if there are links present. You may provide a list of additionally allowed domains for crawling via --scope.

./skweez https://en.wikipedia.org/wiki/Sokushinbutsu -d 1
19:07:44 Finished https://en.wikipedia.org/wiki/Sokushinbutsu
There
learned
Edit
Alternate
mantra
Sokushinbutsu
36–37
Pseudoapoptosis
Necrophilia
information
mummification
many
identifiers
reducing
range
threat
popular
Honmyō-ji
Republic
Dignified
Recent
Himalayan
Burning
cause
Last
Español
honey
Siberia
That
Megami
Karyorrhexis
have
practically
1962
Forensic
Magyar
[...]

skweez is pretty fast. It crawls several pages a second, the example Wikipedia article above with default settings (depth=2) takes skweez 38 seconds to crawl over 360 Wikipedia sites and generates a dictionary of > 109.000 unique words.

skweez allows you to write the results into a file, if you chose JSON, you will also get the count for each word. I recommend jq for working with JSON.

In order to improve result quality, skweez has a builtin regex to filter out strings that do not look like words. --no-filter disables this behavior. skweez only selects words in length between 3 and 24 - you can override this behavior with --min-word-length and --max-word-length.

Bugs, Feature requests

Just file a new issue or, even better, submit a PR and I will have a look.

Future improvements

These are just ideas, I don't have plans of implementing them now since I usually don't need them.

  • Features CeWL provides (E-Mail filtering, proxy auth, custom headers, custom user agent)
  • Better performance
  • More control over what's getting scraped

License

GPL, see LICENSE file.

You might also like...
Gospider - Fast web spider written in Go
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Just a web crawler
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Golang based web site opengraph data scraper with caching
Golang based web site opengraph data scraper with caching

Snapper A Web microservice for capturing a website's OpenGraph data built in Golang Building Snapper building the binary git clone https://github.com/

WebWalker - Fast Script To Walk Web for find urls...

WebWalker send http request to url to get all urls in url and send http request to urls and again .... WebWalker can find 10,000 urls in 10 seconds.

Examples for chromedp for web scrapping

About chromedp examples This folder contains a variety of code examples for working with chromedp. The godoc page contains a number of simple examples

Implementing WEB Scraping with Go

WEB Scraping with Go In this project I implement a WEB scraper that create a CSV file with quotes and authors from the Pensador programing Web Page. R

Dumbass-news - A web service to report dumbass news

Dumbass News - a web service to report dumbass news Copyright (C) 2022 Mike Tayl

A recursive, mirroring web crawler that retrieves child links.

A recursive, mirroring web crawler that retrieves child links.

Owner
Michael Eder
Michael Eder
Crawls web pages and prints any link it can find.

crawley Crawls web pages and prints any link it can find. Scan depth (by default - 0) can be configured. features fast SAX-parser (powered by golang.o

Alexei Shevchenko 71 Sep 19, 2022
Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files

Fetch web pages using headless Chrome, storing all fetched resources including JavaScript files. Run arbitrary JavaScript on many web pages and see the returned values

Detectify 454 Sep 17, 2022
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 770 Sep 6, 2022
Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Apollo ?? A Unix-style personal search engine and web crawler for your digital footprint Demo apollodemo.mp4 Contents Background Thesis Design Archite

Amir Bolous 1.3k Sep 23, 2022
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHenHQ 790 Aug 14, 2022
Youtube tutorial about web scraping using golang and Gocolly

This is an example project I wrote for a youtube tutorial about webscraping using golang and gocolly It extracts data from a tracking differences webs

null 1 Mar 26, 2022
Fast golang web crawler for gathering URLs and JavaSript file locations.

Fast golang web crawler for gathering URLs and JavaSript file locations. This is basically a simple implementation of the awesome Gocolly library.

Mansz 1 Sep 24, 2022
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 265 Sep 28, 2022
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 5.1k Sep 29, 2022
Web Scraper in Go, similar to BeautifulSoup

soup Web Scraper in Go, similar to BeautifulSoup soup is a small web scraper package for Go, with its interface highly similar to that of BeautifulSou

Anas Khan 1.9k Sep 23, 2022