A quick and dirty but useful tool to download each text/html page from the wayback machine for a specific domain and search for keywords within the saved content

Overview

wayback-keyword-search

A quick and dirty but useful tool to download each text/html page from the wayback machine for a specific domain and search for keywords within the saved content

This tool downloads EACH "text/html" page for a specific domain from the Way Back Machine and stores the content of each retrieved page in the local directory as a .txt file.

python3 download.py > specify your domain like: nytimes.com (no quotes!)

When the download is finished, a directory named as the domain will be saved in the local path.

So you can search for keyword matches within each file in the local dir using the "search.py" file:

python3 search.py > specify your keyword (no quotes!).

BE CAREFUL: BIG DOMAINS MAY REQUIRE A LONG TIME TO DOWNLOAD!

The tool is available in both Python3 and Go versions. Go is currently linux-only version. Will add Windows version soon.

IF YOU CHOOSE TO RUN THE GO VERSION, YOU BEST COMPILE IT (go build download.go; go build search.go). OTHERWISE REMEMBER TO RUN IT IN THE TERMINAL FROM THE SCRIPT'S DIRECTORY.

You might also like...
Golang tool to parse netblocks and domain names from SPF and get information about ASN
Golang tool to parse netblocks and domain names from SPF and get information about ASN

gospf Example Install go get github.com/incogbyte/gospf # get from releases ht

A CLI tool to get Certificate Transparency logs of a domain name.
A CLI tool to get Certificate Transparency logs of a domain name.

crt crt is a CLI tool to get Certificate Transparency logs of a domain name. It can also enumerate subdomains. Installation If you have Go installed:

Application written in Go which polls Time-series data at specific intervals and saves to persistent storage

TPoller Server Overview The purpose of this application is to poll time-series data per time interval from any (telemetry) application running a gRPC

A simple way to share files and clipboard within a local network.

Shortcut Simple way to share files and clipboard with devices within a local network. Usage Run the ./shortcut executable. A browser window will be op

A proxy to hide NFT metadata during the sale and prevent people from sniping specific NFTs.

NFT Sale Proxy A proxy to hide NFT metadata during the sale and prevent people from sniping specific NFTs. Check alephao/nft-sale-proxy-examples to se

go HTTP client that makes it plain simple to configure TLS, basic auth, retries on specific errors, keep-alive connections, logging, timeouts etc.

goat Goat, is an HTTP client built on top of a standard Go http package, that is extremely easy to configure; no googling required. The idea is simila

Simple endpoint to create chat for specific application.

About Chat System Simple endpoint to create chat for specific application. Note This endpoints depend on chat-system repoistory, so you ought to run c

Simple utility to set the WSL2 subnet to a specific range

WSL subnet utility This is a small Go utility to set the WSL2 host and subnet. It achieves this by: deleting the existing WSL network creating a new o

A simple Chat software from within the Shell

directChat A simple Chat software from within the Shell Usage Build The client a

Comments
  • Adding a main() function.

    Adding a main() function.

    I was having trouble when trying to run this with it halting in each Pool worker waiting for domain input.

    To get around this I've added a main() function that runs only in the entry process.

    I've moved all the code not already in functions into this function. This took some variables out of global scope so I've also refactored the functions to be passed the variables they were using from global scope where appropriate.

    opened by stephenpaulger 2
  • Refactoring

    Refactoring

    There are a few changes here that don't change the functionality (except in one case).

    Using context managers

    I've changed the code to use context managers for using the multiprocessing pool and for writing out the files to disk, this shouldn't affect how the program works but removes the need to call the tidying up functions.

    Ran black

    I've run both python files through black, just to make it easier to keep a consistent code style.

    Added a WaybackRecord data class and a parsing function

    The getUrls function was doing a bit of record parsing and I wanted to extract this for a few reasons. I think it improves readability and may improve the ability to add some filtering to what is downloaded before it is downloaded in the future. A next step there might be separating the creation of wayback URLs from those records.

    Extracted the filename creation function

    I've added a function called safe_filename to create the filenames that content is downloaded to, extracting this means in the future it could be unit tested separately in the future and hopefully makes download a little easier to read.

    Refactored the retry logic in the download function.

    There is a check that the output filename isn't over 255 characters, I've move this outside of the retry logic as it the filename doesn't change between tries.

    I've changed the type of loop to a for loop so the number of retries can be limited just in case it gets stuck in a constant retry situation.

    I also move the file writing code outside of the retry loop as I don't think that needs to be in there.

    opened by stephenpaulger 0
Owner
null
A quick and dirty concurrent Golang-based port scanner

go-scan-ports A quick and dirty concurrent Golang-based port scanner, this will scan ports 1 through 1024 Usage: Requires 1 command line argument of U

Rolla Campbell 0 Jan 6, 2022
Parse any web page for URLs and return the HTTP response code of each one.

ParseWebPage - Fully Functional WebPage Parser Parse any web page for URLs and return the HTTP response code of each one. Creators ?? Steven Williams

null 0 Oct 25, 2021
Assanlab - JSON input data with counts on how many times you showed an ad on each individual domain

JSON input data with counts on how many times you showed an ad on each individua

null 1 Dec 31, 2021
Remake of the original sqlifinder but in GOlang, and allows for listed targets, domain crawling, and tor connections

_______ _____ _____ _______ _____ __ _ ______ _______ ______ |______ | __| | | |______ | | \ | | \ |______ |_____/

RE43P3R 1 Aug 17, 2022
Moviefetch: a simple program to search and download for movies from websites like 1337x and then stream them

MovieFetch Disclaimer I am NOT responisble for any legal issues or other you enc

Hashm 2 May 12, 2022
GBPool-- a simple but useful golang free proxy pool

GBPool-- a simple but useful golang free proxy pool Intro(English) (中文) GBPool, golang baipiao proxy pool, a free & simple golang proxy pool module, g

null 3 May 30, 2022
red-tldr is a lightweight text search tool, which is used to help red team staff quickly find the commands and key points they want to execute, so it is more suitable for use by red team personnel with certain experience.

Red Team TL;DR English | 中文简体 What is Red Team TL;DR ? red-tldr is a lightweight text search tool, which is used to help red team staff quickly find t

倾旋 176 Sep 23, 2022
Check DNS and optionally Consul and serve the status from a Web page

dns-checker Table of contents Preamble Compiling the program Keepalived and LVS Available options Setting up systemd Preamble This application checks

Massimiliano Adamo 0 Nov 7, 2021
Simple, secure and modern Go HTTP server to serve static sites, single-page applications or a file with ease

srv srv is a simple, secure and modern HTTP server, written in Go, to serve static sites, single-page applications or a file with ease. You can use it

Kevin Pollet 55 Sep 7, 2022
🤖 Automatically scrape PortableApps.com (or official release page) and convert into Edgeless plugin package

Edgeless 自动插件机器人 2 简介 该项目是为了使用 Golang 重新实现 Edgeless 自动插件机器人 特性 (WIP) 完全兼容 Edgeless 自动插件机器人,包括 Tasks,以实现无缝迁移 更快的构建速度 更好的代码结构 更高的拓展性 工作进度 截止至 2021/11/28

Hydrogen 2 Sep 12, 2022