Apollo 💎 A Unix-style personal search engine and web crawler for your digital footprint.

Overview

Apollo 💎

A Unix-style personal search engine and web crawler for your digital footprint

apollo demo

Demo

apollodemo.mp4

Contents

Background
Thesis
Design Architecture
Data Schema
Workflows
Document Storage
Shut up, how can I use it?
Notes
Future
Inspirations

Background

Apollo is a different type of search engine. Traditional search engines (like Google) are great for discovery when you're trying to find the answer to a question, but you don't know what you're looking for.

However, they're very poor at recall and synthesis when you've seen something before on the internet somewhere but can't remember where. Trying to find it becomes a nightmare - how can you synthezize the great material on the internet when you forgot where it even was? I've wasted many an hour combing through Google and my search history to look up a good article, blog post, or just something I've seen before.

Even with built in systems to store some of my favorite articles, podcasts, and other stuff, I forget things all the time.

Thesis

Screw finding a needle in the haystack. Let's create a new type of search to choose which gem you're looking for

Apollo is a search engine and web crawler to digest your digital footprint. What this means is that you choose what to put in it. When you come across something that looks interesting, be it an article, blog post, website, whatever, you manually add it (with built in systems to make doing so easy). If you always want to pull in data from a certain data source, like your notes or something else, you can do that too. This tackles one of the biggest problems of recall in search engines returning a lot of irrelevant information because with Apollo, the signal to noise ratio is very high. You've chosen exactly what to put in it.

Apollo is not necessarly built for raw discovery (although it certainly matches rediscovery), it's built for knowledge compression and transformation - that is looking up things that you've previously deemed to be cool

Design

The first thing you might notice is that the design is reminiscent of the old digital computer age, back in the Unix days. This is intentional for many reasons. In addition to paying homage to the greats of the past, this design makes me feel like I'm searching through something that is authentically my own. When I search for stuff, I genuinely feel like I'm travelling through the past.

Architecture

architecture Apollo's client side is written in Poseidon. The client side interacts with the backend via a REST-like API which provides endpoints for searching data and adding a new entry.

The backend is written in Go and is composed of a couple of important components

  1. The web server which serves the endpoints
  2. A tokenizer and stemmer used during search queries and when building the inverted index on the data
  3. A simple web crawler for scraping links to articles/blog posts/YouTube video
  4. The actual search engine which takes a query, tokenizes and stems it, finds the relevant results from the inverted index using those stemmed tokens then ranks results with TF-IDF
  5. A package which pulls in data from a couple of different sources - if you want to pull data from a custom data source, this is where you should add it.

Data Schema

Two schemas we use, one to first parse the data into some encoded format. This does not get stored, it's purely an intermediate before we transform it into a record for our inverted index. Why is this important?

  • Because since any data gets parsed into this standarized format, you can link any data source you want, if you build your own tool, if you store a lot of data in some existing one, you don't have to manually add everything. You can pull in data from any data source provided you give the API data in this format.
type Data struct {
    title string //a title of the record, self-explanatory
    link string //links to the source of a record, e.g. a blog post, website, podcast etc.
    content string //actual content of the record, must be text data
    tags []string //list of potential high-level document tags you want to add that will be
                  //indexed in addition to the raw data contained 
}
//smallest unit of data that we store in the database
//this will store each "item" in our search engine with all of the necessary information
//for the inverted index
type Record struct {
	//unique identifier
	ID string `json:"id"`
	//title
	Title string `json:"title"`
	//potential link to the source if applicable
	Link string `json:"link"`
	//text content to display on results page
	Content string `json:"content"`
	//map of tokens to their frequency
	TokenFrequency map[string]int `json:"tokenFrequency"`
}

Workflows

Data comes in many forms and the more varied those forms are, the harder it's to write reliable software to deal with it. If everything I wanted to index was just stuff I wrote, life would be easy. All of my notes would probably live in one place, so I would just have to grab the data from that data source and chill

The problem is I don't take a lot of notes and not everything I want to index is something I'd take notes of.

So what to do?

Apollo can't handle all types of data, it's not designed to. However in building a search engine to index stuff, there are a couple of things I focused on:

  1. Any data that comes from a specific platform can be integrated. If you want to index all your Twitter data for example, this is possible since all of the data can be absorbed in a constant format, converted into the compatible apollo format, and sent off. So data sources can be easily integrated, this is by design in case I want to pull in data from personal tools.
  2. The harder thing is what about just, what I wil call, "writing on the internet." I read a lot of stuff on the Internet, much of which I'd like to be able to index, without necessarily having to takes notes on everything I read because I'm lazy. The dream would be to just be able to drop a link and have Apollo intelligently try to fetch the content, then I can index it without having to go to the post and copying the content, which would be painful and too slow. This was a large motivation for the web crawler component of the project
  • If it's writing on the Internet, should be able to post link and autofill pwd
  • If it's a podcast episode or any YouTube video, download text transcription e.g. this
  • If you want to pull data from a custom data source, add it as a file in the pkg/apollo/sources folder, following the same rules as some of the examples and make sure to add it in the GetData() method of the source.go file in this package

Document storage

Local records and data from data sources are stored in separate JSON files. This is for convenience.

I also personally store my Kindle highlights as a JSON file - I use read.amazon.com and a readwise extension to download the exported highlights for a book. I put any new book JSON files in a kindle folder in the outer directory and every time the inverted index is recomputed, the kindle file takes any new book highlights, integrate them into the main kindle.json file stored in the data folder, then delete the old file.

Shut up, how can I use it?

Although I built Apollo first and foremost for myself, I also wanted other people to be able to use if they found it valuable. To use Apollo locally

  1. Clone the repo: git clone ....
  2. Make sure you have Go installed and youtube-dl which is how we download the subtitles of a video. You can use this to install it.
  3. Navigate to the root directory of the project: cd apollo . Note since Apollo syncs from some personal data sources, you'll want to remove them, add your own, or build stuff on top of them. Otherwise the terminal wil complain if you attempt to run it, so:
  4. Navigate to the pkg/apollo/sources in your preferred editor and replace the body of the GetData function with return make(map[string]schema.Data)
  5. Create a folder data in the outer directory
  6. Create a .env file in the outermost directory (i.e. in the same directory as the README.md) and add PASSWORD= where is whatever password you want. This is necessary for adding or scraping the data, you'll want to "prove you're Amir" i.e. authenticate yourself and then you won't need to do this in the future. If this is not making sense, try adding some data on apollo.amirbolous.com/add and see what happens.
  7. Go back to the outer directory (meanging you should see the files the way GitHub is displaying them right now) and run go run cmd/apollo.go in the terminal.
  8. Navigate to 127.0.0.1:8993 on your browser
  9. It should be working! You can add data and index data from the database If you run into problems, open an issue or DM me on Twitter

A little more information on the Add Data section

  • In order to add data, you'll first need to authenticate yourself - enter your password once in the "Please prove you'r Amir" and if you see a Hooray! popup then that means you were authenticated successfully. You only need to do this once since we use localStorage to save whether you've been authenticated once or not.
  • In order to scrape a website, you'll want to paste a link in the link textbox, then click on the button scrape. Note this does not add the website/content - you still need to click the add button if you want to save it. The web crawler works reliably most of the time if you're dealing with written content on a web page or a YouTube video. We use a Go ported version of readability to scrape the main contents from a page if it's written content and youtube-dl to get the transcript of a video. In the future, I'd like to make this web crawler more robust, but it works well enough most of the time for now.

As a side note, although I want others to be able to use Apollo, this is not a "commercial product" so feel free to open a feature request if you'd like one but it's unlikely I will get to it unless it becomes something I personally want to use.

Notes

  • The inverted index is re-generated once every n number of days (currently for n = 3)
  • Since this is not a commercial product, I will not be running your version of this (if you find it useful) on my server. However, although I designed this, first and foremost for myself, I want other people to be able to use if this is something that's useful, refer to How can I use this
  • I had the choice between using Go's gob package for the database/inverted index and JSON. The gob package is definitely faster however it's only native in Go so I decided to go with JSON to make the data available in the future for potentially any non-Go integrations and be able to switch the infrastructure completely if I want to etc.
  • I use a ported version of the Go snowball algorithm for my stemmer. Although I would have like to build my own stemmer, implementing a robust one (which is what I wanted) was not the focus of the project. Since the algorithm for a stemmer does not need to be maintined like other types of software, I decided to use one out of the box. If I write my own in the future, I'll swap it out.

Future

  • Improve the search algorithm, more like Elasticsearch when data grows a lot?
  • Improve the web crawler - make more robust like mercury parser, maybe write my own
  • Speed up search

Inspirations

Issues
  • error in the instructions or not clear

    error in the instructions or not clear

    I don't know go but followed the instructions and replaced the content of the GetData function with return []schema.Data{} after deleting all other content. the function has just that one line and when I run I get:

    cant use []schema.Data{} (type []schema.Data) as type map[string]schema.Data in return argument

    Clearly i shouldn't have deleted everything, but not sure what to do no knowing go.

    opened by pcause 8
  • Error when running server

    Error when running server

    When I run the server as described in the readme, I get the following errors:

    $ go run cmd/apollo.go
    
    # github.com/amirgamil/apollo/pkg/apollo/sources
    pkg/apollo/sources/kindle.go:50:2: undefined: ensureFileExists
    pkg/apollo/sources/kindle.go:66:3: undefined: deleteFiles
    pkg/apollo/sources/kindle.go:121:11: undefined: getFilesInFolder
    
    opened by mbokinala 3
  • Apollo from command line & Chrome extension

    Apollo from command line & Chrome extension

    First of all, amazing project! Really clever way of sifting through search results like that. I was wondering if you thought about cmdline client for apollo? I spend quite a lot of time in my terminal and have to context switch to browser if i need something. Tmuxing into apollo would be really useful in my case, especially for code documentation or papers I've read.

    Might be material for another feature request / idea but having a Chrome / Firefox extension to quickly pull webpages into personal store for apollo as you read them would be very cool. I often read something, leave it in the 200 Chrome tabs I have open and need to find it 2 days later and find myself digging through my history. So far I've tried things like Workona to categorise tabs into projects, but its tedious and I can see Apollo being much less attention hungry than that.

    enhancement 
    opened by dborowiec10 2
  • [suggestion/enhancement] Use pandoc

    [suggestion/enhancement] Use pandoc

    You use youtube-dl to get subtitles for videos. For local files and things like PDF's and other form's that might get downloaded, pandoc will convert just about anything to txt. Perhaps if you check in the crawler for file:// and then use pandoc to get the file as text which you can index a lot more options open up. I suspect a lot of people have local files/content they want to add to apollo and this would give huge flexibility.

    enhancement 
    opened by pcause 2
  • Missing license

    Missing license

    Since this repo contains no LICENSE file and no license or copyright notice in any file, we must assume that it is copyright (c) 2021 Amir Bolous, All Rights Reserved. I would not recommend that anyone download, install, distribute, or contribute here until the licensing is resolved.

    This issue appears to affect all other repos owned by Bolous that I have checked.

    opened by Eleison23 1
  • update Readme to reflect current repo status

    update Readme to reflect current repo status

    While going through the readme I saw this was an issue so I just swapped the name out

    opened by Mrashes 1
  • Error when running server: could not create database for path

    Error when running server: could not create database for path

    I got the following error when running the server:

    $ go run cmd/apollo.go
    
    2021/07/29 16:14:58 Error, could not create database for path: ./data/sources.json with: open ./data/sources.json: no such file or directory
    exit status 1
    

    This is my file structure:

    .
    ├── LICENSE
    ├── README.md
    ├── cmd
    │   └── apollo.go
    ├── docs
    │   ├── Screen\ Shot\ 2021-07-25\ at\ 4.36.15\ PM.png
    │   ├── apollo.png
    │   └── architecture.png
    ├── go.mod
    ├── go.sum
    ├── pkg
    │   └── apollo
    │       ├── backend
    │       │   ├── api.go
    │       │   ├── searcher.go
    │       │   └── tokenizer.go
    │       ├── data
    │       ├── schema
    │       │   ├── crawler.go
    │       │   └── schema.go
    │       ├── server.go
    │       └── sources
    │           ├── athena.go
    │           ├── kindle.go
    │           ├── source.go
    │           ├── utils.go
    │           └── zeus.go
    ├── static
    │   ├── css
    │   │   └── stylesheet.css
    │   ├── img
    │   │   ├── about.png
    │   │   ├── add.png
    │   │   └── home.png
    │   ├── index.html
    │   └── js
    │       ├── main.js
    │       └── poseidon.min.js
    └── tests
        └── main_test.go
    
    13 directories, 27 files
    
    opened by mbokinala 1
  • Create LICENSE

    Create LICENSE

    opened by amirgamil 0
  • Fixed spelling mistakes

    Fixed spelling mistakes

    opened by vishalsodani 0
  • Trying to get in touch regarding a security issue

    Trying to get in touch regarding a security issue

    Hey there!

    I'd like to report a security issue but cannot find contact instructions on your repository.

    If not a hassle, might you kindly add a SECURITY.md file with an email, or another contact method? GitHub recommends this best practice to ensure security issues are responsibly disclosed, and it would serve as a simple instruction for security researchers in the future.

    Thank you for your consideration, and I look forward to hearing from you!

    (cc @huntr-helper)

    opened by JamieSlome 1
  • Add Dockerfile

    Add Dockerfile

    I tried to create a quick Dockerfile that includes youtube-dl. I also had to ignore errors from calling godotenv.Load() to be able to set the env variables from the docker-compose.yml

    opened by JaCoB1123 0
  • Added ApolloNIA - Chrome extension for Apollo

    Added ApolloNIA - Chrome extension for Apollo

    PR: Chrome extension for Apollo - more in the submodule and at: ApolloNIA

    I've modified the server.go slightly to ensure CORS behaves with the extension (it should not affect normal operation of Apollo)

    opened by dborowiec10 2
  • [ehnancement request] take input from an rss/atom feed

    [ehnancement request] take input from an rss/atom feed

    cAn you support taking input from an RSS/Atom feed? Lots of sites have feeds and some have data you want to remember. Plus, with rss-bridge, you can get an rss feed from lots of sites and apps that don't have easy api access and in a uniform way. For example, I listen to NPR Fresh Air. The page has an rss feed so I can add that to apollo and automatically have an index of the podcasts, at least from the summary/desciption provided.

    enhancement 
    opened by pcause 1
Owner
Amir Bolous
Hacker, maker, and professional noob
Amir Bolous
Fast, highly configurable, cloud native dark web crawler.

Bathyscaphe dark web crawler Bathyscaphe is a Go written, fast, highly configurable, cloud-native dark web crawler. How to start the crawler To start

Darkspot 66 Sep 9, 2021
crawlergo is a browser crawler that uses chrome headless mode for URL collection.

A powerful browser crawler for web vulnerability scanners

9ian1i 1.4k Sep 25, 2021
Go IMDb Crawler

Go IMDb Crawler Hit the ⭐ button to show some ❤️ ?? INSPIRATION ?? Want to know which celebrities have a common birthday with yours? ?? Want to get th

Niloy Sikdar 9 Jul 31, 2021
Collyzar - A distributed redis-based framework for colly.

Collyzar A distributed redis-based framework for colly. Collyzar provides a very simple configuration and tools to implement distributed crawling/scra

Zarten 208 Sep 9, 2021
ant (alpha) is a web crawler for Go.

The package includes functions that can scan data from the page into your structs or slice of structs, this allows you to reduce the noise and complexity in your source-code.

Amir Abushareb 246 Aug 27, 2021
[爬虫框架 (golang)] An awesome Go concurrent Crawler(spider) framework. The crawler is flexible and modular. It can be expanded to an Individualized crawler easily or you can use the default crawl components only.

go_spider A crawler of vertical communities achieved by GOLANG. Latest stable Release: Version 1.2 (Sep 23, 2014). QQ群号:337344607 Features Concurrent

胡聪 1.7k Sep 16, 2021
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架

Crawlab 中文 | English Installation | Run | Screenshot | Architecture | Integration | Compare | Community & Sponsorship | CHANGELOG | Disclaimer Golang-

Crawlab Team 8.1k Sep 16, 2021
用Go实现抓取Boss直聘职位数据。IP代理,模拟浏览器,高效快速。

crawler-boss 用Go实现抓取Boss直聘职位数据。有几个特点 1.代理防IP被封 2.模拟浏览器,反识别爬虫。 3.控制爬取频率。 4.多协程爬取。 不足之处 1.爬取失败,没有进行重试以及更换IP处理。 2.错误处理 3.代码结构方面进行优化。 交流 && 疑问 如果有任何错误或不懂的

Ray 18 Sep 20, 2021
Elegant Scraper and Crawler Framework for Golang

Colly Lightning Fast and Elegant Scraping Framework for Gophers Colly provides a clean interface to write any kind of crawler/scraper/spider. With Col

Colly 14.9k Sep 24, 2021
:paw_prints: Creeper - The Next Generation Crawler Framework (Go)

About Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your new

Plutonist 766 Sep 15, 2021
DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHen Till is a standalone tool that instantly makes your existing web scraper scalable, maintainable, and more unblockable, with minimal code changes on your scraper.

DataHenHQ 751 Sep 14, 2021
Gospider - Fast web spider written in Go

GoSpider GoSpider - Fast web spider written in Go Painless integrate Gospider into your recon workflow? Enjoying this tool? Support it's development a

Jaeles Project 996 Sep 22, 2021
DorkScout - Golang tool to automate google dork scan against the entiere internet or specific targets

dorkscout dokrscout is a tool to automate the finding of vulnerable applications or secret files around the internet throught google searches, dorksco

R4yan 95 Sep 6, 2021
Declarative web scraping

Ferret Try it! Docs CLI Test runner Web worker What is it? ferret is a web scraping system. It aims to simplify data extraction from the web for UI te

MontFerret 4.6k Sep 17, 2021
Pholcus is a distributed high-concurrency crawler software written in pure golang

Pholcus Pholcus(幽灵蛛)是一款纯 Go 语言编写的支持分布式的高并发爬虫软件,仅用于编程学习与研究。 它支持单机、服务端、客户端三种运行模式,拥有Web、GUI、命令行三种操作界面;规则简单灵活、批量任务并发、输出方式丰富(mysql/mongodb/kafka/csv/excel等

henrylee2cn 6.9k Sep 18, 2021
Just a web crawler

gh-dependents gh command extension to see dependents of your repository. See The GitHub Blog: GitHub CLI 2.0 includes extensions! Install gh extension

Hiromu OCHIAI 8 Sep 15, 2021
A crawler/scraper based on golang + colly, configurable via JSON

A crawler/scraper based on golang + colly, configurable via JSON

Go Tripod 1 Sep 7, 2021
Cirno-go A tool for downloading books from hbooker in Go.

Cirno-go A tool for downloading books from hbooker in Go. Features Login your own account Search books by book name Download books as txt and epub fil

沚水 26 Sep 11, 2021
一个便捷的创意工坊下载器

Wallpaper_Engine 一个便捷的创意工坊下载器 开源说明 本项目旨在练习golang 和减少劳动力。

联盟少侠 164 Sep 24, 2021