community search engine

Overview

Lieu

an alternative search engine

Created in response to the environs of apathy concerning the use of hypertext search and discovery. In Lieu, the internet is not what is made searchable, but instead one's own neighbourhood. Put differently, Lieu is a neighbourhood search engine, a way for personal webrings to increase serendipitous connexions.

Goals

  • Enable serendipitous discovery
  • Support personal communities
  • Be reusable, easily

Usage

$ lieu help
Lieu: neighbourhood search engine

Commands
- precrawl  (scrapes config's general.url for a list of links: 
  • elements containing an anchor tag) - crawl (start crawler, crawls all urls in config's crawler.webring file) - ingest (ingest crawled data, generates database) - search (interactive cli for searching the database) - host (hosts search engine over http) Example: lieu precrawl > data/webring.txt lieu ingest lieu host
  • Lieu's crawl & precrawl commands output to standard output, for easy inspection of the data. You typically want to redirect their output to the files Lieu reads from, as defined in the config file. See below for a typical workflow.

    Workflow

    • Edit the config
    • Add domains to crawl in config.crawler.webring
      • If you have a webpage with links you want to crawl:
      • Set the config's url field to that page
      • Populate the list of domains to crawl with precrawl: lieu precrawl > data/webring.txt
    • Crawl: lieu crawl > data/source.txt
    • Create database: lieu ingest
    • Host engine: lieu host

    After ingesting the data with lieu ingest, you can also use lieu to search the corpus in the terminal with lieu search.

    Config

    The config file is written in TOML.

    [general]
    name = "Merveilles Webring"
    # used by the precrawl command and linked to in /about route
    url = "https://webring.xxiivv.com"
    port = 10001
    
    [data]
    # the source file should contain the crawl command's output 
    source = "data/crawled.txt"
    # location & name of the sqlite database
    database = "data/searchengine.db"
    # contains words and phrases disqualifying scraped paragraphs from being presented in search results
    heuristics = "data/heuristics.txt"
    # aka stopwords, in the search engine biz: https://en.wikipedia.org/wiki/Stop_word
    wordlist = "data/wordlist.txt"
    
    [crawler]
    # manually curated list of domains, or the output of the precrawl command
    webring = "data/webring.txt"
    # domains that are banned from being crawled but might originally be part of the webring
    bannedDomains = "data/banned-domains.txt"
    # file suffixes that are banned from being crawled
    bannedSuffixes = "data/banned-suffixes.txt"
    # phrases and words which won't be scraped (e.g. if a contained in a link)
    boringWords = "data/boring-words.txt"
    # domains that won't be output as outgoing links
    boringDomains = "data/boring-domains.txt"

    For your own use, the following config fields should be customized:

    • name
    • url
    • port
    • source
    • webring
    • bannedDomains

    The following config-defined files can stay as-is unless you have specific requirements:

    • database
    • heuristics
    • wordlist
    • bannedSuffixes

    For a full rundown of the files and their various jobs, see the files description.

    License

    Source code AGPL-3.0-or-later, Inter is available under SIL OPEN FONT LICENSE Version 1.1, Noto Serif is licensed as Apache License, Version 2.0.

    Issues
    • HTML + CSS + overhaul performance

      HTML + CSS + overhaul performance

      TLDR:

      • Design is the same except when accessibility mattered most
      • HTML was heavily tweaked in some spaces for accessibility and semantic reasons, but the design staus the same
      • CSS was remade from scratch but old files are in the "css_old" folder
      • Loading performances were increased by using woff2 instead of ttf

      Notable changes:

      • A reset was added to limit the number of basic CSS fixes or addons to add. It's placed inside the
      • For accessibility reasons, an input MUST have a label. A label was added to the search input. Not very esthetic I know, but well...
      • For accessibility reasons, each page MUST start with at least an h1. It was added in some pages. In other places, h2 were converted to simple text.
      • The CSS is loosely following the CUBE CSS methodology, could maybe use some cleaning.
      • Entries are now lists for accessibility reasons. Screen readers announce the number of elements inside a list when they enter them, and shortcuts help jumping from one to the next

      The commits can globally tell my though process when I coded this. If you need tweaks or corrections please ask.

      opened by Thomasorus 12
    • Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

      Allows the configuration of a proxy for the HTTP client and the Colly HTTP client, Allows extracting precrawl links without matching the pattern

      This PR adds a configuration option which allows a user to configure an http:// or socks:// proxy in the lieu.toml, and an additional option for extracting links from web sites that don't have a them in the form of <li><a></a></li>. My intent with this is to build a plugin to the I2P network which enhances the user's ability to search for sites by sharing the task of crawling sites and making it easy to run a small search engine. Simply setting the http_proxy environment variable does not appear to be sufficient due to DNS leakage, it must be done by replacing the default transport.

      opened by eyedeekay 7
    • Light theme wanted

      Light theme wanted

      Hello, light theme is needed. I tried to fork and modify the project but had some problems (on my side, probably) related to sqlite. I don't want to deal with them currently, so I'm leaving the the task to you :-)

      You probably need to add this to base.css:

      @media screen and (prefers-color-scheme: light) {
          :root {
              --primary: #000;
              --secondary: #fefefe;
          }
      }
      

      It will basically swap colors everywhere, except for the search button. You have hard-coded colors in the logo.svg, there are two ways to swap the colors:

      1. Provide two versions of the file and serve them depending on current theme.
      2. Change svg color with svg. Look here or somewhere else maybe.

      Thanks

      opened by bouncepaw 3
    • Crawler indexes all pages on a particular domain rather pages under a path

      Crawler indexes all pages on a particular domain rather pages under a path

      When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

      I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.

      opened by Amolith 1
    • Add opensearch metadata for browser integration

      Add opensearch metadata for browser integration

      This allows Lieu to be discovered as a search engine by browsers, which then can be set as the default search engine for example.

      Looks like this in Firefox: leu

      opened by claudiiii 1
    • Path-enhanced crawling

      Path-enhanced crawling

      Webring sites passed to Lieu which end with a path, e.g. https://example.com/site/lupin will now only have their children pages crawled (as opposed to allowing all pages of example.com being crawled).

      This falls more in line with expectations for webring sites which might exist on shared hosting, or just sites which have separate areas that should not be crawled.

      Thanks @amolith for the issue!

      opened by cblgh 0
    • Implement site:<domain> filter, add indexing of outgoing links

      Implement site: filter, add indexing of outgoing links

      • Implemented a search function allowing visitors to search the wider web from the perspective of webring denizens, by way of indexing and making searchable the webring's outgoing links
      • Implemented the site:<domain> filter to allow searching the webring for results from a single site
      • Made it possible for static sites to implement search for their sites using Lieu, example code
      opened by cblgh 0
    • crawl-delay

      crawl-delay

      Lieu crawls fast; it seems to crawl at a rate of 5req/sec by default.

      Normally, this wouldn't bee too worrying; however, the types of sites Lieu crawls are small hobbyist sites. Some peoples' sites may be self-hosted on limited hardware.

      Respecting a crawl-delay robots.txt directive could help avoid overwhelming smaller sites.

      opened by Seirdy 1
    • Handle query argument type of link style like

      Handle query argument type of link style like"?post=xxx"

      Hi there administrators! My blog in the webring uses a link format of index.php?post=20220612001238, like this, however it looks to me that the crawler doesn't like query arguments in the url: https://github.com/cblgh/lieu/blob/b0ad7dce102d35123bb0092527b7ceea6df8ad86/crawler/crawler.go#L51

      Can there be a way for sites to hint that they may want to use ? or # separated URLs? From what I know, MDWiki is quite popular and it uses #! to specify page links, so that way we could index more pages for these sites as well.

      I can see where # could post some problems with title links... I'd suggest that allow a <meta> or some sort of tag in the page head to hint the crawler that some formats of the link can be allowed, and if href regex matches the "allowed link format", the link will be preserved?

      Thanks!

      opened by ChengduLittleA 0
    • Potential sadness if you find your domain in

      Potential sadness if you find your domain in "boringDomains" config

      :wave: r a d project :partying_face:

      As an admin, it feels a bit uncomfortable putting domains into that config with such a name. The functionality is great & relevant but in a pubnix / shared server environment, other users might get the wrong idea seeing the naming? If I'm trying to instead focus my search space on relevant links and not saying I think their stuff is "boring".

      Proposal: skipDomains.

      opened by decentral1se 1
    • pizza finds .pizza domain

      pizza finds .pizza domain

      I noticed that a search for pizza just shows stuff that is on a site with a .pizza domain. I don't know if this is preventing actual pizza content from turning up or if there isn't any, but either way, it could be more useful if it was aware of it just being part of the domain.

      opened by benatkin 0
    Releases(2022-03-07)
    Weaviate is a cloud-native, modular, real-time vector search engine

    Weaviate is a cloud-native, real-time vector search engine (aka neural search engine or deep search engine). There are modules for specific use cases such as semantic search, plugins to integrate Weaviate in any application of your choice, and a console to visualize your data.

    SeMI Technologies 2.6k Aug 5, 2022
    Self hosted search engine for data leaks and password dumps

    Self hosted search engine for data leaks and password dumps. Upload and parse multiple files, then quickly search through all stored items with the power of Elasticsearch.

    Davide Pataracchia 22 Aug 2, 2021
    A search engine for XKCD

    xkcd_searchtool a search engine for XKCD What is it? This tool can crawling the comic transcripts from XKCD.com Users can search a comic using key wor

    null 1 Sep 29, 2021
    Zinc Search engine. A lightweight alternative to elasticsearch that requires minimal resources, written in Go.

    Zinc Search Engine Zinc is a search engine that does full text indexing. It is a lightweight alternative to Elasticsearch and runs using a fraction of

    null 9.9k Aug 7, 2022
    The gofinder program is an acme user interface to search through Go projects.

    The gofinder program is an acme user interface to search through Go projects.

    null 22 Jun 14, 2021
    Universal code search (self-hosted)

    Sourcegraph OSS edition is a fast, open-source, fully-featured code search and navigation engine. Enterprise editions are available. Features Fast glo

    Sourcegraph 6.5k Aug 8, 2022
    using go search the Marvel universe characters via marvel api

    go-marvel-api using go search the Marvel universe characters via marvel api Build and run tests on the local environemnt Build the project $ go build

    Burak KÖSE 1 Oct 5, 2021
    Alfred 4 workflow to easily search and launch bookmarks from the Brave Browser

    Alfred Brave Browser Bookmarks A simple and fast workflow for searching and launching Brave Browser bookmarks. Why this workflow? No python dependency

    Josh Newman 6 Jul 14, 2022
    Quick search and short links for NYC Council Legislation

    Quick Search and Short Links for NYC Council Legislation Quick Search Link to searches with /?q=${query}. In-browser searching is implemented with fle

    Jehiah Czebotar 6 Feb 26, 2022
    Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and traditional RESTful API.

    Phalanx Phalanx is a cloud-native full-text search and indexing server written in Go built on top of Bluge that provides endpoints through gRPC and tr

    Minoru Osuka 247 Aug 10, 2022
    Search running process for a given dll/function. Exposes a bufio.Scanner-like interface for walking a process' PEB

    Search running process for a given dll/function. Exposes a bufio.Scanner-like interface for walking a process' PEB

    Alex Flores 2 Apr 21, 2022
    Target Case Study - Document Search

    Target Case Study - Document Search Goal The goal of this exercise is to create

    Warren V 0 Feb 7, 2022
    IBus Engine for GoVarnam. An easy way to type Indian languages on GNU/Linux systems.

    IBus Engine For GoVarnam An easy way to type Indian languages on GNU/Linux systems. goibus - golang implementation of libibus Thanks to sarim and haun

    Varnamproject 10 Feb 10, 2022
    A BPMN engine, meant to be embedded in Go applications with minim hurdles, and a pleasant developer experience using it.

    A BPMN engine, meant to be embedded in Go applications with minim hurdles, and a pleasant developer experience using it. This approach can increase transparency for non-developers.

    Martin W. Kirst 42 Aug 1, 2022
    Program to generate ruins using the Numenera Ruin Mapping Engine

    Ruin Generator This is my attempt to build a program to generate ruins for Numenera using the rules from the Jade Colossus splatbook. The output only

    Sean Hagen 0 Nov 7, 2021
    An experimental vulkan 3d engine for linux (raspberry 4)

    protomatter an experimental vulkan 3d engine for linux (raspberry 4).

    Torben Schinke 0 Nov 14, 2021
    Rule engine implementation in Golang

    Rule engine implementation in Golang

    Hyperjump 1.3k Aug 1, 2022
    Nune - High-performance numerical engine based on generic tensors

    Nune (v0.1) Numerical engine is a library for performing numerical computation i

    Lord Larker 63 Jul 13, 2022
    Nune-go - High-performance numerical engine based on generic tensors

    Nune (v0.1) Numerical engine is a library for performing numerical computation i

    Lord Larker 63 Jul 13, 2022