Coder Social home page Coder Social logo

ipfs-search / ipfs-search Goto Github PK

View Code? Open in Web Editor NEW
842.0 42.0 106.0 12.28 MB

Search engine for the Interplanetary Filesystem.

Home Page: http://ipfs-search.com

License: GNU Affero General Public License v3.0

Go 99.59% Makefile 0.15% Shell 0.05% Dockerfile 0.22%
search-engine ipfs-search rabbitmq elasticsearch ipfs golang

ipfs-search's Introduction

pipeline status Maintainability Test Coverage Documentation Status Go Reference Backers on Open Collective Sponsors on Open Collective

Search engine for the Interplanetary Filesystem. Sniffs the DHT gossip and indexes file and directory hashes.

Metadata and contents are extracted using ipfs-tika, searching is done using OpenSearch, queueing is done using RabbitMQ. The crawler is implemented in Go, the API and frontend are built using Node.js.

The ipfs-search command consists of two components: the crawler and the sniffer. The sniffer extracts hashes from the gossip between nodes. The crawler extracts data from the hashes and indexes them.

Docs

Documentation is hosted on Read the Docs, based on files contained in the docs folder. In addition, there's extensive Go docs for the internal API as well as SwaggerHub OpenAPI documentation for the REST API.

Contact

Please find us on our Freenode/Riot/Matrix channel #ipfs-search:matrix.org.

Snapshots

ipfs-search provides the daily snapshot for all of the indexed data using snapshots. To learn more about downloading and restoring snapshots please refer to the relevant section in our documentation.

Related repo's

Contributors wanted

Building a search engine like this takes a considerable amount of resources (money and TLC). If you are able to help out with either of them, do reach out (see the contact section in this file).

Please read the Contributing.md file before contributing.

Roadmap

For discussing and suggesting features, look at the issues.

External dependencies

  • Go 1.19
  • OpenSearch 2.3.x
  • RabbitMQ / AMQP server
  • NodeJS 9.x
  • IPFS 0.7
  • Redis

Internal dependencies

Building

$ go get ./...
$ make

Running

Docker

The most convenient way to run the crawler is through Docker. Simply run:

docker-compose up

This will start the crawler, the sniffer and all its dependencies. Hashes can also be queued for crawling manually by running ipfs-search a <hash> from within the running container. For example:

docker-compose exec ipfs-crawler ipfs-search add QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv

Ansible deployment

Automated deployment can be done on any (virtual) Ubuntu 16.04 machine. The full production stack is automated and can be found in it's own repository.

Contributors

This project exists thanks to all the people who contribute.

Backers

Thank you to all our backers! ๐Ÿ™ [Become a backer]

Sponsors


ipfs-search is supported by NLNet through the EU's Next Generation Internet (NGI0) programme.


RedPencil is supporting the hosting of ipfs-search.com.

Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]

ipfs-search's People

Contributors

dokterbob avatar femans avatar fnkr avatar landakram avatar lastexile16 avatar monkeywithacupcake avatar mrd0ll4r avatar szeket avatar tungland avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ipfs-search's Issues

Display last seen & first seen + status colour

We could use last seen as a status indicator for how likely a resource is available.

E.g.: last 24 hours: green
last 2 days: orange
later than 5 days: red

I would recommend using a logarithmic scale here/exponential backoff.

Object type filter

A common, at least for me, and I believe for other people is to search for content by its type. For example when searching for Doxygen you can find miriad of files containing this text, but what in fact you search is, say doxygen.exe i.e. quering for the filename (if it has one).

Basically there should be somewhat an easy way to specify what kind of objects you are quering for.

Best regards.

Help wanted

Hello!

Its really nice to have such search engine. Mainly because if you want to check whether something to exist you have no other way. There is no such thing as catalogue or so. Hence a search engine as this one seems really useful. I have troubles though using it. Maybe other people can have the same issues too.

Basically, I do not know what it does and how it works. Can there be some examples or short doc or help or so? For example I have just typed there "doxygen" and got some results, but I have no idea what the title with the icon means - it says "d", "D", "t", ? Also can I search by filename? Basically that's what I want in this case - not the content so much as the file name (really nice to have some regular expressions btw).

Small hint, I see IPFS guys do that too, but IMHO the size in bytes should be in human readable form i.e. Kb, Mb, Gb and etc.

Best Regards!

Recreate/clean references

Due to some unknown bug (probably in the past) a lot of erroneous references ended up in the search index.

This makes it impossible to retrieve items reliably using their referenced name, making mime type guessing harder. Moreover, it yields incorrect search results.

To do: write JS script that iterates over all items and checks their references based on index directories.

Getting an error when running the Manual Provisioning for backend

I tried running the manual provisioning steps on a DO server hosted in Singapore.

When I do so, I run into this error:


ERROR! no action detected in task

The error appears to have been in '/root/go/bin/src/github.com/ipfs-search/ipfs-search/provisioning/roles/common/tasks/main.yml': line 10, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  debconf: name=unattended-upgrades question=enable_auto_updates value=true vtype=boolean
- name: set timezone to Singapore
  ^ here

Can someone advice how to fix this?

Index file types to document types

Use mapping in config.yml to map mime types to different object types, depending on their typical rendering:

  • document
  • image
  • video
  • audio

Sane method of detecting & skipping partials

No idea yet how to implement this. There currently seems no cosistent way to identify partials - we're currently skipping unreferenced items with the default chunker block size.

Feature Suggestion: Use bleve as Search Engine

Currently we have a reliance on Elastic Search server which uses Java.
There is an equivalent called "Bleve" that has similar functionality,

-It has an Apache licence.

  • Written in Go.
  • Already provides allot of value add functionality.
  • It also designed for being sharded onto many servers, and so can handle high throughput.

DGraph implemented Bleve into their stack in one week. So, its not that much work at all.

Urls:
http://www.blevesearch.com/

https://github.com/blevesearch/bleve

There are many golang projects using bleve, and so heaps of example out there.

Tag repo on deploy

So that next time we know what version was running previously. Note that this is 'difficult' as we're often upgrading different components at different times.

Seperate index for 'unindexables'

Create a seperate index with only empty documents with following types:

  • invalid
  • partial

This would make the search index more efficient and would prevent partials from being scraped for size in the first place, so they're skipped more easily.

API: Metadata backend

We have a LOT of metadata we're currently storing (some of which we're indexing).

It would be great to have this available through the API, possibly for use in the frontend as well.

Find and add extracted IPFS url's

We are currently already extracting URL's from documents. Many of them contain IPFS addresses. Would be great to add them to the indexing queue.

Also see #49

Configuration file

Use configuration file (currently config.yml for configuration such as thread counts and file type mappings.

Allow Docker/Compose workflow

This should/might replace Vagrant. Especially for quick'n dirty build tests - although a complete build is definitely slower than incremental builds in mounted dirs in Vagrant.

WIP on docker branch.

Missing: ipfs-tika needs to be told where to listen, and where to find IPFS, through environment variables. Plus, cleanup.

DHT sniffer is no longer working

In the recent go-ipfs update:

  • .event will never equal handleAddProvider in ipfs log tail
  • but handleAddProvider is available in Opentracing's logs(maybe)
  • but Opentracing's log no longer outputs multihash. It only prints the peer ID:
$ ipfs log tail|grep handleAddProvider
{"Operation":"handleAddProvider","Fields":[{"Key":"peer","Value":"QmYxoZmhx5kiAd8MK5ZgMF2Co2GpH2xXWSm8T6JPnrwzUt"}], <others..>}

So code in dht-snifloop.sh ipfs log tail | jq -r 'if .event == "handleAddProvider" then .key else empty end' is no longer working.

Weird Tika error

Somehow the crawler generates empty metadata requests requests (to URL http://localhost:8081/). This should only be possible from here:
https://github.com/ipfs-search/ipfs-search/blob/master/crawler/metadata.go#L32

However, I have indeed confirmed that on no ocassion does it get an 'empty' URL here... Is this a weird racing condition?

Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: Fetching: http://localhost:8080/
Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: Internal server error:
Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: java.io.FileNotFoundException: http://localhost:8080/

I'm getting lot's of these and they seem to be uncorrelated to any crawling activity.

Also note that, generally, the crawler seems to function just fine. Metadata seems to be fetched just fine and content gets indexed and everything.

Spare server resources

We could increase the IPFS connection limit, the ES heap size or scale the server down. (Resources became available as as of 0.4.18 IPFS is much more resource efficient.)

First make an inventory of the memory available and then experimentally refine.

Index IPNS

Index IPNS entries. I guess the sensible way would be to create an IPNS index with the IPNS id (pubkey) as ES document id. The document structure could be:

{
  "links": [
    {
      "first-seen": "<date>",
      "last-seen": "<date>",
      "cid": "<ipfs-cid>"
    }
  ]
}

This way we not only store the current state but also previous versions and when they were seen.

@madnificent Does this make sense?

Separate out components

We actually consist of a bunch of components now, many of which belong in their own repository:

  • metadata-api
  • search-api
  • frontend
  • provisioning?

With regards to the latter, question is whether we want to include provisioning roles in the respective project's repo's or whether we want to keep it all together for maximum consistency. Also, having them in separate repo's might cause version mismatches - forcing us to manually update the specific has on every update (this is the initial reason for going with the current approach).

Alternately, we might use git submodules, but they imply headaches of their own.

Developer documentation

  1. What's this project about
  2. Directory layout
  3. Where and how to contribute
  4. ... (suggestions welcome)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.