ipfs-search / ipfs-search Goto Github PK

View Code? Open in Web Editor NEW

842.0 42.0 106.0 12.28 MB

Search engine for the Interplanetary Filesystem.

Home Page: http://ipfs-search.com

License: GNU Affero General Public License v3.0

Go 99.59% Makefile 0.15% Shell 0.05% Dockerfile 0.22%

search-engine ipfs-search rabbitmq elasticsearch ipfs golang

ipfs-search's Introduction

ipfs-search.com

Search engine for the Interplanetary Filesystem. Sniffs the DHT gossip and indexes file and directory hashes.

Metadata and contents are extracted using ipfs-tika, searching is done using OpenSearch, queueing is done using RabbitMQ. The crawler is implemented in Go, the API and frontend are built using Node.js.

The ipfs-search command consists of two components: the crawler and the sniffer. The sniffer extracts hashes from the gossip between nodes. The crawler extracts data from the hashes and indexes them.

Docs

Documentation is hosted on Read the Docs, based on files contained in the docs folder. In addition, there's extensive Go docs for the internal API as well as SwaggerHub OpenAPI documentation for the REST API.

Contact

Please find us on our Freenode/Riot/Matrix channel #ipfs-search:matrix.org.

Snapshots

ipfs-search provides the daily snapshot for all of the indexed data using snapshots. To learn more about downloading and restoring snapshots please refer to the relevant section in our documentation.

Related repo's

Contributors wanted

Building a search engine like this takes a considerable amount of resources (money and TLC). If you are able to help out with either of them, do reach out (see the contact section in this file).

Please read the Contributing.md file before contributing.

Roadmap

For discussing and suggesting features, look at the issues.

External dependencies

Go 1.19
OpenSearch 2.3.x
RabbitMQ / AMQP server
NodeJS 9.x
IPFS 0.7
Redis

Internal dependencies

Building

$ go get ./...
$ make

Running

Docker

The most convenient way to run the crawler is through Docker. Simply run:

docker-compose up

This will start the crawler, the sniffer and all its dependencies. Hashes can also be queued for crawling manually by running ipfs-search a <hash> from within the running container. For example:

docker-compose exec ipfs-crawler ipfs-search add QmS4ustL54uo8FzR9455qaxZwuMiUhyvMcX9Ba8nUH4uVv

Ansible deployment

Automated deployment can be done on any (virtual) Ubuntu 16.04 machine. The full production stack is automated and can be found in it's own repository.

Contributors

This project exists thanks to all the people who contribute.

Backers

Thank you to all our backers! 🙏 [Become a backer]

ipfs-search's People

Contributors

Stargazers

Watchers

Forkers

vandrongelen normandmickey ivan386 edchainio landakram alexsicart awesome-archive alvin-reyes weswinder tedeum dinkin carlocayos cloudtinkerer openthings royhodge 0zand1z harrshasri bedri artiya4u eleztian cloud-robotics evaluation-alex jonnycrunch fsabado bigdot123456 savvyblockproject haoglehaogle lastexile16 bonedaddy joelamouche kissthink sarremans stanxii opencollective awesomegolang iwe7 songjiayang xudonggit phaphanduong autolambda pynchmeister lucmski yangfhit lovether xdengpao metaspartan teddycode seif-abaza ekliptor mrd0ll4r dorucioclea klerx jamescheuk91 msgpo do4 ozapp x-chain tungland yueguanyu stevenans66 dappcenter jbelke akashic-ether robertdigital chanhvq abob9 liping17 aibp1 alphaquark js-ts pantyhose-x peterza2019 deltastorage thrustdeltav littlespeechless uoften 0xcd21 szeket aadorian mankezhou akhileshthite harivmasoor mrinjamul-web3 baronrustamov defacedef1 cryptobuks jia9izhang bellingcatosint iq-scm gohan472 whichbfj28 samyaza-geek noorahsmith linyanghao testwill andre-beautrait beautrait localartaction andreasmhahn tjikaljedy

ipfs-search's Issues

Document presence of mimetype in search API

Render results on server (non-AJAX implementation)

To allow for sending results around and SEO.

Validate hashes against cid

https://github.com/ipfs/go-cid

Display last seen & first seen + status colour

We could use last seen as a status indicator for how likely a resource is available.

E.g.: last 24 hours: green
last 2 days: orange
later than 5 days: red

I would recommend using a logarithmic scale here/exponential backoff.

Object type filter

A common, at least for me, and I believe for other people is to search for content by its type. For example when searching for Doxygen you can find miriad of files containing this text, but what in fact you search is, say doxygen.exe i.e. quering for the filename (if it has one).

Basically there should be somewhat an easy way to specify what kind of objects you are quering for.

Best regards.

ipfs-search.com redirecting

http://ipfs-search.com/ is not working for me. I get this in osx chrome:

Integrate frontend into ipfs-search repo

Maybe as a submodule - just to make sure versions are in sync.

https://github.com/dokterbob/ipfs-search-front

Allow users to contribute (meta)data!

Allowing users to tag content and/or mark content as interesting (like).

Help wanted

Hello!

Its really nice to have such search engine. Mainly because if you want to check whether something to exist you have no other way. There is no such thing as catalogue or so. Hence a search engine as this one seems really useful. I have troubles though using it. Maybe other people can have the same issues too.

Basically, I do not know what it does and how it works. Can there be some examples or short doc or help or so? For example I have just typed there "doxygen" and got some results, but I have no idea what the title with the icon means - it says "d", "D", "t", ? Also can I search by filename? Basically that's what I want in this case - not the content so much as the file name (really nice to have some regular expressions btw).

Small hint, I see IPFS guys do that too, but IMHO the size in bytes should be in human readable form i.e. Kb, Mb, Gb and etc.

Best Regards!

Index file types to different indexes

This would improve indexing by having less 'empty' fields by having a larger overlap for 'similar' objects.

https://www.elastic.co/blog/index-vs-type

SSL support

Recreate/clean references

Due to some unknown bug (probably in the past) a lot of erroneous references ended up in the search index.

This makes it impossible to retrieve items reliably using their referenced name, making mime type guessing harder. Moreover, it yields incorrect search results.

To do: write JS script that iterates over all items and checks their references based on index directories.

Getting an error when running the Manual Provisioning for backend

I tried running the manual provisioning steps on a DO server hosted in Singapore.

When I do so, I run into this error:


ERROR! no action detected in task

The error appears to have been in '/root/go/bin/src/github.com/ipfs-search/ipfs-search/provisioning/roles/common/tasks/main.yml': line 10, column 3, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

  debconf: name=unattended-upgrades question=enable_auto_updates value=true vtype=boolean
- name: set timezone to Singapore
  ^ here

Can someone advice how to fix this?

Limit of total fields [2000] in index [ipfs_v4] has been exceeded

Due to an error in the index definition, I guess that some things are detected as fields which should not be. That, or there are simply too many different fields returned by Tika.

Current index definition: https://github.com/ipfs-search/ipfs-search/blob/master/reindex/v4.json
Current mapping: https://pastebin.com/Y3fQJXWE

Index file types to document types

Use mapping in config.yml to map mime types to different object types, depending on their typical rendering:

document
image
video
audio

Upgrade to go-ipfs 0.4.18

Sane method of detecting & skipping partials

No idea yet how to implement this. There currently seems no cosistent way to identify partials - we're currently skipping unreferenced items with the default chunker block size.

Feature Suggestion: Use bleve as Search Engine

Currently we have a reliance on Elastic Search server which uses Java.
There is an equivalent called "Bleve" that has similar functionality,

-It has an Apache licence.

Written in Go.
Already provides allot of value add functionality.
It also designed for being sharded onto many servers, and so can handle high throughput.

DGraph implemented Bleve into their stack in one week. So, its not that much work at all.

Urls:
http://www.blevesearch.com/

https://github.com/blevesearch/bleve

There are many golang projects using bleve, and so heaps of example out there.

Remove `z*` hashes from invalids index

Before we couldn't index these (cid1), but now we can.

elastic: Error 400 (Bad Request): failed to parse [links.Size] [type=mapper_parsing_exception]

For directories (always size 0), ElasticSearch gives the above error.

Idea ideas, hints, welcome!

Migrate to Go modules for dependency management

https://github.com/golang/go/wiki/Modules#how-to-install-and-activate-module-support

Requires Go 1.11

Fix snapshots

We want daily cron'd snapshots of the ES index.

Better logging and IP address masking

We should keep logs infinite but then also mask IP's properly.

https://www.nginx.com/blog/data-masking-user-privacy-nginscript/

Tag repo on deploy

So that next time we know what version was running previously. Note that this is 'difficult' as we're often upgrading different components at different times.

Seperate index for 'unindexables'

Create a seperate index with only empty documents with following types:

invalid
partial

This would make the search index more efficient and would prevent partials from being scraped for size in the first place, so they're skipped more easily.

API: Metadata backend

We have a LOT of metadata we're currently storing (some of which we're indexing).

It would be great to have this available through the API, possibly for use in the frontend as well.

Proper test script for Travis

Currently, return values of parts of the tests don't propagate out; they are overwritten.

Unittest reference code

This badly needs testing before we're going to 'clean' the previous f*cked up results.

Search connected to Block Chain

Can't we connect search to block chain, so that it costs money to search? Then money pays for nodes for the processing

Items with first-seen and not last-seen

They exist. They should not exist.

Example: https://api.ipfs-search.com/v1/search?q=banana

        {
            "hash": "QmRRW3p1zXwAx9vb2aLn5j8gwe3S4ddhcfwuHHoGpq3e8a",
            "title": "<em>banana</em> dogs.jpg",
            "description": null,
            "type": "file",
            "size": 35673,
            "first-seen": "2018-08-07T04:42:13Z"
        },

All items added should have their last-seen equal to their first-seen!

Find and add extracted IPFS url's

We are currently already extracting URL's from documents. Many of them contain IPFS addresses. Would be great to add them to the indexing queue.

Also see #49

Wrong references added

It seems that, probably due to some race condition, wrong references keep being added to the index.

Relevant code: https://github.com/ipfs-search/ipfs-search/blob/master/crawler/existing_item.go#L15

For now, my best guess is just to create a different crawler for every goroutine worker.

Configuration file

Use configuration file (currently config.yml for configuration such as thread counts and file type mappings.

Allow Docker/Compose workflow

This should/might replace Vagrant. Especially for quick'n dirty build tests - although a complete build is definitely slower than incremental builds in mounted dirs in Vagrant.

WIP on docker branch.

Missing: ipfs-tika needs to be told where to listen, and where to find IPFS, through environment variables. Plus, cleanup.

DHT sniffer is no longer working

In the recent go-ipfs update:

.event will never equal handleAddProvider in ipfs log tail
but handleAddProvider is available in Opentracing's logs(maybe)
but Opentracing's log no longer outputs multihash. It only prints the peer ID:

$ ipfs log tail|grep handleAddProvider
{"Operation":"handleAddProvider","Fields":[{"Key":"peer","Value":"QmYxoZmhx5kiAd8MK5ZgMF2Co2GpH2xXWSm8T6JPnrwzUt"}], <others..>}

So code in dht-snifloop.sh ipfs log tail | jq -r 'if .event == "handleAddProvider" then .key else empty end' is no longer working.

Weird Tika error

Somehow the crawler generates empty metadata requests requests (to URL http://localhost:8081/). This should only be possible from here:
https://github.com/ipfs-search/ipfs-search/blob/master/crawler/metadata.go#L32

However, I have indeed confirmed that on no ocassion does it get an 'empty' URL here... Is this a weird racing condition?

Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: Fetching: http://localhost:8080/
Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: Internal server error:
Nov 03 23:28:05 oetmoen.ipfs-search.com java[19496]: java.io.FileNotFoundException: http://localhost:8080/

I'm getting lot's of these and they seem to be uncorrelated to any crawling activity.

Also note that, generally, the crawler seems to function just fine. Metadata seems to be fetched just fine and content gets indexed and everything.

Upgrade apache tika to 1.19.1

https://tika.apache.org/

Spare server resources

We could increase the IPFS connection limit, the ES heap size or scale the server down. (Resources became available as as of 0.4.18 IPFS is much more resource efficient.)

First make an inventory of the memory available and then experimentally refine.

Enable GZIP for frontend assets in nginx

Allow indexing of cid URL's

Example: https://gateway.ipfs.io/ipfs/zb2rhf8tKV989mPHxpa3EAn9RRR7UwzgB285qCv296aUQmLyz

This currently gives: file/ls: expected protobuf dag node

Make forward/back button work

Currently, the UI fucks up when the forward/back button in the browser are used after the first search result.

Limit of total fields has been exceeded

http://stackoverflow.com/questions/40857060/elasticsearch-bulk-upload-error-with-php-limit-of-total-fields-1000-in-index#40858040

Automatic builds of crawler

https://docs.travis-ci.com/user/deployment/releases/

So that we may fully automate builds & also deploy fully online. That way we may split off the deployment stuff to it's own repository.

Cryptag support

Might be worth considering supporting https://github.com/cryptag/cryptag (searchable tagged encrypted content).

Index IPNS

Index IPNS entries. I guess the sensible way would be to create an IPNS index with the IPNS id (pubkey) as ES document id. The document structure could be:

{
  "links": [
    {
      "first-seen": "<date>",
      "last-seen": "<date>",
      "cid": "<ipfs-cid>"
    }
  ]
}

This way we not only store the current state but also previous versions and when they were seen.

@madnificent Does this make sense?

Separate out components

We actually consist of a bunch of components now, many of which belong in their own repository:

metadata-api
search-api
frontend
provisioning?

With regards to the latter, question is whether we want to include provisioning roles in the respective project's repo's or whether we want to keep it all together for maximum consistency. Also, having them in separate repo's might cause version mismatches - forcing us to manually update the specific has on every update (this is the initial reason for going with the current approach).

Alternately, we might use git submodules, but they imply headaches of their own.

What's this project about
Directory layout
Where and how to contribute
... (suggestions welcome)

Add ES snapshots to IPFS and publish to IPNS

We want to automatically make nightly snapshots of our full ES index available over IPFS. Problem is this crashes when adding the data.

Pending on ipfs/kubo#5654

ipfs-search / ipfs-search Goto Github PK

ipfs-search's Introduction

Docs

Contact

Snapshots

Related repo's

Contributors wanted

Roadmap

External dependencies

Internal dependencies

Building

Running

Docker

Ansible deployment

Contributors

Backers

Sponsors

ipfs-search's People

Contributors

Stargazers

Watchers

Forkers

ipfs-search's Issues

Recommend Projects

Recommend Topics

Recommend Org