A Modest Proposal For Federated Search In ActivityPub

As a tech nerd with strong opinions on a lot of things (as you probably already knew if you read literally any of my other posts or stuff I write on social media), I obviously have opinions regarding how federated search ought to work. I don't think anyone is ever going to really listen to this, but hey, nobody listens to political pundits either and it hasn't stopped them.

Note this is going to be more of a "soft tech" than "hard tech" post, but that's most of what I post anyway.

Status quo

Right now in ActivityPub, all messages are sent to all servers whom a connection has been made to, and persistently stored. This is basically unavoidable; even if links were given to a status which the client should follow, that would open up massive XSS issues and be really inefficient. It also wouldn't solve anything, as servers could still scrape the endpoint given, and creates even bigger privacy implications (anyone can read the message if they have the link).

But in even a centralised social media system, you have to trust the security of the central authority anyway, and anyone who has ever followed you has your posts. So ultimately this point isn't terribly important.

Why do I mention all this? Because this brings about a major tempting way for search to work: just search the local posts. This is how Mastodon does it, I think this is how Pleroma does it, and how any future implementation is likely to do it.

However, this has a few drawbacks.

Unable to search posts you've not seen yet

Some may consider this a feature, but I consider it a break of user expectations: you can't search for posts the server has never seen. This means when a new instance appears, users can't search for any posts older than the instance itself. Bummer.

This could theoretically worked around by retrieving past posts, but this has its own implications for resource usage and privacy.

This also means if you wish to search for older posts, you have to have an account on a big instance like mastodon.social which is likely to have seen these posts.

No flags for search privacy

At present, ActivityPub does not have any way to hint to clients that a post should not show up in searchable indexes. In my view, this is actually a bug, and this should be added, even if servers ultimately are sworn "on their honour" to respect this.

Regardless, this is a case of being careful what instances you follow people on. Technology can't solve a social problem like this, as far as I know.

Bearing costs for searching remote posts

Ultimately, the local server bears the costs for searching remote posts, possibly in a real sense, since Mastodon uses ElasticSearch and many providers of ES are metered.

Proposal

What I propose be done is simple:

Distributed search.

By this I mean that search queries are sent to remote servers by local clients, and reported back from remote servers.

Queries can be searched from all known instances connected to the server, or individual servers.

Advantages

In my opinion, this offers many advantages:

  • Limiting of search query power through careful API design
  • Respect for local flags for search privacy
  • No need to keep expensve indexes for remote posts
  • Allows posts from before the searcher's instance to be searched for
  • Distributes costs of searching posts across many nodes
  • Instances can disallow searching for certain search terms (like Nazi) and have it generally respected
  • Instances can disable searching of posts altogether if they don't want their posts searched

Disadvantages

I admit however, this system may have several disadvantages, although some can be mitigated:

  • Bandwidth; although I consider this a red herring, since posts already use a lot of bandwidth (often posts need to be sent to hundreds if not thousands of instances, especially when popular users make posts)
  • CPU costs; although throttling of queries can mitigate this one
  • Doesn't solve all privacy issues; this is generally intractable given the current state of the art, regardless of search method used
  • Remote instance search filtering or prohibitions could be worked around by just searching the stored posts anyway
  • UI; since search is usually paginated, it means searches from each instance will have a limited number of returns; a user-friendly way to note this would need to be figured out

How it would work

A sort of high-level overview:

  • Search scope is figured out, i.e. what servers need to receive what queries and for what users, or if it applies to all servers; maximum scope is all connected instances
  • If the default is to only show a certain number of posts, the number of instances can be limited accordingly
  • Users could choose instances to search on
  • All relevant instances are sent a query on some well-known API endpoint
  • The reports are gathered and after a certain period of time (or using AJAX or whatever for asynchronous reports), they are sent to the user
  • Pagination, etc. all works the same way

Conclusion

I think such a system would be a good solution for search privacy and distributing queries. I don't think it's perfect, but perfection is impossible to achieve when humans are involved.

I could also be full of shit and this is a terrible idea; this is always a possibility. :p

links

social