Published: Tue 13 March 2018
As a tech nerd with strong opinions on a lot of things (as you probably already knew if you read literally any of my other posts or stuff I write on social media), I obviously have opinions regarding how federated search ought to work. I don't think anyone is ever going to really listen to this, but hey, nobody listens to political pundits either and it hasn't stopped them.
Note this is going to be more of a "soft tech" than "hard tech" post, but that's most of what I post anyway.
Right now in ActivityPub, all messages are sent to all servers whom a connection has been made to, and persistently stored. This is basically unavoidable; even if links were given to a status which the client should follow, that would open up massive XSS issues and be really inefficient. It also wouldn't solve anything, as servers could still scrape the endpoint given, and creates even bigger privacy implications (anyone can read the message if they have the link).
But in even a centralised social media system, you have to trust the security of the central authority anyway, and anyone who has ever followed you has your posts. So ultimately this point isn't terribly important.
Why do I mention all this? Because this brings about a major tempting way for search to work: just search the local posts. This is how Mastodon does it, I
think this is how Pleroma does it, and how any future implementation is likely to do it.
However, this has a few drawbacks.
Unable to search posts you've not seen yet
Some may consider this a feature, but I consider it a break of user expectations: you can't search for posts the server has never seen. This means when a new instance appears, users can't search for any posts older than the instance itself. Bummer.
This could theoretically worked around by retrieving past posts, but this has its own implications for resource usage and privacy.
This also means if you wish to search for older posts, you have to have an account on a big instance like mastodon.social which is likely to have seen these posts.
No flags for search privacy
At present, ActivityPub does not have any way to hint to clients that a post should not show up in searchable indexes. In my view, this is actually a bug, and this should be added, even if servers ultimately are sworn "on their honour" to respect this.
Regardless, this is a case of being careful what instances you follow people on. Technology can't solve a social problem like this, as far as I know.
Bearing costs for searching remote posts
Ultimately, the local server bears the costs for searching remote posts, possibly in a real sense, since Mastodon uses ElasticSearch and many providers of ES are metered.
What I propose be done is simple:
By this I mean that search queries are sent to remote servers by local clients, and reported back from remote servers.
Queries can be searched from all known instances connected to the server, or individual servers.
In my opinion, this offers many advantages:
Limiting of search query power through careful API design
Respect for local flags for search privacy
No need to keep expensve indexes for remote posts
Allows posts from before the searcher's instance to be searched for
Distributes costs of searching posts across many nodes
Instances can disallow searching for certain search terms (like Nazi) and have it generally respected
Instances can disable searching of posts altogether if they don't want their posts searched
I admit however, this system may have several disadvantages, although some can be mitigated:
Bandwidth; although I consider this a red herring, since posts already use a lot of bandwidth (often posts need to be sent to hundreds if not thousands of instances, especially when popular users make posts)
CPU costs; although throttling of queries can mitigate this one
Doesn't solve all privacy issues; this is generally intractable given the current state of the art, regardless of search method used
Remote instance search filtering or prohibitions could be worked around by just searching the stored posts anyway
UI; since search is usually paginated, it means searches from each instance will have a limited number of returns; a user-friendly way to note this would need to be figured out
How it would work
A sort of high-level overview:
Search scope is figured out, i.e. what servers need to receive what queries and for what users, or if it applies to all servers; maximum scope is all connected instances
If the default is to only show a certain number of posts, the number of instances can be limited accordingly
Users could choose instances to search on
All relevant instances are sent a query on some well-known API endpoint
The reports are gathered and after a certain period of time (or using AJAX or whatever for asynchronous reports), they are sent to the user
Pagination, etc. all works the same way
I think such a system would be a good solution for search privacy and distributing queries. I don't think it's perfect, but perfection is impossible to achieve when humans are involved.
I could also be full of shit and this is a terrible idea; this is always a possibility. :p