Hello, I am Dr. Ajay Kumar PHD.
I am a mathematician, with a lot of interest in technology. One of my projects is to set up a system to produce interesting mathematics and computer science research outside of the universities. The current name for my project is "QAnal", which stands for "rational analysis". I could (and should) write up a whole 1000-page manifesto on why mathematics has become so difficult to understand and inaccessible to most people. Centralization of knowledge into universities is one of the many boogeymen. Even mathematics articles on Wikipedia are completely incomprehensible. They shouldn't be, given the astounding simplicity of the ideas.
I haven't set up a website yet, but I've begun incrementally publishing some of my work on GitLab: https://gitlab.com/DoctorAjayKumar
Real reason I'm here:
I am good friends with a distributed systems engineer named Craig Everett ( http://zxq9.com/ , @zxq9 on Twitter). Craig is having some issue making an account, so I'm introducing him.
We've been talking for some while on how to write some software to solve the cancellation problem. I have at best a vague grasp on what the software looks like. Craig seems to have a super clear image of what the software needs are. I'm trying to learn enough about distributed systems so that I can help.
Craig sent me this, to introduce himself:
My name is Craig Everett. I work as a distributed systems engineer and am very familiar with technical solutions to decentralized data distribution, peering and meshed network systems, robust architecture, and finding optimal tradeoffs for problems inherent in dealing with distributed data. I very much want to write a system and have been discussing one with Dr. Kumar for quite a while, but don't see any realistic way to get development funded, particularly because the first phase of development is quite difficult to monetize (software-as-infrastructure always is). Anyway, I'll explain in very rough terms how such a thing can work without getting too in the weeds.
First off: Separation of concerns.
1. Distributed data is about infrastructure: retrieving the correct bits, verifying they are the expected bits, and hopefully doing that in a timely manner.
2. Social media features are an application issue and an orthogonal concern.
You could certainly write a distributed social media app that conflates everything together, but that would be a massive waste of effort as the same infrastructure that could support the creation of a Twitter-like alternative could also support the creation of a YouTube type alternative, leaving the difference between the two up to an application author that need not know or understand the details of the infrastructure component he's writing his application on top of. Any data schema that works by referenced-retrieval (that is to say, non-ACID, linked or navigational data) could be based on the same infrastructure. (Contrast this with relational data that requires ACID compliant features -- a much stronger model for representation of data, but much harder to do in a distributed way. It is doable, but would take longer to implement.)
There are necessary tradeoffs between complete anonymity and latency, and in the context of social media latency is really the more valuable feature to favor. This does not mean real names are necessarily known (that is an application-level issue) but rather that canonical network origin is known (basically the IP address of a node providing data) just as in any other peer-to-peer system. Pure peer systems suffer from resource contention in the case of a very popular resource being located somewhere in a bandwidth-constrained network (residential networks, for example). Systems such as bittorrent get around this by chunking data and distributing them broadly. A robust data infrastructure must combine approaches to achieve both of robust, distributed data provision in addition to having very high responsiveness on the level required by streaming, chat, and similar services.
The way this manifests in software will feel familiar to anyone who has used a P2P application before:
- The users' systems are the physical infrastructure the system runs on
- The nodes inform each other of other available connections to form a data distribution mesh
- A canonical name registry origin is defined, with the contents of that registry being write-only to enable distribution and caching of it (so that outages do not damage the network -- this is as close to an SPOF the system can have)
- Resource retrieval involves something like a web request and a torrent/freenode resource request and advertisement combined:
1. Dereference the name to a network address and key or certificate
2. Send the request for the resource to both the canonical origin
3. If the origin is reachable then retrieve the data, if not, broadcast the request to the peer network to source the constituent chunks from cache (basically go into torrent mode -- this will incur latency, of course)
4. Any responding nodes (including the origin if it was reachable in the previous step) will forward a list of other nodes that have recently accessed the same data resource
5. As the data arrives it is cached as "fresh" and is available to forward to any other node that requests it.
6. The receiving node will decide to make a broad torrent request from the list of nodes known to have recently accessed the data (combine torrenting with data transfer from the origin in the case of large or streaming data resources)
7. Registry, public key, and certificate data is cached in a similar way.
Of course there are a million little details to tweak on the way to making that a highly responsive system, but that's the basics of it. There is also the detail of variable network transmission (direct TCP for those who know how to forward ports, UDP hole punching for those that don't, UPnP where it is available, proxying schemes for passing data around blocked intermediate routes and so on -- every approach must be employed at once), and this is where a lot of the actual work needs to be put in.
The bottom line is there is no off-the-shelf system that can provide this and there is no getting around that getting the infrastructure *right* and properly implemented is critical to allowing application developers the freedom to write apps on top of that in whatever way they might choose.