Dynamic Document Networks

September 12, 2017

Dynamic Document Networks

So this one is a bit far fetched and hypothetical. I will only briefly cover the idea here, and then link to the "official" documentation that I wrote up which get's more technical and much more lengthy.

Many of us think of the internet as "wild west" of sorts where anybody can author an contribute content. We view it as a place for the everyman, where large corporations are humbled and individuals are empowered.

At least, that is what we use to think. More and more we are realizing that the internet is another place where power, control and content is centralized. What percentage of videos watched over the internet come from somewhere other than Youtube, Netflix and maybe a half dozen other providers? How much social interaction happens outside of Facebook, Twitter and again, maybe a half dozen or so other providers? We could go on, but they key is that while the internet radically transformed how what we do, see and read is delivered to us, it has not distributed that power as much as we may think.

I am not claiming that this is necessarily a bad thing, nor do I want to discredit the positive effects that the internet has had. However after realizing this it caused me to think about other ways to solve some of these issues. I wrote up a semi technical description of what I came to call, "Bidirectional Network of Dynamic and Distributed Documents". The "living version" of this document is here. Below is the version that I had at the time of this post.

Bidirectional Network of Dynamic and Distributed Documents

Assumptions

Everyone's computer has a port forwarded to accept incoming connections to the network application that will be running on their computers, or otherwise can connect to and receive TCP/IP connections.

The application will be running the majority of the time. Downtimes on any one computer will cause major disruptions in the DDN network.

Terminology

Note: We use the term “network” and “map” interchangeably as well as the terms “node” and “computer”

We will call this idea a Bidirectional Network of Dynamic and Distributed Documents.

We will call the application the Distributed Document Network (DDN).

The Idea

Computer A has a document that has some associated metadata which defines how that document can be modified, as well as how to deal with change conflicts.

Computer B requests that document from computer A. Both computer A and computer B add an entry to their records for one another. This entry is specific to this document only.

Computer C requests that document from computer B. Again, both computer B and C add an entry to their records for one another. This entry is specific to this document only.

This process continues so that new computers request the document from other computers, and each computer remembers these connections.

When a computer updates that documents data (only in the prescribed manner based on the metadata) the computer updates neighboring computers with the change. Those computers in turn will update all of the computers they are neighbors to, except the one from which the update came. If a computer updates the information in a manner that is not allowed, the computers it updates will not propagate the change.

Unresolved Problem: How do you know that the metadata itself has not been tampered with? Well, once a computer has received the metadata it can assume that the metadata will not change, and therefore if a computer modifies the metadata it will not be able to send changes upstream because upstream computers will not accept the changes. However downstream computers will not have tampered metadata and will not be able to communicate with computers who have the original metadata. To prevent computers from modifying the metadata and thereby corrupting computers downstream, a digital signature could be associated with the meta data to ensure that it has not be tampered with. However the problem of public key agreement exists, as described in the admin access problem.

Undesirable Resolution: Ignore the problem and allow computers to modify the metadata. Hopefully the effect will be limited if the document network is big enough to discourage computers from creating subnetworks downstream from itself by changing the metadata.

Conflict Resolution

If two computers change the data at the same time, and those changes reach one another at another computer, that computer must resolve that conflict according to the metadata.

Metadata Modification Rules and Resolution Rules Example

If a list of comments is the modifiable section then you might have rules that say you can only add comments after the previous one's and the order must remain the same, and each comment must have atleast a message and a timestamp that is later than the previous comment but not after the current time. Then the resolution metadata might say that if you receive updates from two different computers that both added the “next” comment, order the comments based on the timestamp.

Modeling the network

Notice that we have a single undirected tree where each node is a computer and each edge is an agreement between two computers to update one another when the document changes. These edges are created when one computer asks another computer for the document. Each node is responsible for updating each node that it is connected to with new data. Every time their is a data change the changes begin propagating throughout the tree, as long as the changes are valid (based on the metadata). If a node has its edge data corrupted or if a node begins to misbehave, then this undirected tree becomes a directed tree where every edge is a two way edge except for the one's involved in some sort of data corruption. In order to always have a valid tree represented by the nodes, it is best to represent the tree as a directed tree where every edge should go both directions, and any edge that does not go both directions is a result of either a node having corrupted data or a node misbehaving. Also notice that this map is specific to a document. Each document has a map of the interconnected nodes that are sharing that document. If a map is unconnected then it is a result of either two neighboring nodes misbehaving or having corrupted edge data, or it is a result of two documents of the same name being created by two different nodes. These two documents, for our purposes, are considered the same document (because the name is the same) even though they probably are entirely unrelated.

Knowing when to stop propagation

If a node makes an update, it updates every neighboring node, which all do the same. This two way data flow is problematic, how do you know when to stop? When a node receives an update it only propagates the update the first time it comes through. Depending on the map the update might come back to the node, but it won’t be propagated from this node again. You can check this by hashing the update message and storing that hash value in a table, and check incoming updates against that table.

On second thought with this even be a problem? If the map is a tree, then there are no loops and so an update should never “loop” back around.

Preventing spam updates

The document metadata can specify how often any given computer can update the document. Then, neighboring nodes would be responsible for not propagating changes that are not allowed based on the time constraints. These constraints could be document wide or specific to certain areas. For example you might put in a constraint on the comment section to only allow computers to put in one comment per minute. Some sort of document wide limit might be added to make sure that random modification spams don’t clog the network.

Admin access

Digital signatures can be used to give certain computers admin rights on the document. In this way certain computers could make any change they wanted to the document. They could add new sections with new modification and resolution metadata, change existing metadata or even rewrite history. When an admin level computer makes a change outside of the prescribed changes allowed by the metadata, they would send a note to the neighboring computers claiming that this change is an admin change along with a signature computed with the message and private key; signature=Sign(message, private_key). Those computers would then use the signature and public key to validate that the signature was valid; Vrfy(signature, public_key). This update, along with the signature, would then be propagated throughout the network.

Unresolved Problem: Everyone needs to agree on a public key. So when the document is created the creator would also create a public and private key so that he can use them in a digital signature scheme, and therefore have admin rights on this document. He would send the document along with the corresponding public key. Other computers could modify this public key to a new public key (that has a corresponding private key that they know) and then any computers downstream would think that this computer was the admin.

If this problem is not resolved then essentially the admin access idea would need to be scrapped, and therefore once a document is built it’s metadata cannot be changed.

The problem boils down to public key agreement. If you can figure out how all the computers in the network can agree on a public key without any one of the computers being able to tamper with the key agreement process then you have solved this problem.

The public key could be a hash of the original contents of the document. So that while the document could change, the public key is always based on the original contents. This doesn’t really get us anywhere, because malicious nodes could change both the document and the public key, so that downstream nodes wouldn’t know what the truly original document / public key pair was.

Undesirable Solution: Instead of passing the public key along with the document, store it in some public location. Then when a node receives an admin update and does not have the public key available, it is the user's responsibility to go and retrieve the known-to-be-correct public key and give that value to the software. Then the software can check the update’s signature and make sure it is correct before applying the update and propagating it. This is undesirable because the solution comes from an external source, a centralized location, and is not built into the scheme, which breaks every benefit of the DDN. Also anyone could claim to be the true admin and publish a public key for the document, and if some people use one public key and others use a different public key, you would once again break the network into two halves. Also anyone with a large enough following could try to convince people to use their public key, thereby gaining admin access to the document.

Backend “Server” Logic / Database

With this system the document seems to be “static” and it seems like there is no server logic running on each request. However this is not as true as you might think. Each computer is capable of modifying the document based on the metadata. When a computer modifies the document it is acting like the logic that the server would have otherwise been doing. The document needs no backing database because it is being updated with new data by each individual computer in the system. So this “server logic” is not lost, it is just not as explicit anymore. The metadata on the document defines this logic. We have given the example of having a document with a comments section. This document is not a “static” document but a growing, changing, dynamic document. This is just one example of how these documents might be dynamic but you could do much more.

Document Naming

Say that a give node A creates a document named “foo” and through other nodes sending requests for it this document is sent around. Say that at the same time a node B creates a document named “foo” and also sends it around. You will now have two maps of distinct nodes, one for each document, but both documents are named “foo”. Say at some point a node from one map asks for the “foo” document from a node in the other map. The documents are, in reality, two separate entities with completed unrelated metadata. There is going to be a strong conflict on this edge, each node thinking that the other node is wrong every time it tries to send an update. Computers in the first map will not have access to the document in the other map and vice versa, unless they first drop out of the map they are currently in, and then switch over to the other map by requesting the document from a node of that other map. For this reason if two documents have the same name there may be distribution problems, and for this reason are also considered a single document for our purposes. Therefore it is suggested to have a very unique document name, possible by prefixing the document with some sort of random string. However this is not required because there is no way to enforce the rule.

Document Name Attack

Document naming gives way to an attack. If I want to create a new document named “foo” and prevent naming collision by naming the document “fo93nga93bcj/bar” (i.e. random string as the first element in the full pathname), an attacker could see this being distributed and then start distributing a document of the same name. Anyone who accidently got added into the attackers document map would then be unable to see my document unless they first removed the attackers document and then sent a request to someone that had my document.

Unresolved Problem: I don't see a way of preventing the document name attack.

Documents need to reference one another

If a document A references another document B, the current node should be expected to request document B from the same node that it has stored in its table for document A. There should be some way to specify if the name is relative based on the current document or absolute from the root level. That way you can reference nested documents with a relative path or any document with a complete name.

Problem: What if no one has the document and each node sends back to the original node that they don’t have it and they should keep looking? The algorithm never quites.

Resolution: Required Resource + Timeout as explained in the “Document Search Loop” document. Also quoted here: The original node that made the modification to document X which added the reference to document Y is required to have document Y loaded. As long as edges in the map are not destroyed in unexpected ways node A should always find a node that has document Y. However their is the possibility of corrupted edge data and misbehaving nodes and therefore a timeout is still needed.

Metadata Specified Document creation

When a node edits a document, they should also be allowed to create new sub documents, but this ability is defined and limited by the metadata of the document. When they create a change to a document that involves adding a new document, that new document is put at the correct location, then when that change is propagated both documents are sent along. The other computers check to make sure that the change is allowed and continue to pass along both sets of data. One global rule however is that a change to a document can only add new documents to a file name that is deeper in the path hierarchy than the current document.

Problem: I see a problem growing however, and that is that if a document grows, then everyone will need to maintain a complete copy of that document. If a document has a big listing of large image files, there should be a way to do this such that not everyone needs to have the complete list.

Resolution: When the space a node has to store data fills up it can begin to drop documents. The document network data is kept but the document itself can be dropped. When a request comes through for a document which this node is apart of (i.e. it is in the document network) but that it doesn’t have loaded, it should request the document from a neighboring node in the network. That way the entire set of documents does not need to live in its entirety on a single node except for the node requesting the entire document.

Lazy Loading

Document references in a document should have the ability to be lazy loaded. Their might need to be a DDN protocol level API for loading in document references and using the returned data (similar to what AJAX is used for in HTTP). This way a list of document references could be defined in the document and then this DDN protocol could be used for loading in these documents after the document itself is loaded.

Linking

You could link to a document from an already circulated document. So if you have a document A that is already highly circulated then you could update that document with a link to the new document B. Nodes should ask for document B from neighboring nodes in the document A network. If the neighbor it asks does not have the document, it is that neighbors responsibility to do the same. Once a node is found that has the document, that node should load that document in (or at least build map linkages for document B, so that it has it available for future requests) and then pass along the document. All the nodes in the chain should do the same (that is, build the map linkages for document B). Finally the original node will get the document, and a nice distributed map will have been built for document B.

Exploring

Nodes should be able to “publish” documents. Other nodes should be able to request the hierarchy of published documents that live on another node. Then request an actual document if desired.

Extra Edge Creation

When you request a document for the first time that node will either respond with an error saying that it has no such node, or it will let you know that it has hit its connection capacity and will give a list of neighboring nodes to request the document from, or it will respond with the document itself. Regardless of all this, at some point you will get to a node that has the document and both you and the other node will build a connection to one another. After this however there is nothing in the protocol which builds new connections to you unless more nodes request the document from you, or filled nodes pass the responsibility off to you. This will create a tree structure in which data flows in both directions. This is okay except for the fact that their is no redundancy. If a single node goes down then the network is split and changes will not propagate throughout the network.

Unresolved Problem: Their needs to be some sort of mechanism for building connections between nodes that are both in the network, but are far apart (for best possible redundancy).

Privacy

When a node connects itself to a document network by requesting that document, thereby attaching itself to that document hierarchy, that document is now publicly viewable as having been accessed by that node. This is not a new statement, but a result of the setup as it now exists. If I want to know whether node A has accessed document “foo” all I have to do is request that document from node A. If node A replies that it has then I know that node A accessed that document. This has very little privacy. Now I can’t get a complete list of all of the documents that node A has, but one by one I can request documents and see if node A has them. A misbehaving node that is not following the protocol could request documents but not build the edge data for the document, thereby erasing the history that it has ever requested the document, but for the DDN network to work nodes need to obey the network creation protocol and should not do this “private browsing”. Also, while this private browsing makes erasing the edge data on the requesting node, that is only half of the history. The node that it asked for the document from also built the edge information (assuming it is obeying the rules), and so the history that node A asked for “foo” still exists.

Limiting Responsibility

A node should be able to limit the number of connections that it will build for a single document. If a node is at its full capacity, and other computer request the document, instead of replying with the document and building a connection with this new node for that document, it will instead reply with a list of nodes that it is connected to. In order to get the document the requesting node must then choose a node from that list. This continues until a node is found that has an available connection for the document. In this way a given node is only responsible for propagating updates to a certain number of nodes, but if it wants to receive updates then it must propagate the update to at least one node otherwise it won’t ever get updates itself. Also if the node wants a level of redundancy in case its single neighboring node goes down, it should build connections with other nodes for that document. However the process used to build these redundant connections is still an unanswered problem.

Local Content

Some document modifications can be specified as “local”. That is, each computer is allowed to make certain local changes to the document (again, based on the metadata). Each computer is responsible for following the rules, but of course if they don’t follow the rules they are only hurting themselves.

Example: A document might have a “friends list” section. In this section you could be allowed to add an entry that represents a friend. This section is marked “local” by the meta data, so changes that computers make to this section are not passed on to other computers, and if a computer were to pass on these changes, neighboring computers would drop/ignore them.

Data Storage

A computer keeps as many documents around for as long as it can, but if it run out of space to store the documents then it can drop the document data, but hold on to the document name and map information. If the user ever wants to go back to that document it simply asks a neighboring node for the document. If a node is asked for a document that it has in its table but does not currently have the data for, it asks a neighboring node and then passes the information back along to the original requester.

Search This Blog

Alex Lockhart

Dynamic Document Networks

Popular Posts

How to Draw an Isometric Village

How to Draw an Isometric Forest