Project

General

Profile

Data format considerations

Added by koszko 3 months ago

With Repository_API more or less defined, we should now decide how content should be stored on-disk. I think we could make on-disk and network formats more similar than they are now...

Suggestions?


Replies (6)

RE: Data format considerations - Added by jahoti 3 months ago

I agree; in fact, using the filesystem API you discovered, it should be possible to use the Hydrilla format as-is for this purpose. That can also be used for the new repository format, just changing the contents of the index.json files.

As a partial aside, the Repository_API seems to imply script location are resolved relative to the Hydrilla content directory, whereas the implementation resolves them relative to the script index.json file.

RE: Data format considerations - Added by koszko 3 months ago

As a partial aside, the Repository_API seems to imply script location are resolved relative to the Hydrilla content directory, whereas the implementation resolves them relative to the script index.json file.

That's one of the differences between the network and on-disk formats we currently have. And Hydrilla is simply appending the right directory name to script filenames. I agree, that might be a bit counter-intuitive.

What options we have?

We could obviously define the format so that Hydrilla scripts are always referenced by content/-relative paths. In this case, however, we have redundancy - index.json is supposed to always specify paths that include its own directory. Allowing it to reference other directories would remove the redundancy, but we don't really want it to reference other dirs...

We could move all index.json's one directory higher and use content/-relative script paths in them, like this:

content/
  bsome_bag.json
  bsome_bag/
    script1.js
    script2.js
  bother_bag.json
  bother_bag/
    scripta.js
    scriptb.js
  pfirst_site.json
  psecond_site.json

and JSON file like:

{
        "type" : "bag",
        "name" : "some_bag",
        "scripts": [{
                "location": "some_bag/script1.js",
                "sha256":   "e4dbe4dba40e8bd159fb987b0f0cf2c243d7e6b9b9dc792e58dedf1fae38b0a1"
        }, {
                "location": "some_bag/script2.js",
                "sha256":   "5099d27284c2257d2983450585cbd4bede6475519755508047e213d985cbc7c9"
        }]
}

cons:

  • If we add Hachette support for this data format, the user, being presented a directory-selection window, will only be able to select the entire content/ and not single bags or pages within it. We could make up for that by then giving user fine-grained control over what to import in Hachette's UI, though.
  • The redundancy is still there.

We could modify the above approach by requiring that script's direcotory always equals name (or some prefix + name) and then modifying Hachette to respect that.

Or, we could just entirely remove location from the network format and instead use script URLS that contain their hashsums. Unfortunately, this actually makes on-disk and network formats less similar...

Well, there is one reason we might actually want to allow bags to reference multiple dirs. That's in case we have multiple versions of the same bag and some of the scripts remain the same between versions while others don't... But is such little space saving worth the complication involved?

RE: Data format considerations - Added by jahoti 3 months ago

That's one of the differences between the network and on-disk formats we currently have.

Just to clarify, what exactly do you means by "on-disk format". I'm beginning to doubt my interpretation, the Hachette import settings format, is incorrect, which would explain some confusion :).

Well, there is one reason we might actually want to allow bags to reference multiple dirs. That's in case we have multiple versions of the same bag and some of the scripts remain the same between versions while others don't... But is such little space saving worth the complication involved?

Not to mention it seems unlikely to be a problem that gets solved just once; every time a new feature is added, we would need to consider how to deal with a file belonging to multiple bags.

RE: Data format considerations - Added by koszko 3 months ago

That's one of the differences between the network and on-disk formats we currently have.

Just to clarify, what exactly do you means by "on-disk format".

In this particular case I meant Hydrilla's content/ dir format. I do not guarantee that I haven't called Hachette's JSON format this way on Hachettebugs before, though...

Well, there is one reason we might actually want to allow bags to reference multiple dirs. That's in case we have multiple versions of the same bag and some of the scripts remain the same between versions while others don't... But is such little space saving worth the complication involved?

Not to mention it seems unlikely to be a problem that gets solved just once; every time a new feature is added, we would need to consider how to deal with a file belonging to multiple bags.

A new feature to the original tool like Etherpad? Whenever a feature is added, the version of that tool would increase, so by default we would create a new bag for that version and only put the modified file there, without modifying the old bags. Althought we could also do some sorcery like backporting of purely client-side features, which would indeed bring the issue you predict, it would by itself be something that assumes this kind of problems anyway...

RE: Data format considerations - Added by jahoti 3 months ago

Just to clarify, what exactly do you means by "on-disk format".
In this particular case I meant Hydrilla's content/ dir format. I do not guarantee that I haven't called Hachette's JSON format this way on Hachettebugs before, though...

That makes more sense- thanks for clarifying! The confusion was entirely mine, however; I don't think anyone has ever used the term anywhere on here before this thread.

Well, there is one reason we might actually want to allow bags to reference multiple dirs. That's in case we have multiple versions of the same bag and some of the scripts remain the same between versions while others don't... But is such little space saving worth the complication involved?
Not to mention it seems unlikely to be a problem that gets solved just once; every time a new feature is added, we would need to consider how to deal with a file belonging to multiple bags.
A new feature to the original tool like Etherpad? Whenever a feature is added, the version of that tool would increase, so by default we would create a new bag for that version and only put the modified file there, without modifying the old bags. Althought we could also do some sorcery like backporting of purely client-side features, which would indeed bring the issue you predict, it would by itself be something that assumes this kind of problems anyway...

I was more thinking new features to Hydrilla (or other components of the packaging infrastructure); having "cross-linked" bags means software must check and account for other bags that depend on files before deleting, moving, or modifying them. While such actions should probably be avoided anyway, making them even more risky seems like a very bad idea.

RE: Data format considerations - Added by jahoti 3 months ago

Just to clarify, what exactly do you means by "on-disk format".

In this particular case I meant Hydrilla's content/ dir format. I do not guarantee that I haven't called Hachette's JSON format this way on Hachettebugs before, though...

That makes more sense- thanks for clarifying! The confusion was entirely mine, however; I don't think anyone has ever used the term anywhere on here before this thread.

Well, there is one reason we might actually want to allow bags to reference multiple dirs. That's in case we have multiple versions of the same bag and some of the scripts remain the same between versions while others don't... But is such little space saving worth the complication involved?

Not to mention it seems unlikely to be a problem that gets solved just once; every time a new feature is added, we would need to consider how to deal with a file belonging to multiple bags.

A new feature to the original tool like Etherpad? Whenever a feature is added, the version of that tool would increase, so by default we would create a new bag for that version and only put the modified file there, without modifying the old bags. Althought we could also do some sorcery like backporting of purely client-side features, which would indeed bring the issue you predict, it would by itself be something that assumes this kind of problems anyway...

I was more thinking new features to Hydrilla (or other components of the packaging infrastructure); having "cross-linked" bags means software must check and account for other bags that depend on files before deleting, moving, or modifying them. While such actions should probably be avoided anyway, making them even more risky seems like a very bad idea (on top of the general complexity just from adding directory paths).

    (1-6/6)