Conversation
| l.append(self._compressor.flush()) | ||
| return b''.join(l) | ||
|
|
||
| def _calculate_hash(self): |
There was a problem hiding this comment.
This is used in super().__init__()
I've added the comment below to explain why we need to override it (to hash the raw file, not the gzip stream).
|
|
||
| def _calculate_hash(self): | ||
| '''ensure the hash is over the raw file, not the gzip steam''' | ||
| b2 = blake2b(digest_size=16) |
There was a problem hiding this comment.
Is digest_size required here? Or just to ensure we're consistent?
There was a problem hiding this comment.
yeah, I we need both a hashing algorithm and a standard way to call it, otherwise we might have the same file twice with different hash lengths.
Actually, I'll probably encode the hash type and length with something like pymultihash.
| import zlib | ||
| import socket | ||
| import datetime | ||
| from pyblake2 import blake2b |
There was a problem hiding this comment.
Why not builtin? https://docs.python.org/3/library/hashlib.html#hashlib.blake2b
There was a problem hiding this comment.
I'm relying on planetlabs/datalake, which uses pyblake2
There was a problem hiding this comment.
I guess I could vendor this functionality actually and make the modifications directly.
| start = UTC("now").iso | ||
|
|
||
| if where is None: | ||
| where = socket.gethostname() |
There was a problem hiding this comment.
good point - I'll change to getfqdn
|
|
||
| cid = f.metadata["hash"] | ||
|
|
||
| fleetmeta = { |
There was a problem hiding this comment.
I'm inheriting from planetlabs/datalake File class, but we've changed our metadata structure, so this converts to the "fleet" version of metadata.
Ideally we'd fork/enhance this library with metadata 2.0 and distribute it, but using it as is provides a lot of value without much effort in the near term.
There was a problem hiding this comment.
Ah right.
At this point I don't actually see what this File class adds to us? Can you explain a bit?
There was a problem hiding this comment.
Yeah - I'm trying to leverage as much of the datalake infrastructure as possible, as it includes lessons learned and features developed over several years.
For example, datalake files have a tar bundle format, and an inotify-based auto-uploaded service. If we can re-use a lot of this work, we won't have to rewrite it from scratch.
There was a problem hiding this comment.
Except currently we're just doing a POST directly anyway?
I'm wondering what specifically this PR users from the File class of datalake
Maybe the answer is 'nothing yet'?
There was a problem hiding this comment.
I think the mismatch is that I was overriding the methods used in this diff to get streaming gzip working.
I've since moved that into a datalake fork. I think it's worth factoring out the metadata and tooling as it's a bit more complex than just normal REST api stuff.
Ideally this library would be very lean and not doing too much magic.
36689b5 to
c19b4db
Compare
This adds client-side functionality for interfacing with the missioncontrol files endpoints.
Note this requires us to land the changes into missioncontrol first.