Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
More text changes
  • Loading branch information
mitsuhiko committed Jul 13, 2023
commit b0a1340c0efed26b7117ef6f088647baa29c3830
65 changes: 55 additions & 10 deletions text/XXXX-filestore-new.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,23 +28,68 @@ done with the internal abstractions and how they should be used.

# Background

blah
The primary internal abstraction in Sentry today is the `filestore` service which itself
is built on top of Django's `files` system. At this level "files" have names and they
are stored on a specific GCS bucket (or an alternative backend). On top of that the `files`
models are built. There each file is created out of blobs where each blob is stored
(deduplicated) just once in the backend of `filestore`.

For this purpose each blob is given a unique filename (a UUID). Blobs are deduplicated
by content hash and only stored once. This causes some challenge to the system as it
means that the deletion of blobs has to be driven by the system as auto-expiration is
thus no longer possible.

# Supporting Data

[Metrics to help support your decision (if applicable).]
We currently store petabytes of file assets we would like to delete.

# Possible Changes

These are some plans about what can be done to improve the system:

## Removal of Blob Deduplication

Today it's not possible for us to use GCS side expiration. That's because without the
knowledge of the usage of blobs from the database it's not save to delete blobs. This
can be resolved by removing deduplication. Blobs thus would be written more than once.
This works on the `filestore` level, but it does not work on the `FileBlob` level.
However `FileBlob` itself is rather well abstracted away from most users. A new model
could be added to replace the one one. One area where `FileBlob` leaks out is the
data export system which would need to be considered.

`FileBlobOwner` itself could be fully removed, same with `FileBlobIndex` as once
deduplication is removed the need of the owner info no longer exists, and the index
info itself can be stored on the blob itself.

```python
class FileBlob2(Model):
organization_id = BoundedBigIntegerField(db_index=True)
path = TextField(null=True)
offset = BoundedPositiveIntegerField()
size = BoundedPositiveIntegerField()
checksum = CharField(max_length=40, unique=True)
timestamp = DateTimeField(default=timezone.now, db_index=True)
```

## TTL Awareness

# Options Considered
The abstractions in place today do have any support for storage classes. Once however
blobs are deduplicated it would be possible to fully rely on GCS to clean up on it's own.
Because certain operations are going via our filestore proxy service, it would be preferrable
if the policies were encoded into the URL in one form or another.

If an RFC does not know yet what the options are, it can propose multiple options. The
preferred model is to propose one option and to provide alternatives.
## Assemble Staging Area

# Drawbacks
The chunk upload today depends on the ability to place blob by blob somewhere. Once blobs are
stored regularly in GCS there is no significant reason to slice them up into small pieces as
range requests are possible. This means that the assembly of the file needs to be reconsidered.

Why should we not do this? What are the drawbacks of this RFC or a particular option if
multiple options are presented.
The easiest solution here would be to allow chunks to be uploaded to a per-org staging area where
they linger for up to two hours per blob. That gives plenty of time to use these blobs for
assembly. A cleanup job (or TTL policy if placed in GCS) would then collect the leftovers
automatically. This also detaches the coupling of external blob sizes from internal blob
storage which gives us the ability to change blob sizes as we see fit.

# Unresolved questions

- What parts of the design do you expect to resolve through this RFC?
- What issues are out of scope for this RFC but are known?
TBD