Skip to content

Conversation

@dimitri-yatsenko
Copy link
Member

@dimitri-yatsenko dimitri-yatsenko commented Oct 20, 2017

@dimitri-yatsenko dimitri-yatsenko changed the title Fixed #264, #367, #375 [WIP] External storage, enhancements, bugfixes Oct 20, 2017
@dimitri-yatsenko dimitri-yatsenko changed the title [WIP] External storage, enhancements, bugfixes External storage, enhancements, bugfixes Nov 5, 2017
@coveralls
Copy link

coveralls commented Nov 5, 2017

Coverage Status

Coverage decreased (-2.5%) to 89.414% when pulling 90c021e on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@coveralls
Copy link

coveralls commented Nov 6, 2017

Coverage Status

Coverage decreased (-2.4%) to 89.492% when pulling 7302571 on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@coveralls
Copy link

coveralls commented Nov 6, 2017

Coverage Status

Coverage decreased (-2.4%) to 89.492% when pulling 569881a on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@ixcat
Copy link

ixcat commented Nov 6, 2017

Generally speaking, seems reasonable to me -
I'll be doing more direct testing as part of the S3 feature (#204)

thoughts:

  • If we are definitely leaving the external file cleanup as a separate action, it would be good to have a reference implementation of this, if it is even simple.
  • I think it would be a good idea to abstract the low-level external storage functions (currently get/put) into a separate class hierarchy that is called from the main ExternalTable operations. This will help to clearly encapsulate what methods are required to create ExternalTable 'backends', and also could allow for reuse among backends (e.g. S3 and any other other 'WAN' backends could concievably use operations defined by the plain-file backend to manage the local cache). I'd probably go with an abstract or plain-file class as base, and have a {'backendname' : BackendClass} dict statically configured as a class property of ExternalFile at least initially, but that's me.
  • Unless I am mistaken, the blob is currently being written outside of any serialization of operations on the specific external record file; this could conceivably create a race condition in some (rare) corner cases, though I am admittedly not yet 100% up to date on DJ's locking mechanisms and how they might relate to say, 2 processes writing the same external blob when using populate in parallel on different keys which generate the same externally-stored result in a related table.. For posix the data should be coherent (though possibly with incorrect timestamps if a slower writer finished after the fast writer overwrote it's data and committed the timestamp of the now-replaced-file to the db), but am also not sure if this is well defined on other systems or if their semantics would match. Quite possibly paranoia, but it might be a good idea to create a temporary '-lock' and check for the existence of this and either stall or generate an error if it exists, or create some sort of similar locking protocol to track this on the database side.. I think in previous discussions I had assumed external file transactions would be serialized; I'm now not sure this is happening and so am mentioning it here.
  • If the database insertion of the external file record into the database fails, we are left with a dangling blob. While this is a rare case and should be caught by the 'external cleanup' mechanism, it might be useful to handle cleanup here and then re-raise the error.


# serialize object
blob = pack(obj)
hash = long_hash(blob) + store[len('external-'):]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store[len('external-'):] seems incongruent w/r/t line 88: store = 'external' + ('-' if store else '') + store which could conceivably allow only external; that said using only 'external' as a specifier is already blocked in declare.py so I'm not sure how important this is.. perhaps line 88 should throw an error rather than allowing 'external' only?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to allow either external or external-somethin where somethin cannot exceed 8 characters. external by itself should not be blocked. If it is, it's my mistake and I will fix it and add a test for it. Many people will only have one external storage and they will just use external.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is store[len('external-'):] incongruent? You can do a[n:] where n exceeds the size of a no problem.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok - I think on first read I was taking this to mean there was an explicit expectation that the slice would return something to append to the hash, which was then "confirmed" when I saw that declare.py requires the external- convention (via the 'isidentifier' check on the secondary portion) and throws exceptions suggesting this is a requirement.

Not allowing the bare 'external' in some cases (here, declare.py) but allowing it in another (line 88) is why I proposed one was incongruent with respect to the other.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if 'external' alone is is allowed, it looks like the store_name parts from declare (line 191+) need to be altered -

>>> storestr='external'
>>> storestr.split('-')
['external']
>>> store_name = storestr.split('-')
>>> store_name[1:]
[]
>>> ''.join(store_name[1:])
''
>>> ''.join(store_name[1:]).isidentifier()
False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. It should be isidentifier or empty.

@ixcat
Copy link

ixcat commented Nov 7, 2017

Concerning preparation for #204 -
In my s3 branch I've refactored the file handling operations into a separate class hierarchy which should help keep things more modular - the specific commit is:

ixcat@2729fd5

(other commits in that branch relate to other parts of S3 integration)

currently defined tests are passing locally.

If you agree with this approach, it might make sense to get that mechanism sorted out and merged in to your branch with this PR; alternately can tack it on afterwords. Either way good to get the discussion started.

return arg, _negate
elif isinstance(arg, AndList):
return '(' + ' AND '.join([make_condition(element)[0] for element in arg]) + ')', _negate
elif arg is True or arg is False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much prefer isinstance(arg, bool)

raise DataJointError('External attributes cannot be primary in:\n%s' % line)
store_name = match['type'].split('-')
if store_name[0] != 'external':
raise DataJointError('External store types must be in format external-<name>')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should say in format external vs external-<name>

if store_name[0] != 'external':
raise DataJointError('External store types must be in format external-<name>')
store_name = '-'.join(store_name[1:])
if not store_name.isidentifier():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will reject external - needs to be updated to not (store_name == ' ' or storename.isidentifier())

"Cannot restrict in place a projection with renamed attributes."
if isinstance(restriction, AndList):
self.restrictions.extend(restriction)
:param as_or: if True than the applied restriction becomes or-listed with already existing conditions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is documented but not present - either implement it or update the doc.

return self & Not(restriction)

def restrict(self, restriction):
def restrict(self, arg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked it better to call it restriction rather than arbitrary arg as the parameter name.

from . schema_external import schema


def test_external_put():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add test case for pure external in addition to external-xxx formatted configs

@@ -1,5 +1,6 @@
from nose.tools import assert_equal, assert_false, assert_true
from datajoint.declare import declare
from datajoint import AndList, OrList
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why importing AndList and OrList here? Appears to be unused.

logger.info('Found %d keys to populate' % len(keys))
for key in keys:

make = self._make_tuples if hasattr(self, '_make_tuples') else self.make
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should throw deprecation warning when we find _make_tuples

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concluded that we are holding off deprecation of _make_tuples

@coveralls
Copy link

coveralls commented Nov 13, 2017

Coverage Status

Coverage decreased (-1.8%) to 90.155% when pulling 86c2480 on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@coveralls
Copy link

coveralls commented Nov 14, 2017

Coverage Status

Coverage decreased (-1.8%) to 90.181% when pulling 6f7c6bd on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@coveralls
Copy link

coveralls commented Nov 15, 2017

Coverage Status

Coverage decreased (-1.8%) to 90.181% when pulling 8018a9d on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

@coveralls
Copy link

coveralls commented Nov 15, 2017

Coverage Status

Coverage decreased (-1.8%) to 90.181% when pulling 7627569 on dimitri-yatsenko:master into 8ea75a6 on datajoint:master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants