Skip to content

Conversation

@mckinsel
Copy link

SwaggerClient discovers methods by reading a swagger document, creating
instances of _ClientMethodFactory for paths in the swagger, and
assigning those instances as class attributes of SwaggerClient.

This means that a single instance of each _ClientMethodFactory is shared
across threads. This is not so good because _ClientMethodFactories
store state associated with requests. In particular, consider the
following scenario:

Thread-1: client = hca.dss.DSSClient()
Thread-2: client = hca.dss.DSSClient()
Thread-1: with client.get_file.stream(...) as handle:
Thread-2: with client.get_file.stream(...) as handle:
Thread-1: handle.raw.read()

Depending on the execution of __enter__, Thread-1 may have just read from
the file for Thread-2! Going on,

Thread-1: __exit__ the with block
Thread-2: handle.raw.read()
Thread-2: __exit__ the with block
Exception: NoneType has no attribute close()

Or you know, something like that.

This change makes the methods instance attributes.

mckinsel added a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
mckinsel added a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
mckinsel added a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
mckinsel added a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
calvinnhieu pushed a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
calvinnhieu pushed a commit to HumanCellAtlas/matrix-service that referenced this pull request Oct 24, 2018
* parallelize read from DSS in workers

* Use thread-safer DCP interface

This incorporates the change in HumanCellAtlas/dcp-cli#202

* Include thread id in logs

This is helpful in our multithreaded code paths

* Don't print logs twice

AWS Lambda has a default root logger that captures everything that goes
to stdout. If you don't turn `propagate` off, you get every message
twice in cloudwatch.

* Poll for locks much less frequently

A failure mode we encounter is using up capacity on the lock table. If
we're anticipating smaller numbers of worker lambdas that handle larger
blocks of works, we can wait longer between checking for lock
availability. This reduced our dynamo consumed capacity.

* Distribute 100 bundles to each lambda

And aggressively parallelize DSS I/O.

* Assume that each bundle has a single cell in the mapper

This is a bad assumption, but it's helpful for the demo.

* Address comments
@mckinsel mckinsel requested a review from ttung October 26, 2018 16:10
Copy link
Member

@ttung ttung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unit tests are broken, btw.

my suggestion for making this a bit safer: in __init__(..), create a lock. in stream(..), hold the lock, and verify that _context_manager_response is None before issuing the request. if it's not None, then the previous request is plausibly still running.

@mckinsel
Copy link
Author

That's definitely safer, but I think the safety is achieved by fully serializing calls to stream. Once one thread calls stream, no other thread could until that first thread calls __exit__.

@mckinsel
Copy link
Author

Also, re unit tests, I'm pretty sure master is broken? Is that possible?

@ttung
Copy link
Member

ttung commented Oct 29, 2018

That's definitely safer, but I think the safety is achieved by fully serializing calls to stream. Once one thread calls stream, no other thread could until that first thread calls exit.

I think my suggestion achieves that same effect...?

@ttung
Copy link
Member

ttung commented Oct 29, 2018

Also, re unit tests, I'm pretty sure master is broken? Is that possible?

Yes, that's definitely possible.

@kozbo
Copy link
Collaborator

kozbo commented Nov 13, 2018

Passing now, would you like to try your unit tests again?

@kozbo kozbo added this to the M-11-25-18 milestone Nov 14, 2018
calvinnhieu pushed a commit to HumanCellAtlas/matrix-service that referenced this pull request Nov 15, 2018
* parallelize read from DSS in workers

* Use thread-safer DCP interface

This incorporates the change in HumanCellAtlas/dcp-cli#202

* Include thread id in logs

This is helpful in our multithreaded code paths

* Don't print logs twice

AWS Lambda has a default root logger that captures everything that goes
to stdout. If you don't turn `propagate` off, you get every message
twice in cloudwatch.

* Poll for locks much less frequently

A failure mode we encounter is using up capacity on the lock table. If
we're anticipating smaller numbers of worker lambdas that handle larger
blocks of works, we can wait longer between checking for lock
availability. This reduced our dynamo consumed capacity.

* Distribute 100 bundles to each lambda

And aggressively parallelize DSS I/O.

* Assume that each bundle has a single cell in the mapper

This is a bad assumption, but it's helpful for the demo.

* Address comments
SwaggerClient discovers methods by reading a swagger document, creating
instances of _ClientMethodFactory for paths in the swagger, and
assigning those instances as class attributes of SwaggerClient.

This means that a single instance of each _ClientMethodFactory is shared
across threads. This is not so good because _ClientMethodFactories
store state associated with requests. In particular, consider the
following scenario:

Thread-1: client = hca.dss.DSSClient()
Thread-2: client = hca.dss.DSSClient()
Thread-1: with client.get_file.stream(...) as handle:
Thread-2: with client.get_file.stream(...) as handle:
Thread-1: handle.raw.read()

Depending on the execution of __enter__, Thread-1 may have just read from
the file for Thread-2! Going on,

Thread-1: __exit__ the with block
Thread-2: handle.raw.read()
Thread-2: __exit__ the with block
Exception: NoneType has no attribute close()

Or you know, something like that.

This change makes the methods instance attributes.
Really feel like this one should pass...
@mckinsel mckinsel force-pushed the mckinsel-thread-safety branch from ae247b3 to 3c57a7f Compare November 16, 2018 21:05
@mckinsel
Copy link
Author

Thanks @kozbo , still failing, but interesting failures now.

@kozbo kozbo modified the milestones: M-11-25-18, M-12-21-2018 Dec 21, 2018
@kozbo kozbo removed this from the M-12-21-2018 milestone Feb 26, 2019
@kozbo
Copy link
Collaborator

kozbo commented Jul 1, 2019

@mckinsel any progress on this? Is this still an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants