All notable changes to this project will be documented in this file
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Support for x-cook-pool header, from @pschorf
- Bug in reporting total usage when pools are enabled, from @pschorf
- Updated some metric names to incorporate pools, from @pschorf and @dposada
- Rate limiting on job submission, from @scrosby
- Remove stale dataset cost data, from @pschorf
- Don't show uncommitted jobs in unscheduled_jobs endpoint, from @pschorf
- Support for contacting a data local service to obtain cost data for scheduling, from @pschorf
- Bug in quota-checking when running without pools, from @dposada
- Bug in the rebalancer's retrieval of DRU divisors when running with pools, from @dposada
- Integer overflows in timer tasks when the scheduler runs for a long time, from @shamsimam
- Per-pool job scheduling, from @dposada and @pschorf
- Support for self-impersonation requests from normal users, from @DaoWen
- Exit code syncer to handle a high rate of incoming exit code messages, from @shamsimam
- Removed TTL from agent attributes cache, from @dposada
- Performance improvements to job submission, from @scrosby and @pschorf
- data-local field to jobs, from @pschorf
- Performance improvements to job submission, from @scrosby and @pschorf
- Consume entire request before sending response, from @pschorf
- Container fields to /jobs, from @dposada
- reason_mea_culpa to instance responses, from @dposada
- Support for x-forwarded-proto header for CORS requests, from @pschorf
- Removed mesos master-hosts config, from @dposada
- Removed rebalancer min-utilization-threshold, from @dposada
- Better authorization failed message on job deletion, from @dposada
- Handle edge case in estimated completion constraint, from @pschorf
- Issue where task reconciliation was failing, from @pschorf
- Issue where nil instance timestamps would cause NPEs, from @dposada
- Pool support to /jobs, from @dposada
- Estimated completion constraint, from @pschorf
- Pool submap to /quota and /share, from @pschorf
- Improvements to job query times, from @scrosby
- Added pool support to /share and /quota endpoints, from @pschorf
- Returns 409 on some retry operations instead of retrying jobs which could end up in a bad state, from @pschorf
- Fixed bug with disable_mea_culpa_retries, from @pschorf
- Improved logging for some error cases, from @dposada
- Support for pool param to /usage endpoint, from @dposada
- Support for pool param on job submission, from @dposada
- Support for SSL, from @pschorf
- Support for api-only mode, from @dposada
- Issue where monitor metrics would sometimes stop on a non-zero value, from @dposada
- Fix performance regression in list API, from @scrosby
- Support for listing custom executor jobs in /jobs endpoint, from @dposada
- Kill instances for cancelled jobs on leadership election, from @pschorf
- Performance improvements to scheduling and list APIs, from @scrosby
- Fixed GPU support, from @dPeS
- Support for CORS requests, from @pschorf
- Scheduling performance improvements, from @scrosby
- Counters for job cpu/mem/runtime by failure reason, from @dposada
- Endpoint for instance statistics, from @dposada
- Support for a configurable run as user, from @shamsimam
- Support for configuring number of instances which can fail before falling back to the mesos executor, from @shamsimam
- Performance improvements to sandbox syncer, from @shamsimam
- Rebalancer now reserve hosts after preempting, from @pschorf
- Performance improvents to dru computation, @shamsimam
- Added timely sandbox directory updates for tasks that are not executed by the cook executor, from @shamsimam
- Added environment variables that contain the resources requested by the job, from @shamsimam
- Converted monitor Riemann events to codahale metrics, from @dposada
- Fixed string encoding on
/rawschedulerPOST, from @pschorf - The
start-timetimestamp on/infono longer re-evaluates tonowon each request, from @DaoWen
- Added user-impersonation functionality to support services running on top of Cook Scheduler, from @DaoWen
- Jobs that exceed a user's total resource quota are rejected rather than waiting indefinitely, from @DaoWen
- Added unauthenticated /info endpoint for retrieving basic setup information, from @DaoWen
- Added metrics for message rates of Mesos status changes and framework updates, from @shamsimam
- Added check for required
reasonparameter on share and quota deletions, from @DaoWen
- Fixed error in Kerberos middleware setup, from @DaoWen
- Reclassified
MESOS_EXECUTOR_TERMINATEDas a mea-culpa error, from @shamsimam - Fixed bug preventing group retry updates by non-admin users, from @DaoWen
- Fixed bug causing a 500 rather than a 404 for gets on non-existent groups, from @DaoWen
- Re-enabled Fenzo group constraints, from @pschorf
- Added /instances endpoint for retrieving job instances, from @dposada
- Added /jobs resource for retrieving jobs, from @dposada
- Added /usage endpoint for displaying user resource usage, from @DaoWen
- Added failed-only option for retry endpoint, from @DaoWen
- Fixed authorization check on group endpoint, from @DaoWen
- Disabled fenzo group constraints, from @pschorf
- Retries sandbox syncing of hosts when cache entries expire, from @shamsimam
- Allow partial results from /unscheduled_jobs, from @dposada
- Improve performance by defering calculation of group components, from @pschorf
- Support millisecond time resolution for lingering tasks, from @DaoWen
- Added COOK_JOB_UUID and COOK_JOB_GROUP_UUID to the job environment, from @shamsimam
- Added support for killing a group of jobs, from @DaoWen
- Added sysouts to get job output closer to Mesos' CommandExecutor, from @shamsimam
- Added metrics for usage of /list, from @dposada
- Added support for retrying a group of jobs, from @DaoWen
- Added support for configurable environment passed to Cook Executor, from @shamsimam
- Fixed bug with job group constraints, from @pschorf
- Fixed bug where Cook Executor jobs were opting in to the heartbeat support, from @shamsimam
- Changed (simplified) the sandbox directory syncing mechanism for jobs, from @shamsimam
- Renamed user whitelist to users allowed, from @dposada
- Fixes for stderr/out file handling in Cook executor, from @shamsimam
- Fixed bug with /unscheduled_jobs endpoint, from @pschorf
- Added support for allowing job to specify which executor (cook|mesos) to use, from @shamsimam
- Added support for passing state=success/failed in /list, from @dposada
- Added support for filtering by name in /list, from @dposada
- More failure codes have been classified as mea-culpa failures, from @pschorf
- /queue endpoint redirects to the master on non-master hosts, from @pschorf
- Fixed handling of detailed parameter on group queries, from @DaoWen
- Fixed bug with launching docker container jobs, from @DaoWen
- Fixed bug with docker container port mappings, from @pschorf
- Performance improvement in rank jobs, from @wyegelwel
- Added JVM metric reporting, from @pschorf
- Added support for partial results when querying for groups, from @dposada
- Added support for user whitelisting, from @dposada
- Added support for throttling rate of publishing instance progress updates, from @shamsimam
- Added authorization check for job creation, from @dposada
- The Mesos Framework ID is now configurable, from @dposada
- Added configuration for agent-query-cache, from @shamsimam
- Added support for Cook Executor, from @shamsimam
- Replaced aggregate preemption logging with individual preemption decisions, from @wyegelwel
- /debug endpoint now returns the version number, from @dposada
- Fixed a bug which was overwriting end-time on duplicate mesos messages, from @pschorf
- Fixed a bug with querying for jobs with a non-zero number of ports, from @dposada
- Parallelize in-order processing of status messages, from @shamsimam
- Change reason string from "Mesos command executor failed" to "Command exited non-zero", from @wyegelwe
- Added configuration option for the leader to report unhealthy, from @pschorf
- Optimized list endpoint query for running and waiting jobs, from @wyegelwel and @pschorf
- Lowered log level of sandbox directory fetch error to reduce noise, from @wyegelwel
- Further optimize list endpoint query, from @pschorf and @wyegelwel
- Optimized the query in the list endpoint to avoid an expensive datomic join, from @pschorf and @wyegelwel
- Change the list endpoint time range to be inclusive on start, from @wyegelwel
- Add check to ensure job/group uuids do not exist before creation, from @pschorf
- Limit rebalancer jobs to consider to max preemptions, from @wyegelwel
- Added simulator to test scheduler performance, from @wyegelwel
- Added job constraints, from @wyegelwel
- Added instance progress to query response, from @dposada
- Fixed bug where job submit errors would return 201, from @pschorf
- Optimizations in ranking to improve schedule time, from @shamsimam
- Refactor fenzo constraints to use less memory, from @pschorf
- Added disable-mea-culpa-retries to jobclient, from @WenboZhao
- Fix bug with disable-mea-culpa-retries, from @pschorf
- Make DRU order deterministic, from @wyegelwel
- Change default cycle time for checking max-runtime exceeded to 1m, from @wyegelwel
- Remove concat usage, from @pschorf
- /unscheduled_jobs API endpoint, from @mforsyth
- Added application to job description, from @dposada
- Added disable-mea-culpa-retries flag, from @pschorf
- Added docker, from @dposada
- Added support for job groups in simulator, from @mforsyth
- Added /failure_reasons API endpoint, from @mforsyth
- Added expected-runtime to job description, from @dposada
- Added /settings API endpoint, from @dposada
- Added group host placement constraints, from @DiegoAlbertoTorres
- Require an explicit reason when changing shares or quotas (from @mforsyth). This intentionally breaks backwards compatibility.
- Optimized matching code to speed schedule time @wyegelwel
- Stream JSON responses, from @pschorf
- Speed up ranking with commit latch and caching from @wyegelwel
- Fixed a bug with calculating whether we matched the head of the queue which caused cook to only schedule 1 job at a time. (this is why 1.2.0 was yanked)
- Start of CHANGELOG. We are likely missing some items from 1.0.1, will be better from now on.
- Switch to use Fenzo for matching from @dgrnbrg and @mforsyth
- GPU support from @dgrnbrg
- Swaggerized endpoints from @mforsyth
- Groups (https://github.com/twosigma/Cook/blob/master/scheduler/docs/groups.md) from @DiegoAlbertoTorres
- Containers support from @sdegler, @leifwalsh, @wyegelwel
- Retry endpoint from @pjlegato and @wyegelwel
- Authorization on endpoints from @pjlegato and @wyegelwel
- System simulator and CI from @mforsyth
- Access logs for server from @sophaskins
- Mea culpa reasons so some failures don't count against retries from @DiegoAlbertoTorres @mforsyth
- Switch to use mesomatic over clj-mesos from @mforsyth
- Tied to mesos 1.x.x (exact version is 1.0.1)
- State change of a job from waiting to running now occurs when Cook submits the job to mesos (not when mesos confirms the job is running) from @aadamson and @DiegoAlbertoTorres
- Performance improvements to ranking and scheduling from @wyegelwel
- Split brain on mesos / zk fail over. Cook will now exit when it loses leadership with either zk or mesos. A supervisor is expected to restart it from@wyegelwel