-
Notifications
You must be signed in to change notification settings - Fork 6
Add Hybrid Query Builder #960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
fc3e6f5
7468316
ab32c4e
612845b
73d5e1f
eacbff5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,69 @@ | ||
| class HybridQueryBuilder | ||
| def build(params, fulltext: false) | ||
| query_text = params[:q].to_s.strip | ||
|
|
||
| lexical_query = LexicalQueryBuilder.new.build(params, fulltext: fulltext) | ||
|
|
||
|
jazairi marked this conversation as resolved.
|
||
| # If no query text provided, return lexical query so filters/other constraints are still applied | ||
| return lexical_query if query_text.blank? | ||
|
|
||
| begin | ||
| semantic_query = SemanticQueryBuilder.new.build(params, fulltext: fulltext) | ||
|
|
||
| # Both succeeded - combine them with should clause while preserving filters | ||
| combine_queries(semantic_query, lexical_query) | ||
| rescue SemanticQueryBuilder::LambdaError => e | ||
| # Lambda service failure - report to Sentry and gracefully fall back to lexical search | ||
| Sentry.capture_exception(e, level: 'warning') | ||
| Rails.logger.warn( | ||
| "HybridQueryBuilder semantic query failed: #{e.class}: #{e.message}" | ||
| ) | ||
| lexical_query | ||
| end | ||
|
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| end | ||
|
|
||
| private | ||
|
|
||
| # Combines semantic and lexical queries while preserving non-q filters. | ||
| # The q multi_match stays in the lexical branch to allow semantic-only matches. | ||
| def combine_queries(semantic_query, lexical_query) | ||
| # Extract filters (non-q constraints like title/citation/geo) to apply at top level. | ||
| # Do NOT extract must (which contains the q multi_match) - it stays in lexical branch | ||
| # so semantic matches can be returned without matching the q query. | ||
| lexical_bool = lexical_query.is_a?(Hash) && lexical_query[:bool] ? lexical_query[:bool] : {} | ||
| top_level_filters = lexical_bool[:filter] || [] | ||
|
|
||
| # Keep the full lexical query structure (with q multi_match in must) but remove filters | ||
| # so we don't duplicate them in the final query | ||
| lexical_search = if lexical_query.is_a?(Hash) && lexical_query[:bool] | ||
| { | ||
| bool: { | ||
| should: lexical_bool[:should] || [], | ||
| must: lexical_bool[:must] || [] | ||
| }.reject { |_, v| v.blank? } | ||
| } | ||
| else | ||
| lexical_query | ||
| end | ||
|
|
||
| hybrid_bool = { | ||
| should: [ | ||
| semantic_query, | ||
| lexical_search | ||
| ] | ||
| } | ||
|
|
||
| # Apply only filters (non-q constraints) at top level so they apply to both branches | ||
| hybrid_bool[:filter] = top_level_filters if top_level_filters.present? | ||
|
|
||
| # In OpenSearch, when a bool query has no filters, should clauses are required by default. | ||
| # When filters are added, should clauses become optional. We explicitly require at least | ||
| # one should clause to match (semantic or lexical) when filters are present, so we don't | ||
| # return filter-only results that matched neither branch. | ||
| hybrid_bool[:minimum_should_match] = 1 if top_level_filters.present? | ||
|
|
||
| { | ||
| bool: hybrid_bool | ||
| } | ||
|
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Found 5 issues: |
||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,7 @@ | ||
| class SemanticQueryBuilder | ||
| # Dedicated exception for Lambda invocation failures (not parsing/validation errors) | ||
| class LambdaError < StandardError; end | ||
|
|
||
| def build(params, fulltext: false) | ||
| query_text = params[:q].to_s.strip | ||
|
|
||
|
|
@@ -13,16 +16,23 @@ def build(params, fulltext: false) | |
|
|
||
| def invoke_semantic_builder(query_text) | ||
| payload = { query: query_text } | ||
| function_name = ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME') | ||
|
|
||
| response = Timdex::LambdaClient.invoke( | ||
| function_name: ENV.fetch('TIMDEX_SEMANTIC_BUILDER_FUNCTION_NAME'), | ||
| invocation_type: 'RequestResponse', | ||
| payload: payload.to_json | ||
| ) | ||
| begin | ||
| response = Timdex::LambdaClient.invoke( | ||
| function_name: function_name, | ||
| invocation_type: 'RequestResponse', | ||
| payload: payload.to_json | ||
| ) | ||
| rescue StandardError => e | ||
| # All errors from the Lambda service call are wrapped in LambdaError | ||
| # so HybridQueryBuilder can catch it and gracefully fall back to lexical search. | ||
| raise LambdaError, "Lambda invocation error: #{e.message}", e.backtrace | ||
| end | ||
|
jazairi marked this conversation as resolved.
|
||
|
|
||
| # Response parsing below is outside the rescue block, so JSON/validation errors | ||
| # propagate as-is and fail fast rather than triggering graceful fallback. | ||
| parse_lambda_payload(response.payload) | ||
|
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
qltysh[bot] marked this conversation as resolved.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| rescue StandardError => e | ||
| raise "Semantic query builder Lambda error: #{e.message}" | ||
| end | ||
|
|
||
| def parse_lambda_payload(payload) | ||
|
|
@@ -39,12 +49,26 @@ def parse_lambda_payload(payload) | |
|
|
||
| def parse_lambda_response(lambda_response) | ||
| # Lambda returns: { "query": { "bool": { "should": [...] } } } | ||
| # We extract and return just the inner query object | ||
| # We extract and return just the inner query object with keys normalized to symbols | ||
| raise "Invalid semantic query builder response: missing 'query' key" unless lambda_response.key?('query') | ||
|
|
||
| query = lambda_response['query'] | ||
| raise 'Invalid semantic query builder response: query must be a Hash' unless query.is_a?(Hash) | ||
|
|
||
| query | ||
| # Normalize string keys to symbols for consistency with LexicalQueryBuilder | ||
| normalize_keys(query) | ||
| end | ||
|
|
||
| # Recursively converts all string keys to symbols in hashes and nested structures. | ||
| def normalize_keys(value) | ||
| case value | ||
| when Hash | ||
| value.transform_keys { |k| k.is_a?(String) ? k.to_sym : k } | ||
| .transform_values { |v| normalize_keys(v) } | ||
| when Array | ||
| value.map { |item| normalize_keys(item) } | ||
| else | ||
| value | ||
| end | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| end | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,20 +1,17 @@ | ||
| require 'aws-sdk-lambda' | ||
|
|
||
| def configure_lambda_client | ||
| if ENV['AWS_SESSION_TOKEN'].present? | ||
| Aws::Lambda::Client.new( | ||
| region: ENV.fetch('AWS_REGION', 'us-east-1'), | ||
| access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'), | ||
| secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY'), | ||
| session_token: ENV.fetch('AWS_SESSION_TOKEN') | ||
| ) | ||
| else | ||
| Aws::Lambda::Client.new( | ||
| region: ENV.fetch('AWS_REGION', 'us-east-1'), | ||
| access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'), | ||
| secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY') | ||
| ) | ||
| end | ||
| options = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah this is probably better configuration even if you weren't adding the new feature to override the endpoint :) |
||
| region: ENV.fetch('AWS_REGION', 'us-east-1'), | ||
| access_key_id: ENV.fetch('AWS_ACCESS_KEY_ID'), | ||
| secret_access_key: ENV.fetch('AWS_SECRET_ACCESS_KEY') | ||
| } | ||
| options[:session_token] = ENV['AWS_SESSION_TOKEN'] if ENV['AWS_SESSION_TOKEN'].present? | ||
|
|
||
| # AWS SDK sets this env in prod. However, we need to conditionally set it for tests so VCR can | ||
| # intercept the requests with a fake URL. | ||
| options[:endpoint] = ENV['AWS_ENDPOINT_URL_LAMBDA'] if ENV['AWS_ENDPOINT_URL_LAMBDA'].present? | ||
| Aws::Lambda::Client.new(options) | ||
| end | ||
|
|
||
| Timdex::LambdaClient = configure_lambda_client | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm okay with this not being documented as we don't anticipate it ever being used anywhere but the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I did mean to document it, just to say that it's test env only. Will do so before merging.