-
Notifications
You must be signed in to change notification settings - Fork 4
Add support for HCA tissue atlas (#7128) #7877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Changes from all commits
3f3e5c0
8c7831d
b5366ff
1350081
e3e5a86
1d37e17
0910a5e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -21,6 +21,7 @@ | |||||
| FieldTypes, | ||||||
| FieldTypes1, | ||||||
| Nested, | ||||||
| pass_thru_bool, | ||||||
| ) | ||||||
| from azul.indexer.document import ( | ||||||
| Aggregate, | ||||||
|
|
@@ -96,7 +97,6 @@ | |||||
| if isinstance(field_types, Nested): | ||||||
| element = next(elements, None) | ||||||
| if element is not None: | ||||||
| assert element == field_types.agg_property, (element, field_types) | ||||||
| field_types = field_types.properties[element] | ||||||
| assert isinstance(field_types, FieldType), (path, field_types) | ||||||
| element = next(elements, None) | ||||||
|
|
@@ -122,6 +122,27 @@ | |||||
| # does not undergo translation | ||||||
| ) | ||||||
|
|
||||||
| @cache | ||||||
Check warningCode scanning / CodeQL Use of the return value of a procedure Warning
The result of
cache Error loading related location Loading |
||||||
|
|
||||||
| def field_types_by_name(self, catalog: CatalogName) -> Mapping[str, FieldType]: | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: rename method, use |
||||||
| """ | ||||||
| Returns the field type for each supported sort and filter field, using | ||||||
| the name of the field as provided by clients. Unlike field_types(), this | ||||||
| is a flat mapping and includes the synthetic field 'accessible' that has | ||||||
| no entry in the plugin's field_mapping. | ||||||
|
|
||||||
| :return: dict with field names as keys and each field's type as value | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| """ | ||||||
| plugin = self.metadata_plugin(catalog) | ||||||
| result = {} | ||||||
| for field, path in plugin.field_mapping.items(): | ||||||
| field_type = self.field_type(catalog, path) | ||||||
| if isinstance(field_type, FieldType): | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: investigate in which case
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The returned value from |
||||||
| result[field] = field_type | ||||||
| accessible_field = plugin.special_fields.accessible.name | ||||||
|
dsotirho-ucsc marked this conversation as resolved.
|
||||||
| assert accessible_field not in result, result | ||||||
| result[accessible_field] = pass_thru_bool | ||||||
| return result | ||||||
|
|
||||||
| def catalogued_field_types(self) -> CataloguedFieldTypes: | ||||||
| return { | ||||||
| catalog: self.field_types(catalog) | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -735,7 +735,14 @@ def _project(self, project: api.Project) -> MutableJSON: | |
| 'accessions': list(map(self._accession, project.accessions)), | ||
| 'is_tissue_atlas_project': any(bionetwork.atlas_project | ||
| for bionetwork in project.bionetworks), | ||
| 'tissue_atlas': list(map(self._tissue_atlas, project.bionetworks)), | ||
| # We deduplicate the `tissue_atlas` field values since duplicate | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: replace custom deduplication with an existing mechanism, and investigate why this is necessary since a SetOfDictAccumulator is used during aggregation. |
||
| # values in a nested field would cause incorrect term facet totals. | ||
| 'tissue_atlas': [ | ||
| dict(d) for d in dict.fromkeys( | ||
| tuple(self._tissue_atlas(b).items()) | ||
| for b in project.bionetworks | ||
| ) | ||
| ], | ||
| 'bionetwork_name': json_sorted(bionetwork.name | ||
| for bionetwork in project.bionetworks), | ||
| 'estimated_cell_count': project.estimated_cell_count, | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,9 @@ | |
| one, | ||
| ) | ||
|
|
||
| from azul.field_type import ( | ||
| Nested, | ||
| ) | ||
| from azul.lib import ( | ||
| cached_property, | ||
| ) | ||
|
|
@@ -558,9 +561,11 @@ def file_type_summary(aggregate_file: JSON) -> FileTypeSummaryForHit: | |
| ] | ||
| return summarized_hit | ||
|
|
||
| def make_terms(self, agg) -> Terms: | ||
| def choose_entry(_term): | ||
| if 'key_as_string' in _term: | ||
| def make_terms(self, field_type, agg) -> Terms: | ||
| def choose_entry(_term, nested_keys): | ||
| if nested_keys is not None: | ||
| return dict(zip(nested_keys, _term['key'])) | ||
| elif 'key_as_string' in _term: | ||
| return _term['key_as_string'] | ||
| elif (term_key := _term['key']) is None: | ||
| return None | ||
|
|
@@ -571,10 +576,15 @@ def choose_entry(_term): | |
| else: | ||
| return str(term_key) | ||
|
|
||
| if isinstance(field_type, Nested): | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: Rename |
||
| nested_keys = [path[-1] for path in agg['myTerms']['meta']['paths']] | ||
| else: | ||
| nested_keys = None | ||
| terms: list[Term] = [] | ||
| for bucket in agg['myTerms']['buckets']: | ||
| term = Term(term=choose_entry(bucket), | ||
| count=bucket['doc_count']) | ||
| doc_count = bucket['doc_count'] | ||
| term = Term(term=choose_entry(bucket, nested_keys), | ||
| count=doc_count) | ||
| try: | ||
| sub_agg = bucket['myProjectIds'] | ||
| except KeyError: | ||
|
|
@@ -605,8 +615,9 @@ def choose_entry(_term): | |
| type='terms') | ||
|
|
||
| def make_facets(self, aggs: JSON) -> dict[str, Terms]: | ||
| field_types = self.service.field_types_by_name(self.catalog) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please familiarize yourself with what the "Pull up method" refactoring does. This change is not that.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: rename commit title |
||
| facets = {} | ||
| for facet, agg in aggs.items(): | ||
| if facet != '_project_agg': # Filter out project specific aggs | ||
| facets[facet] = self.make_terms(agg) | ||
| facets[facet] = self.make_terms(field_types[facet], agg) | ||
| return facets | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -35,6 +35,7 @@ | |
| ) | ||
| from opensearchpy.helpers.aggs import ( | ||
| Agg, | ||
| MultiTerms, | ||
| Terms, | ||
| ) | ||
| from opensearchpy.helpers.query import ( | ||
|
|
@@ -327,21 +328,35 @@ def _prepare_aggregation(self, *, facet: str, facet_path: FieldPath) -> Agg: | |
|
|
||
| field_type = self.service.field_type(self.catalog, facet_path) | ||
| if isinstance(field_type, Nested): | ||
| nested_agg = agg.bucket(name='nested', | ||
| agg_type='nested', | ||
| path=dotted(facet_path)) | ||
| facet_path = dotted(facet_path, field_type.agg_property) | ||
| path = dotted(facet_path) | ||
| # A nested aggregation to aggregate on fields inside a nested field | ||
| agg.bucket(name='nested', | ||
| agg_type='nested', | ||
| path=path) | ||
| # A multi-terms aggregation to form composite keys made from the | ||
| # fields inside a nested field | ||
| agg.aggs.nested.bucket(name='myTerms', | ||
| agg_type='multi_terms', | ||
| terms=[ | ||
| {'field': path + f'.{field}.keyword'} | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL |
||
| for field in field_type.properties | ||
| ], | ||
| size=config.terms_aggregation_size) | ||
| # A filter aggregation to work around that we can't use a missing | ||
| # aggregation with a nested field. | ||
| # See https://github.com/elastic/elasticsearch/issues/9571 | ||
| agg.bucket(name='untagged', | ||
| agg_type='filter', | ||
| filter=Q('bool', must_not=[ | ||
| Q('nested', path=path, query=Q('exists', field=path)) | ||
| ])) | ||
| else: | ||
| nested_agg = agg | ||
| # Make an inner agg that will contain the terms in question | ||
| path = dotted(facet_path, 'keyword') | ||
| # FIXME: Approximation errors for terms aggregation are unchecked | ||
| # https://github.com/DataBiosphere/azul/issues/3413 | ||
| nested_agg.bucket(name='myTerms', | ||
| agg_type='terms', | ||
| field=path, | ||
| size=config.terms_aggregation_size) | ||
| nested_agg.bucket('untagged', 'missing', field=path) | ||
| path = dotted(facet_path, 'keyword') | ||
| agg.bucket(name='myTerms', | ||
| agg_type='terms', | ||
| field=path, | ||
| size=config.terms_aggregation_size) | ||
| agg.bucket('untagged', 'missing', field=path) | ||
| return agg | ||
|
|
||
| def _annotate_aggs_for_translation(self, request: Search): | ||
|
|
@@ -352,13 +367,23 @@ def _annotate_aggs_for_translation(self, request: Search): | |
| """ | ||
|
|
||
| def annotate(agg: Agg): | ||
| if isinstance(agg, Terms): | ||
| path = agg.field.split('.') | ||
| if path[-1] == 'keyword': | ||
| path.pop() | ||
| if isinstance(agg, (Terms, MultiTerms)): | ||
| if not hasattr(agg, 'meta'): | ||
| agg.meta = {} | ||
| agg.meta['path'] = path | ||
| if isinstance(agg, Terms): | ||
| # A Terms agg is for a single field, so we only put one | ||
| # field path in `paths`. | ||
| path = agg.field.removesuffix('.keyword').split('.') | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: Remove `.removesuffix('.keyword'), instead split first with an existing method, assert last element is 'keyword' and remove. |
||
| agg.meta['paths'] = [path] | ||
| else: | ||
| # A MultiTerms agg contains multiple fields, so we need the | ||
| # path of each one. By storing these paths in the same order | ||
| # that the fields occur in `agg.terms`, we can later pair | ||
| # these paths to the values in the aggregation buckets. | ||
| agg.meta['paths'] = [] | ||
| for term in agg.terms: | ||
| path = term['field'].removesuffix('.keyword').split('.') | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO: remove duplication between if/else branches. |
||
| agg.meta['paths'].append(path) | ||
| if hasattr(agg, 'aggs'): | ||
| subs = agg.aggs | ||
| for sub_name in subs: | ||
|
|
@@ -391,13 +416,25 @@ def translate(k, v: MutableJSON): | |
| translate(k, v) | ||
| else: | ||
| try: | ||
| path = v['meta']['path'] | ||
| # `paths` is a key we added to `meta` to have available here | ||
| # when processing the response. Each path is a list (e.g. | ||
| # ['contents', 'projects', 'document_id']) and `paths` will | ||
| # have only one path in the case of a Terms aggregation, or | ||
| # many paths in the case of a MultiTerms aggregation. | ||
| paths = v['meta']['paths'] | ||
| except KeyError: | ||
| pass | ||
| else: | ||
| field_type = self.service.field_type(self.catalog, tuple(path)) | ||
| for i, path in enumerate(paths): | ||
| field_type = self.service.field_type(self.catalog, tuple(path)) | ||
| for bucket in buckets: | ||
| # If the bucket is from a MultiTemrms aggregation | ||
| if isinstance(bucket['key'], list): | ||
| bucket['key'][i] = field_type.from_index(bucket['key'][i]) | ||
| # If the bucket is from a Terms aggregation | ||
| else: | ||
| bucket['key'] = field_type.from_index(bucket['key']) | ||
| for bucket in buckets: | ||
| bucket['key'] = field_type.from_index(bucket['key']) | ||
| translate(k, bucket) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PL to explain why
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was mistaken in thinking that this is no longer needed. We still have occurrences of nested agg buckets such as in the HCA agg
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please don't ignore my PL request.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Decided in PL to refactor the changes to the translation of aggregates ( |
||
|
|
||
| for k, v in aggs.items(): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was generated by Claude Code.
With these two overrides, does it still make sense for
Nestedto inheritPassThrough?The defining characteristic of
PassThroughis thatto_indexandfrom_indexare identity operations. After this change,Nestedoverrides both with non-trivial logic that delegates to per-property field types — the values are no longer "passed through" unchanged.What
Nestedstill gets fromPassThrough:es_typeproperty (trivial — stores and returns_es_type)allow_sorting_by_empty_lists = False__init__convenience of setting bothnative_formandindex_formto the same type (JSON)All three are easy to replicate by inheriting
FieldType[JSON, JSON]directly.