While fetching data with find_polars_all, find_pandas_all, find_arrow_all from pymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred as null.
MongoDB documentation
[
{
"name": "test",
"code": "1"
},
{
"name": "test",
"code": 1
}
]
Current implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ null │
# └─────────────────────────────────┴──────┴──────┘
In case of such known discrepancies where the first document have pyarrow.str() and subsequent documents have pyarrow.int*(), which can be inferred as pyarrow.str() by adding an optional parameter coerce_number_to_str for all find_* apis.
Expected implementation
from pymongoarrow.api import find_polars_all
query_result_df = find_polars_all(
collection=client,
query=query,
coerce_number_to_str=True
)
query_result_df
# Schema([('_id', Binary), ('name', String), ('code', String)]), Shape ==> (2, 3)
# shape: (2, 3)
# ┌─────────────────────────────────┬──────┬──────┐
# │ _id ┆ name ┆ code │
# │ --- ┆ --- ┆ --- │
# │ binary ┆ str ┆ str │
# ╞═════════════════════════════════╪══════╪══════╡
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# │ b"f\xfb\xe8\x0a\x9f\x16\xe1\xe… ┆ test ┆ 1 │
# └─────────────────────────────────┴──────┴──────┘
Reference - coerce_numbers_to_str in https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field
While fetching data with
find_polars_all,find_pandas_all,find_arrow_allfrompymongoarrow.api, the schema is being inferred based on first document. If the same key is having different datatype, it is inferred asnull.MongoDB documentation
[ { "name": "test", "code": "1" }, { "name": "test", "code": 1 } ]Current implementation
In case of such known discrepancies where the first document have
pyarrow.str()and subsequent documents havepyarrow.int*(), which can be inferred aspyarrow.str()by adding an optional parametercoerce_number_to_strfor allfind_*apis.Expected implementation
Reference -
coerce_numbers_to_strin https://docs.pydantic.dev/latest/api/fields/#pydantic.fields.Field