Length mismatch when using read_parquet chunked · Issue #769 · aws/aws-sdk-pandas

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug
When reading a large parquet file from S3 using read_parquet, I get errors like ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements . Expected axis value matches the integer value of chunked (or 65_536 if chunked=True ).
Traceback:
Traceback (most recent call last):
  File "refresh.py", line 3, in <module>
    scores.refresh_score_partitions()
  File "/Users/sameckert/aw/project_explorer/app/scores.py", line 152, in refresh_score_partitions
    for df in dfs:
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 400, in _read_parquet_chunked
    path_root=path_root,
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 295, in _arrowtable2df
    df = _apply_index(df=df, metadata=metadata)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/awswrangler/s3/_read_parquet.py", line 224, in _apply_index
    df.index = pd.RangeIndex(start=col["start"], stop=col["stop"], step=col["step"])
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 5154, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 66, in pandas._libs.properties.AxisProperty.__set__
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/generic.py", line 564, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/Users/sameckert/Library/Caches/pypoetry/virtualenvs/project-explorer-sNmfCv15-py3.7/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 227, in set_axis
    f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 75536 elements, new values have 6741043 elements
Environment
Provide your pip list output, particularly the version of the AWS Data Wrangler library you used. Providing this information may significantly improve resolution times.
asn1crypto==1.4.0; python_version >= "3.6" and python_version < "3.10"
awswrangler==2.9.0; python_version >= "3.6" and python_version < "3.10"
beautifulsoup4==4.9.3; python_version >= "3.6" and python_version < "3.10"
boto3==1.17.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
botocore==1.20.100; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0" and (python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0")
certifi==2021.5.30; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
chardet==4.0.0; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
click==7.1.2; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
dataclasses==0.8; python_version >= "3.6" and python_version < "3.7" and python_full_version >= "3.6.1"
et-xmlfile==1.1.0; python_version >= "3.6" and python_version < "3.10"
fastapi==0.63.0; python_version >= "3.6"
future==0.18.2; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.3.0"
h11==0.12.0; python_version >= "3.6"
idna==2.10; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
jmespath==0.10.0; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
llvmlite==0.36.0; python_version >= "3.6" and python_version < "3.10"
lmdb==1.2.1
lxml==4.6.3; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
mysqlclient==2.0.3; python_version >= "3.5"
nmslib==2.1.1
numba==0.53.1; python_version >= "3.6" and python_version < "3.10"
numpy==1.19.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
openpyxl==3.0.7; python_version >= "3.6" and python_version < "3.10"
pandas==1.1.5; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pg8000==1.19.5; python_version >= "3.6" and python_version < "3.10"
psutil==5.8.0; python_version >= "2.6" and python_full_version < "3.0.0" or python_full_version >= "3.4.0"
pyarrow==4.0.1; python_version >= "3.6" and python_version < "3.10"
pyathena==2.3.0; python_full_version >= "3.6.1" and python_full_version < "4.0.0"
pybind11==2.6.1; python_version >= "2.7" and python_version < "3.0" or python_version > "3.0" and python_version < "3.1" or python_version > "3.1" and python_version < "3.2" or python_version > "3.2" and python_version < "3.3" or python_version > "3.3" and python_version < "3.4" or python_version > "3.4"
pydantic==1.8.2; python_full_version >= "3.6.1" and python_version >= "3.6"
pymysql==1.0.2; python_version >= "3.6" and python_version < "3.10"
python-dateutil==2.8.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
pytz==2021.1; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.1"
redis==3.5.3; python_version >= "2.7" and python_full_version < "3.0.0" or python_full_version >= "3.5.0"
redshift-connector==2.0.882; python_version >= "3.6" and python_version < "3.10"
requests==2.25.1; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.5.0"
retrying==1.3.3; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
s3transfer==0.4.2; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.0"
scramp==1.4.0; python_version >= "3.6" and python_version < "3.10"
six==1.16.0; python_version >= "3.6" and python_version < "3.10" and python_full_version >= "3.6.2" and python_full_version < "4.0.0"
soupsieve==2.2.1; python_version >= "3.6" and python_version < "3.10"
standardiser==0.1.12
starlette==0.13.6; python_version >= "3.6"
tenacity==6.3.1; python_full_version >= "3.6.2" and python_full_version < "4.0.0"
typing-extensions==3.10.0.0; python_full_version >= "3.6.1" and python_version >= "3.6" and python_version < "3.8"
urllib3==1.26.6; python_version >= "3.6" and python_full_version < "3.0.0" and python_version < "3.10" or python_full_version >= "3.6.0" and python_version < "3.10" and python_version >= "3.6"
uvicorn==0.13.4
To Reproduce
Steps to reproduce the behavior.
Failing code:
boto3_session = boto3.Session()
dfs = wr.s3.read_parquet(f"s3://large_file.parquet.gz", chunked=75_536, ignore_index=True, boto3_session=boto3_session)
for df in dfs:
    print(len(df.index))
P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.
          There is no chance unfortunately. I temporarily got around this by downloading the file and then just using pyarrow iter_batches directly.
    fh, temp_file = tempfile.mkstemp()
    os.close(fh)
    wr.s3.download(
        temp_file,
        use_threads=True,
        boto3_session=boto3_session,
    pfile = pq.ParquetFile(temp_file)
    for table in pfile.iter_batches():
        yield table.to_pandas()
          Hi @JahnKhan, I am unable to reproduce the issue - during all my tests with various sizes and chunksize values all works fine, that suggests this is a data issue or concurrent modification of the object in S3.
Is this intermittent or consistent in your case? Does it happen only on specific data? Can you provide steps to reproduce?
          @kukushking Check this out, also this same is working fine in version 1.8.1.
Yes, This is happening with me in AWS Wrangler
I have a Parquet file with 5000 rows, and want the file in chunks of 1000. It throws me
Length mismatch: Expected axis has 1000 elements, new values have 5000 elements 

Where 1000 being my chunk size and 5000 being my total rows in parquet file.
The script looks like :
dfs = awswrangler.s3.read_parquet(path=f"s3://{self.bucket}/{self.object_key}",
                                              chunked=self.chunk_size)
for data in dfs:
      print(data)
When looping against the chunked generator, this error is thrown.
Note: I have ensured the file path is correct and this works fine in AWS Wrangler version 1.8.1
          This issue was also happening for me on the latest awswrangler version (2.16.1). After some debugging I discovered that the "length mismatch" error only happens with newer versions of pyarrow. I was runnning pyarrow 7.0.0, and the error occurs with pyarrow >= 3.0 (I believe, I don't have the exact breaking version).
The reason why awswrangler version 1.8.1 works is that it lists pyarrow~=1.0.0 as a requirement. When I combined awswrangler==2.16.1 and pyarrow==2.0.0 the chunked parameter works as intended with integer arguments.