Skip to content

Conversation

@muhammadbadar1998
Copy link
Contributor

This PR adds parquet file handling

adds fromparquet and toparquet for reading and writing Parquet tables.

Hooks these routines onto the core Table API.

Adds tests.

Updates the I/O documentation

Includes pandas and pyarrow in the test requirements

Closes issue #627.

def test_fromparquet(tmp_path):
path = make_sample(tmp_path)
tbl = etl.io.fromparquet(str(path))
assert tbl.header() == ('x',)

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
path = make_sample(tmp_path)
tbl = etl.io.fromparquet(str(path))
assert tbl.header() == ('x',)
assert list(tbl.values()) == [(1,), (2,), (3,)]

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
out = tmp_path / 'out.parquet'
tbl.toparquet(str(out))
df2 = pd.read_parquet(out)
assert list(df2['y']) == [10,20]

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.
if not indices and field == () and len(hdr) == 1:
indices = [0]

assert indices, 'no field selected'

Check warning

Code scanning / Bandit (reported by Codacy)

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

Use of assert detected. The enclosed code will be removed when compiling to optimised byte code.

from petl.io.gsheet import fromgsheet, togsheet, appendgsheet

from petl.io.parquet import fromparquet, toparquet

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.io.parquet.fromparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

`petl.io.parquet.fromparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

from petl.io.gsheet import fromgsheet, togsheet, appendgsheet

from petl.io.parquet import fromparquet, toparquet

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.io.parquet.toparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)

`petl.io.parquet.toparquet` imported but unused; consider removing, adding to `__all__`, or using a redundant alias (F401)
from __future__ import absolute_import, print_function, division

# standard library dependencies
from petl.compat import PY2

Check warning

Code scanning / Ruff (reported by Codacy)

`petl.compat.PY2` imported but unused (F401)

`petl.compat.PY2` imported but unused (F401)



import operator

Check warning

Code scanning / Ruff (reported by Codacy)

Module level import not at top of file (E402)

Module level import not at top of file (E402)



import operator

Check warning

Code scanning / Ruff (reported by Codacy)

Redefinition of unused `operator` from line 8 (F811)

Redefinition of unused `operator` from line 8 (F811)
Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pylint (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.


from petl.io.gsheet import fromgsheet, togsheet, appendgsheet

from petl.io.parquet import fromparquet, toparquet

Check warning

Code scanning / Prospector (reported by Codacy)

'petl.io.parquet.fromparquet' imported but unused (F401)

'petl.io.parquet.fromparquet' imported but unused (F401)
from __future__ import absolute_import, print_function, division

# standard library dependencies
from petl.compat import PY2

Check warning

Code scanning / Prospector (reported by Codacy)

Unused PY2 imported from petl.compat (unused-import)

Unused PY2 imported from petl.compat (unused-import)



import operator

Check warning

Code scanning / Prospector (reported by Codacy)

Reimport 'operator' (imported line 8) (reimported)

Reimport 'operator' (imported line 8) (reimported)



import operator

Check warning

Code scanning / Prospector (reported by Codacy)

redefinition of unused 'operator' from line 8 (F811)

redefinition of unused 'operator' from line 8 (F811)



import operator

Check warning

Code scanning / Prospector (reported by Codacy)

Import "import operator" should be placed at the top of the module (wrong-import-position)

Import "import operator" should be placed at the top of the module (wrong-import-position)
@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing module docstring

Missing module docstring
from __future__ import absolute_import, print_function, division

# standard library dependencies
from petl.compat import PY2

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Unused PY2 imported from petl.compat

Unused PY2 imported from petl.compat


# third-party dependencies
import pandas as pd

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

third party import "pandas" should be placed before first party imports "petl.compat.PY2", "petl.io.pandas.fromdataframe", "petl.util.base.Table", "petl.io.sources.read_source_from_arg"

third party import "pandas" should be placed before first party imports "petl.compat.PY2", "petl.io.pandas.fromdataframe", "petl.util.base.Table", "petl.io.sources.read_source_from_arg"

src = read_source_from_arg(source)
with src.open('rb') as f:
df = pd.read_parquet(f, **kwargs)

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member



import operator

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Imports from package operator are not grouped

Imports from package operator are not grouped



import operator

Check notice

Code scanning / Pylintpython3 (reported by Codacy)

Reimport 'operator' (imported line 8)

Reimport 'operator' (imported line 8)



import operator

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Import "import operator" should be placed at the top of the module

Import "import operator" should be placed at the top of the module



import operator

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

standard import "operator" should be placed before first party imports "petl.compat.imap", "petl.errors.FieldSelectionError", "petl.comparison.comparable_itemgetter"

standard import "operator" should be placed before first party imports "petl.compat.imap", "petl.errors.FieldSelectionError", "petl.comparison.comparable_itemgetter"
@muhammadbadar1998 muhammadbadar1998 force-pushed the issue-627-parquet-output branch from 7e81cf6 to 8b302aa Compare July 3, 2025 20:33
@@ -0,0 +1,23 @@
import pandas as pd

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Django was not configured. For more information run pylint --load-plugins=pylint_django --help-msg=django-not-configured

Django was not configured. For more information run pylint --load-plugins=pylint_django --help-msg=django-not-configured
@@ -0,0 +1,23 @@
import pandas as pd

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Missing module docstring

Missing module docstring


def make_sample(tmp_path):
df = pd.DataFrame([{'x': 1}, {'x': 2}, {'x': 3}])

Check warning

Code scanning / Pylintpython3 (reported by Codacy)

Module 'pandas' has no 'DataFrame' member

Module 'pandas' has no 'DataFrame' member
@@ -0,0 +1,64 @@
# -*- coding: utf-8 -*-

Check warning

Code scanning / Pylint (reported by Codacy)

Missing module docstring

Missing module docstring
from __future__ import absolute_import, print_function, division

# standard library dependencies
from petl.compat import PY2

Check notice

Code scanning / Pylint (reported by Codacy)

Unused PY2 imported from petl.compat

Unused PY2 imported from petl.compat
"""

src = read_source_from_arg(source)
with src.open('rb') as f:

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "f" doesn't conform to snake_case naming style

Variable name "f" doesn't conform to snake_case naming style

src = read_source_from_arg(source)
with src.open('rb') as f:
df = pd.read_parquet(f, **kwargs)

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "df" doesn't conform to snake_case naming style

Variable name "df" doesn't conform to snake_case naming style

src = read_source_from_arg(source)
with src.open('rb') as f:
df = pd.read_parquet(f, **kwargs)

Check warning

Code scanning / Pylint (reported by Codacy)

Module 'pandas' has no 'read_parquet' member

Module 'pandas' has no 'read_parquet' member



import operator

Check warning

Code scanning / Pylint (reported by Codacy)

standard import "import operator" should be placed before "from petl.compat import imap, izip, izip_longest, ifilter, ifilterfalse, reduce, next, string_types, text_type"

standard import "import operator" should be placed before "from petl.compat import imap, izip, izip_longest, ifilter, ifilterfalse, reduce, next, string_types, text_type"



import operator

Check warning

Code scanning / Pylint (reported by Codacy)

Import "import operator" should be placed at the top of the module

Import "import operator" should be placed at the top of the module



import operator

Check notice

Code scanning / Pylint (reported by Codacy)

Reimport 'operator' (imported line 8)

Reimport 'operator' (imported line 8)



import operator

Check warning

Code scanning / Pylint (reported by Codacy)

Imports from package operator are not grouped

Imports from package operator are not grouped
setup.py Outdated
'xlsx': ['openpyxl>=2.6.2'],
'xpath': ['lxml>=4.4.0'],
'whoosh': ['whoosh'],
"parquet": ["pandas>=1.3.0","pyarrow>=4.0.0"]

Check warning

Code scanning / Pylint (reported by Codacy)

Exactly one space required after comma

Exactly one space required after comma
@coveralls
Copy link

coveralls commented Jul 3, 2025

Pull Request Test Coverage Report for Build 16583777847

Details

  • 24 of 81 (29.63%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 90.584%

Changes Missing Coverage Covered Lines Changed/Added Lines %
petl/util/base.py 9 11 81.82%
petl/test/io/test_arrow.py 7 34 20.59%
petl/io/arrow.py 7 35 20.0%
Totals Coverage Status
Change from base Build 16205643418: -0.3%
Covered Lines: 13478
Relevant Lines: 14879

💛 - Coveralls

"""
src = write_source_from_arg(source)
with src.open('wb') as f:
df = todataframe(table)

Check warning

Code scanning / Pylint (reported by Codacy)

Variable name "df" doesn't conform to snake_case naming style

Variable name "df" doesn't conform to snake_case naming style
@juarezr juarezr self-requested a review July 10, 2025 20:26
@juarezr juarezr added the Feature A nice to have thing that we don't have yet label Jul 10, 2025
@juarezr
Copy link
Member

juarezr commented Jul 10, 2025

@muhammadbadar1998 ,

Nice addition!

Maybe you would consider:

  • Removing the use of the package pandas in functions fromparquet and toparquet:
    • As pandas loads all data into memory, it would not apply to all use cases.
    • Alternatively, you could call pyarrow directly.
  • Maybe rename the functions to fromarrowt and toparrow and document that they could be used to load parquet and other formats.

@github-advanced-security
Copy link

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

@muhammadbadar1998 muhammadbadar1998 force-pushed the issue-627-parquet-output branch from f2d0d2c to 56071d9 Compare July 29, 2025 00:51
@muhammadbadar1998
Copy link
Contributor Author

@muhammadbadar1998 ,

Nice addition!

Maybe you would consider:

  • Removing the use of the package pandas in functions fromparquet and toparquet:

    • As pandas loads all data into memory, it would not apply to all use cases.
    • Alternatively, you could call pyarrow directly.
  • Maybe rename the functions to fromarrowt and toparrow and document that they could be used to load parquet and other formats.

@juarezr Thank you for the feedback!
I’ve removed the pandas dependency from fromparquet/toparquet and renamed them to fromarrow/toarrow.

I also noticed the new GitHub workflows are rejecting multiple SARIF runs (“Error: The CodeQL Action does not support uploading multiple SARIF runs with the same category.”). I’m not yet familiar enough with the Actions YAML to fix it immediately — would you like me to open a follow‑up issue and investigate it separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature A nice to have thing that we don't have yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants