Issue 627 parquet output #683

muhammadbadar1998 · 2025-07-03T19:54:31Z

This PR adds parquet file handling

adds fromparquet and toparquet for reading and writing Parquet tables.

Hooks these routines onto the core Table API.

Adds tests.

Updates the I/O documentation

Includes pandas and pyarrow in the test requirements

Closes issue #627.

petl/test/io/test_parquet.py

+def test_fromparquet(tmp_path):
+    path = make_sample(tmp_path)
+    tbl = etl.io.fromparquet(str(path))
+    assert tbl.header() == ('x',)


petl/test/io/test_parquet.py

+    path = make_sample(tmp_path)
+    tbl = etl.io.fromparquet(str(path))
+    assert tbl.header() == ('x',)
+    assert list(tbl.values()) == [(1,), (2,), (3,)]


petl/test/io/test_parquet.py

+    out = tmp_path / 'out.parquet'
+    tbl.toparquet(str(out))
+    df2 = pd.read_parquet(out)
+    assert list(df2['y']) == [10,20]


petl/util/base.py

+    if not indices and field == () and len(hdr) == 1:
+        indices = [0]
+
+    assert indices, 'no field selected'


petl/io/__init__.py


 from petl.io.gsheet import fromgsheet, togsheet, appendgsheet
+
+from petl.io.parquet import fromparquet, toparquet


petl/io/__init__.py


 from petl.io.gsheet import fromgsheet, togsheet, appendgsheet
+
+from petl.io.parquet import fromparquet, toparquet


petl/io/parquet.py

+from __future__ import absolute_import, print_function, division
+
+# standard library dependencies
+from petl.compat import PY2


petl/test/io/test_parquet.py

petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


github-advanced-security

Pylint (reported by Codacy) found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

petl/io/__init__.py


 from petl.io.gsheet import fromgsheet, togsheet, appendgsheet
+
+from petl.io.parquet import fromparquet, toparquet


petl/io/parquet.py

+from __future__ import absolute_import, print_function, division
+
+# standard library dependencies
+from petl.compat import PY2


petl/test/io/test_parquet.py

petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/io/parquet.py

@@ -0,0 +1,64 @@
+# -*- coding: utf-8 -*-


petl/io/parquet.py

+from __future__ import absolute_import, print_function, division
+
+# standard library dependencies
+from petl.compat import PY2


petl/io/parquet.py

+
+
+# third-party dependencies
+import pandas as pd


petl/io/parquet.py

+
+    src = read_source_from_arg(source)
+    with src.open('rb') as f:
+        df = pd.read_parquet(f, **kwargs)


petl/test/io/test_parquet.py

petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/test/io/test_parquet.py

@@ -0,0 +1,23 @@
+import pandas as pd


petl/test/io/test_parquet.py

@@ -0,0 +1,23 @@
+import pandas as pd


petl/test/io/test_parquet.py

+
+
+def make_sample(tmp_path):
+    df = pd.DataFrame([{'x': 1}, {'x': 2}, {'x': 3}])


petl/io/parquet.py

@@ -0,0 +1,64 @@
+# -*- coding: utf-8 -*-


petl/io/parquet.py

+from __future__ import absolute_import, print_function, division
+
+# standard library dependencies
+from petl.compat import PY2


petl/io/parquet.py

+    """
+
+    src = read_source_from_arg(source)
+    with src.open('rb') as f:


petl/io/parquet.py

+
+    src = read_source_from_arg(source)
+    with src.open('rb') as f:
+        df = pd.read_parquet(f, **kwargs)


petl/io/parquet.py

+
+    src = read_source_from_arg(source)
+    with src.open('rb') as f:
+        df = pd.read_parquet(f, **kwargs)


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


petl/util/base.py


+
+
+import operator


setup.py

        'xlsx': ['openpyxl>=2.6.2'],
        'xpath': ['lxml>=4.4.0'],
        'whoosh': ['whoosh'],
+        "parquet": ["pandas>=1.3.0","pyarrow>=4.0.0"]


coveralls · 2025-07-03T21:30:46Z

Pull Request Test Coverage Report for Build 16583777847

Details

24 of 81 (29.63%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.3%) to 90.584%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
petl/util/base.py	9	11	81.82%
petl/test/io/test_arrow.py	7	34	20.59%
petl/io/arrow.py	7	35	20.0%

Totals
Change from base Build 16205643418:	-0.3%
Covered Lines:	13478
Relevant Lines:	14879

💛 - Coveralls

petl/io/parquet.py

+    """
+    src = write_source_from_arg(source)
+    with src.open('wb') as f:
+        df = todataframe(table)


juarezr · 2025-07-10T20:34:00Z

@muhammadbadar1998 ,

Nice addition!

Maybe you would consider:

Removing the use of the package pandas in functions fromparquet and toparquet:
- As pandas loads all data into memory, it would not apply to all use cases.
- Alternatively, you could call pyarrow directly.
Maybe rename the functions to fromarrowt and toparrow and document that they could be used to load parquet and other formats.

…arquet

github-advanced-security · 2025-07-27T23:38:53Z

This pull request sets up GitHub code scanning for this repository. Once the scans have completed and the checks have passed, the analysis results for this pull request branch will appear on this overview. Once you merge this pull request, the 'Security' tab will show more code scanning analysis results (for example, for the default branch). Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results. For more information about GitHub code scanning, check out the documentation.

muhammadbadar1998 · 2025-07-29T01:07:21Z

@muhammadbadar1998 ,

Nice addition!

Maybe you would consider:

Removing the use of the package pandas in functions fromparquet and toparquet:

As pandas loads all data into memory, it would not apply to all use cases.

Alternatively, you could call pyarrow directly.

Maybe rename the functions to fromarrowt and toparrow and document that they could be used to load parquet and other formats.

@juarezr Thank you for the feedback!
I’ve removed the pandas dependency from fromparquet/toparquet and renamed them to fromarrow/toarrow.

I also noticed the new GitHub workflows are rejecting multiple SARIF runs (“Error: The CodeQL Action does not support uploading multiple SARIF runs with the same category.”). I’m not yet familiar enough with the Actions YAML to fix it immediately — would you like me to open a follow‑up issue and investigate it separately?

github-advanced-security bot found potential problems Jul 3, 2025

View reviewed changes

Implement Parquet I/O and add docs/tests (closes petl-developers#627)

8b302aa

muhammadbadar1998 force-pushed the issue-627-parquet-output branch from 7e81cf6 to 8b302aa Compare July 3, 2025 20:33

muhammadbadar1998 added 4 commits July 3, 2025 16:51

docs: add Parquet to supported I/O formats list

3ed6af4

docs: add Parquet to supported I/O formats list

aebd345

fixed spacing in io.rst

2331fa7

make parquet example self-contained with tempfile

5bdd2a7

github-advanced-security bot found potential problems Jul 3, 2025

View reviewed changes

Remove hard-coded example from parquet.fromparquet docstring

407c5ed

github-advanced-security bot found potential problems Jul 3, 2025

View reviewed changes

muhammadbadar1998 added 5 commits July 3, 2025 19:01

fixed panas version

f592993

fixed pyarrow version

d6562e9

fixed ancii charachter problem in commit

35cbf59

fixed ancii charachter problem in commit

2a81e48

fixed ancii charachter problem in commit

03cec66

juarezr self-requested a review July 10, 2025 20:26

juarezr added the Feature A nice to have thing that we don't have yet label Jul 10, 2025

muhammadbadar1998 added 5 commits July 26, 2025 19:14

updated to use pyarrow and support multiple arrow formats including p…

20b7e07

…arquet

updated install.rst to refrence corrent io

7425a60

adjusted arrow to work with python 2 and 3

e82ca7f

fixed python 2 logic

cfad14b

fixed arrow test to be compatible with python 2

56071d9

muhammadbadar1998 force-pushed the issue-627-parquet-output branch from f2d0d2c to 56071d9 Compare July 29, 2025 00:51


		from petl.io.gsheet import fromgsheet, togsheet, appendgsheet

		from petl.io.parquet import fromparquet, toparquet



		def make_sample(tmp_path):
		df = pd.DataFrame([{'x': 1}, {'x': 2}, {'x': 3}])

Issue 627 parquet output #683

Are you sure you want to change the base?

Issue 627 parquet output #683

Uh oh!

Conversation

muhammadbadar1998 commented Jul 3, 2025

Uh oh!

Check warning

Check warning

Check warning

Check warning

Check warning

Check warning

Check warning

Uh oh!

Uh oh!

Check warning

Check warning

github-advanced-security bot left a comment

Choose a reason for hiding this comment

Uh oh!

Check warning

Check warning

Uh oh!

Uh oh!

Uh oh!

Check warning

Check warning

Check warning

Check warning

Check notice

Check warning

Check warning

Uh oh!

Uh oh!

Check warning

Check notice

Check warning

Check warning

Check warning

Check warning

Check warning

Check warning

Check notice

Check warning

Check warning

Check warning

Check warning

Check warning

Check notice

Check warning

Check warning

coveralls commented Jul 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16583777847

Details

💛 - Coveralls

Uh oh!

Check warning

juarezr commented Jul 10, 2025

Uh oh!

github-advanced-security bot commented Jul 27, 2025

Uh oh!

muhammadbadar1998 commented Jul 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Jul 3, 2025 •

edited

Loading