CSV reader ignore brackets

Discussion:

(too old to reply)

Mihir Kothari

2019-09-24 19:55:25 UTC

Hi Team,

I am using python 3.4. I have a CSV file as below:

ABC,PQR,(TEST1,TEST2)
FQW,RTE,MDE

Basically comma-separated rows, where some rows have a data in column which
is array like i.e. in brackets.
So I need to read the file and treat such columns as one i.e. do not
separate based on comma if it is inside the bracket.

In short I need to read a CSV file where separator inside the brackets
needs to be ignored.

Output:
Column: 1 2 3
Row1: ABC PQR (TEST1,TEST2)
Row2: FQW RTE MDE

Can you please help with the snippet?

Regards,
Mihir.

Cameron Simpson

2019-09-24 23:09:02 UTC

Permalink

Post by Mihir Kothari
ABC,PQR,(TEST1,TEST2)
FQW,RTE,MDE

Really? No quotes around the (TEST1,TEST2) column value? I would have
said this is invalid data, but that does not help you.

Post by Mihir Kothari
Basically comma-separated rows, where some rows have a data in column which
is array like i.e. in brackets.
So I need to read the file and treat such columns as one i.e. do not
separate based on comma if it is inside the bracket.
In short I need to read a CSV file where separator inside the brackets
needs to be ignored.
Column: 1 2 3
Row1: ABC PQR (TEST1,TEST2)
Row2: FQW RTE MDE
Can you please help with the snippet?

I would be reaching for a regular expression. If you partition your
values into 2 types: those starting and ending in a bracket, and those
not, you could write a regular expression for the former:

\([^)]*\)

which matches a string like (.....) (with, importantly, no embedded
brackets, only those at the beginning and end.

And you can write a regular expression like:

[^,]*

for a value containing no commas i.e. all the other values.

Test the bracketed one first, because the second one always matches
something.

Then you would not use the CSV module (which expects better formed data
than you have) and instead write a simple parser for a line of text
which tries to match one of these two expressions repeatedly to consume
the line. Something like this (UNTESTED):

bracketed_re = re.compile(r'\([^)]*\)')
no_commas_re = re.compile(r'[^,]*')

def split_line(line):
line = line.rstrip() # drop trailing whitespace/newline
fields = []
offset = 0
while offset < len(line):
m = bracketed_re.match(line, offset)
if m:
field = m.group()
else:
m = no_commas_re.match(line, offset) # this always matches
field = m.group()
fields.append(field)
offset += len(field)
if line.startswith(',', offset):
# another column
offset += 1
elif offset < len(line):
raise ValueError(
"incomplete parse at offset %d, line=%r" % (offset, line))
return fields

Then read the lines of the file and split them into fields:

row = []
with open(datafilename) as f:
for line in f:
fields = split_line(line)
rows.append(fields)

So basicly you're writing a little parser. If you have nested brackets
things get harder.

Cheers,
Cameron Simpson <***@cskk.id.au>

Stefan Ram

2019-09-25 17:12:55 UTC

Permalink

Post by Cameron Simpson
Really? No quotes around the (TEST1,TEST2) column value? I would have
said this is invalid data, but that does not help you.

The language to be used for interpretation of the file is
not CSV when it is required that it treats »(TEST1,TEST2)«
as a single column value.

So the task is to write a parser for some language which is
not CSV and whose "specification" is a single example.

From my point of view, it is no fun to spend time on such a
requirement.

MRAB

2019-09-24 23:50:40 UTC

Permalink

Post by Cameron Simpson

Post by Mihir Kothari
ABC,PQR,(TEST1,TEST2)
FQW,RTE,MDE

Really? No quotes around the (TEST1,TEST2) column value? I would have
said this is invalid data, but that does not help you.

I would be reaching for a regular expression. If you partition your
values into 2 types: those starting and ending in a bracket, and those
\([^)]*\)
which matches a string like (.....) (with, importantly, no embedded
brackets, only those at the beginning and end.
[^,]*
for a value containing no commas i.e. all the other values.
Test the bracketed one first, because the second one always matches
something.
Then you would not use the CSV module (which expects better formed data
than you have) and instead write a simple parser for a line of text
which tries to match one of these two expressions repeatedly to consume
bracketed_re = re.compile(r'\([^)]*\)')
no_commas_re = re.compile(r'[^,]*')
line = line.rstrip() # drop trailing whitespace/newline
fields = []
offset = 0
m = bracketed_re.match(line, offset)
field = m.group()
m = no_commas_re.match(line, offset) # this always matches
field = m.group()
fields.append(field)
offset += len(field)
# another column
offset += 1
raise ValueError(
"incomplete parse at offset %d, line=%r" % (offset, line))
return fields
row = []
fields = split_line(line)
rows.append(fields)
So basicly you're writing a little parser. If you have nested brackets
things get harder.

You can simplify that somewhat to this:

import re
rows = []

with open(datafilename) as f:
for line in f:
rows.append(re.findall(r'(\([^)]*\)|(?=.)[^,\n]*),?', line))

Skip Montanaro

2019-09-25 00:02:46 UTC

Permalink

How about just replacing *\(([^)]*)\)* with *"\1"* in a wrapper class's
line reading method? (I think I have the re syntax approximately right.)
The csv reader will "just work". Again, nesting parens not allowed.

Skip

Piet van Oostrum

2019-09-26 15:54:45 UTC

Permalink

Post by Skip Montanaro
How about just replacing *\(([^)]*)\)* with *"\1"* in a wrapper class's
line reading method? (I think I have the re syntax approximately right.)
The csv reader will "just work". Again, nesting parens not allowed.
Skip

here is some working code:

def PReader(csvfile):
import re
for line in csvfile:
line = re.sub(r'\(.*?\)', '"\g<0>"', line)
yield line

import csv
with open('testcsv.csv') as csvfile:
reader = csv.reader(PReader(csvfile), quotechar='"')
for row in reader:
print(row)

--
Piet van Oostrum <piet-***@vanoostrum.org>
WWW: http://piet.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Cameron Simpson

2019-09-25 01:09:28 UTC

Permalink

Post by Skip Montanaro
How about just replacing *\(([^)]*)\)* with *"\1"* in a wrapper class's
line reading method?

Will that work if the OP's (TEST1,TEST2) term itself contains quotes?
Not that his example data did, but example data are usually incomplete
:-)

Also, that would match FOO(TEST1,TEST2)BAH as well (making
FOO"(TEST1,TEST2)"BAH. Which might be wanted, or be not wanted or be
bad data (including but not restricted to csv module unparsable data).
I was deliberately being very conservative and kind of treating brackets
like quotes (needest at start and end) but not trying to hit things in
one go. Better to match exactly the special case you expect and then
scour of mismatches than to incorrectly match and have that mistake
buried in the data.

Post by Skip Montanaro
(I think I have the re syntax approximately right.)
The csv reader will "just work". Again, nesting parens not allowed.

Otherwise, a neat idea.

<snark>Besides, the point isn't the shortest code but to illustrate the
idea of handling special syntax.</snark>

Cheers,
Cameron Simpson <***@cskk.id.au>

Skip Montanaro

2019-09-25 17:50:18 UTC

Permalink

<snark>Besides, the point isn't the shortest code but to illustrate the idea of handling special syntax.</snark>

In my defense, I was typing on my phone while watching a show on
Netflix. I was hardly in a position to test any code. :-)

As you indicated though, the problem is under-specified (nesting?,
presence of quotation marks?, newlines between balanced parens? input
size?, etc). It probably does little good to try and cook up a
comprehensive solution to such a problem. Better to just toss out some
ideas for the OP and let them mull it over, maybe try to solve the
problem themselves.

Skip