Discussion:
grep and anchoring
(too old to reply)
Daniël de Kok
2016-06-26 13:10:57 UTC
Permalink
Dear all,

After a BSD hiatus of many years, I am tinkering with FreeBSD again. I’ve run into some strange issue with grep and beginning of line (^) anchoring:


% echo "1234 1234 1234" | egrep -o '^….'
1234
123
4 12
% echo "123412341234" | egrep -o '^....'
1234
1234
1234


Any idea what is going on here?

(Recent GNU grep from ports gives the expected output.)

With kind regards,
Daniël de Kok

Ps., grep on OS X, which seems to be the same version has similar problems…
Polytropon
2016-06-26 14:34:11 UTC
Permalink
Post by Daniël de Kok
Dear all,
After a BSD hiatus of many years, I am tinkering with FreeBSD again.
I’ve run into some strange issue with grep and beginning of line (^)

% echo "1234 1234 1234" | egrep -o '^….'
1234
123
4 12
% echo "123412341234" | egrep -o '^....'
1234
1234
1234

Any idea what is going on here?
I think what you see here is a typical "UTF-8 fsck-up".
The first search pattern contains a an ellipsis ("…",
2 bytes long, representing 3 characters), and a single
dot (".", one byte long, 1 character); the second pattern
contains four dots (4 x ".", 1 byte long, 1 character).
Of course grep interprets "…" and "..." differently.
In my mailer, I can see the difference clearly as the
ellipsis … is displayed in monospace font as a _one_
character wide symbol on the screen.

Or is this just an "enrichment" your MUA added? :-)

I'm quite sure you run into similar problems when you
include ligatures (like st, ft, ffi, ck or the like)
or one of the many different hyphend and spaces in a
search pattern. :-)

Otherwise, your example seems to show the expected
behaviour.

% echo "1234 1234 1234" | egrep -o '^....'
1234
123
4 12

% echo "123412341234" | egrep -o '^....'
1234
1234
1234

First 4-character pattern is "1234", next is " 123",
and last is "4 12" (each 4 characters wide, as the
space character " " is also "any character" that matches
the . pattern). In the second example, the groups match
4 characters each ("1234" x 3).

What different results did you expect? Or am I misinterpreting
your question?
--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
Eir Nym
2016-06-26 14:41:47 UTC
Permalink
Post by Polytropon
Post by Daniël de Kok
Dear all,
After a BSD hiatus of many years, I am tinkering with FreeBSD again.
I’ve run into some strange issue with grep and beginning of line (^)

% echo "1234 1234 1234" | egrep -o '^….'
1234
123
4 12
% echo "123412341234" | egrep -o '^....'
1234
1234
1234

Any idea what is going on here?
I think what you see here is a typical "UTF-8 fsck-up".
The first search pattern contains a an ellipsis ("…",
2 bytes long, representing 3 characters), and a single
dot (".", one byte long, 1 character); the second pattern
contains four dots (4 x ".", 1 byte long, 1 character).
Of course grep interprets "…" and "..." differently.
In my mailer, I can see the difference clearly as the
ellipsis … is displayed in monospace font as a _one_
character wide symbol on the screen.
I think this was automatic spell correction and he mentioned 4 dot symbols (.), not a ‘…' and ‘.’
Post by Polytropon
Or is this just an "enrichment" your MUA added? :-)
I'm quite sure you run into similar problems when you
include ligatures (like st, ft, ffi, ck or the like)
or one of the many different hyphend and spaces in a
search pattern. :-)
Otherwise, your example seems to show the expected
behaviour.
% echo "1234 1234 1234" | egrep -o '^....'
1234
123
4 12
% echo "123412341234" | egrep -o '^....'
1234
1234
1234
First 4-character pattern is "1234", next is " 123",
and last is "4 12" (each 4 characters wide, as the
space character " " is also "any character" that matches
the . pattern). In the second example, the groups match
4 characters each ("1234" x 3).
What different results did you expect? Or am I misinterpreting
your question?
--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-questions <https://lists.freebsd.org/mailman/listinfo/freebsd-questions>
Daniël de Kok
2016-06-26 14:44:44 UTC
Permalink
Post by Polytropon
Or is this just an "enrichment" your MUA added? :-)
Yes, Mac’s Mail.app likes to replace these. I didn’t use an ellipsis in the actual expression ;), just four dots.
Post by Polytropon
% echo "1234 1234 1234" | egrep -o '^....'
1234
123
4 12
[...]
Post by Polytropon
First 4-character pattern is "1234", next is " 123",
and last is "4 12" (each 4 characters wide, as the
space character " " is also "any character" that matches
the . pattern). In the second example, the groups match
4 characters each ("1234" x 3).
Note the anchoring (^), the pattern should only match any four characters at the beginning of the line, so the expected output is ‘1234’ and nothing more. ‘ 123' and '4 12' are not at the beginning of the line and should consequently not be printed to stdout.

For comparison, the output of a recent GNU grep:


% echo "1234 1234 1234" | grep -o '^....'
1234


With kind regards,
Daniël de Kok
Matthew Seaman
2016-06-26 15:31:20 UTC
Permalink
Note the anchoring (^), the pattern should only match any four characters at the beginning of the line, so the expected output is ‘1234’ and nothing more. ‘ 123' and '4 12' are not at the beginning of the line and should consequently not be printed to stdout.
—
% echo "1234 1234 1234" | grep -o '^....'
1234
—
You are completely correct -- this is a bug in grep(1) on FreeBSD. In
all current releases including the upcoming 11.0-RELEASE grep is
actually GNU grep version 2.1.5. However, the same bug occurs in
bsdgrep(1):

% echo 1234 1234 1234 | bsdgrep -o '^....'
1234
123
4 12

There is already an open PR about a very similar issue:

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201650

Could you update that PR with your findings? Especially that bsdgrep(1)
also shows the same problem.

Cheers,

Matthew
Polytropon
2016-06-26 15:49:39 UTC
Permalink
Post by Daniël de Kok
Post by Polytropon
Or is this just an "enrichment" your MUA added? :-)
Yes, Mac’s Mail.app likes to replace these. I didn’t use an ellipsis
in the actual expression ;), just four dots.
And it also seems to turn the apostrophe ' into a single
closing quote ’. :-)
Post by Daniël de Kok
Post by Polytropon
% echo "1234 1234 1234" | egrep -o '^....'
1234
123
4 12
[...]
Post by Polytropon
First 4-character pattern is "1234", next is " 123",
and last is "4 12" (each 4 characters wide, as the
space character " " is also "any character" that matches
the . pattern). In the second example, the groups match
4 characters each ("1234" x 3).
Note the anchoring (^), the pattern should only match any four
characters at the beginning of the line, so the expected output
is ‘1234’ and nothing more. ‘ 123' and '4 12' are not at the
beginning of the line and should consequently not be printed
to stdout.
You're right; according to "man grep":

-o, --only-matching
Show only the part of a matching line that matches PATTERN.

the first pattern matching "^...." should be the first 4 digits,
the output should then stop, which really looks like a bug. Instead
the pattern matching is repeated over the rest of the input line
(leading to two "additional results").
Post by Daniël de Kok

% echo "1234 1234 1234" | grep -o '^....'
1234

That is what _should_ happen, correct. Thanks for clarifying.
--
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...
Daniël de Kok
2016-06-26 16:39:58 UTC
Permalink
Post by Matthew Seaman
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201650
Could you update that PR with your findings? Especially that bsdgrep(1)
also shows the same problem.
Thanks for confirming and for pointing me to the PR. I have added a comment.

Thanks,
Daniël de Kok
Warren Block
2016-06-26 18:44:48 UTC
Permalink
Post by Matthew Seaman
Post by Daniël de Kok
Note the anchoring (^), the pattern should only match any four characters at the beginning of the line, so the expected output is ‘1234’ and nothing more. ‘ 123' and '4 12' are not at the beginning of the line and should consequently not be printed to stdout.

% echo "1234 1234 1234" | grep -o '^....'
1234

You are completely correct -- this is a bug in grep(1) on FreeBSD. In
all current releases including the upcoming 11.0-RELEASE grep is
actually GNU grep version 2.1.5. However, the same bug occurs in
% echo 1234 1234 1234 | bsdgrep -o '^....'
1234
123
4 12
Yay for compatibility! :)
Daniël de Kok
2016-07-12 17:36:36 UTC
Permalink
Post by Matthew Seaman
You are completely correct -- this is a bug in grep(1) on FreeBSD. In
all current releases including the upcoming 11.0-RELEASE grep is
actually GNU grep version 2.1.5. However, the same bug occurs in
% echo 1234 1234 1234 | bsdgrep -o '^....'
1234
123
4 12
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=201650
FWIW, I took some time to look at the bsdgrep issue tonight and have
attached a fix to the PR above.

With kind regards,
Daniël de Kok

Loading...