Discussion:
[GTALUG] tr: Illegal byte sequence
Giles Orr via talk
2018-09-26 14:43:46 UTC
Permalink
I wrote a random password generator shell script, the core of which is this
one-liner:

dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' | head -c 32

The very ugly string 'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' is the ALLOWED
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the length
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.

I've never had a problem with this on Linux, but on a Mac under some
circumstances we get:

tr: Illegal byte sequence

My coworker, who's also using the script, always got that error. It seems
to come down to locale settings. Mine by default are:

$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=

My co-worker's settings are:

LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

A reliable fix (so far ...):

$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' | head -c 32
z%V;d9uZfWLTgsT*J]Bz`mAmA

I'd really like to understand what the problem is, why 'tr' barfs, and what
the 'locale' settings have to do with this. Thanks.

(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but it's
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
***@gmail.com
Mauro Souza via talk
2018-09-26 14:45:20 UTC
Permalink
I put a base64 after dd, and cut in place of head. Never had any issue...
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the length
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.
I've never had a problem with this on Linux, but on a Mac under some
tr: Illegal byte sequence
My coworker, who's also using the script, always got that error. It seems
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
z%V;d9uZfWLTgsT*J]Bz`mAmA
I'd really like to understand what the problem is, why 'tr' barfs, and
what the 'locale' settings have to do with this. Thanks.
(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but it's
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
Stewart C. Russell via talk
2018-09-26 14:58:06 UTC
Permalink
Post by Giles Orr via talk
I'd really like to understand what the problem is, why 'tr' barfs, and
what the 'locale' settings have to do with this.  Thanks.
tr on Mac OS seems to assume input is valid UTF-8 text (if locale is
suitably UTF-8). You can set your tr string to something trivial and it
still barfs:

dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc 'A-Za-z0-9' |
head -c 32

A portable hack might be to use iconv to say that the input is an 8-bit
charset:

dd if=/dev/urandom bs=1 count=256 2>/dev/null | iconv -f ISO-8859-1
| tr -dc 'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' | head -c 32

cheers,
Stewart
---
Talk Mailing List
talk@
D. Hugh Redelmeier via talk
2018-09-27 04:55:15 UTC
Permalink
| From: Stewart C. Russell via talk <***@gtalug.org>

| tr on Mac OS seems to assume input is valid UTF-8 text (if locale is
| suitably UTF-8).

To amplify this, not all byte sequences are valid UTF-8. Random byte
sequences will sometimes be invalid.

Off the top of my head, I think that the following are invalid:

- A 0x80 byte not preceded by a byte with the high bit on

- A string ending with a byte with the high bit on

- A sequence of more than n bytes with the high bit on (n is something
like 4).

Each valid character is represented as a sequence of zero or more
bytes with the high bit on, not starting with 0x80, followed by a byte
without the high bit on. All the non-high bits are concatenated to
form the UTF-32 value. Overflow is forbidden.

On the other hand, UTF-8 is UTF-8, whether you are in US or CA locale.
So the different behaviours between the two UTF-8 locales would seem
to be a bug. (In theory, collating sequences could be different so
ranges in tr could be different, but I would not see that affecting
the ASCII subset you are using in your ranges.)

Using C locale should give you 8-bit characters, not UTF-8. So it
should work.

This (untested) small change to Giles' script should work.

dd if=/dev/urandom bs=1 count=256 2>/dev/null |
LC_ALL=C tr -dc 'A-Za-z0-9!@$%^&*(){}[]=+-_/?\|~`' |
head -c 32

LC_ALL might be overkill. I don't know.

I'd probably add an echo to put a newline at the end.
---
Talk Mailing List
***@gtalug.org
h
Stewart C. Russell via talk
2018-09-27 12:52:40 UTC
Permalink
Post by D. Hugh Redelmeier via talk
On the other hand, UTF-8 is UTF-8, whether you are in US or CA locale.
So the different behaviours between the two UTF-8 locales would seem
to be a bug.
The Mac I tested this on used the same CA locale as my Linux box. It
still failed on the Mac. The issue is more likely to be that Mac OS 'tr'
is a BSD version, and the Linux one is Gnu.

Mac OS's command line suite is a mish-mash of sources and versions.
Their tr is marked BSD, from 2005. Their sed (which also requires valid
UTF-8 byte streams) is from FreeBSD circa 2004. Mac OS awk is bwk's "One
True awk" (which doesn't seem to care if a byte stream is valid or not),
but a couple of versions behind current.

Linux distros tend to be more homogeneous. The only difference I've
found that's common is that Debian tends to prefer mawk (it's much
faster) while others ship with gawk (it has better - but still limited -
UTF-8 support). There's still enough difference between the two that it
can trip you up on edge-case input data. Or more likely, it's tripped
*me* up a couple of times: the rest of you will know what you're doing.

cheers,
Stewart
Jamon Camisso via talk
2018-09-27 13:29:46 UTC
Permalink
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
If semi-random 32 (or n) character passwords is what you're after, pwgen
should work on Linux and macOS:

pwgen -s -y 32 1
f.,,H%+IMpQ-yDG+W'5'+AmjU$CcF*ZK

That said, if that's a password for a human, I pity the person who has
to type it.

What are you using passwords like that for, as opposed to some kind of
key based auth?

Cheers, Jamon
---
Talk Mailing List
***@gtalug.org
James Knott via talk
2018-09-27 13:35:20 UTC
Permalink
Post by Jamon Camisso via talk
f.,,H%+IMpQ-yDG+W'5'+AmjU$CcF*ZK
That said, if that's a password for a human, I pity the person who has
to type it.
What???  You mean you haven't memorized it?  ;-)
Post by Jamon Camisso via talk
What are you using passwords like that for, as opposed to some kind of
key based auth?
I use that sort of password for WiFi.  However, I use the Perfect
Passwords from www.grc.com.  They have 63 random character strings just
for that purpose.

Here's an example:
"57,%Y9N<Ure}tgrJO[7DS;NElk~/\"mxPyE1BB#,n!so%sl/j6[0JS*R_Db(Yx

---
Talk Mailing List
***@gtalug.org
https://gtalug.org/mailm
Jamon Camisso via talk
2018-09-27 13:46:38 UTC
Permalink
Post by James Knott via talk
Post by Jamon Camisso via talk
f.,,H%+IMpQ-yDG+W'5'+AmjU$CcF*ZK
That said, if that's a password for a human, I pity the person who has
to type it.
What???  You mean you haven't memorized it?  ;-)
Post by Jamon Camisso via talk
What are you using passwords like that for, as opposed to some kind of
key based auth?
I use that sort of password for WiFi.  However, I use the Perfect
Passwords from www.grc.com.  They have 63 random character strings just
for that purpose.
"57,%Y9N<Ure}tgrJO[7DS;NElk~/\"mxPyE1BB#,n!so%sl/j6[0JS*R_Db(Yx
Doesn't seem worth the hassle for short sequences.

GRC sequence:
echo '"57,%Y9N<Ure}tgrJO[7DS;NElk~/\"mxPyE1BB#,n!so%sl/j6[0JS*R_Db(Yx'
|ent |grep Entropy
Entropy = 5.468750 bits per byte.

pwgen sequence:
pwgen -s -y 63 1 |ent |grep Entropy
Entropy = 5.538910 bits per byte.

Negligible difference, and FWIW Ted Ts'o wrote pwgen.

Cheers, Jamon
---
Talk Mailing List
***@gtalug.org
htt
Stewart C. Russell via talk
2018-09-27 14:33:20 UTC
Permalink
Post by Jamon Camisso via talk
Negligible difference, and FWIW Ted Ts'o wrote pwgen.
The big difference, though, is that pwgen isn't installed by default
under Mac OS¹, and Giles's original approach was intended to be
portable. Installing packages is a huge hurdle for many users.

Stewart

----
¹: and, to be fair, neither is it installed by default under Ubuntu.
---
Talk Mailing List
***@gtalug.org
https://gtalug.org/ma
Bill Thanis via talk
2018-09-27 13:47:33 UTC
Permalink
The Locale indirectly controls the character encoding on the shell. that is
the reason why the locale settings have to do with this. I may be wrong,
but I believe the shell on the MAC is hardcoded with a specific character
encoding, probably 7 bit ascii. Try changing your count to 128.

Bill
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the length
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.
I've never had a problem with this on Linux, but on a Mac under some
tr: Illegal byte sequence
My coworker, who's also using the script, always got that error. It seems
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
z%V;d9uZfWLTgsT*J]Bz`mAmA
I'd really like to understand what the problem is, why 'tr' barfs, and
what the 'locale' settings have to do with this. Thanks.
(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but it's
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
William Park via talk
2018-09-27 15:30:38 UTC
Permalink
If Mac has recent Bash, then you could probably use $RANDOM variable
which picks a number from 0-32767 every time you read it. From top of
my head,
for i in $(seq 32); do
printf '%x' $((RANDOM % 94 + 33))
done | xxd -r -ps
That will give you full 94 character range you want.
--
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is this
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the length
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.
I've never had a problem with this on Linux, but on a Mac under some
tr: Illegal byte sequence
My coworker, who's also using the script, always got that error. It seems
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
z%V;d9uZfWLTgsT*J]Bz`mAmA
I'd really like to understand what the problem is, why 'tr' barfs, and what
the 'locale' settings have to do with this. Thanks.
(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but it's
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
---
Talk Mailing List
***@gtalug.org
https://gtalug.org/mailman/list
Giles Orr via talk
2018-10-01 14:16:44 UTC
Permalink
Thanks to everyone that responded - this has been very helpful.

So it seems that this _is_ locale-related: Hugh's explanation points out
that not all two-byte strings are valid characters under UTF-8, and that
would break 'tr'. Thus the change to 'C' fixing the problem. That really
helped me understand the problem, thanks.

I want a command line solution, so GRC's website doesn't work for me. And
I don't think it's a good idea to take a password from another source: it's
unlikely GRC stores generated passwords and then tries to hack the
associated IP or web browser with it, but isn't it better to do this
yourself so only you know the outputted password?

As for 'pwgen', it has precisely the same problem as 'apg' - it's not
installed by default as Stewart mentioned.

Someone asked what these passwords used for. We have to create accounts on
many services (most of which don't support any authentication method except
passwords) and give those accounts to other people to use. It's my intent
that the recipient should change the password to something more to their
liking. But many people don't: they just let their web browser memorize
the password and then let us reset the password when they "forget" it by
changing browsers. At least this way I know they have a relatively random
and secure password to start with, usually much better than what they would
have changed it to.
Post by William Park via talk
If Mac has recent Bash, then you could probably use $RANDOM variable
which picks a number from 0-32767 every time you read it. From top of
my head,
for i in $(seq 32); do
printf '%x' $((RANDOM % 94 + 33))
done | xxd -r -ps
That will give you full 94 character range you want.
--
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is
this
Post by Giles Orr via talk
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the
length
Post by Giles Orr via talk
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.
I've never had a problem with this on Linux, but on a Mac under some
tr: Illegal byte sequence
My coworker, who's also using the script, always got that error. It
seems
Post by Giles Orr via talk
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
z%V;d9uZfWLTgsT*J]Bz`mAmA
I'd really like to understand what the problem is, why 'tr' barfs, and
what
Post by Giles Orr via talk
the 'locale' settings have to do with this. Thanks.
(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but
it's
Post by Giles Orr via talk
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
--
Giles
https://www.gilesorr.com/
***@gmail.com
William Park via talk
2018-10-02 04:44:22 UTC
Permalink
Post by Giles Orr via talk
Post by William Park via talk
for i in $(seq 32); do
printf '%x' $((RANDOM % 94 + 33))
done | xxd -r -ps
Even more portable would be

echo -e $(for i in $(seq 32); do printf '\\x%x' $((RANDOM % 94 + 33)); done)
--
Post by Giles Orr via talk
Thanks to everyone that responded - this has been very helpful.
So it seems that this _is_ locale-related: Hugh's explanation points out
that not all two-byte strings are valid characters under UTF-8, and that
would break 'tr'. Thus the change to 'C' fixing the problem. That really
helped me understand the problem, thanks.
I want a command line solution, so GRC's website doesn't work for me. And
I don't think it's a good idea to take a password from another source: it's
unlikely GRC stores generated passwords and then tries to hack the
associated IP or web browser with it, but isn't it better to do this
yourself so only you know the outputted password?
As for 'pwgen', it has precisely the same problem as 'apg' - it's not
installed by default as Stewart mentioned.
Someone asked what these passwords used for. We have to create accounts on
many services (most of which don't support any authentication method except
passwords) and give those accounts to other people to use. It's my intent
that the recipient should change the password to something more to their
liking. But many people don't: they just let their web browser memorize
the password and then let us reset the password when they "forget" it by
changing browsers. At least this way I know they have a relatively random
and secure password to start with, usually much better than what they would
have changed it to.
Post by William Park via talk
If Mac has recent Bash, then you could probably use $RANDOM variable
which picks a number from 0-32767 every time you read it. From top of
my head,
for i in $(seq 32); do
printf '%x' $((RANDOM % 94 + 33))
done | xxd -r -ps
That will give you full 94 character range you want.
--
Post by Giles Orr via talk
I wrote a random password generator shell script, the core of which is
this
Post by Giles Orr via talk
dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
values. The two counts are replaced by variables, the first 'count='
needing to be a lot bigger than the final '-c <number>' which is the
length
Post by Giles Orr via talk
of the password generated. The size difference is necessary because 'tr'
throws away a lot of values.
I've never had a problem with this on Linux, but on a Mac under some
tr: Illegal byte sequence
My coworker, who's also using the script, always got that error. It
seems
Post by Giles Orr via talk
$ locale
LANG="en_CA.UTF-8"
LC_COLLATE="en_CA.UTF-8"
LC_CTYPE="en_CA.UTF-8"
LC_MESSAGES="en_CA.UTF-8"
LC_MONETARY="en_CA.UTF-8"
LC_NUMERIC="en_CA.UTF-8"
LC_TIME="en_CA.UTF-8"
LC_ALL=
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
$ export LC_CTYPE=C
$ export LC_ALL=C
$ dd if=/dev/urandom bs=1 count=256 2>/dev/null | tr -dc
z%V;d9uZfWLTgsT*J]Bz`mAmA
I'd really like to understand what the problem is, why 'tr' barfs, and
what
Post by Giles Orr via talk
the 'locale' settings have to do with this. Thanks.
(Should anyone have arguments against this as a method of password
generation, I'll entertain those too. And yes, I'm aware of 'apg' but
it's
Post by Giles Orr via talk
not readily available for Mac and this is much lighter weight.)
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
---
Talk Mailing List
https://gtalug.org/mailman/listinfo/talk
--
Giles
https://www.gilesorr.com/
---
Talk Mailing List
***@gtalug.org
https://gtalug.org/mailman/listin
Stewart C. Russell via talk
2018-10-02 13:06:05 UTC
Permalink
Post by William Park via talk
Even more portable would be
echo -e $(for i in $(seq 32); do printf '\\x%x' $((RANDOM % 94 + 33)); done)
It might be more portable, but bash's $RANDOM comes from a very simple
pseudorandom number generator, where Giles's solution uses /dev/urandom.
There's also a bit of modulo bias in the selection method.

cheers,
Stewart
---
Talk Mailing List
***@gtalug.org
https://gtalug.org/mail

Loading...