From JoyceUlysses.txt -- words occurring exactly once

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
              -- Also, a list of words occurring once, twice or 3 times
re: hyphenated words        (you can treat it anyway you like)
       but ideally, i'd treat [editor-in-chief]
                               [go-ahead] [pen-knife]
                               [know-how] [far-fetched] ...
       as one unit.

Did you mention the pay-rate for this work?

Split into words - defined as you will.
Use Counter.

Show some (of your) code and we'll be happy to critique...

--
Regards,
=dn

HenHanna

2024-05-31 02:26:37 UTC

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
               -- Also, a list of words occurring once, twice or 3 times
re: hyphenated words        (you can treat it anyway you like)
        but ideally, i'd treat [editor-in-chief]
                                [go-ahead] [pen-knife]
                                [know-how] [far-fetched] ...
        as one unit.

Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...

hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)

2-step-Process

1. make a file listing all words (one word per line)

2. then, doing the counting. using
from collections import Counter

Related code (for 1) that i'd used before:

Rfile = open("JoyceUlysses.txt", 'r')

with open( 'Out.txt', 'w' ) as fo:
for line in Rfile:
line = line.rstrip()
wLis = line.split()
for w in wLis:
if w != "":
w = w.rstrip(";:,'\"[]()*&^%$#@!,./<>?_-+=")
w = w.lstrip(";:,'\"[]()*&^%$#@!,./<>?_-+=")
fo.write(w.lower())
fo.write('\n')

Peter J. Holzer

2024-06-01 08:04:29 UTC

Post by HenHanna
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)

Especially since the same character is used as both an apostrophe and a
closing quotation mark. And while that's pretty unambiguous between to
characters it isn't at the end of a word:

This is Alexâ house.
This type of building is called an âAlexâ house.
The sentence âWe are meeting at Alexâ houseâ contains an apostrophe.

(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)

Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.

hp

[1] Which I use rarely, anyway.

Thomas Passin

2024-06-01 13:38:51 UTC

Post by Peter J. Holzer

Post by HenHanna
hard to decide what to do with hyphens
and apostrophes
(I'd, he's, can't, haven't, A's and B's)

Especially since the same character is used as both an apostrophe and a
closing quotation mark. And while that's pretty unambiguous between to
This is Alex’ house.
This type of building is called an ‘Alex’ house.
The sentence ‘We are meeting at Alex’ house’ contains an apostrophe.
(using proper unicode quotation marks. It get's worse if you stick to
ASCII.)
Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018
LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as
single quotation marks[1], but despite the suggestive names, this is not
the common typographical convention, so your texts are unlikely to make
this distinction.
hp
[1] Which I use rarely, anyway.

My usual approach is to replace punctuation by spaces and then to
discard anything remaining that is only one character long (or sometimes
two, depending on what I'm working on). Yes, OK, I will miss words like
"I". Usually I don't care about them. Make exceptions to the policy if
you like.

2024-06-05 04:33:15 UTC

Post by HenHanna

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?
               -- Also, a list of words occurring once, twice or 3 times
re: hyphenated words        (you can treat it anyway you like)
        but ideally, i'd treat [editor-in-chief]
                                [go-ahead] [pen-knife]
                                [know-how] [far-fetched] ...
        as one unit.

Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...

Apologies for lateness - only just able to come back to this.

This issue is not Python, and is not solved by code!

If you/your teacher can't define a "word", the code, any code, will
almost-certainly be wrong!

One of the interesting aspects of our work is that we can write all
manner of tests to try to ensure that the code is correct: unit tests,
integration tests, system tests, acceptance tests, eye-tests, ...

However, there is no such thing as a test (or proof) that statements of
requirements are complete or correct!
(nor for any other previous stages of the full project life-cycle)

As coders we need to learn to require clear specifications and not
attempt to read-between-the-lines, use our initiative, or otherwise 'not
bother the ...'. When there is ambiguity, we should go back to the
user/client/boss and seek clarification. They are the
domain/subject-matter experts...

I'm reminded of a cartoon, possibly from some IBM source, first seen in
black-and-white but here in living-color:
https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants

That has been the sad history of programming and dev.projects - wherein
we are blamed for every short-coming, because no-one else understands
the nuances of development projects.

If we don't insist on clarity, are we our own worst enemy?

--
Regards,
=dn

Grant Edwards

2024-06-05 15:24:32 UTC

Post by dn
If you/your teacher can't define a "word", the code, any code, will
almost-certainly be wrong!

Back when I was a student...

When there was a homework/project assignemnt with a vague requirement
(and it wasn't practical to get the requirement refined), what always
worked for me was to put in the project report or program comments or
somewhere a statement that the requirement could be interpreted in
different ways and here is the precise interpretation of the
requirement that is being implemented.

Thomas Passin

2024-06-05 11:10:19 UTC

Post by HenHanna

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?
               -- Also, a list of words occurring once, twice or 3
times
re: hyphenated words        (you can treat it anyway you like)
        but ideally, i'd treat [editor-in-chief]
                                [go-ahead] [pen-knife]
                                [know-how] [far-fetched] ...
        as one unit.

Split into words - defined as you will.
Use Counter.
Show some (of your) code and we'll be happy to critique...

Apologies for lateness - only just able to come back to this.
This issue is not Python, and is not solved by code!
If you/your teacher can't define a "word", the code, any code, will
almost-certainly be wrong!
One of the interesting aspects of our work is that we can write all
manner of tests to try to ensure that the code is correct: unit tests,
integration tests, system tests, acceptance tests, eye-tests, ...
However, there is no such thing as a test (or proof) that statements of
requirements are complete or correct!
(nor for any other previous stages of the full project life-cycle)
As coders we need to learn to require clear specifications and not
attempt to read-between-the-lines, use our initiative, or otherwise 'not
bother the ...'. When there is ambiguity, we should go back to the
user/client/boss and seek clarification. They are the
domain/subject-matter experts...
I'm reminded of a cartoon, possibly from some IBM source, first seen in
https://www.monolithic.org/blogs/presidents-sphere/what-the-customer-really-wants

That one's been kicking around for years ... good job in finding a link
for it!

Post by dn
That has been the sad history of programming and dev.projects - wherein
we are blamed for every short-coming, because no-one else understands
the nuances of development projects.

Of course, we see this lack of clarity all the time in questions to the
list. I often wonder how these askers can possibly come up with
acceptable code if they don't realize they don't truly know what it's
supposed to do.

Post by dn
If we don't insist on clarity, are we our own worst enemy?

Mats Wichmann

2024-06-07 14:37:07 UTC

Post by Thomas Passin
Of course, we see this lack of clarity all the time in questions to the
list. I often wonder how these askers can possibly come up with
acceptable code if they don't realize they don't truly know what it's
supposed to do.

Fortunately, having to explain to someone else why something is giving
you trouble can help shed light on the fact the problem statement isn't
clear, or isn't clearly understood. Sometimes (sadly, many times it
doesn't).

Larry Martell

2024-06-08 15:54:07 UTC

On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <

The original question struck me as homework or an interview question for a
junior position. But having no clear requirements or specifications is good
training for the real world where that is often the case. When you question
that, you are told to just do something, and then you’re told it’s not what
is wanted. That frustrates people but it’s often part of the process.
People need to see something to help them know what they really want.

Stefan Ram

2024-06-08 16:06:07 UTC

Post by Larry Martell
People need to see something to help them know what they really want.

|The hardest single part of building a software system is
|deciding precisely what to build.
Brooks, F.P. Jr., The Mythical Man-Month: Essays on Software
Engineering, Addison Wesley, Reading, MA, 1995, Second Edition.

Thomas Passin

2024-06-08 17:10:13 UTC

Post by Larry Martell
On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <

At the extremes, there are two kinds of approaches you are alluding to.
One is what I learned to call "rock management": "Bring me a rock ...
no, that's not the right one, bring me another ... no that's not what
I'm looking for, bring me another...". If this is your situation, so,
so sorry!

At the other end, there is a mutual evolution of the requirements
because you and your client could not have known what they should be
until you have spent effort and time feeling your way along. With the
right client and management, this kind of project can be a joy to work
on. I've been lucky enough to have worked on several projects of this kind.

In truth, there always are requirements. Often (usually?) they are not
thought out, not consistent, not articulated clearly, and not
communicated well. They may live only in the mind of one person.

a***@gmail.com

2024-06-08 18:46:43 UTC

Agreed, Thomas.

As someone who has spent lots of time writing code OR requirements of various levels or having to deal with the bugs afterwards, there can be a huge disconnect between the people trying to decide what to do and the people having to do it. It is not necessarily easy to come back later and ask for changes that wewre not anticipated in the design or implementation.

I recently wrote a program where the original specifications seemed reasonable. In one part, I was asked to make a graph with some random number (or all) of the data shown as a series of connected line segments showing values for the same entity at different measurement periods and then superimpose the mean for all the original data, not just the subsample shown. This needed to be done on multiple subsamples of the original/calculated data so I made it into a function.

One of the datasets contained a column that was either A or B and the function was called multiple times to show what a random sample of A+B, just A and just B graphed like along with the mean of the specific data it was drawn from. But then, I got an innocuously simple request.

Could we graph A+B and overlay not only the means for A+B as was now done, but also the mean for A and the mean for B. Ideally, this would mean three bolder jaged lines superimposed above the plot and seemed simple enough.

But was it? To graph the means in the first place, I made a more complex data structure needed so when graphed, it aligned well with what was below it. But that was hard coded in my function, but in one implementation, I now needed it three times. Extracting it into a new function was not trivial as it depended initially on other things within the body of the function. But, it was doable and might have been done that way had I known such a need might arise. It often is like that when there seems no need to write a function for just one use. The main function now needed to be modified to allow optionally adding one or two more datasets and if available, call the new function on each and add layers to the graph with the additional means (dashed and dotted) if they are called while otherwise, the function worked as before.

But did I do it right? Well, if next time I am asked to have the data extended to have more measurements in more columns at more times, I might have to rewrite quite a bit of the code. My localized change allowed one or two additional means to be plotted. Adding an arbitrary number takes a different approach and, frankly, there are limits on how many kinds of 'line" segments can be used to differentiate among them.

Enough of the example except to make a point. In some projects, it is not enough to tell a programmer what you want NOW. You may get what you want fairly quickly but if you have ideas of possible extensions or future upgrades, it would be wiser to make clear some of the goals so the programmer creates an implementation that can be more easily adjusted to do more. Such code can take longer and be more complex so it may not pay off immediately.

But, having said that, plenty of software may benefit from looking at what is happening and adjusting on the fly. Clearly my client cannot know what feedback they may get when showing an actual result to others who then suggest changes or enhancements. The results may not be anticipated so well in advance and especially not when the client has no idea what is doable and so on.

A related example was a request for how to modify a sort of Venn Diagram chart to change the font size. Why? Because some of the labels were long and the relative sizes of the pie slices were not known till an analysis of the data produced the appropriate numbers and ratios. This was a case where the documentation of the function used by them did not suggest how to do many things as it called a function that called others to quite some depth. A few simple experiments and some guesses and exploration showed me ways to pass arguments along that were not documented but that were passed properly down the chain and I could now change the text size and quite a few other things. But I asked myself if this was really the right solution the client needed. I then made a guess on how I could get the long text wrapped into multiple lines that fit into the sections of the Venn Diagram without shrinking the text at all, or as much. The client had not considered that as an option, but it was better for their display than required. But until people see such output, unless they have lots of experience, it cannot be expected they can tell you up-front what they want.

One danger of languages like Python is that often people get the code you supply and modify it themselves or reuse it on some project they consider similar. That can be a good thing but often a mess as you wrote the code to do things in a specific way for a specific purpose ...

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=***@python.org> On Behalf Of Thomas Passin via Python-list
Sent: Saturday, June 8, 2024 1:10 PM
To: python-***@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

Post by Larry Martell
On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <

--
https://mail.python.org/mailman/listinfo/python-list

Thomas Passin

2024-06-08 19:41:03 UTC

Post by a***@gmail.com
Agreed, Thomas.
As someone who has spent lots of time writing code OR requirements of various levels or having to deal with the bugs afterwards, there can be a huge disconnect between the people trying to decide what to do and the people having to do it. It is not necessarily easy to come back later and ask for changes that wewre not anticipated in the design or implementation.

And typical contract vehicles aren't often flexible to allow for this
kind of thing. I've always tried to persuade my management to allow
built-in phases where re-evaluation can take place based on what's been
learned. To have a hope of that working, though, there needs to be a lot
of trust between client and development folks. Can be hard to come by.

Post by a***@gmail.com
I recently wrote a program where the original specifications seemed reasonable. In one part, I was asked to make a graph with some random number (or all) of the data shown as a series of connected line segments showing values for the same entity at different measurement periods and then superimpose the mean for all the original data, not just the subsample shown. This needed to be done on multiple subsamples of the original/calculated data so I made it into a function.
One of the datasets contained a column that was either A or B and the function was called multiple times to show what a random sample of A+B, just A and just B graphed like along with the mean of the specific data it was drawn from. But then, I got an innocuously simple request.
Could we graph A+B and overlay not only the means for A+B as was now done, but also the mean for A and the mean for B. Ideally, this would mean three bolder jaged lines superimposed above the plot and seemed simple enough.
But was it? To graph the means in the first place, I made a more complex data structure needed so when graphed, it aligned well with what was below it. But that was hard coded in my function, but in one implementation, I now needed it three times. Extracting it into a new function was not trivial as it depended initially on other things within the body of the function. But, it was doable and might have been done that way had I known such a need might arise. It often is like that when there seems no need to write a function for just one use. The main function now needed to be modified to allow optionally adding one or two more datasets and if available, call the new function on each and add layers to the graph with the additional means (dashed and dotted) if they are called while otherwise, the function worked as before.

I feel your pain. In the generalized X-Y graphing program I've evolved
over several generations, I have graphing methods that can plot points
and curves, optionally overlaying them. Any function that wants to plot
something has to generate a dataset object of the type that the plotter
knows how to plot. No exceptions. Nothing else ever plots to the
screen. It's simple and works very well ... but I only designed it to
have axis labels and the title of the plot. They are all three
interactive, editable by the user. That's good, but for anything else
it's hack time. Witness lines, legends, point labels, etc., etc. don't
have a natural home.

Post by a***@gmail.com
But did I do it right? Well, if next time I am asked to have the data extended to have more measurements in more columns at more times, I might have to rewrite quite a bit of the code. My localized change allowed one or two additional means to be plotted. Adding an arbitrary number takes a different approach and, frankly, there are limits on how many kinds of 'line" segments can be used to differentiate among them.

This is the kind of situation where it needs to be implemented three
times before it gets good. One always thinks that the second time
around will work well because all the lessons were learned the first
time around. But no, it's not the second but the third implementation
that can start to be really good.

Post by a***@gmail.com
Enough of the example except to make a point. In some projects, it is not enough to tell a programmer what you want NOW. You may get what you want fairly quickly but if you have ideas of possible extensions or future upgrades, it would be wiser to make clear some of the goals so the programmer creates an implementation that can be more easily adjusted to do more. Such code can take longer and be more complex so it may not pay off immediately.
But, having said that, plenty of software may benefit from looking at what is happening and adjusting on the fly. Clearly my client cannot know what feedback they may get when showing an actual result to others who then suggest changes or enhancements. The results may not be anticipated so well in advance and especially not when the client has no idea what is doable and so on.
A related example was a request for how to modify a sort of Venn Diagram chart to change the font size. Why? Because some of the labels were long and the relative sizes of the pie slices were not known till an analysis of the data produced the appropriate numbers and ratios. This was a case where the documentation of the function used by them did not suggest how to do many things as it called a function that called others to quite some depth. A few simple experiments and some guesses and exploration showed me ways to pass arguments along that were not documented but that were passed properly down the chain and I could now change the text size and quite a few other things. But I asked myself if this was really the right solution the client needed. I then made a guess on how I could get the long text wrapped into multiple lines that fit into the sections of the Venn Diagram without shrinking the text at all, or as much. The client had not considered that as an option, but it was better for their display than required. But until people see such output, unless they have lots of experience, it cannot be expected they can tell you up-front what they want.
One danger of languages like Python is that often people get the code you supply and modify it themselves or reuse it on some project they consider similar. That can be a good thing but often a mess as you wrote the code to do things in a specific way for a specific purpose ...
-----Original Message-----
Sent: Saturday, June 8, 2024 1:10 PM
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

Post by Larry Martell
On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <

At the extremes, there are two kinds of approaches you are alluding to.
One is what I learned to call "rock management": "Bring me a rock ...
no, that's not the right one, bring me another ... no that's not what
I'm looking for, bring me another...". If this is your situation, so,
so sorry!
At the other end, there is a mutual evolution of the requirements
because you and your client could not have known what they should be
until you have spent effort and time feeling your way along. With the
right client and management, this kind of project can be a joy to work
on. I've been lucky enough to have worked on several projects of this kind.
In truth, there always are requirements. Often (usually?) they are not
thought out, not consistent, not articulated clearly, and not
communicated well. They may live only in the m

a***@gmail.com

2024-06-08 22:07:07 UTC

I agree with Larry that the OP was asking something that might be fair to use in an interview process where perhaps an exact answer is not as important as showing how the person thinks about a process. The original question can be incomplete or vague. Do they give up? Do they ask careful questions that might elicit missing parts? Do they examine alternatives and then pick a reasonable one and document exactly which possible version(s) their presented code should solve?

In this case, as mentioned, one approach is to isolate the determination of what a word means from the rest of the problem.

In effect, the code can be:

- read in all lines of text.
- words = function_name(text)
- optionally, manipulate the words such as making them all the same case, removing some characters, throwing some way.
- count the words in words remaining using some method such as a dictionary.
- output reports as requested.

You could then design any number of functions you can slide into place and the remaining code may continue working without changes.

It may not matter if you can specify the specific definition of text if you show the general solution, and maybe instantiate some fairly simple way of making words.

Note, the above logic applies not to just python but most programming environments. If someone interviewed me for a job in say, Rust, which I am just now learning out of curiosity, I might not know how to program some parts of a problem like this, let alone make use of the idioms of that language right away. But if they want someone who can rapidly learn the local ways and adapt, then the best they can do is try to see if you think like a programmer and can think abstractly and be able to do lots of work even while waiting for more info from someone on what they want for a specific part.

Or, in a case like this problem, I wonder if they would want to hear a suggestion that this may more easily be done in languages that support something like the dictionary or hash or associative array concept or that have available modules/packages/crates/etc. that provide such functionality.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=***@python.org> On Behalf Of Larry Martell via Python-list
Sent: Saturday, June 8, 2024 11:54 AM
To: Mats Wichmann <***@wichmann.us>
Cc: python-***@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On Sat, Jun 8, 2024 at 10:39 AM Mats Wichmann via Python-list <

--
https://mail.python.org/mailman/listinfo/python-list

Grant Edwards

2024-06-09 01:51:09 UTC

Post by Larry Martell
The original question struck me as homework or an interview question for a
junior position. But having no clear requirements or specifications is good
training for the real world where that is often the case. When you question
that, you are told to just do something, and then you’re told it’s not what
is wanted. That frustrates people but it’s often part of the process.
People need to see something to help them know what they really want.

Too true. You can spend all sorts of time getting people to pin down
and sign off on the initial requirements, but it all goes right out
the window when they get the first prototype.

"This isn't what we want, we want it to do <something else>."

"It does what you specified."

"But, this isn't what we want."

...

If you're on salary, it's all part of the job. If you're a contractor,
you either figure it in to the bid or charge for change orders.

Pieter van Oostrum

2024-05-31 12:39:37 UTC

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.

That is a famous Unix task : (Sorry, no Python)

grep -o '\w*' JoyceUlysses.txt | sort | uniq -c | sort -n

--
Pieter van Oostrum <***@vanoostrum.org>
www: http://pieter.vanoostrum.org/
PGP key: [8DAE142BE17999C4]

Grant Edwards

2024-05-31 18:58:55 UTC

Post by Pieter van Oostrum

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
-- Also, a list of words occurring once, twice or 3 times
re: hyphenated words (you can treat it anyway you like)
but ideally, i'd treat [editor-in-chief]
[go-ahead] [pen-knife]
[know-how] [far-fetched] ...
as one unit.

That is a famous Unix task : (Sorry, no Python)
grep -o '\w*' JoyceUlysses.txt | sort | uniq -c | sort -n

Yep, that's what came to my mind (though I couldn't remember the exact
grep option without looking it up). However, I assume that doesn't
get you very many points on a homework assignemnt from an "Intruction
to Python" class.

d***@online.de

2024-05-31 17:59:15 UTC

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?

Your task can be split into several subtasks:
* parse the text into words

This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task

* Make a list unique -- you can use `set` for this

Post by HenHanna
-- Also, a list of words occurring once, twice or 3 times

For this you count the number of occurrences in a `list`.
You can use the `count` method of lists for this.

All individual subtasks are simple. I am confident that
you will be able to solve them by yourself (if you are willing
to invest a bit of time).

Thomas Passin

2024-05-31 21:27:00 UTC

You will probably get a thousand different suggestions, but here's a
fairly direct and readable one in Python:

s1 = 'Is this word is the only word repeated in this string'

counts = {}
for w in s1.lower().split():
counts[w] = counts.get(w, 0) + 1
print(sorted(counts.items()))
# [('in', 1), ('is', 2), ('only', 1), ('repeated', 1), ('string', 1),
('the', 1), ('this', 2), ('word', 2)]

Of course you can adjust the definition of what constitutes a word,
handle punctuation and so on, and tinker with the output format to suit
yourself. You would replace s1.lower().split() with, e.g.,
my_custom_word_splitter(s1).

Mats Wichmann

2024-06-01 19:34:11 UTC

On 5/31/24 11:59, Dieter Maurer via Python-list wrote:

hmmm, I "sent" this but there was some problem and it remained unsent.

Post by d***@online.de

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?

* parse the text into words
This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task

This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's both
the compound word forms and the ones at the end of lines if the source
text has been formatted that way. Are all-lowercase words different
than the same word starting with a capital? What about non-initial
capitals, as happens a fair bit in modern usage with acronyms,
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?

If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone into
what a "word character" is (\w), so it deals with excluding both
punctuation and whitespace.

import re
from collections import Counter

with open("JoyceUlysses/txt", "r") as f:
wordcount = Counter(re.findall(r'\w+', f.read().lower()))

Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count of
1, 2, 3, etc.)

Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters. nltk is a choice there. But at this point it's not really
"simple" any longer (though nltk experts might end up disagreeing with
that).

Edward Teach

2024-06-03 09:47:42 UTC

On Sat, 1 Jun 2024 13:34:11 -0600

Post by Mats Wichmann
hmmm, I "sent" this but there was some problem and it remained
unsent. Just in case it hasn't All Been Said Already, here's the

Post by d***@online.de

Post by HenHanna
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program
that'd give me a list of all words occurring exactly once?

* parse the text into words
This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of
non-whitespace characters. In this case, you can use `split` for
this task

This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's
both the compound word forms and the ones at the end of lines if the
source text has been formatted that way. Are all-lowercase words
different than the same word starting with a capital? What about
non-initial capitals, as happens a fair bit in modern usage with
acronyms, trademarks (perhaps not in Ulysses? :-) ), etc. What about
accented letters?
If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone
into what a "word character" is (\w), so it deals with excluding both
punctuation and whitespace.
import re
from collections import Counter
wordcount = Counter(re.findall(r'\w+', f.read().lower()))
Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count
of 1, 2, 3, etc.)
Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters. nltk is a choice there. But at this point it's not
really "simple" any longer (though nltk experts might end up
disagreeing with that).

The Gutenburg Project publishes "plain text". That's another problem,
because "plain text" means UTF-8....and that means unicode...and that
means running some sort of unicode-to-ascii conversion in order to get
something like "words". A couple of hours....a couple of hundred lines
of C....problem solved!

Grant Edwards

2024-06-03 18:58:26 UTC

Post by Edward Teach
The Gutenburg Project publishes "plain text". That's another
problem, because "plain text" means UTF-8....and that means
unicode...and that means running some sort of unicode-to-ascii
conversion in order to get something like "words". A couple of
hours....a couple of hundred lines of C....problem solved!

I'm curious. Why does it need to be converted frum Unicode to ASCII?

When you read it into Python, it gets converted right back to Unicode...

Edward Teach

2024-06-04 11:21:34 UTC

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)

I'm curious. Why does it need to be converted frum Unicode to ASCII?
When you read it into Python, it gets converted right back to
Unicode...

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

Grant Edwards

2024-06-04 17:05:10 UTC

Post by Edward Teach
On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)

I'm curious. Why does it need to be converted frum Unicode to ASCII?
When you read it into Python, it gets converted right back to
Unicode...

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

I guess I missed the part of the problem description where it said to
use linux.words to decide what a word is. :)

--
Grant

a***@gmail.com

2024-06-04 21:30:47 UTC

Post by Edward Teach
Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

The meaning of "words" depends on the context. The contents of the file
mentioned are a minor attempt to capture a common subset of words in English
but probably are not what you mean by words in other contexts including
words also in ASCII format like names and especially uncommon names or
words like UNESCO. There are other selected lists of words such as valid
Scrabble words or WORLDLE words for specialized purposes that exclude words
of lengths that can not be used. The person looking to count words in a work
must determine what words make sense for their purpose.

ASCII is a small subset of UNICODE. So when using a concept of word that
includes many characters from many character sets, and in many languages,
things may not be easy to parse uniquely such as words containing something
like an apostrophe earlier on as in d'eau. Words can flow in different
directions. There can be fairly complex rules and sometimes things like
compound words may need to be considered to either be one or multiple words
and may even occur both ways in the same work so is every body the same as
everybody?

So what is being discussed here may have several components. One is to
tokenize all the text to make a set of categories. Another is to count them.
Perhaps another might even analyze and combine multiple categories or even
look at words in context to determine if two uses of the same word are
different enough to try to keep both apart in two categories Is polish the
same as Polish?

Once that is decided, you have a fairly simple exercise in storing the data
in a searchable data structure and doing your searches to get subsets and
counts and so on.

As mentioned, the default native format in Python is UNICODE and ASCII files
being read in may well be UNICODE internally unless you carefully ask
otherwise. The conversion from ASCII to UNICODE is trivial.

As for how well the regular expressions like \w work in general, I have no
idea. I can be very sure they are way more costly than the simpler ones you
can write that just know enough about what English words in ASCII look like
and perhaps get it wrong on some edge cases.

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=***@python.org> On
Behalf Of Edward Teach via Python-list
Sent: Tuesday, June 4, 2024 7:22 AM
To: python-***@python.org
Subject: Re: From JoyceUlysses.txt -- words occurring exactly once

On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)

I'm curious. Why does it need to be converted frum Unicode to ASCII?
When you read it into Python, it gets converted right back to
Unicode...

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

--
https://mail.python.org/mailman/listinfo/python-list

Chris Angelico

2024-06-04 22:02:26 UTC

On Wed, 5 Jun 2024 at 02:49, Edward Teach via Python-list

Post by Edward Teach
On Mon, 03 Jun 2024 14:58:26 -0400 (EDT)

I'm curious. Why does it need to be converted frum Unicode to ASCII?
When you read it into Python, it gets converted right back to Unicode...

Well.....when using the file linux.words as a useful master list of
"words".....linux.words is strict ASCII........

Whatever gave you that idea? I have a large number of dictionaries in
/usr/share/dict, all of them encoded UTF-8 except one (and I don't
know why that is). Even the English ones aren't entirely ASCII.

There is no need to "convert from Unicode to ASCII", which makes no sense.

ChrisA

d***@online.de

2024-06-04 16:13:47 UTC