Archive for July, 2015

h1

Mapping changing language

July 30, 2015

Based on a Writing Tip I wrote for my team…

*************

English is a changing language. Some changes take centuries; others are much quicker. Changes that apply in Australia may not apply in the UK or US, and vice versa.

This Writing Tip discusses some language changes and shows how computer algorithms can plot these graphically. My focus is on the paired words that start as separate words, morph into a hyphenated form, and then become a single word.

I’ll start with an easy one that changed very quickly: database. At one point in the 1960s and 70s it was two separate words (‘data base’), then briefly went through a hyphenated phase (‘data-base’), before settling down and becoming the single word we’re all familiar with (‘database’).

ngrams_database

Likewise, ‘email’ – it started as ‘e mail’, became ‘e-mail’ for quite a while, and is now accepted by most style guides as the single word, ‘email’, though ‘e-mail’ is still hanging on in usage.

ngrams_email

Many words are at different stages of morphing into a single word – some will never get there, others may remain hyphenated for a long time (decades, centuries), while others leap into single word status very easily. Who decides when a word pair becomes one? Lexicographers (the people who write dictionaries), who follow usage patterns when deciding whether to hyphenate, join, or leave separate a word pair. But lexicographers can’t keep up with usage patterns that change very quickly, so dictionaries are just one guide as to how to deal with a word pair.

More recently, computer algorithms (such as that used by Google Books Ngram Viewer: https://books.google.com/ngrams/) are plotting usage across centuries of writing (since 1800). However, as far as I can tell, there seems to be no way for casual users of the Ngrams to separate the language of books written in American English from those written in British English (and forget about Australian English – I doubt many books in the Google Books database use Australian English). This means that the results, while fascinating to some, don’t take the place of dictionaries in the ‘home’ language. Algorithms of usage patterns also don’t take the place of style guides put out by professional associations and societies (see this article on whether a honey bee should be a ‘honeybee’ or not: http://entomologytoday.org/2014/05/06/is-it-honey-bee-or-honeybee-bed-bug-or-bedbug-house-fly-or-housefly/).

I used Google Books Ngram Viewer to track down a few words that we use, which have several variations, to see when usage changed. My search criteria for each was separate words, hyphenated form, and joined form. What these graphs don’t show is whether the word is used as a noun, adjective, verb, or something else – context can dictate how you deal with a word pair (e.g. ‘to start up’ something [two words, verb] is quite different to ‘start-up activities’ [adjectival phrase]). Update: I’ve since found out you can specify which part of speech — the ‘Part-of-speech’ information on this page: http://books.google.com/ngrams/info

If you’re interested in language, try the Ngram Viewer with your own words to see when words came into usage (just enter one word), how they changed (enter variations of the word), etc.

ngrams_macroinvertebrate

 

ngrams_dataset

ngrams_seabed

 

 

ngrams_subsea

ngrams_subtropical

ngrams_website

 

ngrams_colour

This last one compares ‘color’ and ‘colour’ (click on it to view it full size), but doesn’t separate out the British or American English usages/instances. Surprisingly, ‘colour’ was most common until about 1890 when ‘color’ came into prominence. Was this a reflection of language usage, or that the Google Books project had more British English books published in the 1800s scanned than American English ones? And vice versa after the 1890s?

*************

For fun, I entered a single four-letter word and came up with this graph, which clearly shows that the word was used in printed books until the 1820s and then did not appear again in books until the late 1950s! Those Victorians had a lot of influence…

ngrams_f

 

And I entered another word (‘gay’) that has changed meaning over the past 100 years, dropping out of usage as that change occurred (from the 1940s to 1980s), then picking up its current usage from the early 1980s:

ngrams_gay

I also checked the stats for ‘awesome’, which show a big swing upwards from the 1960s onwards:

ngrams_awesome

And in the same vein, I checked ‘groovy’, which surprisingly was used as far back as the 1840s, but had its heyday in the 1960s and 70s, with a resurgence in the late 1990s/early 2000s:

ngrams_groovy

Fascinating stuff! I could play all day…

See also: http://www.brainpickings.org/2014/01/17/uncharted-big-data/

[Links last checked Aug 2015]

h1

Word: Find words with numerals and delete the numeral

July 29, 2015

Lesley, a transcriptionist from the UK, emailed me with her problem:

…although [I’m] a fast typist I do make mistakes. One of my common ones is that when I’m typing quickly I can hit a number key so a word can come out like tr4st for example or what4ever.

She wanted to know if there was a wildcard find/replace for [a-z][0-9] that would solve her problem.

Well, yes there is, but Word already offers some standard solutions to identify and fix these situations (Method 1), which may mean you don’t need the find/replace solution (Method 2).

Method 1: Use Word’s existing settings and functions

There’s a setting to ignore numbers in words when doing a spell check. If you turn that off and turn on check spelling as you type, AND run a final spell check, that may solve the problem as the spell check function will find them.

  1. Go to File > Options > Proofing.
  2. Make sure the Ignore words that contain numbers checkbox is clear (unselected).
  3. For added checks, turn on Check spelling as you type (also in that Proofing area).
  4. Don’t forget to run the spellchecker when you’ve finished your document.

Method 2: Use a wildcard find and replace

If you’d rather use a find/replace routine to find words with a single numeral surrounded by lower case letters (as in Lesley’s examples above), then follow the steps below.

NOTE: This find string will NOT find occurrences of upper case letters before or after the numeral, more than two numerals, or a numeral at the start or end of a word (e.g. 4trust, whatever4).

  1. Press Ctrl+H to open the Find and Replace dialog box on the Replace tab.
  2. Click More.
  3. Select the Use wildcards check box.
  4. In the Find What field, type: ([a-z])([0-9])([a-z]) (This string looks for a lower case letter immediately followed by a single number, immediately followed by a lower case letter. There are NO spaces in this string.)
  5. In the Replace With field, type \1\3 (You’re replacing the first and third found elements [i.e. the lower case letters] with themselves, and not replacing the number, thus deleting it. NOTE: There are NO spaces in this replace string.)
  6. Click Find Next to find the first instance of a number inside a word.
  7. Assuming the number is found correctly and it’s what you want to delete, click Replace.
  8. Repeat steps 6 and 7 for all instances.

Warning: You could do Replace All but you have to be ABSOLUTELY certain you aren’t replacing something you shouldn’t. Replace All is very powerful and makes global changes… You have been warned!

This is what your Find and Replace dialog box should look like:

numeral_FR

h1

Word: Change subscripted numerals to normal and surround with square brackets

July 22, 2015

On another blog post (https://cybertext.wordpress.com/2009/08/06/word-replace-text-containing-superscript-and-subscript-characters/), Bill asked:

All of my subscripts are consecutive numbers. I need to change all of them to normal size and with brackets around each consecutive number. Any ideas how I can do this for a large document?

You can do this using wildcards in the find and replace window, BUT I couldn’t find an easy way to get numbers greater than one numeral, so you may have to run two (for tens) or three (for hundreds) instances to get them all. Despite the apparent length of the steps below, it shouldn’t take very long — a few minutes at most.

(For others reading this, yes, I tried ([0-9]@) but for some reason it didn’t work — when the replace occurred, it replaced each number individually and surrounded each numeral with its own set of square brackets, which is not what was wanted. Update Aug 2015: I’ve found a simpler solution! And as a result have deleted all the complex steps I originally documented in this post.)

As the consecutive numbering that Bill talked about is irrelevant to my solution, I’ll ignore it.

Because you are potentially making global changes to your document, I strongly suggest you work on a copy of it, not the original, until you are satisfied that it works as you expect it to.

Find and replace one or more subscripted numbers

  1. Turn off track changes, if it’s on. (see Nic’s comment dated 10 April 2019)
  2. Press Ctrl+H to open the Find and Replace dialog box on the Replace tab.
  3. Make sure your cursor is in the Find what field.
  4. Click More.
  5. Select the Use wildcards check box.
  6. Click Format.
  7. Click Font.
  8. Select the Subscript check box until it becomes a check mark.
  9. Click OK to close the Font dialog box. You should have Use wildcards and Subscript listed below the Find What field.
  10. In the Find What field, type: ([0-9]{1,}) (NO spaces).
  11. In the Replace With field, type [\1] (you’re putting square brackets around what you’re replacing, and you’re replacing the found element with itself [that’s the \1 bit]).
  12. While your cursor is still in the Replace With field, click Format again.
  13. Click Font.
  14. Select the Subscript check box until it becomes blank, then click OK.
  15. Click Find Next to find the first subscripted numeral.
  16. Assuming the number is found correctly and it’s what you want to change, click Replace.
  17. Repeat steps 14 and 15 for all the other 4-digit numerals.

Warning: You could do Replace All but you have to be ABSOLUTELY certain you aren’t replacing something you shouldn’t. Replace All is very powerful and makes global changes… You have been warned!

Your Find and Replace dialog box should look like this:

subscript_FR_01a

h1

Word: Wildcard find and replace for numbers and trailing punctuation

July 14, 2015

On another post on this blog, Michael commented that he had a situation where he needed to run find and replace routines (possibly using wildcards) on hundreds of documents to convert MANUALLY entered step numbers (with various trailing punctuation) to a common format ready for conversion into another type of document.

He wanted to convert numbered steps like 1), 2), 3), etc. (each on a new line) to (1), (2), (3), etc. (also on separate lines). And convert numbered steps like (1), (2), (3), etc. (each on a new line) to 1., 2., 3., etc. (also on separate lines).

Notes and assumptions:

  • These steps ONLY work for manually entered numbered steps, NOT automatic numbering in Word.
  • I have assumed that each numbered item starts a new paragraph.
  • Before you test this on your document, make a copy of the document and work on that until you are comfortable with what you have to do, and until you are satisfied that these steps are what you want.
  • Be very careful with the ‘Replace All’ command — you may inadvertently replace something you don’t want to replace. Only if you’re 100% sure there are no other uses of a number and the relevant trailing punctuation would you use ‘Replace All’. If you’re not sure, use ‘Replace’ only. Yes, this will mean clicking ‘Replace’ for each one, but until you are confident that you won’t replace something you shouldn’t, it’s a safer option.

Example 1: Replace 1) etc. with (1)

  1. Press Ctrl+H to open the Find and Replace dialog.
  2. Click More, then select the Use wildcards check box.
  3. In Find What, type: (^013)([0-9]@)([\)])
  4. In Replace With, type: \1(\2\3

The three elements of the Find are:

  1. (^013) — This represents the preceding paragraph marker for the line above the numbered step.
  2. ([0-9]@) — The [0-9] represents any number from 0 to 9, and the @ represents any number of those numbers, thus not limiting the find to only single digit numbers. This finds numbers like 2, 25, and 283,
  3. ([\)]) — You need to find a specific character (the closing parenthesis), so you need to enclose it in parentheses. However, because parentheses are special wildcard characters in their own right, you need to tell Word to treat them as normal text characters and not as special characters, so you put in a backslash ‘\‘ (also known as an ‘escape’ character) before the ).

There are no spaces preceding or trailing any of these elements, or in between them, so if you copy the code from this blog post, PLEASE get rid of any preceding spaces otherwise it won’t work (yes, I know this because it caught me out too!).

The four elements of the Replace are:

  1. \1 — Tells Word to replace the first element of the Find with what was in the Find (the paragraph marker).
  2. (  — Tells Word to add an opening parenthesis before the next element (the number).
  3. \2 — Tells Word to replace the second element of the Find with the same text as what was found (the numerals).
  4. \3 — Tells Word to replace the third element of the Find with what was in the Find (the closing parenthesis).

Example 2: Replace (1) etc. with 1.

  1. Press Ctrl+H to open the Find and Replace dialog.
  2. Click More, then select the Use wildcards check box.
  3. In Find What, type: (^013)([\(])([0-9]@)(\))
  4. In Replace With, type: \1\3.

What the find and replace ‘codes’ mean:

The four elements of the Find are:

  1. (^013) — This represents the preceding paragraph marker for the line above the numbered step.
  2. ([\(]) — You need to find a specific character (the opening parenthesis), so you need to enclose it in parentheses. However, because parentheses are special wildcard characters in their own right, you need to tell Word to treat them as normal text characters and not as special characters, so you put in a backslash ‘\‘ (also known as an ‘escape’ character) before the (, AND square brackets surrounding this string (otherwise, it won’t work).
  3. ([0-9]@) — The [0-9] represents any number from 0 to 9, and the @ represents any number of those numbers, thus not limiting the find to only single digit numbers. This finds numbers like 2, 25, and 283,
  4. (\)) — You need to find a specific character (the closing parenthesis), so you need to enclose it in parentheses. However, because parentheses are special wildcard characters in their own right, you need to tell Word to treat them as normal text characters and not as special characters, so you put in a backslash ‘\‘ (also known as an ‘escape’ character) before the ).

There are no spaces preceding or trailing any of these elements, or in between them, so if you copy the code from this blog post, get rid of any preceding spaces otherwise it won’t work .

The three elements of the Replace are:

  1. \1 — Tells Word to replace the first element of the Find with what was in the Find (the paragraph marker).
  2. \3 — Tells Word to replace the third element of the Find with the same text as what was found (the numerals).
  3. . — Tells Word to add a period (full stop) after the third element of the Find.

Michael: I hope this solves your problem. Donations to keeping this blog ad-free gratefully accepted (see the link at the top right of the page).

Update: Michael also wanted to know how to revise these find Strings if the lists used lower case letters — e.g. a), b) or (a), (b), etc. instead of numerals. In that case you’d substitute the [0-9]@ in both examples above with [a-z]. If the letters might include upper case letters, then you’d use [A-z]. Don’t forget to omit the @!