Perkins - The phonetician's assistant

Version 0.4.6.3 beta

Copyright (c) 2012 Scott Sadowsky

ssadowsky at g m a i l dot com
http://sadowsky.cl/

Perkins is software which phonemically transcribes, syllabifies and assigns accents and pauses to orthographic Spanish texts. Other common transcription modes include CV, place of articulation and manner of articulation.

When beta testing is over, Perkins will be licensed under a GNU free software license. In the meantime, to protect the innocent and unwary, it is copyrighted, proprietary software. You may use Perkins to transcribe anything you wish. You may not copy, distribute, modify or reverse engineer it, nor use it in your own programs (for now).

Please send bug reports to this e-mail address.

Installation and use

To process a text using Perkins' default options, unzip the downloaded file and do the following:

Windows executable

GNU/Linux binary

To change the interface language to English or Spanish, run it with the -eng or -esp switches, respectively.

Getting help

Run Perkins with the -h switch for help, and with the -u switch for usage information.

Keep in mind that the Windows command line cannot correctly display Unicode text, and so non-ASCII characters in the built-in help and usage info will be rather difficult to read on Windows. The README file contains mostly the same information, however.

Also note that this issue in no way affects Perkins' output.

Additional requirements

Input files must be ISO-8859-1 (Latin-1) encoded plain text. Output files are UTF-8 (Unicode) plain text. In order to correctly view Perkins' transcriptions, you need the following:

General usage information

Selecting a transcription mode

Processing options

Main options

-i source.txt
--input=source.txt
  Specify the file to be processed. MANDATORY.
-o trans.txt
--output=trans.txt
  Specify the file in which to save Perkins' output. If not specified, a name will be automatically generated using the input file basename and a appropriate extension (e.g. .phnm).
-en   Set interface language to English.
-es   Set interface language to Spanish.
   

Transcription mode options

-MODE
-f MODE
--format=MODE
  Specify the transcription format or mode. NOT case-sensitive. The possible modes are:
  F or PH (phonemic transcription)
  CV (consonant/vowel transcription)
  CVG (consonant/vowel/glide transcription)
  CVN (cons/vowel/nasal/liquid/rhotic/glide)
  M or MANNER (manner of articulation)
  P or PLACE (place of articulation)
  V (voicing)
   

Specific phoneme options

-multi, -mc   Use multi-character IPA symbols for some phonemes.
-tg   Treat /tr/ as a single phoneme (use ligature, or represent it as voiceless retroflex fricative /ʂ/, depending on the setting of -mc).
-yf   Represent the "ye" phoneme as the fricative /ʝ/.
-ya   Represent the "ye" phoneme as the affricate /d͡ʒ/.
-ar   Use the "retracted" diacritic in some affricates (e.g. t̠͡ʃ).
   

Glide options

-gd, --glides-dia   Represent glides as vowel + the "non-syllabic" diacritic (// and //).
-nogd, --noglides-dia   Represent glides as /j/ and /w/.
-wv   Represent wau as u + the "non-syllabic" diacritic (//).
-yv   Represent yod as i + the "non-syllabic" diacritic (//).
   

Stress options

-st   Mark stress with tilde over vowel rather than IPA apostrophe-like symbol.
-oa   Mark stress with orthographic apostrophe, rather than the similar IPA symbol.
   

Syllabification options

-sd, --syl-dots   Mark syllable divisions with dots (periods).
-ss, --syl-spaces   Represent syllable divisions with spaces.
-sbu   Syllabify by utterance/sentence, not by word ("los hombres" becomes /lo.som.bres/ instead of /los om.bres/)
-nosbu   Syllabify by word, not by utterance/sentence ("los hombres" becomes /los om.bres/ instead of /lo.som.bres/)
   

Pause / group options

-ip, --ipa-pauses   Represent pauses / groups with the IPA symbols | and ||.
-cmp   Treat commas as pauses.
-clp   Treat colons as pauses.
-scp   Treat semicolons as pauses.
-snp   Treat sentence breaks as pauses.
-ppp   Treat paragraph breaks as pauses.
-elp   Treat ellipses (...) as pauses.
-brp   Treat square brackets [] as pauses.
-pnp   Treat parentheses as pauses.
   

Substitution options

-n2w   Convert numerals into word form ("4" is converted into "cuatro", which is then transcribed).
-sn=SYMBOL   Replace numerals with the SYMBOL specified here.
-cur=TEXT   Replace the $ symbol with the TEXT specified here.
-sl=TEXT   Replace the "/" symbol with the TEXT specified here.
-nsm, --no-stress-marks   Do not indicate stress in any way.
-pu   Process URLs as linguistic items. Otherwise, they're deleted. If treated linguistically, common items such as "Gmail", "Facebook", "http" and "www" are transcribed as commonly pronounced, while other things are transcribed as they would be pronounced if spelled out loud.
-pe   Process e-mail addresses as linguistic items. Otherwise, they're deleted.
   

Presentation options

-owl   One word per line (split at words).
-osl   One syllable per line (split at syllables).
-kp   Keep paragraph breaks. Otherwise, output will be a wall of text.
-kc   Eliminate common words (for testing purposes).
   

Number processing options

-nyr   Treat two groups of 4 digits separated by "-" as a range of years ("1900-2000" > "1900 a 2000", not "1900 menos 2000").
-byr   Treat two groups of 1 to 4 digits separated by "-" as a range of years ("43-103" > "43 a 103", not "43 menos 103").
-ayr   Treat ALL groups of 1 to 4 digits separated by a "-" as ranges of years.
-bcy   Also process BCE years.
   

Meta-configurations

-rt, --corpus   For processing corpora of running text.
-sl, --syl-list   For creating transcriptions that permit easy processing at the syllable level.
-vrt   For processing verticalized text (one word per line of input). Can't perform all analyses (e.g. expanding abbreviations).
-wl, --word-list   For processing word lists (syllabifies at word level, not sentence level).

 

Usage examples

Below are some examples of the different types of transcription that Perkins can produce. In order to correctly view the IPA symbols, your browser must support Unicode and you must have an appropriate font installed. The text transcribed in all cases is:

En Concepción, se trata de aguantar la lluvia durante 5 meses del año. ¿Cachái?

Command:   ./perkins.pl -i=source.txt
Transcription:   en.kon.sep.ˈsjon | se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈʝu.bja.
d̪u.ˈɾan.t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:   Default options. Phonemic transcription. Affricates have ligature. Yod and wau are represented as /j/ and /w/. IPA stress apostrophe. Dentals have diacritic. Multi-character symbols (e.g. /t͡ʃ/). Utterance-level processing. The "ye" phoneme is transcribed as /ʝ/.
     
Command:   ./perkins.pl -i=source.txt -at
Transcription:   en.kon.sep.sjón | se.t̪ɾá.t̪a.d̪e.a.gwan.t̪áɾ.la.ʝú.bja.
d̪u.ɾán. t̪e.sín.ko.mé.ses.d̪e.lá.ɲo ‖ ka.t͡ʃáj
Description:   Stress accent is marked with a tilde on the vowel instead of an IPA apostrophe before the syllable.
     
Command:   ./perkins.pl -i=source.txt -ya
Transcription:   en.kon.sep.ˈsjon | se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd͡ʒu.bja.
d̪u.ˈɾan.t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:   The "ye" phoneme is transcribed as the affricate /d͡ʒ/.
     
Command:   ./perkins.pl -i=source.txt -ya -ar
Transcription:   en.kon.sep.ˈsjon | se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd̠͡ʒu.bja.d̪u.ˈɾan.
t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt̠͡ʃaj
Description:   The "retracted" diacritic is used to represent the affricates /d̠͡ʒ/ and /t̠͡ʃ/.
     
Command:   ./perkins.pl -i=source.txt -ya -tg
Transcription:   en.kon.sep.ˈsjon | se.ˈt̪͡ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd͡ʒu.bja.d̪u.ˈɾan.
t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:   The "tr" cluster is treated as a phoneme (which is how it behaves in many Chilean speakers).
     
Command:   ./perkins.pl -i=source.txt -ya -tg -nomc
Transcription:   en.kon.sep.ˈsjon | se.ˈʂa.ta.de.a.gwan.ˈtaɾ.la.ˈʤu.bja.du.ˈɾan.te.
ˈsin.ko.ˈme.ses.de.ˈla.ɲo ‖ ka.ˈʧaj
Description:   Phonemes are represented only with one-character symbols (/ʤ/; /ʧ/; /ʂ/ instead of /t̪͡ɾ/) except for glides, which may be configured separately with the -gd and -nogd switches.
     
Command:   ./perkins.pl -i=source.txt -gd
Transcription:   en.kon.sep.ˈsi̯on | se.ˈt̪ɾa.t̪a.d̪e.a.gu̯an.ˈt̪aɾ.la.ˈʝu.bi̯a.d̪u.ˈɾan.t̪e.
ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃai̯
Description:   Transcribe glides as vowel + "non-syllabic diacritic" (//, //) instead of /j/ and /w/.
     
Command:   ./perkins.pl -i=source.txt -nospe
Transcription:   en kon.sep.ˈsjon | se ˈt̪ɾa.t̪a d̪e a.gwan.ˈt̪aɾ la ˈʝu.bja d̪u.ˈɾan.t̪e
ˈsin.ko ˈme.ses d̪el ˈa.ɲo ‖ ka.ˈt͡ʃaj
Description:   Syllabify at word-level rather than utterance/sentence-level.
     
Command:   ./perkins.pl -i=source.txt -cv
Transcription:   VC.CVC.CVC.ˈCVVC | CV.ˈCCV.CV.CV.V.CVVC.ˈCVC.CV.ˈCV.CVV.CV.ˈCVC.
CV.ˈCVC.CV.ˈCV.CVC.CVC.ˈV.CV ‖ CV.ˈCVV
Description:   Analyze input in terms of consonant/vowel.
     
Command:   ./perkins.pl -i=source.txt -cvg
Transcription:   VC.CVC.CVC.ˈCGVC | CV.ˈCCV.CV.CV.V.CGVC.ˈCVC.CV.ˈCV.CGV.CV.ˈCVC.
CV.ˈCVC.CV.ˈCV.CVC.CVC.ˈV.CV ‖ CV.ˈCVG
Description:   Analyze input in terms of consonant/vowel/glide.
     
Command:   ./perkins.pl -i=source.txt -cvn
Transcription:   VN.CVN.CVC.ˈCGVN | CV.ˈCRV.CV.CV.V.CGVN.ˈCVR.LV.ˈCV.CGV.CV.ˈRVN.
CV.ˈCVN.CV.ˈNV.CVC.CVL.ˈV.NV ‖ CV.ˈCVG
Description:   Analyze input in terms of consonant/vowel/glide/nasal/lateral/rhotic.
     
Command:   ./perkins.pl -i=source.txt -m
Transcription:   VN.PVN.FVP.ˈFXVN | FV.ˈPTV.PV.PV.V.PXVN.ˈPVT.LV.ˈFV.PXV.PV.ˈTVN.
PV.ˈFVN.PV.ˈNV.FVF.PVL.ˈV.NV ‖ PV.ˈAVX
Description:   Analyze input in terms of MANNERS of articulation. (P=plosive, N=nasal, R=trill, T=tap/flap, F=fricative, L=lateral, A=affricate, X=approximant, V=vowel).
     
Command:   ./perkins.pl -i=source.txt -p
Transcription:   -A.V-A.A-B.ˈAP-A | A-.ˈDA-.D-.D-.-.VW-A.ˈD-A.A-.ˈP-.BP-.D-.ˈA-A.D-.
ˈA-A.V-.ˈB-.A-A.D-A.ˈ-.P- ‖ V-.ˈT-P
Description:   Analyze input in terms of PLACES of articulation. (B=bilabial, L=labiodental, D=dental, A=alveolar, T=post-alveolar, P=palatal, V=velar, W=labiovelar, -=vowel).

 

Known issue

In all modes but phonemic (i.e. CV, CVGNLR, etc.), syllabification is always performed at the word level.