Ver en castellano

Perkins

The phonetician's assistant

Version 1.0.6

Perkins is software which phonemically transcribes, syllabifies and assigns accents and pauses to orthographic Spanish texts. Other common transcription modes include CV, place of articulation and manner of articulation. Perkins's output is highly configurable though command line switches.

Perkins is written in Perl. It requires a few modules from CPAN which you may not have installed. If you want to avoid the hassle of doing this, just use the .exe version for Windows, or the binary version for x64 Linux.

Please send bug reports to this e-mail address.

How to cite: Sadowsky, Scott. 2016. Perkins - The Phonetician's Assistant. Version 1.0.6. Computer software. http://sadowsky.cl/perkins.html

Download from GitHub

Older versions

Published under the GNU Affero GPL v3 license.

Installation and use

To process a text using Perkins' default options, unzip the downloaded file and do the following:

Perl script

Make the .pl file executable (in *nix OSes).
Copy it to the directory (folder) where the text to be processed is located (or to a directory that's in your path).
Open a terminal window (command line) and navigate to the directory that contains the script and the text to be processed.
Execute the following command: ./perkins-1.0.5.pl -i sourcetext.txt

Windows executable

Copy the program ( perkins-win-x86-1.0.5.exe or perkins-win-x64-1.0.5.exe) to the folder where the text to be processed is located (or copy it to a folder that's on your path, such as C:\Windows or C:\Windows\System32, to avoid this hassle).
Open a command prompt by hitting WINDOWS+R and typing cmd.exe (you can also type this in the Start Menu search box on Vista or later).
In the command prompt, navigate to the directory with the text to be processed using the cd command.
Type the following: perkins-win-x86-1.0.5.exe -i sourcetext.txt (or, if you're using the 64-bit version, perkins-win-x64-1.0.5.exe -i textofuente.txt).

GNU/Linux binary

Make the .bin file executable.
Copy it to the directory where the text to be processed is located (or to a directory that's on your path).
Open a terminal window and navigate to the directory that contains the text to be processed.
Execute the following command: ./perkins-x86-1.0.5.bin -i sourcetext.txt (or, if you're using the 64-bit version, ./perkins-x64-1.0.5.bin -i textufuente.txt).

To change the interface language to English or Spanish, run it with the -eng or -esp switches, respectively.

Getting help

Run Perkins with the -h switch for help, and with the -u switch for usage information.

Keep in mind that the Windows command line cannot correctly display Unicode text, and so non-ASCII characters in the built-in help and usage info will be rather difficult to read on Windows. The README file contains mostly the same information, however.

Also note that this issue in no way affects Perkins' output.

Additional requirements

Input files must be ISO-8859-1 (Latin-1) encoded plain text. Output files are UTF-8 (Unicode) plain text. In order to correctly view Perkins' transcriptions, you need the following:

A Unicode font that supports IPA symbols, such as Charis SIL or Doulos SIL. MS Arial Unicode will do, though it has problems correctly displaying some phonetic symbols and many diacritics.
A text editor that supports Unicode. Notepad++ is a good, open-source alternative for Windows. Modern versions of MS Word can also be used to open the files Perkins generates.
Whatever software you use to view the transcriptions, don't forget to set the font accordingly.

General usage information

In the text that follows, perkins-1.0.5.pl is used in the examples. You should change this to the name of the version of the program you're actually using.

Options can be entered with either - or --. The = is optional. Thus, the following all do the same thing:
- perkins-1.0.5.pl --i=inputfile.txt
- perkins-1.0.5.pl -i=inputfile.txt
- perkins-1.0.5.pl -i inputfile.txt
The order of options and filenames is irrelevant.
Most binary options can be inverted by inserting 'no' between the hyphen and the option itself (e.g. -mc can be deactivated with -nomc).
There is no limit on the number of options that can be selected.
If a filename contains spaces or certain special characters, it must be entered in quotation marks.
If an output file name is not specified, a name will be automatically generated using the input file's base name and an extension that reflects the transcription mode chosen.

Selecting a transcription mode

The transcription mode to be used can be selected from the command line in two different ways: -f=MODE and -MODE.
Valid transcription modes are: F, CV, CVG, CVN, M, P, S. See below for details.

Processing options

Main options
-i source.txt --input=source.txt		Specify the file to be processed. MANDATORY.
-o trans.txt --output=trans.txt		Specify the file in which to save Perkins' output. If not specified, a name will be automatically generated using the input file basename and a appropriate extension (e.g. .phnm).
-en		Set interface language to English.
-es		Set interface language to Spanish.

Transcription format options
-MODE -f MODE --format=MODE		Specify the transcription format. NOT case-sensitive. The possible formats are:
		F or PH (phonemic transcription)
		CV (consonant/vowel transcription)
		CVG (consonant/vowel/glide transcription)
		CVN (cons/vowel/nasal/liquid/rhotic/glide)
		M or MANNER (manner of articulation)
		P or PLACE (place of articulation)
		V (voicing)

Specific phoneme options
-multi, -mc		Use multi-character IPA symbols for some phonemes.
-tg		Treat /tr/ as a single phoneme (use ligature, or represent it as voiceless retroflex fricative /ʂ/, depending on the setting of -mc).
-yf		Represent the "ye" phoneme as the fricative /ʝ/.
-ya		Represent the "ye" phoneme as the affricate /d͡ʒ/.
-ar		Use the "retracted" diacritic in some affricates (e.g. t̠͡ʃ).
-och		Represent the "ch" phoneme as the one-character symbol /ʧ/. Overrides all other options affecting this phoneme.
-oye		Represent the "ye" phoneme as the one character symbol "ʝ". Overrides all other options regarding this phoneme.

Glide options
-gd, --glides-dia		Represent glides as vowel + the "non-syllabic" diacritic (/i̯/ and /u̯/).
-nogd, --noglides-dia		Represent glides as /j/ and /w/.
-wv		Represent wau as u + the "non-syllabic" diacritic (/u̯/).
-yv		Represent yod as i + the "non-syllabic" diacritic (/i̯/).

Stress options
-st		Mark stress with tilde over vowel rather than IPA apostrophe-like symbol.
-oa		Mark stress with orthographic apostrophe, rather than the similar IPA symbol.

Syllabification options
-sd, --syl-dots		Mark syllable divisions with dots (periods).
-ss, --syl-spaces		Represent syllable divisions with spaces.
-nosd		Do not separate syllables with dots or spaces.
-sbu		Syllabify by utterance/sentence, not by word ("los hombres" becomes /lo.som.bres/ instead of /los om.bres/)
-nosbu		Syllabify by word, not by utterance/sentence ("los hombres" becomes /los om.bres/ instead of /lo.som.bres/)

Pause / group options
-ip, --ipa-pauses		Represent pauses / groups with the IPA symbols \| and \|\|.
-cmp		Treat commas as pauses.
-clp		Treat colons as pauses.
-scp		Treat semicolons as pauses.
-snp		Treat sentence breaks as pauses.
-ppp		Treat paragraph breaks as pauses.
-elp		Treat ellipses (...) as pauses.
-brp		Treat square brackets [] as pauses.
-pnp		Treat parentheses as pauses.

Substitution options
-n2w		Convert numerals into word form ("4" is converted into "cuatro", which is then transcribed).
-sn=SYMBOL		Replace numerals with the SYMBOL specified here.
-cur=TEXT		Replace the $ symbol with the TEXT specified here.
-sl=TEXT		Replace the "/" symbol with the TEXT specified here.
-nsm, --no-stress-marks		Do not indicate stress in any way.
-pu		Process URLs as linguistic items. Otherwise, they're deleted. If treated linguistically, common items such as "Gmail", "Facebook", "http" and "www" are transcribed as commonly pronounced, while other things are transcribed as they would be pronounced if spelled out loud.
-pe		Process e-mail addresses as linguistic items. Otherwise, they're deleted.

Presentation options
-owl		One word per line (split at words).
-osl		One syllable per line (split at syllables).
-kp		Keep paragraph breaks. Otherwise, output will be a wall of text.
-kc		Eliminate common words (for testing purposes).

Number processing options
-nyr		Treat two groups of 4 digits separated by "-" as a range of years ("1900-2000" > "1900 a 2000", not "1900 menos 2000").
-byr		Treat two groups of 1 to 4 digits separated by "-" as a range of years ("43-103" > "43 a 103", not "43 menos 103").
-ayr		Treat ALL groups of 1 to 4 digits separated by a "-" as ranges of years.
-bcy		Also process BCE years.

Meta-configurations
-rt, --corpus		For processing corpora of running text.
-sl, --syl-list		For creating transcriptions that permit easy processing at the syllable level.
-vrt		For processing verticalized text (one word per line of input). Can't perform all analyses (e.g. expanding abbreviations).
-wl, --word-list		For processing word lists (syllabifies at word level, not sentence level).

Usage examples

Below are some examples of the different types of transcription that Perkins can produce. In order to correctly view the IPA symbols, your browser must support Unicode and you must have an appropriate font installed. The text transcribed in all cases is:

En Concepción, se trata de aguantar la lluvia durante 5 meses del año. ¿Cachái?

Command:		perkins-1.0.5.pl -i=source.txt
Transcription:		en.kon.sep.ˈsjon \| se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈʝu.bja. d̪u.ˈɾan.t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:		Default options. Phonemic transcription. Affricates have ligature. Yod and wau are represented as /j/ and /w/. IPA stress apostrophe. Dentals have diacritic. Multi-character symbols (e.g. /t͡ʃ/). Utterance-level processing. The "ye" phoneme is transcribed as /ʝ/.

Command:		perkins-1.0.5.pl -i=source.txt -at
Transcription:		en.kon.sep.sjón \| se.t̪ɾá.t̪a.d̪e.a.gwan.t̪áɾ.la.ʝú.bja. d̪u.ɾán. t̪e.sín.ko.mé.ses.d̪e.lá.ɲo ‖ ka.t͡ʃáj
Description:		Stress accent is marked with a tilde on the vowel instead of an IPA apostrophe before the syllable.

Command:		perkins-1.0.5.pl -i=source.txt -ya
Transcription:		en.kon.sep.ˈsjon \| se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd͡ʒu.bja. d̪u.ˈɾan.t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:		The "ye" phoneme is transcribed as the affricate /d͡ʒ/.

Command:		perkins-1.0.5.pl -i=source.txt -ya -ar
Transcription:		en.kon.sep.ˈsjon \| se.ˈt̪ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd̠͡ʒu.bja.d̪u.ˈɾan. t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt̠͡ʃaj
Description:		The "retracted" diacritic is used to represent the affricates /d̠͡ʒ/ and /t̠͡ʃ/.

Command:		perkins-1.0.5.pl -i=source.txt -ya -tg
Transcription:		en.kon.sep.ˈsjon \| se.ˈt̪͡ɾa.t̪a.d̪e.a.gwan.ˈt̪aɾ.la.ˈd͡ʒu.bja.d̪u.ˈɾan. t̪e.ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃaj
Description:		The "tr" cluster is treated as a phoneme (which is how it behaves in many Chilean speakers).

Command:		perkins-1.0.5.pl -i=source.txt -ya -tg -nomc
Transcription:		en.kon.sep.ˈsjon \| se.ˈʂa.ta.de.a.gwan.ˈtaɾ.la.ˈʤu.bja.du.ˈɾan.te. ˈsin.ko.ˈme.ses.de.ˈla.ɲo ‖ ka.ˈʧaj
Description:		Phonemes are represented only with one-character symbols (/ʤ/; /ʧ/; /ʂ/ instead of /t̪͡ɾ/) except for glides, which may be configured separately with the -gd and -nogd switches.

Command:		perkins-1.0.5.pl -i=source.txt -gd
Transcription:		en.kon.sep.ˈsi̯on \| se.ˈt̪ɾa.t̪a.d̪e.a.gu̯an.ˈt̪aɾ.la.ˈʝu.bi̯a.d̪u.ˈɾan.t̪e. ˈsin.ko.ˈme.ses.d̪e.ˈla.ɲo ‖ ka.ˈt͡ʃai̯
Description:		Transcribe glides as vowel + "non-syllabic diacritic" (/i̯/, /u̯/) instead of /j/ and /w/.

Command:		perkins-1.0.5.pl -i=source.txt -nospe
Transcription:		en kon.sep.ˈsjon \| se ˈt̪ɾa.t̪a d̪e a.gwan.ˈt̪aɾ la ˈʝu.bja d̪u.ˈɾan.t̪e ˈsin.ko ˈme.ses d̪el ˈa.ɲo ‖ ka.ˈt͡ʃaj
Description:		Syllabify at word-level rather than utterance/sentence-level.

Command:		perkins-1.0.5.pl -i=source.txt -cv
Transcription:		VC.CVC.CVC.ˈCVVC \| CV.ˈCCV.CV.CV.V.CVVC.ˈCVC.CV.ˈCV.CVV.CV.ˈCVC. CV.ˈCVC.CV.ˈCV.CVC.CVC.ˈV.CV ‖ CV.ˈCVV
Description:		Analyze input in terms of consonant/vowel.

Command:		perkins-1.0.5.pl -i=source.txt -cvg
Transcription:		VC.CVC.CVC.ˈCGVC \| CV.ˈCCV.CV.CV.V.CGVC.ˈCVC.CV.ˈCV.CGV.CV.ˈCVC. CV.ˈCVC.CV.ˈCV.CVC.CVC.ˈV.CV ‖ CV.ˈCVG
Description:		Analyze input in terms of consonant/vowel/glide.

Command:		perkins-1.0.5.pl -i=source.txt -cvn
Transcription:		VN.CVN.CVC.ˈCGVN \| CV.ˈCRV.CV.CV.V.CGVN.ˈCVR.LV.ˈCV.CGV.CV.ˈRVN. CV.ˈCVN.CV.ˈNV.CVC.CVL.ˈV.NV ‖ CV.ˈCVG
Description:		Analyze input in terms of consonant/vowel/glide/nasal/liquid/rhotic.

Command:		perkins-1.0.5.pl -i=source.txt -m
Transcription:		VN.PVN.FVP.ˈFXVN \| FV.ˈPTV.PV.PV.V.PXVN.ˈPVT.LV.ˈFV.PXV.PV.ˈTVN. PV.ˈFVN.PV.ˈNV.FVF.PVL.ˈV.NV ‖ PV.ˈAVX
Description:		Analyze input in terms of MANNER of articulation. (P=plosive, N=nasal, R=trill, T=tap/flap, F=fricative, L=lateral, A=affricate, X=approximant, V=vowel).

Command:		perkins-1.0.5.pl -i=source.txt -p
Transcription:		-A.V-A.A-B.ˈAP-A \| A-.ˈDA-.D-.D-.-.VW-A.ˈD-A.A-.ˈP-.BP-.D-.ˈA-A.D-. ˈA-A.V-.ˈB-.A-A.D-A.ˈ-.P- ‖ V-.ˈT-P
Description:		Analyze input in terms of PLACE of articulation. (B=bilabial, L=labiodental, D=dental, A=alveolar, T=post-alveolar, P=palatal, V=velar, W=labiovelar, -=vowel).

Known issues

In all modes except phonemic (i.e. CV, CVG, etc.), silabification is always performed at word level.

Old versions

1.0.5

1.0.0

Perl script / source codel
Windows x64 (.exe)
Windows x86 (.exe)
Linux x64 (binary)
Linux x86 (binary)
Mac OSX (experimental)

0.4.6.3

Windows x86 (exe)
Windows x64 (exe)
Linux x86 (binary)
Linux x64 (binary)
Mac OS (experimental)

ENGLISH

CASTELLANO

Optional Sidebar Info

Ver en castellano

Perkins

The phonetician's assistant

Version 1.0.6

Installation and use

Perl script

Windows executable

GNU/Linux binary

Getting help

Additional requirements

General usage information

Selecting a transcription mode

Processing options

Main options

Transcription format options

Specific phoneme options

Glide options

Stress options

Syllabification options

Pause / group options

Substitution options

Presentation options

Number processing options

Meta-configurations