Frequency List Wizard
Frequency List Wizard is a command-line program that does various useful things with... frequency lists. It's free software, written in Perl and licensed under the GPL v3.
Quick usage info
To process a frequency list using FLW's default options, unzip the downloaded file and do the following:
- Copy the program to the folder where your frequency list is (or copy it to a folder that's on your path, such as C:\Windows or C:\Windows\System32, to avoid this hassle).
- Open a command prompt by hitting WINDOWS+R and typing cmd.exe (you can also type this in the Start Menu search box on Vista or later).
- In the command prompt, navigate to the directory with the frequency list using the cd command.
- Type the following: frequency-list-wizard.exe -i your-list.txt
- Make the .pl file executable.
- Copy it to the directory where your frequency list is (or to a directory that's on your path).
- Open a terminal window and navigate to the directory with your frequency list.
- Execute the following command:
- GNU/Linux: ./frequency-list-wizard.pl -i your-list.txt
- Windows: perl frequency-list-wizard.pl -i your-list.txt
Run the program with the -h switch to see help and usage information.
Frequency List Wizard's default processing mode takes a 2-column frequency list in ISO-8859-1 (Latin-1) encoding, merges all entries that vary only by their capitalization (e.g. house, House and HOUSE), and sums the frequencies of each of these items to give you the total frequency per set of variant capitalizations (which is almost certainly what is desired when working with lexical items, lemmas, etc.). It performs a reverse natural numeric sort on the results (1000, 200, 30, 1 instead of 30, 200, 1000, 1) and outputs them to a text file.
Three-column lists (e.g. frequency + lemma + POS) can be processed using the -3c switch. This options allows identical lemmas with different POSes to be processed (and counted) separately (e.g. jump (NOUN) and jump (VERB)).
If desired, FLW can also calculate the total number of types and tokens in the frequency list, as well as its type-token ratio (this is done by default, and printed at the end of the processed frequency list).
Optionally, FLW can eliminate entries containing numerals (using -nn) and/or punctuation (using -np) from frequency lists. It can also merge certain Spanish allomorphs (y + e, o + u) into a single item (-ma) (this is FLW's only language-specific feature). All three options are activated by default, and can be deactivated with the -nonn, -nonp and -noma switches. The difference between the number of items in the source frequency list and the number actually processed after eliminating numbers and/or punctuation marks is reflected in the type and token counts shown with the --print-stats option (INPUT_TYPES versus PROCESSED_TYPES, etc.).
When using the 3-column option, POS information in the third column can be pruned if it is in a Connexor-style format (e.g. @NH N MSC SG). The -kh (--killhead) switch will eliminate the head of the field (the first block of characters plus the first space; in the example tag, this eliminates @NH ), while -kt (--killtail) will eliminate the tail (everything between the second space and the end of the POS field; in the example tag, MSC SG). To process differently-formatted POS information, use only the --posfull option .
The "meta-frequency" (AKA "legomena") processing mode, activated with the -mf or -hx switches, calculates the frequency of each frequency in the list. Its output is a frequency list of frequencies -- how many items occur 1 time, 2 times, and so on.
|-i, --input||Name of the input frequency list file. MANDATORY! Must be ISO-8859-1 (Latin-1).|
|-o, --output||Name of the output file. If not provided, a name will be automatically generated using the input file base name.|
|-ps,--print-stats||Calculate and print type, token and TTR statistics to output file (DEFAULT: ON).|
|-mf, --meta-freq||Calculate the frequencies of each frequency in the list. In other words, generates a meta-frequency list, or list of n-legomena.|
|-leg, --legomena||Same as -mf or --meta-freq.|
|-nn, --nonums||Eliminate frequency list entries that contain numbers (e.g. Bill7).|
|-np, --nopunct||Eliminate frequency list entries that contain punctuation (e.g. email@example.com).|
|-ma, --mergeallo||Merge Spanish allomorphs (e.g. "y" and "e", "o" and "u").|
|-3c, --3-col||Process 3-column frequency lists. This allows intelligent handling of identical items that have different POSes/lemmas assigned to them (e.g. canto (NOUN SG MSC) and canto (V 1SG PRES IND)).|
|-kh, -killhead||In frequency lists that provide syntactic info in the format @NH, eliminate this information, leaving only POS info (e.g. Connexor). Assumes that this info is in the THIRD column.|
|-kt, -killtail||In frequency lists with POS info, eliminate all of this info EXCEPT the general grammatical category (e.g. DET MSC SG becomes DET). Forces -killhead.|
|-so, --spliton||Define the character that input frequency list columns will be split on. The default value is \t (tab).|
|-d, --delimiter||Allows an alternative column delimiter character to be specified in the output file. This is the character that is inserted between columns in the output file. Entering t will produce \t. The default value is \t (tab).|
|-st,--spaces-split||Treat 2 or more spaces in the input frequency list as the split character. Typically for messy lists. Care must be taken with this option, as any extraneous space can (and will) have undesirable consequences.|
|-db, --debug||Show debug info.|
|-h, --help||Show FLW's help information.|
|-w, --words||Process frequency list as words (2 columns: FREQ, WORD).|
|-l, --lemmas||Process frequency list as lemmas (2 columns: FREQ, LEMMA).|
|-pm, --posmin||Process frequency list as minimal POS (2 columns: FREQ, POS. Kills POS head and tail).|
|-p, --pos||Process frequency list as partial POS (2 columns: FREQ, POS. Kills POS head, leaves tail intact).|
|-pf, --posfull||Process frequency list as full POS (2 columns: FREQ, POS. Leaves entire POS intact).|
|-sr, --synrel||Process frequency list as syntactic relationships (2 columns, deactivates potentially destructive options).|
|-wp, --wordpos||Process frequency list as words + POS (3 columns: FREQ, WORD, POS. Kills POS head and tail, and eliminates numbers and punctuation).|
|-lp, --lemmapos||Process frequency list as lemmas + POS (3 columns: FREQ, LEMMA, POS. Kills POS head and tail, and eliminates numbers and punctuation).|