Ver en castellano

Frequency List Wizard

Version 1.2.0

Frequency List Wizard is a command-line program that does various useful things with... frequency lists. It's free software, written in Perl and licensed under the GPL v3.

Download from GitHub

Older versiones

 

Quick usage info

To process a frequency list using FLW's default options, unzip the downloaded file and do the following:

Windows executable

  • Copy the program to the folder where your frequency list is (or copy it to a folder that's on your path, such as C:\Windows or C:\Windows\System32, to avoid this hassle).
  • Open a command prompt by hitting WINDOWS+R and typing cmd.exe (you can also type this in the Start Menu search box on Vista or later).
  • In the command prompt, navigate to the directory with the frequency list using the cd command.
  • Type the following: frequency-list-wizard.exe -i your-list.txt

Perl script

  • Make the .pl file executable.
  • Copy it to the directory where your frequency list is (or to a directory that's on your path).
  • Open a terminal window and navigate to the directory with your frequency list.
  • Execute the following command:
    • GNU/Linux: ./frequency-list-wizard.pl -i your-list.txt
    • Windows: perl frequency-list-wizard.pl -i your-list.txt

Run the program with the -h switch to see help and usage information.

Description

Frequency List Wizard's default processing mode takes a 2-column frequency list in ISO-8859-1 (Latin-1) encoding, merges all entries that vary only by their capitalization (e.g. house, House and HOUSE), and sums the frequencies of each of these items to give you the total frequency per set of variant capitalizations (which is almost certainly what is desired when working with lexical items, lemmas, etc.). It performs a reverse natural numeric sort on the results (1000, 200, 30, 1 instead of 30, 200, 1000, 1) and outputs them to a text file.

Three-column lists (e.g. frequency + lemma + POS) can be processed using the -3c switch. This options allows identical lemmas with different POSes to be processed (and counted) separately (e.g. jump (NOUN) and jump (VERB)).

If desired, FLW can also calculate the total number of types and tokens in the frequency list, as well as its type-token ratio (this is done by default, and printed at the end of the processed frequency list).

Optionally, FLW can eliminate entries containing numerals (using -nn) and/or punctuation (using -np) from frequency lists. It can also merge certain Spanish allomorphs (y + e, o + u) into a single item (-ma) (this is FLW's only language-specific feature). All three options are activated by default, and can be deactivated with the -nonn, -nonp and -noma switches. The difference between the number of items in the source frequency list and the number actually processed after eliminating numbers and/or punctuation marks is reflected in the type and token counts shown with the --print-stats option (INPUT_TYPES versus PROCESSED_TYPES, etc.).

When using the 3-column option, POS information in the third column can be pruned if it is in a Connexor-style format (e.g. @NH N MSC SG). The -kh (--killhead) switch will eliminate the head of the field (the first block of characters plus the first space; in the example tag, this eliminates @NH ), while -kt (--killtail) will eliminate the tail (everything between the second space and the end of the POS field; in the example tag, MSC SG). To process differently-formatted POS information, use only the --posfull option .

The "meta-frequency" (AKA "legomena") processing mode, activated with the -mf or -hx switches, calculates the frequency of each frequency in the list. Its output is a frequency list of frequencies -- how many items occur 1 time, 2 times, and so on.

Options

   
   
-i, --input   Name of the input frequency list file. MANDATORY! Must be ISO-8859-1 (Latin-1).
-o, --output   Name of the output file. If not provided, a name will be automatically generated using the input file base name.
-ps,--print-stats   Calculate and print type, token and TTR statistics to output file (DEFAULT: ON).
-mf, --meta-freq   Calculate the frequencies of each frequency in the list. In other words, generates a meta-frequency list, or list of n-legomena.
-leg, --legomena   Same as -mf or --meta-freq.
-nn, --nonums   Eliminate frequency list entries that contain numbers (e.g. Bill7).
-np, --nopunct   Eliminate frequency list entries that contain punctuation (e.g. a@b.com).
-ma, --mergeallo   Merge Spanish allomorphs (e.g. "y" and "e", "o" and "u").
-3c, --3-col   Process 3-column frequency lists. This allows intelligent handling of identical items that have different POSes/lemmas assigned to them (e.g. canto (NOUN SG MSC) and canto (V 1SG PRES IND)).
-kh, -killhead   In frequency lists that provide syntactic info in the format @NH, eliminate this information, leaving only POS info (e.g. Connexor). Assumes that this info is in the THIRD column.
-kt, -killtail   In frequency lists with POS info, eliminate all of this info EXCEPT the general grammatical category (e.g. DET MSC SG becomes DET). Forces -killhead.
-so, --spliton   Define the character that input frequency list columns will be split on. The default value is \t (tab).
-d, --delimiter   Allows an alternative column delimiter character to be specified in the output file. This is the character that is inserted between columns in the output file. Entering t will produce \t. The default value is \t (tab).
-st,--spaces-split   Treat 2 or more spaces in the input frequency list as the split character. Typically for messy lists. Care must be taken with this option, as any extraneous space can (and will) have undesirable consequences.
-db, --debug   Show debug info.
-h, --help   Show FLW's help information.
   

Meta-configurations

   
   
-w, --words   Process frequency list as words (2 columns: FREQ, WORD).
-l, --lemmas   Process frequency list as lemmas (2 columns: FREQ, LEMMA).
-pm, --posmin   Process frequency list as minimal POS (2 columns: FREQ, POS. Kills POS head and tail).
-p, --pos   Process frequency list as partial POS (2 columns: FREQ, POS. Kills POS head, leaves tail intact).
-pf, --posfull   Process frequency list as full POS (2 columns: FREQ, POS. Leaves entire POS intact).
-sr, --synrel   Process frequency list as syntactic relationships (2 columns, deactivates potentially destructive options).
-wp, --wordpos   Process frequency list as words + POS (3 columns: FREQ, WORD, POS. Kills POS head and tail, and eliminates numbers and punctuation).
-lp, --lemmapos   Process frequency list as lemmas + POS (3 columns: FREQ, LEMMA, POS. Kills POS head and tail, and eliminates numbers and punctuation).

 

 

 

Older versions

Windows executable: Local - Mirror
Perl script / source code: Local - Mirror