NWLC


1 About

New Word Level Checker (NWLC) is a web app for vocabulary profiling. Powered by the ingenious concept of its predecessor, Word Level Checker, developed by Professor Yasumasa Someya, New Word Level Checker analyzes English words submitted by the user and produces vocabulary levels based on the selected word lists. As of March 2021, New Word Level Checker features five research-based, trustworthy word lists: New JACET8000, SVL12000, the New General Service List, CEFR-J, and SWEK-J.


2 Word Lists

2.1 New JACET8000

The New JACET List of Basic Words (New JACET8000) is the updated version of JACET8000 (JACET, 2003), compiled by the Japan Association of College English Teachers (JACET). Based on the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA), New JACET8000 serves as an educational word list for Japanese learners of English, especially university students. The list has 8,000 words, and for each 1,000 words, the level (i.e., from 1 to 8) is provided on New Word Level Checker. New JACET8000 can be downloaded from Dr. Shin Ishikawa’s website. JACET reserves the copyright to the list (JACET, 2016).

2.2 SVL12000

SVL12000 (Standard Vocabulary List 12000) was developed and published in 2001 by ALC Press, Inc. It is based on the British National Corpus (BNC). Like New JACET8000, the word list is intended to be used for educational purposes, and many ALC materials use this list. As the name shows, the list has 12,000 words and can be divided into 12 levels for each 1000-word band. The list is not publicly available. The copyright of SVL12000 belongs to ALC Press, Inc.

2.3 New General Service List

The New General Service List (NGSL) was released in 2013 by Dr. Charles Browne, Dr. Brent Culligan, and Joseph Phillips. NGSL is a modern update of the General Service List (West, 1953). It covers about 90 percent of general texts of English with a list of approximately 2,800 high frequency words. Dr. Browne and his colleagues continued to produce a series of word lists for learners who have mastered the first 2,800 words with NGSL. These include the New Academic Word List (NAWL), the TOEIC Service List (TSL), and the Business Service List (BSL), which have the following number of headwords in each list.

  • NGSL (General): 2,801 words (Ver. 1.01)
  • NAWL (Academic): 963 words (Ver. 1.0)
  • TSL (TOEIC): 1,259 words (ver.1.1)
  • BSL (Business): 1,754 words (Ver. 1.01)

Among NAWL, TSL, and BSL, some words appear in more than one list (unlike NGSL words, which only appear in NGSL). For example, the word “impact” is in all three lists (NAWL, TSL, and BSL). Also, while the word “quit” is in both TSL and BSL, the word “syndicate” only appears in BSL. For this reason, New Word Level Checker considers the overlapping of words in the three lists and produces the output using the criteria shown below. In total, New Word Level Checker has 5,621 words for word profiling.

  • Level 1: NGSL = 2,801 words
  • Level 2: NAWL/TOEIC/BSL = 183 words
  • Level 3: NAWL/TOEIC, NAWL/BSL, or TOEIC/BSL = 790 words
  • Level 4: Only in NAWL, TOEIC, or BSL = 1,847 words

NGSL and other word lists are available on the NGSL website. The lists are licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

2.4 CEFR-J Wordlist

CEFR-J Wordlist was created by Dr. Yukio Tono (Tokyo University of Foreign Studies). It was based on the English textbook corpora consisting of textbooks for primary and secondary schools in China, Korea, and Taiwan. The word levels were classified according to the Common European Framework of Reference for Languages (CEFR) levels, and they were compared with the English Vocabulary Profile. All headwords, as a result, have part-of-speech information and their corresponding CEFR(-J) levels as shown below (Note: Type is counted without the part-of-speech distinction).

CEFR-J Level Number of Headwords Type Level in Japan
A1 1,165 1,068 Elementary
A2 1,416 1,359 Junior High
B1 2,451 2,358 Senior High
B2 2,783 2,696 University
Total 7,815 7,481

New Word Level Checker uses the CEFR-J Wordlist Version 1.5. (Retrieved from http://www.cefr-j.org/download.html#cefrj_wordlist on April 1, 2019). The copyright of CEFR-J Wordlist belongs to Tono Laboratory at Tokyo University of Foreign Studies.

2.5 SEWK-J

SEWK-J stands for “Scale of English Word Knowledge - Japanese”. For the purposes of estimating the difficulty that vocabulary presents to Japanese learners of English, the SEWK-J list estimates the likelihood that a word is known to Japanese university students. The probability of knowledge of a word is based on a multiple regression performed by Dr. Geoff Pinchbeck (Carleton University) using vocabulary test data of the 149-item New Vocabulary Levels Test (McLean & Kramer, 2016) administered to Japanese University EFL students as the criterion (dependent) variable. The regression formula includes the following predictive variables to provide estimates for about 75,000 flemma headwords:

  1. English L2 vocabulary yes/no test data – accuracy
  2. English L2 vocabulary yes/no test data – reaction time
  3. English-Japanese loan words – identity
  4. English-Japanese loan words – frequency in Japanese
  5. Age of Acquisition
  6. Age of Exposure

A full description of the SEWK-J list will be provided in an upcoming publication.

In New Word Level Checker, Levels 1 to 10 represent the following word bands: L1 (1-500), L2 (501-1000), L3 (1001-1500), L4 (1501-2000), L5 (2001-2500), L6 (2501-3000), L7 (3001-4000), L8 (4001-5000), L9 (5001-7500), and L10 (7501-10000). “Over10” thus means the word is over 10,000 in the rank of SEWK-J.

If “SEWK-J: Fine-grained” is selected as the word list, up to 30 bands (each band with 250 words, 7,500 words in total) in the SEWK-J list are used for profiling, and words over the 30 bands will be categorized as “Over30”. This approach gives you a detailed understanding of how many words in the text the learners may know. When “SEWK-J: Fine-grained” is chosen, New Word Level Checker outputs (a) mean logit (log odds unit) difficulty of the input text as a whole, (b) mean logit difficulty of the content words, and (c) mean logit difficulty of the function words.


3 Word Counting

3.1 Proper Nouns and Numerals

In New Word Level Checker, proper nouns and numerals (numbers) are first identified using an open-source Part of Speech tagger, spaCy, are treated as possibly “known” words because they can be assumed to be understood by learners. The possessive ’s (e.g., Todd’s dog) is also put into this category in word profiling. For the remaining words in the text, the following lemmatization (tokenization) rules are applied.

3.2 Lemmatization

Counting words is a tricky business. In some word counting methods, “happy”, “happily”, “happiness”, and “unhappy” can be counted as one word with the headword “happy.” Four out of five word lists in New Word Level Checker adapt flemma counting (i.e., a base form as a headword and its inflected forms as a one word). For example, for the headword “study,” the following word forms are included and counted as one word: study, studies, studied, and studying. Note that flemma (family lemma)(first introduced by Pinchbeck, 2017), which is a recommended word counting method in the field of applied linguistics (see McLean, 2017) including teaching English as a foreign language, flemma counting does not distinguish the part of speech (POS). That is, with flemma counting, the verb “study” and the noun “study” are both counted as one headword “study”.

On the other hand, lemma counting can detect the POS differences, and CEFR-J Wordlist adopts lemma counting. That is why the verb “study” is A1 and the noun “study” is A2 in CEFR-J Wordlist.

For this reason, the “lemma” lists used in New Word Level Checker (see below) are all in fact “flemma” lists.

For New JACET8000 and SVL12000, New Word Level Checker uses the AntBNC Lemma List created by Dr. Laurence Anthony (Waseda University), which is based on all words in the BNC corpus with a frequency greater than two for lemmatization. Modifications were manually made to match the headwords of New JACET8000 and SVL12000. For example, the words “interesting” and “interested” are listed as two headwords in both New JACET8000 and SVL12000, so they were excluded from the lemma entry “interest” (interest = interest, interests). In addition, words with British spellings in New JACET8000 and SVL12000 are included in the revised lemma list (e.g., advertise = advertise, advertised, advertises, advertising, advertize, advertizes, advertized, advertizing).

For NGSL, lemmatization is much simpler because all NGSL (New General Service List) lists come with lemmatized forms. New Word Level Checker uses the lemma lists available at the NGSL website.

For CEFR-J Wordlist, New Word Level Checker utilizes spaCy to get each word’s POS and its lemma form. It should be noted that, although considerable “fine-tuning” was required and conducted to achieve the optimal matching of the POS tagging and CEFR-J Wordlist, there may still be a few cases where the result is not 100% perfect.

For SEWK-J, New Word Level Checker refers to a lemma list developed by Dr. Geoff Pinchbeck (Carleton University).

3.3 Capitalized Letters

If the headwords in New JACET8000, SVL12000, and CEFR-J Wordlist include capital letters, they are treated as they are (e.g., “I” not “i”). However, as the headwords in NGSL and SEWK-J are all lowercase (or capitalized) letters, New Word Level Checker treats all words as lowercases. In other words, New JACET8000, SVL12000, and CEFR-J are case-sensitive, and NGSL and SEWK-J are case-insensitive.

List of Words with Capitalized Letters

3.4 Contracted Forms

New Word Level Checker detects the contracted forms by using spaCy and reverts them to the uncontracted forms (e.g., I’m => I am). Note that, as contracted forms such as “s” as in “he’s” (he has, he is) and “d” as in “we’d” (we had or we would) are impossible to distinguish from one to another, New Word Level Checker regards “s” and “d” as they are. This is because the headwords, “be (or is),” “have (has),” “would,” etc. are all in one of the lowest levels. For this reason, the results may need to be checked carefully.

SVL12000, NGSL, and CEFR-J have words with apostrophes. If those words are in the input text, New Word Level Checker treats them as they are.

List of Words with Apostrophes

3.5 Words with Periods

SVL12000 and CEFR-J have words with periods (i.e., abbreviated words). If those word lists are selected and the input text has those words (see below), New Word Level Checker treats them as they are.

List of Words with Periods

3.6 Hyphenated Words

Hyphenated words are first divided into two words (e.g., “Osaka-based” is treated as two words, “Osaka” and “based”) in all word lists except for the cases where the selected list has hyphenated words as headwords (see below).

List of Hyphenated Words

3.7 Compounds/Multi-word Units

If a headword in the selected word list consists of more than one word (e.g., “ice cream”, “bank account”, “mobile phone” in CEFR-J), it is counted as one word (unit), and New Word Level Checker returns the word profiling accordingly.

List of Compounds

3.8 Summary

Category NewJ8 SVL NGSL CEFR-J SEWK-J
Total Words 8,000 12,000 5,621 7,975 74,810
Lemma or Flemma Flemma Flemma Flemma Lemma Flemma
Lemma List No No Yes No Yes
Capitalized 20 44 0 72 0
Case Sensitive Sensitive Insensitive Sensitive Insensitive
Apostrophe 0 1 2 6 0
Period 0 4 0 8 0
Hyphen 0 0 3 62 0
Compound 0 0 1 145 0

4 Features of NWLC

4.1 Downloadable Word List

Things you can do after creating a list based on the selected word list:

  1. Search any term in the list.
  2. Sort each column.
  3. Download the whole list as a CSV file.

4.2 Auto-extracted Keywords

When you create a word list, 20 keywords of the input text are extracted automatically. NWLC uses Yake! (Yet Another Keyword Extractor), a light-weight, single document unsupervised automatic keyword extraction method using text statistical features.

According to this report (a study conducted by the authors of YAKE!), it outperforms ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank).

The 20 auto-extracted keywords in NWLC are listed according to their importance in the text (from the most relevant keywords to the least).

4.3 Key Word In Context

After you check out the words in the list and the auto-extracted keywords, you might want to see how those words (and phrases) are used in the original context. NWLC has a KWIC (Key Words In Context) feature, which produces concordance lines.

When you type in a search term (two or more words are also acceptable) and click on “Search”, the concordance lines containing the search term in the center will appear. Once you click on the “Search” button, two new buttons (“L” and “R”) are displayed. You can sort the concordance lines in the left (“L”) or in the right (“R”).

Observing the concordance lines is useful in finding the pattern and usage of certain words and phrases in the context (i.e., collocations and phraseology).

In addition to checking the search term in the concordance-line format, you can check the search term in the original text by clicking on “Show Original”. Note that, if your input text is too large to process online, it will take a long time to show the text, or NWLC cannot process the text properly.


5 Developer

New Word Level Checker is developed and programmed by Dr. Atsushi Mizumoto, Professor of Applied Linguistics Faculty of Foreign Language Studies and Graduate School of Foreign Language Education and Research, Kansai University, Osaka, Japan.

I would like to acknowledge and thank Dr. Yasu Imao, the developer of CasualConc, for his support in producing word lists and concordance lines.


6 Citing NWLC

Mizumoto, A. (2021). New Word Level Checker [Web application]. https://nwlc.pythonanywhere.com/