Algorithm for finding names in Chinese texts

Question

I am writing an application for extraction information from Chinese texts. One of the tasks is finding names (personal, geographical or something else). The algorithm is not required to find all 100% of the names. 50% is enough.

In European languages(for example English or Russian), I can detect names by first capital letters: if a word in the middle of the sentence begins with a capital letter then this word is name. This criterion is not 100% reliable and it does not allow to find all names but for my purpose it is enough.

I know that sometimes (but not always) name can be after 叫. But I do not know how long this name.

Could you tell me some sings (criterions, features) by which the algorithm can find the names in Chinese texts?

Siyi Deng · Answer

This is a non-trivial task.
For example, in the sentence below, are 李三 张三 both name of people?
李三买了一张三角桌子。
A deep-learning based algorithm is your best bet, for example:
https://zhuanlan.zhihu.com/p/61227299
https://github.com/wainshine/Chinese-Names-Corpus

Lionel Rowe · Answer

Depending on your use case, you could consider an approach like the following:

Segment the text into words using some tried-and-trained machine learning algorithm (this step alone will already yield imperfect results).

Look up each word in the free and downloadable CC-Cedict dictionary.

Check if the Pinyin field for that word's CC-Cedict entry begins with a capital letter.
If you want some idea of how accurate such an approach would be, try playing around with the regex-based query here (disclaimer - shameless plug for my app; instructions available by clicking the ℹ icon).

If yes, judge that the word is a proper noun. Else, judge that it isn't.

[Edited - previous answer below. Above approach is simpler, probably more reliable, and doesn't require sending tons of network requests]

For each word (or string of isolated characters that have been judged not to form a word), use the Wikipedia API to check if a page exists for it, using a query something like this:
action=query&format=json&prop=categories&titles=...&formatversion=2
If yes, check if one or more of the categories of said page match some variation of the following regex:
/(?:人|者|地名|城市)$/

If yes, judge that the word is a proper noun. Else, judge that it isn't.

Either approach will be a little messy and not entirely reliable, but if you're lucky you might hit at least the 50% accuracy you're hoping for.

Xuehong Zhang · Answer

It seems to be an impossible task. For example this sentence
加利福尼亚在美国。
There are two names in the sentence, the first one has five characters, and the second one has two characters. Unless you have come across these names before, you would not know.
Names after 叫 must be a very tiny part of all places where a name springs up.

Mo. · Answer

This doesn't work with all texts but it should provide a 100% grab for supported texts.
You can use parameters to find proper name marks throughout texts:

In Chinese writing, a proper name mark (Simplified Chinese: 专名号, zhuānmínghào; Traditional Chinese: 專名號) is an underline used to mark proper names, such as the names of people, places, dynasties, organizations. The related book name mark (Simplified Chinese: 书名号, shūmínghào; Traditional Chinese: 書名號) indicated by a wavy underline (﹏﹏) is used to mark the titles of publications or texts.
For example:
屈原放逐，乃賦離騒。左丘失明，厥有國語。（司馬遷 《報任安書》）(using double underline to indicate wavy underline)
屈原放逐，乃賦離騒。左丘失明，厥有國語。 (using CSS 3; wavy underline may not be visible in all browsers)
Qu Yuan was exiled, and thus composed the Li Sao. Zuo Qiu (or Zuoqiu1) lost his sight, hence there is the Guo Yu. (Sima Qian, Letter to Ren An)

(Underline doesn't seem to be supported, check the Wiki link for the better example.)
The problem is that:

The proper name mark is rarely used in modern Chinese publications, and the Guillemet (《 》or〈 〉) is more commonly used to indicate titles. It is occasionally used in Taiwan and Hong Kong in school textbooks. However, in scholarly editions of classical Chinese texts, especially vertically typeset texts (where they appear to the left of the text instead of underneath), use of both the proper name mark and the book name mark is common, as they help readers avoid misinterpretations of the text.

Algorithm for finding names in Chinese texts

4 Answers

Add your own answers!

Ask a Question