备注

点击此处下载完整示例代码

从零开始的自然语言处理：使用字符级 RNN 生成名字

创建于：2025 年 4 月 1 日 | 最后更新：2025 年 4 月 1 日 | 最后验证：2024 年 11 月 5 日

作者：Sean Robertson

本教程是三个部分系列的一部分：

这是关于“从零开始学习 NLP”的三个教程中的第二个。在第一个教程中，我们使用 RNN 将名字分类到它们的起源语言。这次我们将反过来，从语言中生成名字。

> python sample.py Russian RUS
Rovakov
Uantov
Shavakov

> python sample.py German GER
Gerren
Ereng
Rosher

> python sample.py Spanish SPA
Salla
Parer
Allan

> python sample.py Chinese CHI
Chan
Hang
Iun

我们仍然在手工制作一个包含几个线性层的简单 RNN。最大的不同是，我们不是在读取完一个名字的所有字母后预测一个类别，而是输入一个类别，一次输出一个字母。递归地预测字符以形成语言（这也可以用单词或其他更高阶的结构来完成）通常被称为“语言模型”。

推荐阅读：

我假设你已经至少安装了 PyTorch，了解 Python，并且理解张量：

https://maskerprc.github.io/ 安装说明
使用 PyTorch 进行深度学习：60 分钟快速入门，了解 PyTorch 的基本使用
通过实例学习 PyTorch，全面深入的了解
PyTorch 面向前 Torch 用户，如果您是前 Lua Torch 用户

了解 RNN 及其工作原理也将很有帮助：

《循环神经网络的不合理有效性》展示了大量真实生活中的例子
理解 LSTM 网络主要关于 LSTM，但也对循环神经网络有一般性的说明

我还建议之前的教程，从零开始的自然语言处理：使用字符级 RNN 进行命名分类

准备数据

备注

从这里下载数据并将其提取到当前目录。

请参阅上一教程以获取此过程的更多细节。简而言之，有一堆包含每行一个名称的纯文本文件 data/names/[Language].txt 。我们将行拆分为数组，将 Unicode 转换为 ASCII，最终得到一个字典 {language: [names ...]} 。

from io import open
import glob
import os
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'-"
n_letters = len(all_letters) + 1 # Plus EOS marker

def findFiles(path): return glob.glob(path)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Read a file and split into lines
def readLines(filename):
    with open(filename, encoding='utf-8') as some_file:
        return [unicodeToAscii(line.strip()) for line in some_file]

# Build the category_lines dictionary, a list of lines per category
category_lines = {}
all_categories = []
for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

if n_categories == 0:
    raise RuntimeError('Data not found. Make sure that you downloaded data '
        'from https://download.pytorch.org/tutorial/data.zip and extract it to '
        'the current directory.')

print('# categories:', n_categories, all_categories)
print(unicodeToAscii("O'Néàl"))

创建网络

此网络扩展了上一教程中的 RNN，增加了一个用于类别张量的额外参数，该参数与其他参数一起连接。类别张量是一个类似于字母输入的一维向量。

我们将把输出解释为下一个字母的概率。在采样时，使用最可能的输出字母作为下一个输入字母。

我在结合隐藏层和输出层之后添加了一个第二层线性层 o2o （以增加其工作能力）。还有一个 dropout 层，它会以一定的概率（这里为 0.1）随机将输入的部分置零，通常用于模糊输入以防止过拟合。在这里，我们将其用于网络的末端，故意引入一些混乱，以增加采样的多样性。

import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        input_combined = torch.cat((category, input, hidden), 1)
        hidden = self.i2h(input_combined)
        output = self.i2o(input_combined)
        output_combined = torch.cat((hidden, output), 1)
        output = self.o2o(output_combined)
        output = self.dropout(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

训练 ¶

准备训练 ¶

首先，是获取随机对（类别，行）的辅助函数：

import random

# Random item from a list
def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

# Get a random category and random line from that category
def randomTrainingPair():
    category = randomChoice(all_categories)
    line = randomChoice(category_lines[category])
    return category, line

对于每个时间步（即每个训练单词中的每个字母），网络的输入将是 (category, current letter, hidden state) ，输出将是 (next letter, next hidden state) 。因此，对于每个训练集，我们需要类别、一组输入字母和一组输出/目标字母。

由于我们是在每个时间步预测当前字母的下一个字母，因此字母对是来自同一行的连续字母组，例如，对于 "ABCD<EOS>" ，我们将创建（“A”，“B”），（“B”，“C”），（“C”，“D”），（“D”，“EOS”）。

类别张量是一个大小为 <1 x n_categories> 的一热张量。在训练时，我们将它喂给网络在每个时间步，这是一个设计选择，它也可以作为初始隐藏状态或某种其他策略的一部分。

# One-hot vector for category
def categoryTensor(category):
    li = all_categories.index(category)
    tensor = torch.zeros(1, n_categories)
    tensor[0][li] = 1
    return tensor

# One-hot matrix of first to last letters (not including EOS) for input
def inputTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

# ``LongTensor`` of second letter to end (EOS) for target
def targetTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) # EOS
    return torch.LongTensor(letter_indexes)

为了方便训练，我们将创建一个 randomTrainingExample 函数，该函数获取一个随机的（类别，行）对，并将它们转换为所需的（类别，输入，目标）张量。

# Make category, input, and target tensors from a random category, line pair
def randomTrainingExample():
    category, line = randomTrainingPair()
    category_tensor = categoryTensor(category)
    input_line_tensor = inputTensor(line)
    target_line_tensor = targetTensor(line)
    return category_tensor, input_line_tensor, target_line_tensor

网络训练 ¶

与分类不同，分类只使用最后一个输出，我们在每一步都进行预测，因此我们在每一步都计算损失。

自动微分（autograd）的魔法允许你简单地累加每一步的损失，并在最后调用反向传播（backward）。

criterion = nn.NLLLoss()

learning_rate = 0.0005

def train(category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)
    hidden = rnn.initHidden()

    rnn.zero_grad()

    loss = torch.Tensor([0]) # you can also just simply use ``loss = 0``

    for i in range(input_line_tensor.size(0)):
        output, hidden = rnn(category_tensor, input_line_tensor[i], hidden)
        l = criterion(output, target_line_tensor[i])
        loss += l

    loss.backward()

    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, loss.item() / input_line_tensor.size(0)

为了跟踪训练所需的时间，我添加了一个 timeSince(timestamp) 函数，该函数返回一个可读性强的字符串：

import time
import math

def timeSince(since):
    now = time.time()
    s = now - since
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)

训练照常进行 - 多次调用训练，等待几分钟，每 print_every 个示例打印当前时间和损失，并将每 plot_every 个示例的平均损失存储在 all_losses 中以便稍后绘图。

rnn = RNN(n_letters, 128, n_letters)

n_iters = 100000
print_every = 5000
plot_every = 500
all_losses = []
total_loss = 0 # Reset every ``plot_every`` ``iters``

start = time.time()

for iter in range(1, n_iters + 1):
    output, loss = train(*randomTrainingExample())
    total_loss += loss

    if iter % print_every == 0:
        print('%s (%d %d%%) %.4f' % (timeSince(start), iter, iter / n_iters * 100, loss))

    if iter % plot_every == 0:
        all_losses.append(total_loss / plot_every)
        total_loss = 0

绘制损失图

从 all_losses 中绘制历史损失图，显示了网络的训练过程：

import matplotlib.pyplot as plt

plt.figure()
plt.plot(all_losses)

网络采样

为了采样，我们给网络一个字母并询问下一个字母是什么，将其作为下一个字母输入，然后重复，直到 EOS 标记。

为输入类别、起始字母和空隐藏状态创建张量
使用起始字母创建一个字符串 output_name
最多输出长度为
- 将当前字母输入到网络中
- 从最高输出获取下一个字母和下一个隐藏状态
- 如果字母是 EOS，则在这里停止
- 如果是普通字母，则添加到 output_name 并继续
返回最终名称

备注

而不是必须给出起始字母，另一种策略是在训练中包含一个“字符串开始”标记，让网络自己选择起始字母。

max_length = 20

# Sample from a category and starting letter
def sample(category, start_letter='A'):
    with torch.no_grad():  # no need to track history in sampling
        category_tensor = categoryTensor(category)
        input = inputTensor(start_letter)
        hidden = rnn.initHidden()

        output_name = start_letter

        for i in range(max_length):
            output, hidden = rnn(category_tensor, input[0], hidden)
            topv, topi = output.topk(1)
            topi = topi[0][0]
            if topi == n_letters - 1:
                break
            else:
                letter = all_letters[topi]
                output_name += letter
            input = inputTensor(letter)

        return output_name

# Get multiple samples from one category and multiple starting letters
def samples(category, start_letters='ABC'):
    for start_letter in start_letters:
        print(sample(category, start_letter))

samples('Russian', 'RUS')

samples('German', 'GER')

samples('Spanish', 'SPA')

samples('Chinese', 'CHI')

练习 §

尝试使用不同类别的数据集，例如：类别 -> 行，
- 虚构系列 -> 角色名称
- 词性 -> 词
- 国家 -> 城市
使用“句子开头”标记，以便在不需要选择起始字母的情况下进行采样
使用更大或形状更好的网络获得更好的结果
- 尝试使用 nn.LSTM 和 nn.GRU 层
- 将多个这些 RNN 组合成一个高级网络

脚本总运行时间：（0 分钟 0.000 秒）

由 Sphinx-Gallery 生成的画廊