返回首页

🔌 向量嵌入 Embedding

将离散符号映射到连续向量空间，AI 大模型的核心技术基础

📖 技术概述

🎯 什么是 Embedding？

向量嵌入（Embedding）是一种将离散对象（如单词、句子、图像、用户 ID 等）映射到连续向量空间的技术。通过这种映射，语义相似的对象在向量空间中的距离也更近。

Embedding 是现代人工智能和机器学习的核心技术之一，广泛应用于：

自然语言处理 - 词向量、句向量、文档向量
推荐系统 - 用户嵌入、物品嵌入
计算机视觉 - 图像嵌入、特征提取
信息检索 - 语义搜索、相似度匹配
大语言模型 - Token 嵌入、位置嵌入

💡 核心思想

Embedding 的核心思想是分布式表示（Distributed Representation）：

降维 - 将高维稀疏的 one-hot 编码转换为低维稠密向量
语义保持 - 语义相似的对象在向量空间中距离更近
可计算性 - 向量可以进行数学运算，如相似度计算、向量加减
迁移学习 - 预训练的 Embedding 可以迁移到新任务

📜 发展历程

1950s-1980s

🔤 独热编码时代 (One-Hot Encoding)

最早的离散符号表示方法，将每个词表示为一个高维稀疏向量。缺点是无法表达词与词之间的关系，且维度灾难严重。

1986

🧠 分布式表示概念提出

Geoffrey Hinton 等人提出分布式表示的概念，认为可以用低维稠密向量来表示符号，开启了 Embedding 研究的先河。

2003

📊 Neural Probabilistic Language Model

Yoshua Bengio 等人提出神经概率语言模型，首次使用神经网络学习词的分布式表示，是 Word Embedding 的雏形。

2013

🚀 Word2Vec 发布

Google 的 Tomas Mikolov 等人提出 Word2Vec，通过 CBOW 和 Skip-gram 模型高效学习词向量，发现"king - man + woman ≈ queen"等语义关系，引发 NLP 革命。

2014

🌐 GloVe 词向量

Stanford 提出 GloVe (Global Vectors)，结合全局矩阵分解和局部上下文窗口方法，在词向量任务上取得更好效果。

2015

📝 Doc2Vec / Paragraph2Vec

Word2Vec 的扩展，可以学习文档级别的向量表示，用于文档相似度、分类等任务。

2017

⚡ Transformer 架构

"Attention Is All You Need"论文提出 Transformer 架构，位置嵌入 (Positional Embedding) 成为标准组件。

2018

🤖 BERT 发布

Google 提出 BERT (Bidirectional Encoder Representations from Transformers)，通过双向 Transformer 和 Masked LM 预训练，在多项 NLP 任务上刷新记录。

2019-2020

🌟 大模型时代

GPT-2、GPT-3、T5 等超大规模预训练模型涌现，Embedding 维度从几百维增长到几千甚至上万维，模型参数量达到千亿级别。

2021-至今

🔥 多模态 Embedding

CLIP、DALL-E 等多模态模型实现文本、图像、音频的统一嵌入表示，开启跨模态检索和生成的新纪元。

⚙️ 技术原理

🎯 Word2Vec 原理

Word2Vec 通过预测上下文来学习词向量，包含两种模型架构：

1. CBOW (Continuous Bag of Words)

根据上下文词预测中心词。例如，给定"the cat ___ on the mat"，预测"sat"。

2. Skip-gram

根据中心词预测上下文词。例如，给定"sat"，预测"the cat ___ on the mat"。

查看代码示例

# 使用 Gensim 实现 Word2Vec
from gensim.models import Word2Vec

# 准备语料
sentences = [
    ['the', 'cat', 'sat', 'on', 'the', 'mat'],
    ['the', 'dog', 'ran', 'in', 'the', 'park'],
    ['the', 'bird', 'flew', 'over', 'the', 'tree']
]

# 训练 Word2Vec 模型
model = Word2Vec(
    sentences=sentences,
    vector_size=100,      # 向量维度
    window=5,             # 上下文窗口大小
    min_count=1,          # 最小词频
    workers=4,            # 并行线程数
    sg=1,                 # 1=Skip-gram, 0=CBOW
    epochs=10             # 训练轮数
)

# 获取词向量
vector = model.wv['cat']
print(f"'cat' 的向量：{vector[:10]}...")  # 显示前 10 维

# 查找相似词
similar = model.wv.most_similar('cat', topn=3)
print(f"与'cat'相似的词：{similar}")

# 词向量运算
result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=1)
print(f"king - man + woman ≈ {result[0][0]}")

# 计算相似度
similarity = model.wv.similarity('cat', 'dog')
print(f"'cat'和'dog'的相似度：{similarity:.4f}")

// 使用 TensorFlow.js 实现简易 Word2Vec
const tf = require('@tensorflow/tfjs');

// 简化的 Word2Vec 实现示例
class SimpleWord2Vec {
    constructor(vocabSize, embeddingDim) {
        this.vocabSize = vocabSize;
        this.embeddingDim = embeddingDim;
        
        // 初始化嵌入矩阵
        this.embeddings = tf.randomNormal([vocabSize, embeddingDim]);
    }
    
    // 获取词向量
    getEmbedding(wordIndex) {
        return this.embeddings.gather([wordIndex]);
    }
    
    // 计算两个词的相似度（余弦相似度）
    similarity(index1, index2) {
        const vec1 = this.getEmbedding(index1);
        const vec2 = this.getEmbedding(index2);
        
        const dot = vec1.dot(vec2);
        const norm1 = vec1.norm();
        const norm2 = vec2.norm();
        
        return dot.div(norm1.mul(norm2)).dataSync()[0];
    }
    
    // 训练一步（简化版）
    trainStep(contextIndices, targetIndex, learningRate = 0.01) {
        return tf.tidy(() => {
            // 这里应该是完整的 Skip-gram 或 CBOW 训练逻辑
            // 为简化示例，仅展示概念
            const contextEmbeddings = this.embeddings.gather(contextIndices);
            const targetEmbedding = this.getEmbedding(targetIndex);
            
            // 计算损失并更新（实际实现需要反向传播）
            return contextEmbeddings.mean();
        });
    }
}

// 使用示例
const vocabSize = 10000;
const embeddingDim = 100;
const word2vec = new SimpleWord2Vec(vocabSize, embeddingDim);

// 获取词向量
const catVector = word2vec.getEmbedding(42);
console.log('Cat vector shape:', catVector.shape);

// 计算相似度
const sim = word2vec.similarity(42, 100);
console.log('Similarity:', sim);

🌐 GloVe 原理

GloVe (Global Vectors for Word Representation) 结合了全局矩阵分解和局部上下文窗口的优点。

核心思想：词共现矩阵的对数比值与词向量之间的关系存在线性关联。

特性	Word2Vec	GloVe
训练方式	局部上下文窗口	全局共现矩阵
训练速度	较快	需要构建共现矩阵
语义捕捉	局部语义	全局统计信息
内存占用	较低	较高（需要存储共现矩阵）

🤖 BERT 与 Transformer Embedding

BERT (Bidirectional Encoder Representations from Transformers) 使用双向 Transformer 编码器生成上下文相关的词向量。

关键创新

双向上下文 - 同时考虑左右两侧上下文
Masked LM - 随机遮蔽部分词，让模型预测
Next Sentence Prediction - 预测两句话是否连续
位置嵌入 - 使用可学习的位置向量

查看代码示例

# 使用 Hugging Face Transformers 获取 BERT Embedding
from transformers import BertTokenizer, BertModel
import torch

# 加载预训练模型和分词器
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# 准备文本
texts = [
    "The cat sat on the mat.",
    "The dog ran in the park."
]

# 分词并获取输入
inputs = tokenizer(texts, padding=True, truncation=True, 
                   return_tensors='pt', max_length=512)

# 获取 BERT 输出
with torch.no_grad():
    outputs = model(**inputs)

# last_hidden_state: [batch_size, seq_len, hidden_size]
last_hidden = outputs.last_hidden_state
# pooler_output: [batch_size, hidden_size] - [CLS] token 的输出
pooler_output = outputs.pooler_output

print(f"Last hidden state shape: {last_hidden.shape}")
print(f"Pooler output shape: {pooler_output.shape}")

# 获取 [CLS] token 的嵌入（常用于句子表示）
cls_embeddings = last_hidden[:, 0, :]
print(f"CLS embeddings shape: {cls_embeddings.shape}")

# 计算句子相似度
from torch.nn.functional import cosine_similarity

similarity = cosine_similarity(
    cls_embeddings[0].unsqueeze(0),
    cls_embeddings[1].unsqueeze(0)
)
print(f"句子相似度：{similarity.item():.4f}")

# 使用句向量进行语义搜索
def semantic_search(query, candidates, model, tokenizer, top_k=3):
    """语义搜索示例"""
    all_texts = [query] + candidates
    inputs = tokenizer(all_texts, padding=True, truncation=True, 
                       return_tensors='pt', max_length=512)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    embeddings = outputs.pooler_output
    query_emb = embeddings[0:1]
    candidate_embs = embeddings[1:]
    
    # 计算余弦相似度
    similarities = cosine_similarity(query_emb, candidate_embs)
    
    # 返回 top_k 结果
    top_indices = similarities.topk(top_k).indices.tolist()
    return [(candidates[i], similarities[0][i].item()) for i in top_indices]

# 测试
candidates = [
    "A feline is resting on the carpet.",
    "The canine is playing outside.",
    "Birds are flying in the sky."
]

results = semantic_search("The cat sat on the mat.", candidates, model, tokenizer)
print("\n语义搜索结果:")
for text, score in results:
    print(f"  {score:.4f}: {text}")

🌟 现代 Embedding 技术

Sentence-BERT (SBERT)

通过孪生网络结构优化 BERT，生成高质量的句向量，适用于语义相似度任务。

Universal Sentence Encoder

Google 推出的通用句向量编码器，支持多种语言，可直接用于下游任务。

CLIP (Contrastive Language-Image Pre-training)

OpenAI 的多模态模型，将文本和图像映射到同一向量空间，实现跨模态检索。

Text Embedding Ada-002

OpenAI 提供的高效文本嵌入 API，1536 维向量，在多项基准测试中表现优异。

📊 现状与趋势

🎯 当前主流方案

模型	维度	特点	应用场景
Word2Vec	100-300	轻量、快速	传统 NLP 任务
GloVe	100-300	全局统计信息	词相似度任务
BERT	768-1024	上下文相关	各类 NLP 任务
Sentence-BERT	384-768	句向量优化	语义搜索、相似度
text-embedding-ada-002	1536	API 服务、高质量	通用嵌入任务
CLIP	512-768	多模态	图文检索、生成

🚀 发展趋势

更大规模 - 从 GPT-3 的 1750 亿参数到 GPT-4 的万亿级别
多模态融合 - 文本、图像、音频、视频的统一表示
高效推理 - 量化、蒸馏、剪枝等技术降低计算成本
领域适配 - 医疗、法律、金融等垂直领域的专业 Embedding
可解释性 - 理解 Embedding 空间中的语义结构
向量数据库 - FAISS、Milvus、Pinecone 等专用存储检索系统

💼 实际应用案例

搜索引擎 - Google、Bing 使用 Embedding 提升语义理解
推荐系统 - YouTube、Netflix 用 Embedding 表示用户和内容
智能客服 - 语义匹配理解用户问题
内容审核 - 识别相似违规内容
知识图谱 - 实体链接和关系抽取
RAG 系统 - 检索增强生成，结合向量检索和大模型