animateboot开发日记

十二月 15, 2025

Animateboot.com开发日记

需求背景：

摸鱼时候会打开网页版的音乐网站听歌。
部分公司直接把网易云域名屏蔽，电脑无法访问，所以需要网易云以外的音乐平台。
动漫歌曲希望有个汇总的地方，而且点击直达，不需要我再去复制名字去搜索。
在十一月的时候，正好换工作，时间开始充裕起来，所以着手开发。

准备阶段-整理所需数据以及可行性探索

每个季度的动漫数据：爬虫抓取。
每个季度动漫对应的歌曲：第二个爬虫。
每个歌曲对应的链接：问了Gemini，对于YouTube，网易云，spotify，都给出了代码。
ui设计：让Gemini设计，不懂前端，所以本地通过ai-coding完成。

做了一半发现可以做歌单，因此也有了下面需求。
5. 对于动漫进行排名。爬虫抓取中、日、美的网站，计算观看人数进行排名。
6. 根据排名结果，筛选前50%的动漫的歌曲。

相关技术

语义相似度的判断

爬取动漫季度和爬取动漫歌曲，这是两个爬虫，两个爬虫得到的动漫名字可能不一样，而且可能有的在a爬虫存在，b爬虫中不存在。

假设以a爬虫数据为准，需要在b爬虫中找出来对应的动漫，需要对b的每个动漫进行相似度判断。

这个技术在RAG中也有应用。引入模型->转成向量->余弦算数。

为了增加准确度，用多个语言都计算相似度，选最准确的：

1
2
3

score1 = util.cos_sim(emb_a_jp, emb_b_jp).item()  # 日 vs 日
score2 = util.cos_sim(emb_a_name, emb_b_name).item()  # 中 vs 中
score3 = util.cos_sim(emb_a_jp, emb_b_name).item()  # 日 vs 中

这里面Gemini给了一个优化点，就是 先进行向量化，而不是每次都计算向量 。

比如下面这样就是每次都计算，但是b里面单词向量会计算多次，所以时间非常慢。

def cmp_s1_s2(s1, s2):
    if s2 is None:
        return False
    start = time.time()
    embedding1 = model.encode(s1, convert_to_tensor=True)
    embedding2 = model.encode(s2, convert_to_tensor=True)
    cosine_score = util.cos_sim(embedding1, embedding2)[0][0]
    end = time.time()
    return cosine_score > 0.7

改成直接传两个对象，然后只计算一遍向量。

# 部分代码已经省略
def cmp_json_main(json_a, json_b):
    embeddings_a_name, embeddings_a_jp, embeddings_b_name, embeddings_b_jp = init(json_a, json_b)
    
    for i in range(len(json_a)):
        emb_a_name = embeddings_a_name[i]
        emb_a_jp = embeddings_a_jp[i]
        ...
        for j in range(len(json_b)):
            emb_b_name = embeddings_b_name[j]
            emb_b_jp = embeddings_b_jp[j]
            ...
            score1 = util.cos_sim(emb_a_jp, emb_b_jp).item()  # 日 vs 日
            score2 = util.cos_sim(emb_a_name, emb_b_name).item()  # 中 vs 中
            score3 = util.cos_sim(emb_a_jp, emb_b_name).item()  # 日 vs 中
            ...

def init(json_a, json_b):
    print("正在进行批量向量化 (这可能需要几秒钟)...")
    start_time = time.time()
    a_names = [item.get('name', '') for item in json_a]
    a_jp_names = [item.get('jp_name', '') for item in json_a]

    # 提取 json_b 的所有文本
    b_names = [item.get('name', '') for item in json_b]
    b_jp_names = [item.get('jp_name', '') for item in json_b]

    embeddings_a_name = model.encode(a_names, convert_to_tensor=True)
    embeddings_a_jp = model.encode(a_jp_names, convert_to_tensor=True)

    embeddings_b_name = model.encode(b_names, convert_to_tensor=True)
    embeddings_b_jp = model.encode(b_jp_names, convert_to_tensor=True)
    end_time = time.time()
    print(f"向量化完成，cost {end_time - start_time}s, 开始比对...")
    return embeddings_a_name, embeddings_a_jp, embeddings_b_name, embeddings_b_jp

这个优化非常明显，原来可能两分钟，修改完之后，这一步十几秒。

歌曲的获取

也是问Gemini得到的，只有spotify提供了官方接口，其他的都是开源的逆向工程写的依赖包，通过Gemini给出示例代码，复制粘贴就能执行。

拿到了歌曲后，也不是直接就是正确的，还是需要人工校对。

在后期尝试引入llm，增大校对正确率。这种方式对于网易云比较好，因为网易云本身的搜索引擎不够准确，YouTube和spotify本身搜索引擎就很强大了，很少会出现错误情况，所以llm提升不大。

而且时间很长，因为需要retry，一旦retry，和llm的交互就变多了，140首歌曲要跑20min。这部分其实可以做优化，减少和llm的通信次数，但是懒得调试了，也不是不能用。

def deepseek_match(original_text: str, cand: Dict[str, Any], anime_names: List[str]) -> bool:
    api_key = os.environ.get('DEEPSEEK')
    if not api_key:
        return False
    model = 'deepseek-chat'
    system = 'You are a strict music matcher. Return JSON {"is_same": true|false} only.'
    cand_title = cand.get('name') or ''
    cand_artists = ', '.join(cand.get('artists') or [])
    cand_alias = cand.get('alias') or []
    prompt = (
        'Determine if the candidate song matches the intended song described by the original text. '
        'In original text, the song name is inside the parentheses, and the artist name is outside.'
        'Consider candidate song name and artists, they must both match.'
        'Some artists\' names may be converted to Roman sounds, but they are essentially the same person.'
        'Additionally, also consider candidate aliases; '
        'if alias contains anime Chinese names or Japanese names, treat as match. Return strictly JSON with key is_same.\n'
        f'Original text: {original_text}\n'
        f'Anime names (CN/JP): {json.dumps([n for n in anime_names if n], ensure_ascii=False)}\n'
        f'Candidate title: {cand_title}\n'
        f'Candidate artist: {cand_artists}\n'
        f'Candidate aliases: {json.dumps(cand_alias, ensure_ascii=False)}\n'
    )
    try:
        r = requests.post(
            'https://api.deepseek.com/v1/chat/completions',
            headers={
                'Authorization': f'Bearer {api_key}',
                'Content-Type': 'application/json'
            },
            json={
                'model': model,
                'messages': [
                    {'role': 'system', 'content': system},
                    {'role': 'user', 'content': prompt}
                ],
                'temperature': 0
            },
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        content = data.get('choices', [{}])[0].get('message', {}).get('content', '')
        try:
            parsed = json.loads(content)
            print(f"deepseek_answer: {parsed}")
            return bool(parsed.get('is_same'))
        except Exception:
            return False
    except Exception:
        return False