🍊 Latent Atlas 🍉

Home

❯

Sources

❯

Papers

❯

Deduplicating Training Data Makes Language Models Better

Deduplicating Training Data Makes Language Models Better

2026年5月31日1分钟阅读

  • source
  • paper
  • deduplication
  • pretraining-data
  • memorization

基本信息

  • Title: Deduplicating Training Data Makes Language Models Better
  • Source type: paper
  • Related topic notes: Deduplication, Data Engineering, Evaluation

TODO

  • 阅读论文原文,整理 exact / near dedup 对语言模型训练、记忆和泛化的影响。
  • 回填重复数据如何改变 validation loss、benchmark contamination 和 generation memorization。
  • 补充去重粒度、阈值和副作用的实践边界。

关系图谱

  • 基本信息
  • TODO

反向链接

  • Papers
  • Deduplication
  • Data Engineering

🍊 Latent Atlas 🍉 · An AI knowledge atlas built with Quartz © 2026