一种基于大模型的自动化批量文献概要生成系统

项目背景

在实际的科研场景中，我们常常需要阅读大量的文献，这其中有一个文献的筛选过程：我们希望快速地知道这篇文章讲了哪些内容，从而判断这篇文章与自己的方向是否契合，决定是否要精读。虽然，阅读文章的摘要就能做到这一点，但是摘要有时不能覆盖全文每一个部分的主题，提供的信息可能不足，而且纯英文的摘要对于刚刚入门该领域的科研人员并不友好。

笔者基于阿里通义大模型 (qwen-turbo-2024-11-01) 开发了这样一种自动化文献处理系统，它可以实现大批量纯英文文献的阅读，并以pdf格式输出文章的概要，概要有条理且内容清晰，总长度不超过2页A4纸。

科研人员可以大批量下载某一特定方向的论文，然后先经过本系统处理为中文概要进行阅读，再选择其中合适的论文用一些大模型论文阅读插件阅读（或自行阅读），从而在一定程度上提高了科研工作者的论文阅读效率。

开发思路

PDF文本提取模块 (extract_useful_text_from_pdf)

第一步是解析PDF，由于大模型对PDF的解析较慢且准确率不高，笔者决定先自行解析PDF为文本，再向大模型中输入文本。笔者采用PyPDF2库实现PDF解析，通过逐页处理、首尾行剔除策略去除页眉页脚。该模块处理单页时保留段落结构，输出文本已去除冗余格式字符，为后续处理提供干净输入。

文本分块模块 (split_into_chunks)

在论文阅读任务中，文献的长度往往很长，tokenize之后很可能超出了大模型的输入限制（或者因为上下文长度限制导致输出内容过少）。因此，笔者先使用cl100k_base模型对论文进行文本分块，再将分块之后的token整合之后传给通义模型。基于tiktoken的cl100k_base编码器实现智能分块，具备以下特性：

段落完整性保护：以自然段落为最小单位，避免分块时拆分逻辑语义。
动态分块策略：设置8000token的阈值（为API预留空间），对超长段落实施二次分割。
容错机制：通过正则表达式兼容不同换行符格式，确保分割稳定性。

摘要生成模块(summarize_document)

该模块整合了分块处理和AI摘要技术：

分块摘要：采用迭代式请求策略，每块生成带层级结构的Markdown摘要。
双重校验机制：首轮生成后自动检测摘要长度，超限时执行二次精炼。
异常处理：对API调用失败的分块记录错误日志，保证流程持续运行；同时可根据API调用失败的错误日志查阅阿里官方的大模型产品文档，以寻求支持。

格式转换模块(convert_markdown_to_pdf)

基于pandoc实现文档格式转换，核心特性包括：

中文排版优化：通过xelatex引擎配置微软雅黑字体，确保中文PDF正确渲染
临时文件管理：采用写入-转换-清理模式，避免产生冗余文件
路径验证：自动检测输出目录有效性，异常时触发系统告警

源码

import os
import subprocess
import PyPDF2
import re
import tiktoken
from openai import OpenAI

def extract_useful_text_from_pdf(pdf_path):
    all_text = []
    try:
        with open(pdf_path, "rb") as f:
            reader = PyPDF2.PdfReader(f)
            for page in reader.pages:
                text = page.extract_text()
                if text:
                    lines = text.splitlines()
                    if len(lines) > 2:
                        lines = lines[1:-1]
                    page_text = "\n".join(lines)
                    all_text.append(page_text)
    except Exception as e:
        print(f"读取PDF出错：{e}")
    return "\n".join(all_text)

def split_into_chunks(text, max_tokens=8000):
    tokenizer = tiktoken.get_encoding("cl100k_base")

    paragraphs = re.split(r'\n\s*\n|\r\n\s*\r\n', text.strip())

    chunks = []
    current_chunk = []
    current_token_count = 0

    for para in paragraphs:
        if not para.strip():
            continue

        para_tokens = len(tokenizer.encode(para))

        if para_tokens > max_tokens * 0.8:
            sub_paras = re.split(r'(?<=[。！？；]) +', para)
            for sub_para in sub_paras:
                sub_tokens = len(tokenizer.encode(sub_para))
                if current_token_count + sub_tokens > max_tokens:
                    chunks.append("\n\n".join(current_chunk))
                    current_chunk = [sub_para]
                    current_token_count = sub_tokens
                else:
                    current_chunk.append(sub_para)
                    current_token_count += sub_tokens
        else:
            if current_token_count + para_tokens > max_tokens:
                chunks.append("\n\n".join(current_chunk))
                current_chunk = [para]
                current_token_count = para_tokens
            else:
                current_chunk.append(para)
                current_token_count += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))

    return chunks

def summarize_document(text):
    CHUNK_MAX_TOKENS = 8000  
    SUMMARY_MAX_TOKENS = 8192  

    chunks = split_into_chunks(text, max_tokens=CHUNK_MAX_TOKENS)

    all_summaries = []
    client = OpenAI(
        api_key="YOUR_API_KEY",
        base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
    )

    for i, chunk in enumerate(chunks):
        try:
            prompt = (
                    f"请阅读以下文献的第{i + 1}部分，用中文返回文献框架的markdown源代码，并且每个标题下要把文章中这部分内容做一个简单的概括。 请一定注意：\n"
                    "1.除了markdown源代码之外，任何内容都不要输出。\n"
                    "2.输出语言必须为中文。【非常重要！】\n"
                    "3.用多级的逻辑结构（一级标题、二级标题甚至三级标题）来总结原文的框架，尽可能丰满你的框架\n" + text
            )

            response = client.chat.completions.create(
                model="qwen-turbo-2024-11-01",
                messages=[
                    {"role": "system",
                     "content": f"你是一个{area}领域的文献分析助手，擅长从长文档中提取结构化框架并生成技术性摘要。"},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.4,
                top_p=0.7,
                max_tokens=SUMMARY_MAX_TOKENS,
                frequency_penalty=0.5
            )

            summary = response.choices[0].message.content.strip()
            all_summaries.append(summary)

        except Exception as e:
            print(f"第{i + 1}部分处理失败：{e}")
            all_summaries.append(f"## 第{i + 1}部分摘要生成失败\n")

    final_summary = "\n\n".join(all_summaries)

    if len(final_summary) > SUMMARY_MAX_TOKENS * 2:
        try:
            response = client.chat.completions.create(
                model="qwen-turbo-2024-11-01",
                messages=[
                    {"role": "system", "content": "你是一个摘要精炼专家，擅长将多个章节摘要整合成连贯的完整文档摘要"},
                    {"role": "user", "content": f"请将以下分块摘要整合为完整的文献框架，你只能输出一段完整的markdown代码，其他的都不要输出：\n{final_summary}"}
                ],
                temperature=0.3,
                top_p=0.8,
                max_tokens=SUMMARY_MAX_TOKENS
            )
            return response.choices[0].message.content.strip()
        except:
            return final_summary  

    return final_summary

def clean_markdown(md_text):
    if md_text.startswith("```") and md_text.rstrip().endswith("```"):
        lines = md_text.splitlines()
        return "\n".join(lines[1:-1])
    return md_text

def convert_markdown_to_pdf(markdown_text, output_path):
    temp_md_path = r"D:\TestForEssayReader\Temp\temp.md"

    markdown_text = clean_markdown(markdown_text)

    try:

        with open(temp_md_path, "w", encoding="utf-8") as f:
            f.write(markdown_text)
        print(f"临时Markdown文件已创建: {os.path.abspath(temp_md_path)}")


        subprocess.run(
            [
                pandoc,
                temp_md_path,
                "-f", "markdown",
                "-o", output_path,
                "--pdf-engine=xelatex",  
                "-V", "CJKmainfont=Microsoft YaHei" 
            ],
            check=True,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            shell=True, 
            encoding = 'utf-8',  
        )
        print(f"PDF成功生成于：{os.path.abspath(output_path)}")

    except subprocess.CalledProcessError as e:
        print(f"转换失败，错误详情：\n{e.stderr}")
        raise
    except Exception as e:
        print(f"操作出错：{str(e)}")
        raise
    finally:
        if os.path.exists(temp_md_path):
            os.remove(temp_md_path)


def find_pandoc():
    try:
        subprocess.run(["pandoc", "--version"], check=True, stdout=subprocess.PIPE)
        return "pandoc"
    except FileNotFoundError:
        common_paths = [
            r"C:\Program Files\Pandoc\pandoc.exe",
            r"C:\Users\{}\AppData\Local\Pandoc\pandoc.exe".format(os.getenv("USERNAME"))
        ]
        for path in common_paths:
            if os.path.exists(path):
                return path
        return None

def main():
    area = input("请输入你的专业：")
    YOUR_API_KEY = input('请输入你的API_KEY(请在“ https://bailian.console.aliyun.com/?apiKey=1#/api-key ” 申请阿里API)：')
    essay_folder = input("请输入你原始文献文件夹的路径：")
    summary_folder = input("请输入你希望保存摘要的文件夹路径：")

    os.makedirs(summary_folder, exist_ok=True)

    for filename in os.listdir(essay_folder):
        if filename.lower().endswith(".pdf"):
            pdf_path = os.path.join(essay_folder, filename)
            print(f"正在处理：{pdf_path}")

            text = extract_useful_text_from_pdf(pdf_path)
            if not text.strip():
                print("未提取到有效文本，跳过该文件。")
                continue
            markdown_summary = summarize_document(text)

            output_pdf = os.path.join(summary_folder, f"summary_{os.path.splitext(filename)[0]}.pdf")
            convert_markdown_to_pdf(markdown_summary, output_pdf)
            print(f"生成摘要文件：{output_pdf}")

if __name__ == "__main__":
    main()

使用说明

请安装python环境，并安装os,tiktoken,re,openai,subprocess,PyPDF2软件包。
请安装pandoc，并配置系统环境变量
请申请阿里的API，网址：https://bailian.console.aliyun.com/?apiKey=1#/api-key
请创建一个文件夹，其中存放您要处理的文献原始文件（pdf格式）

输出结果示例

在python终端，可看到程序运行结果如下：

在设定的概要输出路径下，可看到概要文件如下：

点开后内容如下，层次分明、结构清晰，便于科研工作者迅速了解文献内容。

备注

本项目是笔者在学习了一点大模型技术后，利用半天时间开发的小型应用。在使用过程中，可能会在整合多个chunk的时候出现格式问题，导致生成的PDF中保留了一部分Markdown源码，这个问题会慢慢解决。当然也可能会有其他的问题，笔者尚未发现。
这篇文档会先在笔者的个人博客上发布，稍后会整理并发布在我的Github上，欢迎读者在自己科研时使用这个小程序，通过Github社区与我交流。