leaudit-platform-backend/docs/文档图片质量校验模块/文档图片质量校验模块第3版接口与落地清单.md

# 文档图片质量校验模块第3版接口与落地清单

## 1. 目标边界

本模块用于对案卷文档中的拍照图片、扫描图片、附件图片做“图片清晰度预警检测”。

本模块的边界固定如下：

- 与现有 OCR 抽取、评查主流程完全独立。
- 即使检测出图片模糊，也不能阻断上传、不能阻断 OCR、不能阻断评查。
- 结果只做预警展示和留痕，分三档：
  - `pass`：通过
  - `review`：疑似模糊待人工确认
  - `reject`：不通过需重拍
- 第一版优先支持：
  - 图片文件
  - PDF / 扫描 PDF
  - 图片附件
  - PDF 附件
- `doc/docx/wps` 第一版允许降级为“正文第 N 张图”定位，不强承诺精确页码。

## 2. 建议代码落点

### 2.1 后端目录

建议新增如下目录与文件：

- `fastapi_modules/fastapi_leaudit/domian/vo/imageQualityVo.py`
- `fastapi_modules/fastapi_leaudit/services/imageQualityService.py`
- `fastapi_modules/fastapi_leaudit/services/impl/imageQualityServiceImpl.py`
- `fastapi_modules/fastapi_leaudit/controllers/imageQualityController.py`
- `fastapi_modules/fastapi_leaudit/image_quality/tasks.py`
- `fastapi_modules/fastapi_leaudit/image_quality/runner.py`
- `fastapi_modules/fastapi_leaudit/image_quality/storage_adapter.py`
- `fastapi_modules/fastapi_leaudit/image_quality/input_resolver.py`
- `fastapi_modules/fastapi_leaudit/image_quality/extractors.py`
- `fastapi_modules/fastapi_leaudit/image_quality/detector.py`
- `fastapi_modules/fastapi_leaudit/image_quality/config_resolver.py`

### 2.2 现有文件改动点

- `fastapi_modules/fastapi_leaudit/controllers/documentController.py`
- `fastapi_modules/fastapi_leaudit/services/documentService.py`
- `fastapi_modules/fastapi_leaudit/services/impl/documentServiceImpl.py`
- `fastapi_modules/fastapi_leaudit/domian/vo/documentVo.py`
- `fastapi_admin/config/_settings.py`
- `fastapi_admin/config/__init__.pyi`

## 3. 新增 VO 建议

按当前仓库 `documentVo.py / auditVo.py / reviewPointVo.py` 风格，建议单独新增 `imageQualityVo.py`。

### 3.1 配置相关 VO

```python
from pydantic import BaseModel, Field


class ImageQualityConfigItemVO(BaseModel):
    """图片质量校验配置项。"""

    id: int = Field(..., description="配置ID")
    scopeType: str = Field(..., description="作用域类型：global/doc_type/region/doc_type_region")
    documentTypeId: int | None = Field(None, description="文档类型ID")
    region: str | None = Field(None, description="地区")
    enabled: bool = Field(..., description="是否启用")
    warnThreshold: float | None = Field(None, description="疑似模糊阈值")
    rejectThreshold: float | None = Field(None, description="不通过阈值")
    maxImagesPerDoc: int | None = Field(None, description="单文档最大检测图片数")
    maxConcurrency: int | None = Field(None, description="单任务并发数")
    updatedAt: str | None = Field(None, description="更新时间")


class ImageQualityConfigUpsertDTO(BaseModel):
    """图片质量校验配置新增/更新请求。"""

    scopeType: str = Field(..., description="作用域类型")
    documentTypeId: int | None = Field(None, description="文档类型ID")
    region: str | None = Field(None, description="地区")
    enabled: bool = Field(..., description="是否启用")
    warnThreshold: float | None = Field(None, description="疑似模糊阈值")
    rejectThreshold: float | None = Field(None, description="不通过阈值")
    maxImagesPerDoc: int | None = Field(None, description="单文档最大检测图片数")
    maxConcurrency: int | None = Field(None, description="单任务并发数")
```

### 3.2 明细与摘要相关 VO

```python
class ImageQualityItemVO(BaseModel):
    """单张图片质检明细。"""

    itemId: int = Field(..., description="明细ID")
    runId: int = Field(..., description="质检运行ID")
    documentId: int = Field(..., description="文档ID")
    documentFileId: int | None = Field(None, description="文档文件ID")
    sourceKind: str = Field(..., description="来源类型")
    sourceFileName: str | None = Field(None, description="来源文件名")
    sourcePageNum: int | None = Field(None, description="来源页码")
    imageIndexInPage: int | None = Field(None, description="页内图片序号")
    imageIndexInFile: int | None = Field(None, description="文件内图片序号")
    bbox: dict | list | None = Field(None, description="图片定位框")
    qualityStatus: str = Field(..., description="pass/review/reject")
    qualityScore: float | None = Field(None, description="清晰度分值")
    reasonCode: str | None = Field(None, description="原因编码")
    reasonText: str | None = Field(None, description="原因说明")
    cropOssUrl: str | None = Field(None, description="裁剪图OSS地址")
    displayText: str | None = Field(None, description="展示文案，例如第12页第1张图片模糊")


class ImageQualitySummaryVO(BaseModel):
    """文档图片质检摘要。"""

    runId: int | None = Field(None, description="最新运行ID")
    runStatus: str | None = Field(None, description="queued/running/completed/failed/skipped")
    summaryStatus: str | None = Field(None, description="pass/review/reject")
    skipReason: str | None = Field(None, description="跳过原因")
    totalImages: int = Field(0, description="总图片数")
    passCount: int = Field(0, description="通过数")
    reviewCount: int = Field(0, description="待人工确认数")
    rejectCount: int = Field(0, description="需重拍数")
    warningText: str | None = Field(None, description="摘要提示文案")
    finishedAt: str | None = Field(None, description="完成时间")


class ImageQualityDetailVO(BaseModel):
    """文档图片质检详情。"""

    summary: ImageQualitySummaryVO = Field(..., description="质检摘要")
    items: list[ImageQualityItemVO] = Field(default_factory=list, description="问题图片列表")
```

### 3.3 批量状态与重检相关 VO

```python
class ImageQualityStatusItemVO(BaseModel):
    """文档图片质检状态项。"""

    documentId: int = Field(..., description="文档ID")
    runId: int | None = Field(None, description="最新运行ID")
    runStatus: str | None = Field(None, description="运行状态")
    summaryStatus: str | None = Field(None, description="摘要状态")
    rejectCount: int = Field(0, description="需重拍数")
    reviewCount: int = Field(0, description="待确认数")
    updatedAt: str | None = Field(None, description="更新时间")


class ImageQualityRecheckVO(BaseModel):
    """手工重检响应。"""

    runId: int = Field(..., description="新质检运行ID")
    documentId: int = Field(..., description="文档ID")
    status: str = Field(..., description="queued")
    message: str = Field("", description="提示信息")
```

## 4. 对现有 Document VO 的扩展建议

建议直接扩充 `documentVo.py`，这样上传页、列表页、详情页可以少开新接口或少拼数据。

### 4.1 `DocumentUploadVO` 增加字段

```python
imageQualityEnabled: bool = Field(False, description="当前文档是否启用图片质量校验")
imageQualityRunId: int | None = Field(None, description="图片质量校验运行ID")
imageQualityRunStatus: str | None = Field(None, description="图片质量校验运行状态")
imageQualitySummaryStatus: str | None = Field(None, description="图片质量校验摘要状态")
```

### 4.2 `DocumentStatusItemVO` 增加字段

```python
imageQualityRunId: int | None = Field(None, description="图片质量校验运行ID")
imageQualityRunStatus: str | None = Field(None, description="图片质量校验运行状态")
imageQualitySummaryStatus: str | None = Field(None, description="图片质量校验摘要状态")
imageQualityRejectCount: int = Field(0, description="需重拍图片数")
imageQualityReviewCount: int = Field(0, description="待人工确认图片数")
```

### 4.3 `DocumentListItemVO` 增加字段

```python
imageQualityRunId: int | None = Field(None, description="图片质量校验运行ID")
imageQualityRunStatus: str | None = Field(None, description="图片质量校验运行状态")
imageQualitySummaryStatus: str | None = Field(None, description="图片质量校验摘要状态")
imageQualityIssueCount: int = Field(0, description="问题图片数")
imageQualityWarningText: str | None = Field(None, description="图片质量提示文案")
```

### 4.4 `DocumentDetailVO` 增加字段

```python
imageQualitySummary: ImageQualitySummaryVO | None = Field(None, description="图片质量摘要")
```

## 5. 新增 Service 接口签名

建议新建 `services/imageQualityService.py`，不要污染 `IDocumentService` 主职责。

```python
from abc import ABC, abstractmethod

from fastapi_modules.fastapi_leaudit.domian.vo.imageQualityVo import (
    ImageQualityConfigItemVO,
    ImageQualityConfigUpsertDTO,
    ImageQualityDetailVO,
    ImageQualityRecheckVO,
    ImageQualityStatusItemVO,
    ImageQualitySummaryVO,
)


class IImageQualityService(ABC):
    """图片质量校验服务接口。"""

    @abstractmethod
    async def DispatchForDocument(
        self,
        DocumentId: int,
        TriggerUserId: int | None = None,
        Force: bool = False,
        Speed: str = "normal",
    ) -> ImageQualityRecheckVO | None:
        """按文档触发图片质量校验任务。"""
        ...

    @abstractmethod
    async def GetDocumentSummary(
        self,
        CurrentUserId: int,
        DocumentId: int,
    ) -> ImageQualitySummaryVO:
        """获取文档图片质检摘要。"""
        ...

    @abstractmethod
    async def GetDocumentDetail(
        self,
        CurrentUserId: int,
        DocumentId: int,
    ) -> ImageQualityDetailVO:
        """获取文档图片质检详情。"""
        ...

    @abstractmethod
    async def GetDocumentsStatus(
        self,
        CurrentUserId: int,
        Ids: list[int],
    ) -> list[ImageQualityStatusItemVO]:
        """批量获取文档图片质检状态。"""
        ...

    @abstractmethod
    async def RecheckDocument(
        self,
        CurrentUserId: int,
        DocumentId: int,
        Speed: str = "normal",
    ) -> ImageQualityRecheckVO:
        """手工重跑文档图片质量校验。"""
        ...

    @abstractmethod
    async def ListConfigs(self) -> list[ImageQualityConfigItemVO]:
        """获取图片质量校验配置列表。"""
        ...

    @abstractmethod
    async def UpsertConfig(self, Body: ImageQualityConfigUpsertDTO) -> ImageQualityConfigItemVO:
        """新增或更新图片质量校验配置。"""
        ...

    @abstractmethod
    async def DeleteConfig(self, Id: int) -> None:
        """删除图片质量校验配置。"""
        ...
```

## 6. 对现有 DocumentService 的接入点建议

建议只在现有 `IDocumentService / DocumentServiceImpl` 中补“触发点”，不要把查询逻辑也塞进去。

### 6.1 `DocumentServiceImpl.Upload`

在上传主文件和附件落库完成后，补一段：

```python
await self.ImageQualityService.DispatchForDocument(
    DocumentId=document.Id,
    TriggerUserId=CreatedBy,
    Force=False,
    Speed=Speed,
)
```

要求：

- try/except 包裹
- 失败只记 warning log
- 不影响现有上传成功返回

### 6.2 `DocumentServiceImpl.AppendAttachments`

在追加附件成功后，补一段：

```python
await self.ImageQualityService.DispatchForDocument(
    DocumentId=Id,
    TriggerUserId=CurrentUserId,
    Force=True,
    Speed="normal",
)
```

原因：

- 附件变化后，图片质检结果已过期
- 需要按最新附件重新跑

## 7. Controller 路由建议

建议新建 `controllers/imageQualityController.py`，风格保持与 `documentController.py / auditController.py` 一致。

### 7.1 文档级路由

```python
@self.router.get("/documents/{DocumentId}/image-quality/summary", response_model=Result[ImageQualitySummaryVO])
async def GetDocumentImageQualitySummary(
    DocumentId: int,
    payload: dict[str, Any] = Depends(verify_access_token),
):
    """获取单个文档图片质量校验摘要。"""


@self.router.get("/documents/{DocumentId}/image-quality", response_model=Result[ImageQualityDetailVO])
async def GetDocumentImageQualityDetail(
    DocumentId: int,
    payload: dict[str, Any] = Depends(verify_access_token),
):
    """获取单个文档图片质量校验详情。"""


@self.router.post("/documents/{DocumentId}/image-quality/recheck", response_model=Result[ImageQualityRecheckVO])
async def RecheckDocumentImageQuality(
    DocumentId: int,
    speed: str = Form("normal", description="执行速度档位：urgent/normal"),
    payload: dict[str, Any] = Depends(verify_access_token),
):
    """手工重跑单个文档图片质量校验。"""
```

### 7.2 批量状态路由

```python
@self.router.get("/documents/image-quality/status", response_model=Result[list[ImageQualityStatusItemVO]])
async def GetDocumentsImageQualityStatus(
    ids: str = Query(..., description="逗号分隔的文档ID列表"),
    payload: dict[str, Any] = Depends(verify_access_token),
):
    """批量获取文档图片质量校验状态。"""
```

### 7.3 配置管理路由

如果第一期不做后台配置页，可以先只保留 service，不先开放 controller。

如果要开接口，建议：

```python
@self.router.get("/v3/image-quality/configs", response_model=Result[list[ImageQualityConfigItemVO]])
async def ListImageQualityConfigs():
    """获取图片质量校验配置。"""


@self.router.post("/v3/image-quality/configs", response_model=Result[ImageQualityConfigItemVO])
async def UpsertImageQualityConfig(Body: ImageQualityConfigUpsertDTO):
    """新增或更新图片质量校验配置。"""


@self.router.delete("/v3/image-quality/configs/{ConfigId}", response_model=Result[None])
async def DeleteImageQualityConfig(ConfigId: int):
    """删除图片质量校验配置。"""
```

## 8. Celery 任务与函数签名建议

新增文件：`fastapi_modules/fastapi_leaudit/image_quality/tasks.py`

```python
def resolve_image_quality_queue(speed: str = "normal") -> str:
    """根据优先级返回图片质量校验队列名。"""


def dispatch_image_quality_task(
    run_id: int,
    *,
    speed: str = "normal",
    trigger_user_id: int | None = None,
) -> Any:
    """投递图片质量校验任务。"""


@celery_app.task(
    bind=True,
    name="leaudit.image_quality.process_document",
    acks_late=True,
)
def image_quality_process_document_task(
    self,
    run_id: int,
    trigger_user_id: int | None = None,
) -> dict[str, Any]:
    """Celery worker 入口。"""
```

新增文件：`fastapi_modules/fastapi_leaudit/image_quality/runner.py`

```python
class ImageQualityRunner:
    """图片质量校验执行器。"""

    async def Execute(
        self,
        RunId: int,
        TriggerUserId: int | None = None,
    ) -> dict[str, Any]:
        """执行一次完整的图片质量校验。"""
```

## 9. SQL 草案文件命名建议

按你当前仓库习惯，建议直接放到：

- `scripts/创建sql/`

建议拆 3 个 SQL 草案文件。

### 9.1 表结构 SQL

文件名：

- `scripts/创建sql/schema_add_image_quality_module.sql`

内容范围：

- `leaudit_image_quality_configs`
- `leaudit_image_quality_runs`
- `leaudit_image_quality_items`
- 索引
- 唯一约束
- 软删除字段

### 9.2 权限点 SQL

文件名：

- `scripts/创建sql/seed_image_quality_permissions.sql`

内容范围：

- 图片质量校验查询权限
- 图片质量校验重检权限
- 图片质量配置管理权限

### 9.3 路由/入口 SQL

文件名：

- `scripts/创建sql/seed_image_quality_routes.sql`

内容范围：

- 新增接口路由资源
- 如需配置后台菜单，再补菜单项

## 10. 建议表结构草案

### 10.1 `leaudit_image_quality_configs`

建议字段：

- `id BIGSERIAL PRIMARY KEY`
- `scope_type VARCHAR(32) NOT NULL`
- `document_type_id BIGINT NULL`
- `region VARCHAR(64) NULL`
- `enabled BOOLEAN NOT NULL DEFAULT TRUE`
- `warn_threshold NUMERIC(10,4) NULL`
- `reject_threshold NUMERIC(10,4) NULL`
- `max_images_per_doc INTEGER NULL`
- `max_concurrency INTEGER NULL`
- `created_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `updated_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `deleted_at TIMESTAMP NULL`

建议索引：

- `idx_leaudit_image_quality_configs_scope`
- `idx_leaudit_image_quality_configs_doc_type`
- `idx_leaudit_image_quality_configs_region`

### 10.2 `leaudit_image_quality_runs`

建议字段：

- `id BIGSERIAL PRIMARY KEY`
- `document_id BIGINT NOT NULL`
- `document_file_id BIGINT NULL`
- `status VARCHAR(32) NOT NULL DEFAULT 'queued'`
- `skip_reason VARCHAR(64) NULL`
- `summary_status VARCHAR(32) NULL`
- `total_images INTEGER NOT NULL DEFAULT 0`
- `pass_count INTEGER NOT NULL DEFAULT 0`
- `review_count INTEGER NOT NULL DEFAULT 0`
- `reject_count INTEGER NOT NULL DEFAULT 0`
- `task_id VARCHAR(255) NULL`
- `error_message TEXT NULL`
- `started_at TIMESTAMP NULL`
- `finished_at TIMESTAMP NULL`
- `created_by BIGINT NULL`
- `created_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `updated_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `deleted_at TIMESTAMP NULL`

建议索引：

- `idx_leaudit_image_quality_runs_document_id`
- `idx_leaudit_image_quality_runs_status`
- `idx_leaudit_image_quality_runs_document_created_at`

### 10.3 `leaudit_image_quality_items`

建议字段：

- `id BIGSERIAL PRIMARY KEY`
- `run_id BIGINT NOT NULL`
- `document_id BIGINT NOT NULL`
- `document_file_id BIGINT NULL`
- `source_kind VARCHAR(64) NOT NULL`
- `source_file_name VARCHAR(255) NULL`
- `source_page_num INTEGER NULL`
- `image_index_in_page INTEGER NULL`
- `image_index_in_file INTEGER NULL`
- `bbox_json JSONB NULL`
- `image_key VARCHAR(255) NULL`
- `parent_image_key VARCHAR(255) NULL`
- `crop_oss_url TEXT NULL`
- `quality_status VARCHAR(32) NOT NULL`
- `quality_score NUMERIC(10,4) NULL`
- `reason_code VARCHAR(64) NULL`
- `reason_text TEXT NULL`
- `extra_json JSONB NULL`
- `created_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `updated_at TIMESTAMP NOT NULL DEFAULT NOW()`
- `deleted_at TIMESTAMP NULL`

建议索引：

- `idx_leaudit_image_quality_items_run_id`
- `idx_leaudit_image_quality_items_document_id`
- `idx_leaudit_image_quality_items_quality_status`
- `idx_leaudit_image_quality_items_source_page_num`

## 11. 配置项建议

在 `fastapi_admin/config/_settings.py` 的 `LeauditSettings` 增加：

```python
LEAUDIT_IMAGE_QUALITY_ENABLED: bool = False
LEAUDIT_IMAGE_QUALITY_WARN_THRESHOLD: float = 0.45
LEAUDIT_IMAGE_QUALITY_REJECT_THRESHOLD: float = 0.30
LEAUDIT_IMAGE_QUALITY_MAX_IMAGES_PER_DOC: int = 80
LEAUDIT_IMAGE_QUALITY_MAX_CONCURRENCY: int = 4
LEAUDIT_IMAGE_QUALITY_QUEUE_NORMAL: str = "leaudit.image_quality.normal"
LEAUDIT_IMAGE_QUALITY_QUEUE_URGENT: str = "leaudit.image_quality.urgent"
LEAUDIT_IMAGE_QUALITY_TIMEOUT: int = 120
```

并同步补到 `fastapi_admin/config/__init__.pyi`。

## 12. 前端接口命名建议

建议前端 API 文件新增：

- `legal-platform-frontend/lib/api/legacy/files/image-quality.ts`

建议函数：

```ts
export async function getDocumentImageQualitySummary(documentId: number)
export async function getDocumentImageQualityDetail(documentId: number)
export async function getDocumentsImageQualityStatus(ids: number[])
export async function recheckDocumentImageQuality(documentId: number, speed?: string)
```

## 13. 第一版实施顺序

建议按下面顺序做：

1. 先补 SQL 与后端表模型
2. 再补 `imageQualityVo.py`
3. 再补 `IImageQualityService / ImageQualityServiceImpl`
4. 再补 `tasks.py / runner.py`
5. 再在 `DocumentServiceImpl.Upload / AppendAttachments` 挂触发
6. 再补 `imageQualityController.py`
7. 最后接前端上传页、列表页、详情页

## 14. 结论

按当前仓库风格，这个模块最稳妥的落地方式是：

- 新开独立 `VO + Service + Controller + Celery task + SQL`
- 只在 `DocumentServiceImpl` 做触发，不侵入现有 `AuditServiceImpl`
- 对外暴露文档级摘要、明细、批量状态、手工重检四类接口
- SQL 先拆成结构、权限、路由三个脚本，方便分阶段上线

这个拆法能保证图片质量校验模块后续可单独演进，不会把现有 OCR / 评查主链路拖乱。

## 15. 页码定位策略修正版

这一节用于修正第一版里对 `doc/docx/wps` 页码定位的表述，避免研发误以为可以直接靠当前 `python-docx` 兜底逻辑拿到真实页码。

### 15.1 现有 OCR / 评查链路里的页码定位是怎么来的

当前系统里，评查详情页能展示字段页码、部分情况下还能做定位高亮，主来源并不是“原始 Word 文档页码”，而是下面两层能力：

- 第一层：OCR/归一化结果自带页结构
  - `ocr_result.pages[].page_num`
  - OCR chunk 的 `bbox`
  - `field_positions` 里的 `pageNum / bbox / matchPosition`
- 第二层：详情页兜底文本匹配
  - 对 PDF 重新逐页抽文本，再把字段文本匹配回页码
  - 这是兜底逻辑，不是主定位来源

也就是说，当前评查模块“真正稳定可复用”的页码语义，主要来自 OCR / normalized document，而不是详情页自己的回推逻辑。

### 15.2 当前项目里 `docx` 的兜底页码能力并不是真分页

当前 `documentServiceImpl.py` 里的 `_extract_page_texts_from_docx()` 实现，实际是：

- 读取段落和表格文本
- 拼成整篇文档文本
- 最终只返回 `[(1, text)]`

这意味着：

- 当前详情页兜底逻辑对 `docx` 并没有真实分页能力
- 它只是把整篇文档退化成“第 1 页”
- 因此不能把这条逻辑当成 `docx` 图片定位的基础

结论要固定为：

- 当前系统里 `docx` 不是完全不能定位
- 但不能依赖现有 `python-docx` 兜底逻辑做精确页码
- 如果要把 `docx` 图片定位做准，必须优先复用 OCR / 归一化后的页结构

### 15.3 图片质量模块对 `docx` 的正确做法

图片质量模块在处理 `doc/docx/wps` 时，页码定位策略建议按优先级分层：

1. 优先复用 OCR 结果中的页结构和视觉对象定位
2. 其次复用 `visual_manifest` 中已有的视觉对象元数据
3. 如果 OCR 已能给出该图片或父图像所在页，则直接采用该页码
4. 如果拿不到稳定页码，则降级为“正文第 N 张图”或“附件第 N 张图”

推荐优先使用的字段包括：

- `ocr_result.pages[].page_num`
- `ocr_result.visual_manifest`
- `visual_manifest` 中对象的 `page_num`
- `visual_manifest` 中对象的 `bbox`
- `image_key / parent_image_key`

这样做的本质是：

- 不自己重新发明 Word 分页
- 而是复用现有 OCR pipeline 已经归一化出来的页概念

### 15.4 为什么图片质量模块不能直接复用详情页 `docx` 兜底逻辑

因为那条逻辑只适合“文本字段没有页码时，尽量补一个最小可用页码”，不适合图片质量场景。

图片质量场景要解决的是：

- 哪一张图片模糊
- 它在第几页
- 它在页内第几张
- 最好还能给出 bbox 或裁剪图

而当前 `docx` 兜底逻辑没有：

- 真分页
- 图片级索引
- 页内图片序号
- 图片 bbox

所以如果直接复用它，只会得到一个看似有页码、其实不可用的假定位。

### 15.5 图片质量模块的页码定位分级策略

建议在方案里明确写死：

- `PDF / 扫描PDF / 图片文件 / 图片附件 / PDF附件`
  - 第一优先级支持精确定位
  - 目标输出：`source_page_num + image_index_in_page + bbox`
- `doc/docx/wps`
  - 第一优先级复用 OCR / visual_manifest 的页码与 bbox
  - 第二优先级降级为 `image_index_in_file`
  - 第一版不承诺所有 Office 文档都能稳定精确到预览页码

也就是说，`docx` 可以做定位，但实现前提不是当前 `python-docx` 的退化逻辑，而是 OCR 归一化之后的页结构。

### 15.6 对图片质量模块的实现修正

基于上面的分析，图片质量模块实现时建议补一个约束：

- 对 `doc/docx/wps`，图片抽取和质量判断可以独立做
- 但页码回指优先放在“抽图后结合 OCR visual metadata 回填”这一层完成
- 不允许直接调用当前 `_extract_page_texts_from_docx()` 来生成图片页码

更具体地说：

- `extractors.py` 负责抽图和生成图片级唯一索引
- `detector.py` 负责清晰度三档判断
- `runner.py` 在汇总结果前，优先尝试将图片索引和 OCR/visual_manifest 做关联
- 关联成功则写入 `source_page_num / bbox`
- 关联失败则写入 `image_index_in_file`，前端展示“正文第 N 张图”

### 15.7 最终结论

本项目当前“评查页码定位”对 `docx` 的可用性，主要来自 OCR 归一化页结构，而不是 `python-docx` 的原生分页能力。

因此文档图片质量校验模块如果要把 `docx` 也做准，正确方向是：

- 参考现有 OCR / 评查流程的页结构设计
- 复用 `ocr_result.pages / visual_manifest / bbox / image_key`
- 避免把当前 `docx` 文本兜底逻辑误当成图片级定位能力

这一点在后续实施时应作为明确技术约束，不建议再走“直接解析 Word 文本推页码”的路线。

## 16. 图片索引表与 OCR visual_manifest 关联策略

这一节用于把图片质量模块里最关键的“抽图记录如何回绑到 OCR 页码与 bbox”讲清楚，尤其是 `doc/docx/wps` 场景。

### 16.1 为什么必须做关联层

图片质量模块自己的 `extractors.py` 会先把图片从原始文档里抽出来，并落到 `leaudit_image_quality_items`。

但抽图本身只能保证下面这些信息：

- 这张图片来自哪个文档文件
- 它是主文件还是附件
- 它是文件内第几张图
- 它在抽图阶段的原始宽高、字节、hash

仅靠这些信息，还不够支持前端稳定回指到：

- 第几页
- 页内第几张
- 具体 bbox

所以必须增加一层“图片索引表与 OCR visual metadata 的关联层”，把抽图记录和 OCR/visual_manifest 里的页码、坐标、父图像标识打通。

### 16.2 建议增加的落库字段

在 `leaudit_image_quality_items` 里，建议补充或明确以下字段：

- `image_sha256`
  - 抽图后二进制 hash
- `image_width`
  - 抽图宽度
- `image_height`
  - 抽图高度
- `source_ext`
  - 来源文件扩展名
- `ocr_page_num`
  - 关联到 OCR 后得到的页码
- `ocr_bbox_json`
  - 关联到 OCR 后得到的 bbox
- `ocr_image_key`
  - 关联到 OCR visual object 的 image_key
- `ocr_parent_image_key`
  - 关联到 OCR visual object 的 parent_image_key
- `ocr_match_mode`
  - 关联命中方式
- `ocr_match_score`
  - 关联命中置信度

说明：

- `source_page_num` 表示模块自己理解的来源页码
- `ocr_page_num` 表示从 OCR / visual_manifest 里回填的页码
- 第一版前端展示时，优先取 `ocr_page_num`，取不到再回退到 `source_page_num`

### 16.3 关联优先级建议

建议在 `runner.py` 里做一段统一的关联流程，优先级从高到低如下：

1. `image_key / parent_image_key` 直接命中
2. 图片二进制 hash 命中
3. 图片尺寸 + 裁剪相似度命中
4. 页级候选范围内按 bbox / area 比例匹配
5. 全部失败则降级为文件内序号定位

这套顺序的原因是：

- `image_key / parent_image_key` 一旦能命中，最稳定
- hash 命中次稳，但前提是 OCR side 和抽图 side 用的是同一图像字节
- 尺寸和相似度可以做弱匹配
- bbox/area 更适合已经知道页候选时做局部定位

### 16.4 各类文件的推荐关联策略

#### 16.4.1 PDF / 扫描PDF / PDF附件

推荐策略：

- 抽图阶段直接记录 `source_page_num`
- OCR 若返回同页 visual object，则用页内 bbox 做二次确认
- 最终定位以 `source_page_num + image_index_in_page + bbox` 为准

这一类文件通常不需要过度依赖 `visual_manifest`，因为自身抽图时就有稳定页概念。

#### 16.4.2 图片文件 / 图片附件

推荐策略：

- `source_page_num` 固定记为 `1`
- `image_index_in_page = 1`
- 若 OCR 输出 visual object，则仅作为 bbox 或 `image_key` 补充

这类文件定位最简单，本身就是单页单图。

#### 16.4.3 `doc/docx/wps`

推荐策略：

- 抽图阶段只保证：
  - `image_index_in_file`
  - `image_sha256`
  - `image_width / image_height`
  - `source_file_name`
- 页码不要在抽图阶段硬算
- 等 OCR/visual_manifest 结果出来后，再做回填

这一类文件的关联优先级建议是：

1. `parent_image_key` 命中
2. `image_key` 命中
3. hash 命中
4. 尺寸与内容相似度命中
5. 命不中则只展示“正文第 N 张图”

### 16.5 建议的 `ocr_match_mode` 枚举

建议在 `ocr_match_mode` 里固定以下值，方便后续排查：

- `parent_image_key_exact`
- `image_key_exact`
- `image_sha256_exact`
- `size_similarity_match`
- `bbox_overlap_match`
- `page_candidate_match`
- `file_index_fallback`
- `no_match`

这样后面如果业务反馈“页码不准”，可以直接按命中模式排查是哪一层出了问题。

### 16.6 `visual_manifest` 关联的实现建议

建议把这块逻辑放在 `image_quality/runner.py` 或单独的 `image_quality/locator.py` 中，职责清晰一点：

- `extractors.py`
  - 只负责抽图
  - 不负责猜页码
- `detector.py`
  - 只负责判断图片清晰度
- `locator.py` 或 `runner.py`
  - 负责把图片索引记录与 OCR `visual_manifest` 做关联
  - 回填 `ocr_page_num / ocr_bbox_json / ocr_match_mode`

也就是说，页码定位不是抽图逻辑的一部分，而是“抽图后 + OCR结果可用后”的补全逻辑。

### 16.7 为什么建议先落图片索引，再做关联

因为如果把“抽图、关联、检测”混在一起，会有两个问题：

- 当 OCR 结果未就绪时，图片质量模块会被迫等待主流程
- 当 OCR side 结构调整时，图片质量模块会被连带影响

正确拆法是：

1. 先独立抽图并持久化图片索引
2. 独立跑清晰度检测
3. 如果 OCR 结果已就绪，则补做 visual_manifest 关联
4. 如果 OCR 结果尚未就绪，则先以弱定位展示，后续可异步补齐

这样既保持了两条 pipeline 独立，又能最大化复用 OCR 页结构。

### 16.8 页面展示的取值优先级

前端展示“第几页/第几张图”时，建议按下面顺序取值：

1. `ocr_page_num + image_index_in_page`
2. `source_page_num + image_index_in_page`
3. `image_index_in_file`
4. 仅展示来源文件名

示例文案：

- `主文件第 12 页第 1 张图片模糊`
- `附件《现场照片2.pdf》第 3 页第 2 张图片疑似模糊`
- `正文第 5 张内嵌图片模糊`
- `附件《取证照片.jpg》图片模糊`

### 16.9 对第一版实施的建议收口

为了控制复杂度，第一版建议这样收：

- `PDF / 扫描PDF / 图片文件 / 图片附件 / PDF附件`
  - 做强定位
- `doc/docx/wps`
  - 先把 OCR `visual_manifest` 关联能力做出来
  - 关联成功就展示页码
  - 关联失败就降级到“正文第 N 张图”

不要第一版就要求所有 Word/WPS 内嵌图都 100% 精确到预览页码，否则开发成本和误判风险都会明显上升。

### 16.10 最终结论

图片质量模块能不能把 `docx` 场景做准，关键不在“能不能把图抽出来”，而在“抽出来以后，能不能与 OCR visual metadata 建立稳定关联”。

所以第一版技术路线应明确为：

- 先做图片索引表
- 再做 visual_manifest 关联层
- 最后才是页码/坐标展示

这比直接依赖 `python-docx` 或手工猜 Word 分页要稳得多，也更贴合当前项目已有的 OCR / 评查体系。

## 17. `extractors.py` 职责拆分与 `runner.py` 执行时序

这一节用于把模块内部分层再落细一点，方便研发直接按文件拆任务。

### 17.1 模块职责分层建议

建议图片质量模块按下面几层拆，不要把所有逻辑堆在一个 service 或一个 task 文件里。

#### 17.1.1 `input_resolver.py`

职责：

- 读取文档主文件与附件
- 统一返回原始文件输入列表
- 不做抽图
- 不做检测
- 不做 OCR 关联

建议输出结构：

- `document_id`
- `document_file_id`
- `file_role`
- `file_name`
- `file_ext`
- `mime_type`
- `source_type`
- `source_path`
- `file_bytes`

#### 17.1.2 `extractors.py`

职责：

- 针对不同文件类型抽图
- 生成图片级索引记录
- 记录图片最基础元数据
- 不负责模糊判断
- 不负责最终页码回指策略

建议内部再拆为：

- `extract_images_from_pdf()`
- `extract_images_from_image_file()`
- `extract_images_from_docx()`
- `extract_images_from_attachment()`
- `persist_image_index_items()`

建议统一输出的数据结构：

- `documentId`
- `documentFileId`
- `sourceKind`
- `sourceFileName`
- `sourcePageNum`
- `imageIndexInPage`
- `imageIndexInFile`
- `imageSha256`
- `imageWidth`
- `imageHeight`
- `imageBytes`
- `extraJson`

其中：

- `PDF / 扫描PDF / 图片文件 / 图片附件` 可以在这里直接给出 `sourcePageNum`
- `doc/docx/wps` 这里不强行给精确页码，允许先留空

#### 17.1.3 `detector.py`

职责：

- 对抽出的图片执行清晰度检测
- 返回三档结果
- 输出原因码和说明
- 不负责页码定位

建议统一返回：

- `qualityStatus`
- `qualityScore`
- `reasonCode`
- `reasonText`

建议原因码先收敛为有限集合：

- `blur_detected`
- `low_resolution`
- `over_exposure`
- `under_exposure`
- `motion_blur`
- `text_unreadable`
- `detector_timeout`
- `detector_failed`

#### 17.1.4 `locator.py`

职责：

- 读取 OCR / visual_manifest
- 把图片索引记录与 visual object 做关联
- 回填 `ocr_page_num / ocr_bbox_json / ocr_match_mode`
- 不负责清晰度打分

如果第一期不想单独开 `locator.py`，也可以先内聚到 `runner.py`，但职责概念上仍然建议独立。

#### 17.1.5 `storage_adapter.py`

职责：

- 创建 run
- 批量写入 image items
- 更新图片检测结果
- 更新关联结果
- 汇总 run 主状态

不要把业务判断写进这里，这里只做存取。

### 17.2 `extractors.py` 建议的实现规则

#### 17.2.1 PDF 抽图规则

建议：

- 按页遍历
- 同页内按出现顺序编号 `image_index_in_page`
- 全文件维度编号 `image_index_in_file`
- 能拿到 bbox 就记录 bbox
- 原始图片字节计算 `sha256`

最终目标：

- PDF 在抽图阶段就拿到强定位能力

#### 17.2.2 图片文件抽图规则

建议：

- 单文件直接视为 1 页 1 图
- `sourcePageNum = 1`
- `imageIndexInPage = 1`
- `imageIndexInFile = 1`

#### 17.2.3 `doc/docx/wps` 抽图规则

建议：

- 只抽出内嵌图片
- 保证文件内顺序稳定
- 第一版不在这里做页码推断
- 重点保留：
  - `image_sha256`
  - `image_width`
  - `image_height`
  - `image_index_in_file`

原因：

- 这类文件的难点不在抽图，而在“抽图后如何挂回 OCR 页结构”

#### 17.2.4 附件抽图规则

建议：

- 每个附件独立视为一个来源文件
- 不要沿用主文件页码空间
- `source_file_name` 必填
- 前端展示时允许出现：
  - `附件《xxx.pdf》第 2 页第 1 张图片模糊`
  - `附件《xxx.jpg》图片模糊`

### 17.3 `runner.py` 建议执行时序

建议整条图片质量任务按下面时序执行：

1. 创建或确认 `image_quality_run`
2. 更新 run 状态为 `running`
3. 调用 `input_resolver.py` 读取主文件与附件
4. 调用 `extractors.py` 抽图
5. 将抽图结果落表 `leaudit_image_quality_items`
6. 判断是否无图
7. 若无图则 run 标记 `skipped/no_images`
8. 若有图则进入并发检测
9. 调用 `detector.py` 批量检测，回写每条 item 的质量结果
10. 尝试读取 OCR / visual_manifest
11. 若可读，则做 `locator` 关联并回填页码/bbox
12. 汇总 `pass/review/reject` 计数
13. 更新 run 的 `summary_status`
14. 标记 run `completed` 或 `partial_failed`

### 17.4 推荐时序图口径

可以把执行理解成下面这条线：

1. 上传成功
2. 文档主流程正常投递 OCR/评查
3. 图片质量任务独立投递
4. 图片质量任务先抽图、先检测
5. OCR 若先完成，则图片质量任务顺手做 visual_manifest 关联
6. OCR 若未完成，则图片质量任务先给弱定位结果
7. 后续可通过补偿任务把弱定位升级成强定位

这意味着：

- 图片质量任务不依赖 OCR 主任务完成后才能启动
- 但它可以在 OCR 结果可用时变得更准

### 17.5 是否需要补偿任务

建议保留“可选补偿任务”设计，但第一版不强求一定实现。

补偿任务适合处理：

- 图片质检先完成，但 OCR 尚未完成
- 首次只拿到了 `image_index_in_file`
- OCR 完成后再补齐 `ocr_page_num / bbox`

如果要做，建议命名：

- `leaudit.image_quality.relocate_by_ocr`

但第一版也可以先不单独开 task，而是在详情查询时做一次懒补偿，或者在 OCR 主流程结束后顺手触发一次回填。

### 17.6 `runner.py` 的失败处理原则

建议明确：

- 抽图失败：
  - 当前 run 标记 `failed`
  - 不影响文档主流程
- 单张图片检测失败：
  - 当前 item 标记失败原因
  - run 最终允许为 `partial_failed`
- OCR 关联失败：
  - 不算质检失败
  - 只算定位降级
- 全部图片检测完成但都无法定位页码：
  - run 仍可 `completed`
  - 只是展示层回退到“第 N 张图”

也就是说，质量检测失败和定位失败要分开统计，不能混成一类。

### 17.7 推荐的汇总规则

建议 `runner.py` 最终汇总时按下面规则判断：

- 只要存在至少 1 张 `reject`，`summary_status = reject`
- 否则只要存在至少 1 张 `review`，`summary_status = review`
- 否则若全部 `pass`，`summary_status = pass`
- 若任务本身没跑起来，则 `status = failed`
- 若任务跑完但部分图片检测失败，则 `status = partial_failed`
- 若命中无图、纯文本、开关关闭，则 `status = skipped`

### 17.8 第一版最小可交付实现

为了避免第一版过重，建议把可交付标准压到下面这一级：

- `extractors.py`
  - 先支持 PDF、图片、docx 内嵌图、附件
- `detector.py`
  - 先给出稳定三档判断
- `runner.py`
  - 先能完成抽图、检测、汇总
- `locator`
  - 先支持 OCR `visual_manifest` 存在时的强关联
  - 不存在时允许降级

只要能做到：

- 不影响主流程
- 能把问题图片列出来
- 大多数 PDF/图片类文档能精确到页
- `docx` 至少能稳定定位到“正文第 N 张图”

第一版就已经是可上线状态。

### 17.9 最终建议

研发拆任务时，建议直接按下面方式分：

- A 同学：`SQL + VO + Controller + Service`
- B 同学：`input_resolver + extractors`
- C 同学：`detector + runner`
- D 同学：`locator / OCR visual_manifest 关联`
- 前端同学：`上传页 + 列表页 + 详情页展示`

这样拆分写入面冲突最小，也最贴合当前仓库结构。