技术方案分享:基于pdf解析+特征工程+lightgbm的校招简历信息完整性检测思路与实现源码

772次阅读

今天是2023年11月9日，星期四，北京，天气晴，我们来看看一个关于简历信息完整性检测的工作。
本文公开了”讯飞AI算法挑战大赛-校招简历信息完整性检测挑战赛“赛道的技术方案和代码，本次比赛主要采用pdf解析和特征工程的方法，通过使用lightgbm的树模型10折交叉验证进行二分类的任务。
供大家一起参考。

一、赛题任务

简历智能化判断，需要大量的数据集作为支撑，同时简历的半结构化数据特点和多变的简历风格给简历智能化判断带来了挑战，本次大赛将提供脱敏的学生中文简历数据集（pdf或docx格式）作为训练样本，参赛选手需要基于提供的样本数据构建模型，预测简历是否符合简历投递基本要求。

任务如下：

简历完整性检测。根据要求提取简历要素特征数据，并根据样本数据构建模型，预测简历是否符合简历投递基本要求，预测结果可分为两个类别：即完整（标签1）或不完整（标签0）。

技术方案分享:基于pdf解析+特征工程+lightgbm的校招简历信息完整性检测思路与实现源码

二、数据和评价指标

数据：脱敏后的学生简历数据集（pdf或docx格式）。训练数据提供脱敏后的数据集，共800余份。测试集不可见，由真实简历数据组成，共100余份。训练集全部为pdf格式。

注：数据集分为正样本和负样本，其中正样本为完整性简历数据集，符合简历投递基本要求；负样本为不完整简历数据集，不符合简历投递基本要求。

评价指标：F1 score

三、方案

3.1.方案概述

本次比赛主要采用pdf解析和特征工程的方法，通过使用lightgbm的树模型10折交叉验证进行二分类的任务。

3.2.pdf2text解析

本次比赛主要实验了以下几种解析工具，最终最高分选择了pymupdf

pdfplumber
PyPDF2
pymupdf
…

3.3.特征工程

主要文本特征如下：

页数
pdf2text的文本长度
按行切分后的平均长度
按行切分后的最大长度
按行切分后的长度标准差
text字符集合的大小
pdf2text的文本长度-text字符集合的大小
text字符集合的大小/（pdf2text的文本长度+1）
text空格切分后的列表大小
text换行符切分后的列表大小
-的数量
x的数量
xxx的数量
数字的数量
@的数量
.com的数量
*的数量
：的数量
****的数量
正则匹配电话号码的数量

特征提取对应的code：

pattern = r"[D]+(1d{10})+(?!d)"

def extract_feature_from_pdf(path):
    doc = fitz.open(path)
    all_content = []
    page_nums = 0
    for i in doc.pages():
        page_nums += 1
        all_content.append(i.get_text())
    text = ”.join(all_content)
    text = ”.join(text.split(‘n’))

    feat = [
        page_nums,
        len(text),
        np.mean([len(x) for x in text.split(‘n’)]),
        np.max([len(x) for x in text.split(‘n’)]),
        np.std([len(x) for x in text.split(‘n’)]),

        len(set(text)),
        len(text) – len(set(text)),
        len(set(text)) / (len(text) + 1),

        len(text.split()),
        len(text.split(‘n’)),
        text.count(‘-‘),
        text.count(‘x’),
        text.count(‘xxx’),
        sum([text.count(x) for x in ‘0123456789’]),
        text.count(‘@’),
        text.count(‘.com’),
        text.count(‘*’),
        text.count(‘：’),
        text.count(‘****’),
        len(re.compile(pattern).findall(text)),
        1 if ‘正样本’ in path else 0,

]
return feat

3.4.训练代码

本次比赛主要使用的是lightgbm的树模型，视为二分类任务，进行10折交叉验证的训练。

#!/usr/bin/env python # _*_coding:utf-8_*_ # Author : Junhui Yu

import warnings

warnings.simplefilter(‘ignore’)

import gc

import pandas as pd

pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, 100)

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, classification_report

import lightgbm as lgb

import glob

import pandas as pd
from tqdm import tqdm
import numpy as np
import re
import fitz

pattern = r”[D]+(1d{10})+(?!d)”

def extract_feature_from_pdf(path):
    doc = fitz.open(path)
    all_content = []
    page_nums = 0
    for i in doc.pages():
        page_nums += 1
        all_content.append(i.get_text())
    text = ”.join(all_content)
    text = ”.join(text.split(‘n’))

    feat = [
        page_nums,
        len(text),
        np.mean([len(x) for x in text.split(‘n’)]),
        np.max([len(x) for x in text.split(‘n’)]),
        np.std([len(x) for x in text.split(‘n’)]),

        len(set(text)),
        len(text) – len(set(text)),
        len(set(text)) / (len(text) + 1),

        len(text.split()),
        len(text.split(‘n’)),
        text.count(‘-‘),
        text.count(‘x’),
        text.count(‘xxx’),
        sum([text.count(x) for x in ‘0123456789’]),
        text.count(‘@’),
        text.count(‘.com’),
        text.count(‘*’),
        text.count(‘：’),
        text.count(‘****’),
        len(re.compile(pattern).findall(text)),
        1 if ‘正样本’ in path else 0,

]
return feat

train_paths = glob.glob(
‘../xfdata/校招简历信息完整性检测训练集/*/*.pdf’)

df_train = pd.DataFrame(
    columns=[
        ‘page_nums’,
        ‘text_len’,
        ‘text_len_mean’,
        ‘text_len_max’,
        ‘text_len_std’,
        ‘text_set_len’,
        ‘lentext-lenset’,
        ‘lenset_div_lentext’,
        ‘text_split_len’,
        ‘text_split_ent_len’,
        ‘-_nums’,
        ‘x_nums’,
        ‘xxx_nums’,
        ‘dig_sum’,
        ‘@_nums’,
        ‘.com_nums’,
        ‘*_nums’,
        ‘：_nums’,
        ‘****_nums’,
        ‘phone_nums’,
        ‘label’
    ])

for t_p in tqdm(train_paths):
df_train.loc[len(df_train)] = extract_feature_from_pdf(t_p)

not_use_feats = [‘label’]
use_features = [col for col in df_train.columns if col not in not_use_feats]
print(len(use_features))
train = df_train[df_train[‘label’].notna()]

NUM_CLASSES = 2
FOLDS = 10
TARGET = ‘label’

def run_lgb(df_train, use_features):
target = TARGET
oof_pred = np.zeros((len(df_train), NUM_CLASSES))

    folds = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=42)
    for fold, (tr_ind, val_ind) in enumerate(folds.split(train, train[TARGET])):
        print(f’Fold {fold + 1}‘)
        x_train, x_val = df_train[use_features].iloc[tr_ind], df_train[use_features].iloc[val_ind]
        y_train, y_val = df_train[target].iloc[tr_ind], df_train[target].iloc[val_ind]
        train_set = lgb.Dataset(x_train, y_train)
        val_set = lgb.Dataset(x_val, y_val)

        params = {
            ‘learning_rate’: 0.1,
            ‘metric’: ‘multiclass’,
            ‘objective’: ‘multiclass’,
            ‘num_classes’: NUM_CLASSES,
            ‘feature_fraction’: 0.75,
            ‘bagging_fraction’: 0.75,
            ‘bagging_freq’: 2,
            ‘n_jobs’: -1,
            ‘seed’: 1029,
            ‘max_depth’: 10,
            ‘num_leaves’: 100,
            ‘lambda_l1’: 0.5,
            ‘lambda_l2’: 0.8,
            ‘verbose’: -1
        }

        model = lgb.train(params,
                          train_set,
                          num_boost_round=500,
                          early_stopping_rounds=100,
                          valid_sets=[train_set, val_set],
                          verbose_eval=100)
        oof_pred[val_ind] = model.predict(x_val)
        print(‘acc:’, accuracy_score(np.argmax(oof_pred, axis=1), df_train[‘label’]))
        del x_train, x_val, y_train, y_val, train_set, val_set
        gc.collect()

return oof_pred, model

oof_pred, model = run_lgb(train, use_features)
print(classification_report(np.argmax(oof_pred, axis=1), df_train[‘label’]))

model.save_model(‘model.txt’)

3.5.推理代码

#!/usr/bin/env python # _*_coding:utf-8_*_ # Author : Junhui Yu

import glob

import pandas as pd
import numpy as np
import re
import fitz

import lightgbm as lgb

patter = r”[D]+(1d{10})+(?!d)”

def extract_feature_from_pdf(path):
    doc = fitz.open(path)
    all_content = []
    page_nums = 0
    for i in doc.pages():
        page_nums += 1
        all_content.append(i.get_text())
    text = ”.join(all_content)
    text = ”.join(text.split(‘n’))

    feat = [
        page_nums,
        len(text),
        np.mean([len(x) for x in text.split(‘n’)]),
        np.max([len(x) for x in text.split(‘n’)]),
        np.std([len(x) for x in text.split(‘n’)]),

        len(set(text)),
        len(text) – len(set(text)),
        len(set(text)) / (len(text) + 1),

        len(text.split()),
        len(text.split(‘n’)),
        text.count(‘-‘),
        text.count(‘x’),
        text.count(‘xxx’),
        sum([text.count(x) for x in ‘0123456789’]),
        text.count(‘@’),
        text.count(‘.com’),
        text.count(‘*’),
        text.count(‘：’),
        text.count(‘****’),
        len(re.compile(patter).findall(text)),
        1 if ‘正样本’ in path else 0,

]
return feat

df = pd.DataFrame(
    columns=[
        ‘page_nums’,
        ‘text_len’,
        ‘text_len_mean’,
        ‘text_len_max’,
        ‘text_len_std’,
        ‘text_set_len’,
        ‘lentext-lenset’,
        ‘lenset_div_lentext’,
        ‘text_split_len’,
        ‘text_split_ent_len’,
        ‘-_nums’,
        ‘x_nums’,
        ‘xxx_nums’,
        ‘dig_sum’,
        ‘@_nums’,
        ‘.com_nums’,
        ‘*_nums’,
        ‘：_nums’,
        ‘****_nums’,
        ‘phone_nums’,
        ‘label’
    ])

test_paths = glob.glob(‘/work/data/integrity-check-of-resume-test-set/*.pdf’)[:]

for t_f in test_paths:
df.loc[len(df)] = extract_feature_from_pdf(t_f)

not_use_feats = [‘label’]
use_features = [col for col in df.columns if col not in not_use_feats]

model = lgb.Booster(model_file=‘model.txt’)

y_pred = model.predict(df[use_features])

predict_label = np.argmax(y_pred, axis=1)

pd.DataFrame({
‘ResumeID’: [x.split(‘/’)[-1] for x in test_paths],
‘label’: predict_label.astype(int)
}).to_csv(‘/work/output/result.csv’, index=None)

3.6.特征重要度与f1-score

feature split gain 16 *_nums 96 23.080862 15 .com_nums 68 15.428008 6 lentext-lenset 126 12.632440 7 lenset_div_lentext 222 10.997545 13 dig_sum 218 7.045122 1 text_len 110 4.449556 17 ：_nums 179 4.178767 8 text_split_len 165 4.169549 10 -_nums 137 3.483447 5 text_set_len 184 3.018025 14 @_nums 13 2.870494 11 x_nums 94 2.141016 19 phone_nums 16 1.668496 18 ****_nums 12 1.608449 12 xxx_nums 24 1.249654 2 text_len_mean 31 1.066294 0 page_nums 31 0.803168 3 text_len_max 5 0.109109 9 text_split_ent_len 0 0.000000 4 text_len_std 0 0.000000

precision recall f1-score support

0 0.75 0.84 0.79 105
1 0.98 0.96 0.97 710

    accuracy                           0.94       815
   macro avg       0.86      0.90      0.88       815
weighted avg       0.95      0.94      0.94       815

四、延伸

本次比赛任务相对简单，如果真正做到应用级别还需要考虑更多维度综合对简历的完整性进行评价。比如：简历中核心字段的填充率、设计简历中核心字段的重要性权值等等多维度信息。涉及技术可能有基于实体识别的简历解析（从本文特征工程也可以看出）技术等。

结论

本文仅记录8月份参与该比赛思路，至于代码也很普通。该比赛任务由于比较简单，线下指标虚高，训练数据与线上评测数据较少（耐心做特征工程分数可以非常高），并且技术价值不高。因此，前前后后投入差不多一个小时左右时间速刷了一下，最后偶然获奖。

参考文献

【1】校招简历信息完整性检测挑战赛：https://challenge.xfyun.cn/topic/info?type=information-integrity&option=ssgy

关于我们

老刘，刘焕勇，NLP开源爱好者与践行者，主页：https://liuhuanyong.github.io。

老刘说NLP，将定期发布语言资源、工程实践、技术总结等内容，欢迎关注。

对于想加入更优质的知识图谱、事件图谱、大模型AIGC实践、相关分享的，可关注公众号，在后台菜单栏中点击会员社区->会员入群加入。

技术方案分享:基于pdf解析+特征工程+lightgbm的校招简历信息完整性检测思路与实现源码

一、赛题任务

二、数据和评价指标

三、方案

3.1.方案概述

3.2.pdf2text解析

3.3.特征工程

3.4.训练代码

3.5.推理代码

3.6.特征重要度与f1-score

四、延伸

结论

参考文献

关于我们

AI青年学霸齐聚杭州！这场峰会要选出「未来科学新星」

李飞飞空间智能独角兽开源底层技术！AI生成3D世界在所有设备流畅运行

终于！全球爆火AI视频神器PixVerse发布国内版——拍我AI

双重突破：全球首个零售VLA大模型来了！开源OpenWBT让机器人遥操门槛暴降！

挑战强化学习后训练霸权！全新无监督方法仅需1条数据+10步优化

通义灵码AI IDE上线，深度适配Qwen3，首创自动记忆功能

GPT-4o-Image仅完成28.9%任务！上海AI实验室等发布图像编辑新基准，360道人类专家严选难题

华为攻克AI推理「想太多」问题！新方法让大模型推理提速60%，准确率还高了

最新一期权威大模型榜单：豆包1.5、商汤日日新V6并列国内第一

每2秒吃透一道高数大题！华为终于揭秘准万亿MoE昇腾训练系统全流程