HumanEval是如何进行代码评估的：从数据构成、评估逻辑到pass@k指标计算

HumanEval: Hand-Written Evaluation Set，是工作《Evaluating Large Language Models Trained on Code》(https://arxiv.org/abs/2107.03374)中提到的一个代码评测基准。

最近在做代码方面的评估，走了许多弯路，在评估逻辑上有些误解，重新整理了下，供大家一起参考。尤其是针对pass@k的理解、如何做的单元测试等。

一、HumanEval的数据构成

HumanEval评测数据集，一共包括164条样本，还是很少量的，可以用json进行更为直观的理解，地址https://github.com/abacaj/code-eval/blob/main/human-eval/data/HumanEval.jsonl.gz：

{ "task_id":"HumanEval/0", "prompt":"from typing import Listnnndef has_close_elements(numbers: List[float], threshold: float) -> bool:n """ Check if in given list of numbers, are any two numbers closer to each other thann given threshold.n >>> has_close_elements([1.0, 2.0, 3.0], 0.5)n Falsen >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)n Truen """n", "entry_point":"has_close_elements", "canonical_solution":" for idx, elem in enumerate(numbers):n for idx2, elem2 in enumerate(numbers):n if idx != idx2:n distance = abs(elem - elem2)n if distance < threshold:n return Truenn return Falsen", "test":"nnMETADATA = {n 'author': 'jt',n 'dataset': 'test'n}nnndef check(candidate):n assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == Truen assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == Falsen assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == Truen assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == Falsen assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == Truen assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == Truen assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == Falsenn" }

如下所示：

task_id表示任务的ID，prompt表示题目（通常直接请求大模型获取答案），entry_point是唯一标记，canonica_solution是参考答案，test是测试单元。 HumanEval是如何进行代码评估的：从数据构成、评估逻辑到pass@k指标计算

二、HumanEval的评估逻辑

每一个测试问题重复实验n次，然后通过单元测试，计算平均通过率。我们可以在源码地址：https://github.com/abacaj/code-eval/tree/main/human-eval中看到起执行逻辑

1、读取每个样本，请求模型获得结果

如下所示，generate_one_completion为请求大模型生成结果的函数，输入每道题的prompt，然后得到结果。

而由于题目太少，测试结果会有偏，大模型的结果具备多样性（如有top_p, top_k）等，所以，num_samples_per_task用来控制每道题生成多少个结果(代码中设置为200次)，从而计算通过率。completion作为补全结果的存储字段。

因此，整体会有32800条样本。

from human_eval.data import write_jsonl, read_problems

problems = read_problems()

num_samples_per_task = 200
samples = [
    dict(task_id=task_id, completion=generate_one_completion(problems[task_id][“prompt”]))
    for task_id in problems
    for _ in range(num_samples_per_task)
]
write_jsonl(“samples.jsonl”, samples)

当然，这一块，需要做一个代码的后处理，因为模型会生成其他多余的代码片段，例如https://github.com/abacaj/code-eval/blob/main/core/evaluation.py中所述：

# reference: https://github.com/declare-lab/instruct-eval/blob/main/human_eval/main.py#L35 def filter_code(completion: str) -> str: # The program tends to overwrite, we only take the first function completion = completion.lstrip("n") return completion.split("nn")[0]

将后处理得到的结果作为最终代码补全结果。

2、获得模型的结果，进行单元测试

这块的逻辑的在于，针对得到的补全结果，通过构造一个完整的测试用例，送入单元测试中进行测试。

其中，如下代码所示：

def check_correctness(problem: Dict, completion: str, timeout: float, completion_id: Optional[int] = None) ->

    def unsafe_execute():
        with create_tempdir():
            # These system calls are needed when cleaning up tempdir.
            import os
            import shutil
            rmtree = shutil.rmtree
            rmdir = os.rmdir
            chdir = os.chdir

# Disable functionalities that can make destructive changes to the test.
reliability_guard()

            # Construct the check program and run it.
            print(completion)
            check_program = (
                problem[“prompt”] + completion + “n” +
                problem[“test”] + “n” +
                f“check({problem[‘entry_point’]})”
            )
            try:
                exec_globals = {}
                with swallow_io():
                    with time_limit(timeout):
                        exec(check_program, exec_globals)
                result.append(“passed”)
            except TimeoutException:
                result.append(“timed out”)
            except BaseException as e:
                result.append(f“failed: {e}”)
            # Needed for cleaning up.
            shutil.rmtree = rmtree
            os.rmdir = rmdir
            os.chdir = chdir
    manager = multiprocessing.Manager()
    result = manager.list()
    p = multiprocessing.Process(target=unsafe_execute)
    p.start()
    p.join(timeout=timeout + 1)
    if p.is_alive():
        p.kill()
    if not result:
        result.append(“timed out”)
    return dict(
        task_id=problem[“task_id”],
        passed=result[0] == “passed”,
        result=result[0],
        completion_id=completion_id,
    )

里面对于测试样例的构造，是将题目的prompt、模型预测的内容completion、题目的test的按照换行符进行拼接。

# Construct the check program and run it. print(completion) check_program = ( problem["prompt"] + completion + "n" + problem["test"] + "n" + f"check({problem['entry_point']})" )

然后进行单元测试，直接使用python内置的exec函数进行校验，逻辑在于，给定超时timeout时间，如果测试通过，则标记为passed，如果不是，则不通过【比如说出现代码语法错误】。

try: exec_globals = {} with swallow_io(): with time_limit(timeout): exec(check_program, exec_globals) result.append("passed") except TimeoutException: result.append("timed out") except BaseException as e: result.append(f"failed: {e}")

经过这个测试之后，就可以得到每条样本的预测情况。

三、再看代码模型评估中的pass@k指标计算

代码生成模型的主要基准是将样本与参考解进行匹配，匹配可以是精确的，也可以是模糊的（如BLEU分数）。

例如：

EM（Exact Match）是指生成的代码与真实代码具有相同的token序列的百分比；

BLUE机器翻译结果越接近专业人工翻译的结果，则越好，本质在判断两个句子的相似程度，相似度越高得分越高。

CodeBLEU是BLEU变体。在BLEU在n-gram匹配上的基础上，进一步通过抽象语法树（AST）融入代码语法，通过数据流融入代码语义；

但是，基于匹配的代码衡量标准存在缺陷。例如，BLEU在捕捉代码特有的语义特征方面存在问题。

因此，Kulal等人（2019年）使用pass@k指标评估功能正确性，每个问题生成k个代码样本，如果任何样本通过单元测试，则认为问题已解决，并报告总分数。

但是一次实验随机性太大，需要多次实验求平均值。pass@k需要对每一个测试问题重复实验t次，并且每次都生成k个代码，最后计算平均通过率。假如重复实验100次来估计pass@100，就需要生成 100*100=10000个代码，这样的计算量是难以接受的。而t越小，估计的pass@k就越不准（方差越大）。

因此，为了评估pass@k，该工作会为每个任务生成n≥k个样本（本文中使用n=200，k≤100），计算通过单元测试的正确样本c≤n的数量，并计算无偏估计值。

HumanEval是如何进行代码评估的：从数据构成、评估逻辑到pass@k指标计算

其中，c是生成的n个代码中通过测试的数量。n越大估计越准确，但计算代价肯定远远小于t*k。

假设模型只能生成这n个代码，而且他们每一种被生成出来的概率是相等的，其中有c个可以通过测试。那么模型任意生成k个代码，全都不能通过测试的概率是：生成k个不能通过测试的代码的情况总和与生成k个代码的全部情况总和之比，即：

HumanEval是如何进行代码评估的：从数据构成、评估逻辑到pass@k指标计算

根据大数定理，当样本总量趋近无穷大的时候，样本的平均值无限接近数学期望。因此只要求出其的均值，即得到了对pass@k的无偏估计。

具体代码实现：

def estimate_pass_at_k( num_samples: Union[int, List[int], np.ndarray], num_correct: Union[List[int], np.ndarray], k: int, ) -> np.ndarray: """ Estimates pass@k of each problem and returns them in an array. """

    def estimator(n: int, c: int, k: int) -> float:
        “”“
        Calculates 1 – comb(n – c, k) / comb(n, k).
        ““”
        if n – c < k:
            return 1.0
        return 1.0 – np.prod(1.0 – k / np.arange(n – c + 1, n + 1))

    if isinstance(num_samples, int):
        num_samples_it = itertools.repeat(num_samples, len(num_correct))
    else:
        assert len(num_samples) == len(num_correct)
        num_samples_it = iter(num_samples)

    return np.array(
        [estimator(int(n), int(c), k) for n, c in zip(num_samples_it, num_correct)]
    )

关于这块，https://zhuanlan.zhihu.com/p/653063532做了公式的推演，感兴趣的可以进一步看看。

最终，即可完成对应的指标，例如官方的脚本运行结果：

$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl Reading samples... 6it [00:00, 3397.11it/s] Running example suites... 100%|...| 6/6 [00:03<00:00, 1.96it/s] Writing results to data/example_samples.jsonl_results.jsonl... 100%|...| 6/6 [00:00<00:00, 6148.50it/s] {'pass@1': 0.4999999999999999}

总结

本文主要针对humaneval这一评测任务，从数据、评估逻辑以及pass@k的评估指标计算方式进行了介绍，之前一直对pass@k有误解，认为是预测K次的通过率，读完代码实现本身才有更为准确的理解。

代码评测，也是整个评测体系中十分重要的部分，感兴趣的可关注。

参考文献

1、https://github.com/abacaj/code-eval/blob/main/human-eval/

2、https://arxiv.org/abs/2107.03374

2、https://zhuanlan.zhihu.com/p/653063532

关于我们

老刘，刘焕勇，NLP开源爱好者与践行者，主页：https://liuhuanyong.github.io。

老刘说NLPhttps://zhuanlan.zhihu.com/p/653063532，将定期发布语言资源、工程实践、技术总结等内容，欢迎关注。

对于想加入更优质的知识图谱、事件图谱、大模型AIGC实践、相关分享的，可关注公众号，在后台菜单栏中点击会员社区->会员入群加入。

ufabet มีเกมให้เลือกเล่นมากมาย: เกมเดิมพันหลากหลาย ครบทุกค่ายดัง

tornado crypto mixer Discover the power of privacy with TornadoCash! Learn how this decentralized mixer ensures your transactions remain confidential.

ดูบอลสด Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

ดูบอลสด Pretty! This has been a really wonderful post. Many thanks for providing these details.

ดูบอลสด Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

Obrazy Sztuka Nowoczesna Thank you for this wonderful contribution to the topic. Your ability to explain complex ideas simply is admirable.

ufabet Hi there to all, for the reason that I am genuinely keen of reading this website’s post to be updated on a regular basis. It carries pleasant stuff.

ufabet You’re so awesome! I don’t believe I have read a single thing like that before. So great to find someone with some original thoughts on this topic. Really.. thank you for starting this up. This website is something that is needed on the internet, someone with a little originality!

ufabet Very well presented. Every quote was awesome and thanks for sharing the content. Keep sharing and keep motivating others.

HumanEval是如何进行代码评估的：从数据构成、评估逻辑到pass@k指标计算

一、HumanEval的数据构成

二、HumanEval的评估逻辑

三、再看代码模型评估中的pass@k指标计算

总结

参考文献

关于我们

Nano banana手办玩法火爆出圈！无需抽卡，效果惊了(°o°)

蚂蚁专用模型超越o3！仅用2K训练样本刷新医疗AI榜单纪录

Claude估值暴涨300%！全球独角兽字节第三他第四

马斯克入局AI编程！新模型限时免费用：256K上下文，主打一个速度快

OpenAI宣布推出AI在线招聘平台，和微软的领英打起来了

小米新系统和iPhone联动了

马斯克入局AI编程！新模型限时免费用：256K上下文，主打一个速度快

Nano banana手办玩法火爆出圈！无需抽卡，效果惊了(°o°)

打工人出差又烦又累？阿里商旅推出了一个AI“行政助理”

蚂蚁专用模型超越o3！仅用2K训练样本刷新医疗AI榜单纪录