最新要闻
- 【环球财经】日本为何出现巨额贸易逆差
- 便宜的瓜不甜?百果园回应女子团购西瓜被嘲讽:是误解 已道歉|全球热闻
- 京东618发布“35711”梦想:打造3家万亿收入公司 创造100万就业_通讯
- 演员胡兵向东航维权失败:价值一万多的白金卡50万积分一夜清零
- 当前播报:京东限时优惠:Redmi 27英寸4K显示器享大降价
- 世界头条:人气漫改!网飞真人版《海贼王》预告片出炉:路飞橡胶手无敌
- 全球车企第一高!马斯克曝特斯拉市值上涨主要动力 今日快看
- 关注:广州一龙舟队全是富婆?当地:事实 但身价不是参赛门槛
- 焦点热门:一公司端午节只发了三颗荔枝 员工吐槽:是不是公司要凉了?
- 女子户外活动后因热射病去世 专家提醒:轻度中暑及时干预_每日时讯
- 春天的诗有哪些古诗 春天的100首古诗有哪些|快资讯
- 当前速递!校长猥亵残障女学生判3年 法院回应全案审查!简直是禽兽不如!
- 保时捷Taycan对手来了!法拉利“老乡”发布纯电轿跑:124万起-世界报道
- 世界短讯!上新啦!这些夏日冷饮新品你尝过了吗?今年中国饮品冷饮产业还有这些新趋势!
- 清除外卖行业“影子店铺”隐患
- 今天是父亲节 微信上线限时状态感谢老爸:教你2步设置|环球快播
手机
iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?
警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案
- iphone11大小尺寸是多少?苹果iPhone11和iPhone13的区别是什么?
- 警方通报辅警执法直播中被撞飞:犯罪嫌疑人已投案
- 男子被关545天申国赔:获赔18万多 驳回精神抚慰金
- 3天内26名本土感染者,辽宁确诊人数已超安徽
- 广西柳州一男子因纠纷杀害三人后自首
- 洱海坠机4名机组人员被批准为烈士 数千干部群众悼念
家电
2023中国高校计算机大数据挑战赛:论文学科分类baseline|清华主办|今日看点
2023中国高校计算机大数据挑战赛:论文学科分类baseline|清华主办
官方地址:https://www.heywhale.com/home/competition
项目码源见文末
1.比赛介绍
- 赛事背景自 2022 年底以来,大规模语言模型在各行各业产生了广泛的应用,其中围绕学术工具开发也诞生了许多具有影响力的应用,例如 ChatPDF 等。另一方面,在 2023 年 3 月 14 日,智谱 AI 与清华大学联合发布了 ChatGLM-6B 开源模型,并在不到一个月的时间内吸引了超过 100 万人下载安装。该模型在 Hugging Face (HF) 全球大模型下载榜中连续 12 天位居第一名,在国内外的开源社区中产生了较大的影响。
为了最大化利用 ChatGLM-6B 开源模型推动科研工具的应用开发,我们联合国内最具影响力的学术平台 AMiner,推出了本次「ChatGLM 实践大赛 · 学术应用篇」。本次比赛的中心主题是如何利用 ChatGLM-6B 开源模型促进学术工具的优化。我们希望通过本次比赛,为有志于投入大模型研究和开发的爱好者提供一个实践平台。大赛共提供 3 个场景、7 个赛道,分别为:
(资料图片仅供参考)
场景 1:论文阅读
赛道一:论文学科分类 (Easy)——根据标题和摘要将论文准确分类到 40 个自然学科中去,可能单学科,也可能交叉学科,准确度达到 90% 以上。
赛道二:问答式科研知识库 (Medium)——将 PDF 论文上传构建向量化科研知识库,在知识库内做自由问答,要求相对回答专业,且答案后要附带相关文件。
赛道三:论文综述和对比分析 (Medium)——给定多篇论文的标题、摘要或全文,对论文的背景、问题、方法、实验、结论等进行综述或对比分析。
场景 2:投稿审稿
赛道四:投稿期刊会议推荐 (Medium)——根据标题和摘要推荐适合的 Top K 期刊会议,并根据匹配度针对每个推荐期刊会议给出推荐理由。
赛道五:审稿回复 (Medium)——基于 Openreview 数据,微调出一个审稿回复模型。
场景 3:论文发现
赛道六:论文检索 (Hard)——给定概念、给定问题、给定实体等的单独和混合检索。
赛道七:论文推荐和科技情报生成 (Hard)——基于用户画像(订阅关键词+搜索浏览行为),从每日最新论文中筛选跟用户相关的1篇或多篇论文,基于论文信息(标题、作者、摘要等,也可以增加其他额外信息)微调大模型生成科技情报,情报形式和深度由选手自定义。
- 大赛组织
主办单位:智谱 AI
协办单位:和鲸科技
数据提供:AMiner 技术团队
组织支持:Huggingface
算力支持:揽睿星舟、AWS
2.论文学科分类赛道任务简介
- 题目描述
根据标题和摘要将论文准确分类到 40 个自然学科里去,可能单学科,也可能交叉学科,准确度达到 90% 以上。
- 数据说明
数据集:40 个自然学科下每个学科 500 篇论文的标题摘要,1000 篇左右交叉学科论文的标题和摘要。
测试集:500 篇文献,客观分类,评价指标 Acc。
部分原数据集展示:
{"id":155,"title":"Modeling heterogeneous network user route and departure time responses to dynamic pricing","abstract":"The ability to realistically capture trip-makers’ responses to time-varying road charges is essential for network equilibrium assignment models typically applied to predict network flows in the presence of dynamic road (congestion) pricing. User responses to pricing are governed by individual trip-makers’ preferences, such as their value of time (VOT), and the cost they attach to late vs. early arrival relative to the destination. These behavioral characteristics vary across users. This paper presents a joint route and departure time network equilibrium assignment model explicitly considering heterogeneous users with different preferred arrival times at destinations, VOT, and values of early and late schedule delays (VOESD and VOLSD). The model is formulated as an infinite-dimensional variational inequality and solved by a column generation-based algorithmic framework that embeds: (i) an extreme non-dominated alternative-generating algorithm to obtain combinations of VOT, VOESD, and VOLSD subintervals (or breakpoints) that define multiple user classes, and the corresponding least trip cost alternative (joint departure time and path) for each user class, (ii) a traffic simulator to capture traffic flow dynamics and determine experienced travel costs; and (iii) a multi-class alternative flow updating scheme to solve the reduced multi-class simultaneous route and departure time user equilibrium problem defined by a subset of feasible alternatives. Application to an actual network illustrates the properties of the algorithm, and underscores the importance of capturing user heterogeneity and temporal shifts in the appraisal of dynamic pricing schemes.","subject_name":["交通运输工程"]}{"id":156,"title":"Duration-dependent effect of transient neonatal hypothyroidism on sertoli and germ cell number, and plasma and testicular interstitial fluid androgen binding protein concentration.","abstract":"The impact of transient neonatal hypothyroidism on growth and function of puberal testis during different milestones of postnatal testicular development was studied in Wister rats. Rat pups were made hypothyroid for 10, 15, 30, 40 and 60 days of postnatal age from birth by providing 0.05% (W\/V) methimazole (MMI) in the drinking water of the mother, from day 1 postpartum till weaning (25 days postpartum) and thereafter in the drinking water. Control rats were raised without MMI treatment. Sertoli cell number and its function was assessed on day 60 postpartum. Sertoli cell number increased consistently in 10, 15, 30 and 40 days transient hypothyroid rats but decreased in rats subjected to continuous hypothyroidism from birth to 60 days postpartum. Rats subjected to continuous hypothyroidism from birth showed spermatogenic arrest at puberty and had only a single layer of spermatogonia. Transient neonatal hypothyroidism for 10 (or) 15 days from birth increased spermatocytes (pachytene and zygotene), spermatids (elongated and round) whereas, that of 30 and 40 days decreases the number of germ cells. Plasma androgen binding protein (ABP) concentration decreased in puberal rats belonging to all groups, whereas the testicular interstitial fluid (TIF) concentration of ABP increased significantly in 10 and 15 days hypothyroid rats while it decreased in all other groups. These findings indicate that the mitogenic activity of Sertoli cell is increased irrespective of the duration of transient neonatal hypothyroidism. However, the functional activity of Sertoli cells (ABP production) in these puberal rats varies depending upon the postnatal period at which the animals were in hypothyroid state.","subject_name":["临床医学"]}
- train.json格式
{"id":0,"title":"title0","abstract":"abstract0","subject_name":["社会学"]}{"id":1,"title":"title1","abstract":"abstract1","subject_name":["社会学","石油工程"]}
- test.json预测文件格式(官方未放出来,我就简单构造几个作为测试)
{"id":0,"title":"title0","abstract":"abstract0",}{"id":1,"title":"title1","abstract":"abstract1"}
{"id":0,"title":"Oxidative coupling of methane in the redox cyclic mode over the catalysts on the basis of CeO2 and La2O3","abstract":"The 1% CeO 2 , 9% La 2 O 3 \/SiO 2 and 2% CeO 2 , 8% La 2 O 3 \/SiO 2 catalysts show reliable efficiency in the OCM reaction, as well as stable work in the redox cyclic mode. Selectivity to C 2 products remarkably increases if preliminary reduction of the catalyst by a small amount of hydrogen is used."}{"id":1,"title":"Tissue engineering: strategies, stem cells and scaffolds.","abstract":"Tissue engineering scaffolds are designed to influence the physical, chemical and biological environment surrounding a cell population. In this review we focus on our own work and introduce a range of strategies and materials used for tissue engineering, including the sources of cells suitable for tissue engineering: embryonic stem cells, bone marrow-derived mesenchymal stem cells and cord-derived mesenchymal stem cells. Furthermore, we emphasize the developments in custom scaffold design and manufacture, highlighting laser sintering, supercritical carbon dioxide processing, growth factor incorporation and zoning, plasma modification of scaffold surfaces, and novel multi-use temperature-sensitive injectable materials."}{"id":2,"title":"Enhancement of Forced Convection Subcooled Film Boiling Heat Transfer Using Gas Sheet Collapse by Electric Field Application","abstract":"Enhancement of forced-convection boiling heat transfer by electriceld is investigated experimentally. When a high-temperature horizontallament is immersed in water, a gas sheet is formed around and the abovelament due to liquid boiling, in the early immersion process. This gas-sheet markedly decreases the boiling cooling rate of thelament. Here, forced collapse of the gas sheet is attempted by imposing an electriceld to enhance the boiling cooling rate, In the experiments, a horizontal platinum wire of 0.5mm in diameter is immersed in pure water under atmospheric pressure, and a DC voltage up to 600V is applied between the wire surface and an electrode made of glass placed 10mm apart. The whole boiling curve is measured under different applied voltages and wire-falling velocities in 0.5 to 2.0m\/s range, and at subcooling of 60 K. The experimental results show that the electric field is effective in promoting the disintegration of the gas sheet. Under the tested conditions, boiling cooling rate increased two-fold for an applied electriceld of 600 V\/cm. This result shows that the use of an electriceld to break up the gas-sheet has resulted in a remarkable increase in the cooling rate at high superheats during initial cooling period, which is even greater than that used in the existing material manufacturing processes by the rapid cooling method, and therefore, this method may contribute to developing new materials."}{"id":195,"title":"Speciation of some heavy metals in bottom sediments of the Ob and Yenisei estuarine zones","abstract":"The speciation of Fe, Mn, Zn, Cu, Co, Ni, Cr, Pb, and Cd was studied in 52 samples of bottom sediments collected during Cruise 49 of the R\/V Dmitrii Mendeleev in estuaries of the Ob and Yenisei rivers in the southwestern Kara Sea. Immediately after sampling, the samples were subjected to on-board consecutive extraction to separate metal species according to their modes of occurrence in the sediments: (1) adsorbed, (2) amorphous Fe-Mn hydroxides and related metals, (3) organic + sulfide, and (4) residual, or lithogenic. The atomic absorption spectroscopy of the extracts was carried out at a stationary laboratory. The distribution of Fe, Zn, Cu, Co, Ni, Cr, Pb, and Cd species is characterized by the predominance of lithogenic or geochemically inert modes (70–95% of the bulk content), in which the metals are bound in terrigenous and clastic mineral particles and organic detritus. About half of the total Mn amount and 15–30% Zn and Cu is contained in geochemically mobile modes. The spatiotemporal variations in the proportions of metal species in the surface layer of sediments along the nearly meridional sections and through the vertical sections of bottom sediments cores testify that Mn and, to a lesser extent, Cu are the most sensitive to changes in the sedimentation environment. The role of their geochemically mobile species notably increases under reducing conditions."}
3.数据转换
将官方数据进行处理得到模型格式要求的输入,这边就直接给出来了,可以参考
其中40个分类分别为:
{"材料科学与工程", "临床医学", "电气工程", "数学", "化学", "地质工程", "地理学", "食品科学与工程", "医学", "生物学", "核科学与技术", "地球物理学", "水产", "药学", "交通运输工程", "体育学", "生物医学工程", "护理", "物理学", "心理学", "社会学", "神经科学", "计算机科学", "建筑学", "环境科学与工程", "机械工程", "航空航天工程", "石油工程", "免疫与微生物学", "矿业", "通信与信息科学", "光学", "历史学", "地质学", "教育学", "海洋工程", "公共管理学", "仪器科学与技术", "经济学", "音乐"}
其中训练集和测试集比例0.8:0.2
{"id": 11810, "text_a": "Restoration of the shear capacity for RC beams with web openings using precast SHCC plates Providing web opening in the shear-span zone of RC beams results in significant reduction in the shear capacity of such beams. Thus, an efficient restoration technique has to be found out and implemented in order to compensate the developed reduction. The main target of the current paper is to introduce and validate an innovative restoration technique for the new construction making use of the Strain-Hardening Cementitious Composites (SHCC) material. Accordingly, precast thin SHCC plates having the required opening were cast and cured for about 3 weeks to eliminate the volumetric change issues, and then placed inside the formwork at both sides before casting the RC beams included web openings. The chosen thickness of the SHCC plates was 20 mm in order to be easily accommodated in the concrete cover. For the considered openings, the opening depth was kept constant to be 0.30 of the beam effective depth, while the opening length was varied considering three values; 150, 300, and 450 mm. Besides, small amount of internal reinforcement in the form of steel wire mesh was provided inside some SHCC plates in order to enhance their shear strength and ductility. Experimental results showed that the provided SHCC layers enabled the strengthened beams to exhibit distinguished performance in terms of ultimate capacity, ductility and decreased shear crack width. In addition, the gain in shear capacity due to the SHCC plates is decreased with the increase of the opening width. Finally, comparisons between the obtained experimental results and the predicted shear capacities stipulated by the ACI 318-19 and JSCE codes were performed. The comparisons revealed that the estimated shear capacities are in satisfactory agreement with the experimental results, however, these estimations tend to be overestimated with the increase of the opening length.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技术", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通信与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技术"], "labels": [4]}{"id": 5984, "text_a": "Caractéristiques et évaluation des symptômes de la rhinite allergique : Résultats de l’enquête CESAR Des recommandations sont publiées depuis plusieurs années pour la prise en charge de la rhinite allergique. Avec le temps, les concepts visant à définir les entités chroniques et celles qui se manifestent sur de plus courtes périodes ont évolué. Nous sommes ainsi passés du couple « perannuelle/saisonnière » à celui de « persistante/intermittente ». La sévérité des symptômes et leur répercussion sur la vie quotidienne des patients sont prises aussi en compte dans ces nouvelles recommandations. L’enquête observationnelle CESAR « Caractéristiques et Evaluation des Symptômes de la rhinite AlleRgique » vise à évaluer le paysage de la rhinite allergique en France sur ces nouveaux critères ainsi qu’à mieux appréhender les modalités de prise en charge des patients en médecine générale.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技术", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通信与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技术"], "labels": [10, 6]}{"id": 4707, "text_a": "Dynamic analysis of OWT foundation with large diameter monopile under transient storm loading To investigate the stability of monopile foundation under dynamic loading, a comprehensive numerical model for the analysis of offshore wind turbines (OWT) foundation under a general transient storm loading is presented in this study. The dynamic stiffness and soil deformation around the large-diameter monopile is simulated using this method. During the numerical analysis, a dynamic boundary surface model of soil is derived instead of the empirical strength degradation. Along the axis direction of the monopile, an intensive study about deformation law of the seabed soil is analysed, moreover, some parameters which may affect the OWT stability and the dynamic stiffness are discussed. Some conclusions can be drawn that the dynamic stiffness and the lateral displacement of the monopile foundation can be obviously improved by increasing the buried depth than the diameter, and the proposed failure mode can well describe the failure law of soil around the monopile due to the dynamic loading.", "choices": ["交通运输工程", "体育学", "机械工程", "水产", "建筑学", "公共管理学", "医学", "地质学", "地球物理学", "生物学", "临床医学", "数学", "物理学", "化学", "石油工程", "历史学", "地质工程", "音乐", "核科学与技术", "护理", "经济学", "航空航天工程", "海洋工程", "社会学", "药学", "心理学", "矿业", "材料科学与工程", "电气工程", "教育学", "神经科学", "地理学", "光学", "环境科学与工程", "计算机科学", "生物医学工程", "通信与信息科学", "免疫与微生物学", "食品科学与工程", "仪器科学与技术"], "labels": [22]}
4.模型训练预测
多任务训练场景可分别进行数据转换再进行混合:通用分类、评论情感分析、语义相似度计算、蕴含推理、多项式阅读理解等众多“泛分类”任务
##代码结构├── deploy/simple_serving/ # 模型部署脚本├── utils.py # 数据处理工具├── run_train.py # 模型微调脚本├── run_eval.py # 模型评估脚本├── label_studio.py # 数据格式转换脚本├── label_studio_text.md # 数据标注说明文档└── README.md
4.1 模型微调
#安装最新版本paddlenlp!pip install --upgrade paddlenlp
#移动数据集!cp /home/aistudio/input/train.txt /home/aistudio/data!cp /home/aistudio/input/dev.txt /home/aistudio/data
# 单卡启动:!python run_train.py \ --device gpu \ --logging_steps 100 \ --save_steps 100 \ --eval_steps 100 \ --seed 1000 \ --model_name_or_path utc-base \ --output_dir ./checkpoint/model_best \ --dataset_path ./data/ \ --max_seq_length 512 \ --per_device_train_batch_size 32 \ --per_device_eval_batch_size 32 \ --gradient_accumulation_steps 8 \ --num_train_epochs 20 \ --learning_rate 1e-5 \ --do_train \ --do_eval \ --do_export \ --export_model_dir ./checkpoint/model_best \ --overwrite_output_dir \ --disable_tqdm True \ --metric_for_best_model macro_f1 \ --load_best_model_at_end True \ --save_total_limit 1 \ --save_plm
该示例代码中由于设置了参数 --do_eval
,因此在训练完会自动进行评估。
可配置参数说明:
single_label
: 每条样本是否只预测一个标签。默认为False
,表示多标签分类。device
: 训练设备,可选择 "cpu"、"gpu" 其中的一种;默认为 GPU 训练。logging_steps
: 训练过程中日志打印的间隔 steps 数,默认10。save_steps
: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。eval_steps
: 训练过程中保存模型 checkpoint 的间隔 steps 数,默认100。seed
:全局随机种子,默认为 42。model_name_or_path
:进行 few shot 训练使用的预训练模型。默认为 "utc-base", 可选"utc-xbase", "utc-base", "utc-medium", "utc-mini", "utc-micro", "utc-nano", "utc-pico"。output_dir
:必须,模型训练或压缩后保存的模型目录;默认为None
。dataset_path
:数据集文件所在目录;默认为./data/
。train_file
:训练集后缀;默认为train.txt
。dev_file
:开发集后缀;默认为dev.txt
。max_seq_len
:文本最大切分长度,包括标签的输入超过最大长度时会对输入文本进行自动切分,标签部分不可切分,默认为512。per_device_train_batch_size
:用于训练的每个 GPU 核心/CPU 的batch大小,默认为8。per_device_eval_batch_size
:用于评估的每个 GPU 核心/CPU 的batch大小,默认为8。num_train_epochs
: 训练轮次,使用早停法时可以选择 100;默认为10。learning_rate
:训练最大学习率,UTC 推荐设置为 1e-5;默认值为3e-5。do_train
:是否进行微调训练,设置该参数表示进行微调训练,默认不设置。do_eval
:是否进行评估,设置该参数表示进行评估,默认不设置。do_export
:是否进行导出,设置该参数表示进行静态图导出,默认不设置。export_model_dir
:静态图导出地址,默认为None。overwrite_output_dir
: 如果True
,覆盖输出目录的内容。如果output_dir
指向检查点目录,则使用它继续训练。disable_tqdm
: 是否使用tqdm进度条。metric_for_best_model
:最优模型指标, UTC 推荐设置为macro_f1
,默认为None。load_best_model_at_end
:训练结束后是否加载最优模型,通常与metric_for_best_model
配合使用,默认为False。save_total_limit
:如果设置次参数,将限制checkpoint的总数。删除旧的checkpoints输出目录
,默认为None。--save_plm
:保存模型进行推理部署
NOTE:
如需恢复模型训练,则可以设置 init_from_ckpt , 如 init_from_ckpt=checkpoint/model_state.pdparams 。
4.2 模型评估
通过运行以下命令进行模型评估预测:
#评估样本!python run_eval.py \ --model_path ./checkpoint/model_best \ --test_path ./data/dev.txt \ --per_device_eval_batch_size 32 \ --max_seq_len 512 \ --output_dir ./checkpoint_test
99%|██████████████████████████████████████████▌| 98/99 [00:31<00:00, 3.30it/s][2023-06-15 22:19:30,758] [ INFO] - ***** test metrics *****[2023-06-15 22:19:30,758] [ INFO] - test_loss = 1.8884[2023-06-15 22:19:30,758] [ INFO] - test_macro_f1 = 0.8427[2023-06-15 22:19:30,758] [ INFO] - test_micro_f1 = 0.9849[2023-06-15 22:19:30,759] [ INFO] - test_runtime = 0:00:34.16[2023-06-15 22:19:30,759] [ INFO] - test_samples_per_second = 92.189[2023-06-15 22:19:30,759] [ INFO] - test_steps_per_second = 2.897100%|███████████████████████████████████████████| 99/99 [00:33<00:00, 2.94it/s]
可配置参数说明:
model_path
: 进行评估的模型文件夹路径,路径下需包含模型权重文件model_state.pdparams
及配置文件model_config.json
。test_path
: 进行评估的测试集文件。per_device_eval_batch_size
: 批处理大小,请结合机器情况进行调整,默认为16。max_seq_len
: 文本最大切分长度,输入超过最大长度时会对输入文本进行自动切分,默认为512。single_label
: 每条样本是否只预测一个标签。默认为False
,表示多标签分类。
4.3模型预测
paddlenlp.Taskflow
装载定制模型,通过task_path
指定模型权重文件的路径,路径下需要包含训练好的模型权重文件model_state.pdparams
。
from pprint import pprintimport jsonfrom paddlenlp import Taskflowimport pandas as pd#读取文件并合并数据data = []ids = []with open("/home/aistudio/input/test.json", "r", encoding="utf-8") as f: for line in f: record = json.loads(line.strip()) text = record["title"] + " " + record["abstract"] data.append(text) ids.append(record["id"])schema = ["材料科学与工程", "临床医学", "电气工程", "数学", "化学", "地质工程", "地理学", "食品科学与工程", "医学", "生物学", "核科学与技术", "地球物理学", "水产", "药学", "交通运输工程", "体育学", "生物医学工程", "护理", "物理学", "心理学", "社会学", "神经科学", "计算机科学", "建筑学", "环境科学与工程", "机械工程", "航空航天工程", "石油工程", "免疫与微生物学", "矿业", "通信与信息科学", "光学", "历史学", "地质学", "教育学", "海洋工程", "公共管理学", "仪器科学与技术", "经济学", "音乐"]my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path="/home/aistudio/checkpoint/model_best/plm")results=my_cls(data)#获取预测labelslabels = []for prediction in results: label_list = [] for item in prediction["predictions"]: label_list.append(item["label"]) labels.append(label_list)result = pd.DataFrame({"id": ids, "subject_name": [labels[i] for i in range(len(labels))]})print(result)# 保存输出结果result.to_csv("result.csv", index=False)with open("/home/aistudio/output/output.txt", "w+",encoding="UTF-8") as f: #a : 写入文件,若文件不存在则会先创建再写入,但不会覆盖原文件,而是追加在文件末尾 for result in results: print(result) line = json.dumps(result, ensure_ascii=False) #对中文默认使用的ascii编码.想输出真正的中文需要指定ensure_ascii=False f.write(line + "\n")print("数据结果已导出")
[2023-06-16 14:58:56,470] [ INFO] - We are using to load "utc-base".[2023-06-16 14:58:56,472] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/utc-base/utc_base_vocab.txt[2023-06-16 14:58:56,495] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/utc-base/tokenizer_config.json[2023-06-16 14:58:56,500] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/utc-base/special_tokens_map.json[2023-06-16 14:58:56,502] [ INFO] - Assigning ["[O-MASK]"] to the additional_special_tokens key of the tokenizer id subject_name0 0 [化学]1 1 [光学]2 2 [物理学]3 195 [化学]{"predictions": [{"label": "化学", "score": 0.9999739312596861}], "text_a": "Oxidative coupling of methane in the redox cyclic mode over the catalysts on the basis of CeO2 and La2O3 The 1% CeO 2 , 9% La 2 O 3 /SiO 2 and 2% CeO 2 , 8% La 2 O 3 /SiO 2 catalysts show reliable efficiency in the OCM reaction, as well as stable work in the redox cyclic mode. Selectivity to C 2 products remarkably increases if preliminary reduction of the catalyst by a small amount of hydrogen is used."}{"predictions": [{"label": "光学", "score": 0.5380524000927461}], "text_a": "Tissue engineering: strategies, stem cells and scaffolds. Tissue engineering scaffolds are designed to influence the physical, chemical and biological environment surrounding a cell population. In this review we focus on our own work and introduce a range of strategies and materials used for tissue engineering, including the sources of cells suitable for tissue engineering: embryonic stem cells, bone marrow-derived mesenchymal stem cells and cord-derived mesenchymal stem cells. Furthermore, we emphasize the developments in custom scaffold design and manufacture, highlighting laser sintering, supercritical carbon dioxide processing, growth factor incorporation and zoning, plasma modification of scaffold surfaces, and novel multi-use temperature-sensitive injectable materials."}{"predictions": [{"label": "物理学", "score": 0.8062627429802265}], "text_a": "Enhancement of Forced Convection Subcooled Film Boiling Heat Transfer Using Gas Sheet Collapse by Electric Field Application Enhancement of forced-convection boiling heat transfer by electriceld is investigated experimentally. When a high-temperature horizontallament is immersed in water, a gas sheet is formed around and the abovelament due to liquid boiling, in the early immersion process. This gas-sheet markedly decreases the boiling cooling rate of thelament. Here, forced collapse of the gas sheet is attempted by imposing an electriceld to enhance the boiling cooling rate, In the experiments, a horizontal platinum wire of 0.5mm in diameter is immersed in pure water under atmospheric pressure, and a DC voltage up to 600V is applied between the wire surface and an electrode made of glass placed 10mm apart. The whole boiling curve is measured under different applied voltages and wire-falling velocities in 0.5 t{"predictions": [{"label": "化学", "score": 0.6280942049702516}], "text_a": "Speciation of some heavy metals in bottom sediments of the Ob and Yenisei estuarine zones The speciation of Fe, Mn, Zn, Cu, Co, Ni, Cr, Pb, and Cd was studied in 52 samples of bottom sediments collected during Cruise 49 of the R/V Dmitrii Mendeleev in estuaries of the Ob and Yenisei rivers in the southwestern Kara Sea. Immediately after sampling, the samples were subjected to on-board consecutive extraction to separate metal species according to their modes of occurrence in the sediments: (1) adsorbed, (2) amorphous Fe-Mn hydroxides and related metals, (3) organic + sulfide, and (4) residual, or lithogenic. The atomic absorption spectroscopy of the extracts was carried out at a stationary laboratory. The distribution of Fe, Zn, Cu, Co, Ni, Cr, Pb, and Cd species is characterized by the predominance of lithogenic or geochemically inert modes (70–95% of the bulk content), in which the metals are bound in terrigeno数据结果已导出
#按照官方输出格式要求import jsonfrom paddlenlp import Taskflowimport pandas as pd# 后台将在project目录下运行,路径若不确定可写绝对路径 "/home/mw/project/xxx"def invoke(input_data_path): data = [] ids = [] with open("/home/aistudio/input/test.json", "r", encoding="utf-8") as f: for line in f: record = json.loads(line.strip()) text = record["title"] + " " + record["abstract"] data.append(text) ids.append(record["id"]) schema = ["材料科学与工程", "临床医学", "电气工程", "数学", "化学", "地质工程", "地理学", "食品科学与工程", "医学", "生物学", "核科学与技术", "地球物理学", "水产", "药学", "交通运输工程", "体育学", "生物医学工程", "护理", "物理学", "心理学", "社会学", "神经科学", "计算机科学", "建筑学", "环境科学与工程", "机械工程", "航空航天工程", "石油工程", "免疫与微生物学", "矿业", "通信与信息科学", "光学", "历史学", "地质学", "教育学", "海洋工程", "公共管理学", "仪器科学与技术", "经济学", "音乐"] my_cls = Taskflow("zero_shot_text_classification", model="utc-base", schema=schema, task_path="/home/aistudio/checkpoint/model_best/plm") results=my_cls(data)#pred_threshold阈值函数记得修改 # 提取结果中的label值 #获取预测labels labels = [] for prediction in results: label_list = [] for item in prediction["predictions"]: label_list.append(item["label"]) labels.append(label_list) # 构建输出结果 result = pd.DataFrame({"id": ids, "subject_name": [labels[i] for i in range(len(labels))]}) return resultinput_data_path="/home/aistudio/input/test.json"result=invoke(input_data_path)print(result)
[2023-06-16 14:59:12,503] [ INFO] - We are using to load "utc-base".[2023-06-16 14:59:12,505] [ INFO] - Already cached /home/aistudio/.paddlenlp/models/utc-base/utc_base_vocab.txt[2023-06-16 14:59:12,528] [ INFO] - tokenizer config file saved in /home/aistudio/.paddlenlp/models/utc-base/tokenizer_config.json[2023-06-16 14:59:12,531] [ INFO] - Special tokens file saved in /home/aistudio/.paddlenlp/models/utc-base/special_tokens_map.json[2023-06-16 14:59:12,535] [ INFO] - Assigning ["[O-MASK]"] to the additional_special_tokens key of the tokenizer id subject_name0 0 [化学]1 1 [光学]2 2 [物理学]3 195 [化学]
5.总结
赛道一:论文学科分类 (Easy)——根据标题和摘要将论文准确分类到 40 个自然学科中去,可能单学科,也可能交叉学科,准确度达到 90% 以上。整体任务比较简单,花了几个小时就搞完了,但是在官方镜像上浪费好多时间,导致任务提交失败,必须吐槽一下,提供的基础镜像只包含TF 和torch,没有paddle,个人在构建新镜像一直发布不出导致婴儿惨死腹中,先把baseline开源出来,欢迎大家调试,
5.1 改进策略
- 对摘要部分进行处理,做个文摘提取关键内容,目前模型字符处理长度512,会丢失部分信息。
项目码源见文末
项目云端码源链接链接:https://www.heywhale.com/mw/project/648c0e2e9de7b81463991943
更多优质内容请关注公号:汀丶人工智能
关键词:
2023中国高校计算机大数据挑战赛:论文学科分类baseline|清华主办|今日看点
【技术积累】算法中的排序算法【一】
极限科技旗下软件产品 INFINI Easysearch 通过统信 UOS 认证|天天热议
【环球财经】日本为何出现巨额贸易逆差
便宜的瓜不甜?百果园回应女子团购西瓜被嘲讽:是误解 已道歉|全球热闻
京东618发布“35711”梦想:打造3家万亿收入公司 创造100万就业_通讯
演员胡兵向东航维权失败:价值一万多的白金卡50万积分一夜清零
天天快讯:在 Cenntos6.8 下安装 Oracle11g
当前播报:京东限时优惠:Redmi 27英寸4K显示器享大降价
世界头条:人气漫改!网飞真人版《海贼王》预告片出炉:路飞橡胶手无敌
全球车企第一高!马斯克曝特斯拉市值上涨主要动力 今日快看
关注:广州一龙舟队全是富婆?当地:事实 但身价不是参赛门槛
焦点热门:一公司端午节只发了三颗荔枝 员工吐槽:是不是公司要凉了?
女子户外活动后因热射病去世 专家提醒:轻度中暑及时干预_每日时讯
春天的诗有哪些古诗 春天的100首古诗有哪些|快资讯
【热闻】JWT的基本组成结构
当前速递!校长猥亵残障女学生判3年 法院回应全案审查!简直是禽兽不如!
保时捷Taycan对手来了!法拉利“老乡”发布纯电轿跑:124万起-世界报道
世界短讯!上新啦!这些夏日冷饮新品你尝过了吗?今年中国饮品冷饮产业还有这些新趋势!
【技术积累】Java中的集合框架【一】
清除外卖行业“影子店铺”隐患
今天是父亲节 微信上线限时状态感谢老爸:教你2步设置|环球快播
环球信息:重新定义移动办公:华为MatePad Air“野趣办公”成年轻人办公新潮流
每日焦点!16GB显存直戳RTX 4070痛点 AMD RX 7800 XT显卡定了
读发布!设计与部署稳定的分布式系统(第2版)笔记04_集成点
张家界桑植,跳桥救人小哥彭清林的家乡:那里都是“彩色的人”
1799元!前小米9号员工李明发布全球首款Android桌面机器人
Windows游戏一键移植 苹果Mac电脑硬伤不再:暗黑4能跑近百帧 当前热闻
李想造车:网上全赢过 现实没输过 焦点资讯
今日观点!女子骑车手扶帽子头部着地身亡:戴头盔太重要 新国标下月施行
天天观热点:吸甲醛最快最有效方法不花钱(吸甲醛最好的方法是什么)
理想自研自动充电机器人亮相:自动插枪 车主全程免下车_实时焦点
观焦点:广汽本田首批ZR-V致在e:HEV出口欧洲
rust 使用第三方库构建mini命令行工具
spring-boot 项目 使用总结
淄博狂飙90天:烧烤降温流量下滑 大部分烧烤店不再需要排队|当前速看
跳河救人小哥说大家给我的太多了:我只是做了一件小事
95后小伙卖临期食品走红:极具性价比 也能防止浪费_天天视讯
余承东:除了华为和比亚迪 其他人活下来很难
焦点热议:小孩路边偷偷买猫 家长找商家退猫遭拒把猫摔死:网友看怒
83届奥斯卡获奖名单(关于83届奥斯卡获奖名单的基本详情介绍)-全球速递
Rachio3控制用水量的智能庭院洒水器_全球最资讯
【全球新视野】Linux批量文件操作——基于find-xargs
安全攻击溯源-钓鱼邮件溯源 环球滚动
天天百事通!保时捷Mission_E原型车在纽伯格林上飞奔
一加Ace 2V大促:12+256不到2000元 无塑料支架 前沿资讯
大雾致航班取消 乘客骂哭机场员工
焦点速看:江苏常州:把惠企的公交车开到企业“家门口”
环球动态:08. centos安装包方式安装nginx(推荐该方式)
nas docker安装mysql 整理
麻省理工学院开发出超吸水性水凝胶 具有巨大的应用潜力
为计划支持平稳过渡 谷歌千万域名将全部打包出售
网友将梅西亲笔签名纹在了手臂上 网友点评称将永远擦不掉
国产开放世界端游《仙剑世界》即将开启首测 研发时间超过两年
极兔速递正式向港交所提交上市申请书 2022年包裹量计为东南亚排名第一
山东一高校为毕业生举行别有滋味的龙虾宴 引发网友羡慕嫉妒
2023端午档期预售票房突破1000万 王宝强新作《八角笼中》备受期待
医生发现一男子是正常人骨密度的8倍 车祸后毫发无损
媒体曝光C罗与乔治娜曾达成“婚前协议” 若分手女友每个月可获10万欧元补贴
北方多地迎来高温天气 网友在线表演无火煎蛋
《永劫无间》胡桃1/4限量收藏级雕像即将上线 引入全彩3D仿真眼球
百度旗下萝卜快跑开展L4级无人驾驶商业化收费运营 运营时间覆盖早晚高峰
研究结果显示土星发现高浓度磷元素存在 为构成生命的六大基本元素之一
热心市民无意捡到石头竟是新石器时代遗存文物 具有不小的研究价值
快资讯:自研认知大模型 理想汽车带来“一员”真正的家庭成员
理想汽车发布城市NOA:不依赖高精地图 驾驶技术接近人类司机_全球微资讯
双黄蛋!胡歌大鹏获上海电影节影帝|当前快播
苏翊鸣回应获得清华大学保送资格消息 称未来四年将更加努力
因云南昭通突降暴雨引发山洪 135名初三学生的身份证被冲走
因影响公共秩序 冲场拥抱梅西球迷1年内被禁入赛场观看同类比赛
韩国人气女星宋慧乔全新短片曝光 身穿白色衬衫搭配咖啡色长裙且气质优雅
防治荒漠化能有多少种可能?来看看这份“中国良方”_速看料
英特尔将在波兰建造全新半导体组装和测试设施 现等待欧盟委员会批准
深圳邮局海关发现23只活体蜈蚣 其毒腺可分泌出大量毒液
环球看热讯:杀疯了!折叠屏不到4000 moto razr 40明天预售
充电进入5G时代 理想公布800V快充:充电9分半续航400公里-世界快资讯
阴阳师物语vivibear_阴阳师物语-微速讯
世界速递!【MathJax】语法总结
k8s 深入篇———— docker 是什么[一] 全球信息
英伟达RTX 40系列公版显卡今晚20点开抢:3199-12999元
售价超200万的FF 91又鸽了 贾跃亭造车梦何时能成?还差得远!
别人家的学校你羡慕了吗?山东一高校请近300名毕业生吃龙虾宴 环球即时看
618买笔记本电脑!教你看懂屏幕参数
埃安2024款AION V Plus将于6月20日上市:买五座送三排-当前热闻
图片报:拜仁关注哈弗茨,若无缘其他目标图赫尔能设想签他踢前锋
Lua 中如何实现继承
女子买1根牛肉干店员偷塞5根的量 当事人:特别气愤
世界今头条!六个口味 八喜冰淇淋3.8元/杯抄底(商超8元)
堆料不留遗憾的满血顶级非公旗舰!索泰RTX 4090 PGF OC评测:当之无愧新一代卡皇-世界头条
银河证券:通信+新基建板块有望率先预期上修 高景气度结合低估值是选股重点方向
记录--封装一个通过js调用的全局vue组件
世界讯息:Linux下常用命令
vim敲字如弹琴
我找到了阅读GitHub项目源码的最佳姿势,太舒服了!
世界短讯!玩游戏怎么选显示器?认准这几点准没错
男子发现捡回家两年的石头是文物 已无偿捐赠:网友惊叹
全球热推荐:国产GPU能否满足ChatGPT算力要求 景嘉微回应来了:还不行
环球观焦点:今年“蒸煮模式”咋提前了?专家解读
长城汽车给理想汽车起了个外号:“微博之王” 李想本人回应亮了 世界微动态
易宝支付总裁余晨出席《通用人工智能》新书发布会