非结构文本、图片、视频等数据是待挖掘的数据矿藏, 在经管、社科等研究领域中谁拥有了从非结构提取结构化信息的能力,谁就拥有科研上的数据优势。正则表达式是一种强大的文档解析工具,但它们常常难以应对现实世界文档的复杂性和多变性。而随着chatGPT这类LLM的出现,为我们提供了更强大、更灵活的方法来处理多种类型的文档结构和内容类型。For many years, regular expressions have been my go-to tool for parsing documents, and I am sure it has been the same for many other technical folks and industries.Even though regular expressions are powerful and successful in some case, they often struggle with the complexity and variability of real-world documents.Large language models on the other end provide a more powerful, and flexible approach to handle many types of document structures and content types....
arXiv2024 | 使用大语言模型自动进行定性研究中的扎根理论开发
在当今的学术界,定性研究因其深入挖掘现象背后的原因和逻辑而备受重视。然而,定性数据的分析往往耗时且成本高昂。现在,随着chatGPT这类大语言模型的问世,这一局面可能即将改变。AcademiaOS是一个创新的开源平台,它利用大型语言模型(LLMs)的能力,自动化地进行地面理论的发展,为定性研究带来了新的视角。AcademiaOS is a first attempt to automate grounded theory development in qualitative research with large language models. Using recent large language models’ language understanding, generation, and reasoning capabilities, AcademiaOS codes curated qualitative raw data such as interview transcripts and develops themes and dimensions to further develop a grounded theoretical model, affording novel insights. A user study (n=19) suggests that the system finds acceptance in the academic community and exhibits the potential to augment humans in qualitative research. AcademiaOS has been made open-source for others to build upon and adapt to their use cases....