小例子们 - 爬取网页并输出JSON数据 - 《LangChain 中文入门教程》

爬取网页并输出JSON数据

爬取网页并输出JSON数据

有些时候我们需要爬取一些结构性比较强的网页，并且需要将网页中的信息以JSON的方式返回回来。

我们就可以使用 LLMRequestsChain 类去实现，具体可以参考下面代码

为了方便理解，我在例子中直接使用了Prompt的方法去格式化输出结果，而没用使用上个案例中用到的 StructuredOutputParser去格式化，也算是提供了另外一种格式化的思路

from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMRequestsChain, LLMChain
llm = OpenAI(model_name="gpt-3.5-turbo", temperature=0)
template = """在 >>> 和 <<< 之间是网页的返回的HTML内容。
网页是新浪财经A股上市公司的公司简介。
请抽取参数请求的信息。
>>> {requests_result} <<<
请使用如下的JSON格式返回数据
{{
  "company_name":"a",
  "company_english_name":"b",
  "issue_price":"c",
  "date_of_establishment":"d",
  "registered_capital":"e",
  "office_address":"f",
  "Company_profile":"g"
}}
Extracted:"""
prompt = PromptTemplate(
    input_variables=["requests_result"],
    template=template
)
chain = LLMRequestsChain(llm_chain=LLMChain(llm=llm, prompt=prompt))
inputs = {
  "url": "https://vip.stock.finance.sina.com.cn/corp/go.php/vCI_CorpInfo/stockid/600519.phtml"
}
response = chain(inputs)
print(response['output'])

我们可以看到，他很好的将格式化后的结果输出了出来