采集器的编程方法取决于你想要采集的数据类型、来源以及使用的编程语言。以下是几种常见采集器的编程方法:
使用PHP编写采集器
file_get_contents():用于远程读取网页内容。
preg_match_all():用于通过正则表达式提取网页中的特定内容。
cut():自定义函数,用于从字符串中提取子字符串。
示例:
```php
<?php
// 获取网页内容
$url = "http://example.com/page";
$content = file_get_contents($url);
// 提取书名、作者、类型等信息
preg_match_all('/
preg_match_all('/.*?([^<]+)<\/span>.*?([^<]+)<\/span>.*?<\/div>/', $content, $bookInfo);
// 输出提取结果
foreach ($titles as $i => $title) {
echo "Title " . ($i + 1) . ": " . $title . "
";
}
foreach ($bookInfo as $i => $title) {
echo "Book " . ($i + 1) . " Author: " . $title . "
";
}
?>
```
使用Python编写采集器
requests:用于发送HTTP请求并获取网页内容。
BeautifulSoup:用于解析HTML内容。
示例:
```python
import requests
from bs4 import BeautifulSoup
url = "http://example.com/page"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
提取书名、作者、类型等信息
titles = soup.find_all("title")
book_titles = [title.text for title in titles]
book_info = soup.find_all("div", class_="book-info")
book_authors = [info.find("span", class_="book-author").text for info in book_info]
输出提取结果
for i, title in enumerate(book_titles):
print(f"Title {i + 1}: {title}")
for i, author in enumerate(book_authors):
print(f"Book {i + 1} Author: {author}")
```
使用C编写采集器
HttpClient:用于发送HTTP请求。
HtmlAgilityPack:用于解析HTML内容。
示例:
```csharp
using System;
using System.Net.Http;
using HtmlAgilityPack;
class Program
{
static async System.Threading.Tasks.Task Main(string[] args)
{
var url = "http://example.com/page";
using var httpClient = new HttpClient();
var response = await httpClient.GetAsync(url);
var content = await response.Content.ReadAsStringAsync();
var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(content);
var titles = htmlDoc.DocumentNode.SelectNodes("//title");
var bookTitles = titles.Select(t => t.InnerText).ToList();
var bookInfo = htmlDoc.DocumentNode.SelectNodes("//div[@class='book-info']");
var bookAuthors = bookInfo.Select(i => i.SelectSingleNode(".//span[@class='book-author']").InnerText).ToList();
// 输出提取结果
for (int i = 0; i < bookTitles.Count; i++)
{
Console.WriteLine($"Title {i + 1}: {bookTitles[i]}");
}
for (int i = 0; i < bookAuthors.Count; i++)
{
Console.WriteLine($"Book {i + 1} Author: {bookAuthors[i]}");
}
}
}
```
使用Shell脚本采集数据
curl:用于发送HTTP请求。
-