话不多说先上图
爬取10页大概500个帖子大概10s,500页2w多个帖子大概2min,由此可见性能并不是特别好,但是也没有很差。
好了话不多说,我们来一步一步实现这么个简易的客户端。
1.创建项目
创建一个WPF空项目,导入需要的Devexpress的dll
Devexpress可以到官网下载,基本16版本以上都可以。下载试用版的也可以,基本到期也不会限制你使用,只有开发的时候会弹出框,叉掉即可,比较良心。
下载地址:https://www.devexpress.com/
2.编辑界面
基本就是xaml代码的编写,DevExpress的demo中心也有很多样例,直接上代码。
<dx:ThemedWindow x:Class="SearchAnyWay.MainWindow" xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:d="http://schemas.microsoft.com/expression/blend/2008" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:dx="http://schemas.devexpress.com/winfx/2008/xaml/core" xmlns:dxmvvm="http://schemas.devexpress.com/winfx/2008/xaml/mvvm" xmlns:dxe="http://schemas.devexpress.com/winfx/2008/xaml/editors" xmlns:dxlc="http://schemas.devexpress.com/winfx/2008/xaml/layoutcontrol" xmlns:dxg="http://schemas.devexpress.com/winfx/2008/xaml/grid" xmlns:local="clr-namespace:SearchAnyWay" mc:Ignorable="d" Title="百度贴吧搜索神器(v1.0)" Height="600" Width="800"> <Grid> <dxlc:LayoutControl VerticalAlignment="Stretch" Orientation="Vertical" TextBlock.FontSize="11"> <Label VerticalAlignment="Top" FontWeight="Bold" Content="输入您需要查找的关键字"></Label> <dxlc:LayoutGroup Orientation="Horizontal"> <dxlc:LayoutItem Label="关键字(K)" AddColonToLabel="True"> <dxe:TextEdit EditValue="{Binding Path=Name, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged, ValidatesOnDataErrors=True}" > <dxmvvm:Interaction.Triggers> <dxmvvm:KeyToCommand KeyGesture="Enter" Command="{Binding SearchCommand}"></dxmvvm:KeyToCommand> </dxmvvm:Interaction.Triggers> </dxe:TextEdit> </dxlc:LayoutItem> <dxlc:LayoutItem Label="贴吧名(N)" AddColonToLabel="True"> <dxe:TextEdit EditValue="{Binding Path=HubName, Mode=TwoWay, UpdateSourceTrigger=PropertyChanged, ValidatesOnDataErrors=True}"> </dxe:TextEdit> </dxlc:LayoutItem> <dxlc:LayoutItem Label="爬取页数(P)" AddColonToLabel="True"> <dxe:ComboBoxEdit ItemsSource="{Binding PageRange}" SelectedItem="{Binding Page}" ShowSizeGrip="False" IsTextEditable="False"> </dxe:ComboBoxEdit> </dxlc:LayoutItem> <dxlc:LayoutGroup HorizontalAlignment="Right" VerticalAlignment="Center"> <dx:SimpleButton x:Name="btnSearch" Content="查找(S)" Width="80" Command="{Binding SearchCommand}"></dx:SimpleButton> </dxlc:LayoutGroup> </dxlc:LayoutGroup> <dxg:TreeListControl x:Name="treeList" Margin="0,10" ItemsSource="{Binding Source}" SelectionMode="Row" SelectedItem="{Binding SelectedRow}"> <dxg:TreeListControl.Columns> <dxg:TreeListColumn FieldName="Title" Header="标题" Width="2*"/> <dxg:TreeListColumn FieldName="Brief" Width="2*" Header="详情"/> <dxg:TreeListColumn Header="回复数" FieldName="CommentCount" Width="*"/> <dxg:TreeListColumn Header="作者" FieldName="AuthorName" Width="*"/> </dxg:TreeListControl.Columns> <dxg:TreeListControl.View> <dxg:TreeListView x:Name="view" VerticalScrollbarVisibility="Auto" AutoExpandAllNodes="True" AllowEditing="False" NavigationStyle="Row" ShowIndicator="False" TreeDerivationMode="ChildNodesSelector" ChildNodesPath="ICDItems"> <dxmvvm:Interaction.Triggers> <dxmvvm:EventToCommand EventName="SourceUpdated" Command="{Binding Commands.ExpandAllNodes, ElementName=view}" /> <dxmvvm:EventToCommand EventName="RowDoubleClick" Command="{Binding SearchCommand}" CommandParameter="{Binding ElementName=treeList,Path=SelectedItem}" /> </dxmvvm:Interaction.Triggers> </dxg:TreeListView> </dxg:TreeListControl.View> </dxg:TreeListControl> <dxlc:LayoutGroup VerticalAlignment="Bottom" Orientation="Horizontal"> <Label Content="帖子总数:" HorizontalAlignment="Right"/> <Label Content="{Binding Source.Count, UpdateSourceTrigger=PropertyChanged}" HorizontalAlignment="Right"> </Label> </dxlc:LayoutGroup> <dxlc:LayoutGroup VerticalAlignment="Bottom" Orientation="Horizontal"> <dxe:CheckEdit IsChecked="{Binding IsAll}" Content="Include All" HorizontalAlignment="Left"/> <dx:SimpleButton Content="Copy VLPath To Clipboard" IsEnabled="{Binding CanNext}" Command="{Binding CopyVLPathCommand}" HorizontalAlignment="Left"></dx:SimpleButton> <dxlc:LayoutGroup HorizontalAlignment="Right"> <dx:SimpleButton Content="下载(D)" Width="80" IsEnabled="{Binding CanNext}" Command="{Binding NextCommand}"></dx:SimpleButton> <dx:SimpleButton Content="清除(C)" Width="80" IsEnabled="{Binding CanNext}" Command="{Binding OKCommand}"></dx:SimpleButton> <dx:SimpleButton Content="合作(P)" Width="80" Command="{Binding CancelCommand}"></dx:SimpleButton> </dxlc:LayoutGroup> </dxlc:LayoutGroup> </dxlc:LayoutControl> <dx:WaitIndicator DeferedVisibility="{Binding IsLoading}" /> </Grid> </dx:ThemedWindow>
3.实现mvvm模式。
这里采用了DevExpress自带的的mvvm模式,和WPF自带的去创建的框架基本一致。不了解mvvm的同学可以去园子里看看相关文章。
(1)后台代码设置主题还有绑定视图模型。
public partial class MainWindow { public MainWindow() { InitializeComponent(); //设置样式 ApplicationThemeHelper.UseLegacyDefaultTheme = true; ApplicationThemeHelper.ApplicationThemeName = Theme.VisualStudioCategory; this.WindowStyle = System.Windows.WindowStyle.SingleBorderWindow; this.Icon = new BitmapImage(new Uri("../../debug.png",UriKind.Relative)); this.BorderThickness = new Thickness(0); this.Margin = new Thickness(0); this.Padding = new Thickness(0); this.DataContext = new MainViewModel(); } }
( 2 ) 设计帖子的实体类。
可以根据自己想要爬取的信息设计。
public class ArticleModel { public string Title { get; set; } public string Brief { get; set; } public int CommentCount { get; set; } public string AuthorName { get; set; } }
(3)页数,帖子集合,等属性在ViewModel中进行声明。
//加载中 private bool _loading; public bool IsLoading { get { return this._loading; } set { SetProperty(ref _loading, value, () => IsLoading); } } //贴吧名 private string _hub; public string HubName { get { return this._hub; } set { SetProperty(ref _hub, value, () => HubName); } } //爬取页数 private int _page; public int Page { get { return this._page; } set { SetProperty(ref _page, value, () => Page); } } //帖子集合 public ObservableCollection<ArticleModel> _source; public ObservableCollection<ArticleModel> Source { get { return _source; } set { SetProperty(ref _source, value, ()=>Source); } }
(3)查询业务绑定到按钮的Command,下拉列表的绑定等。
public AsyncCommand SearchCommand { get; set; } public IEnumerable<int> PageRange { get; private set; }
public MainViewModel() { Page = 10; PageRange = new List<int>() { 10,50, 100, 200, 500, 1000, 10000 }; Source = new ObservableCollection<ArticleModel>(); SearchCommand = new AsyncCommand(Search); }
4.爬虫业务的简单实现
我们使用HttpClient进行请求获取html页面的代码
使用AngleSharp
解析html
示例代码(按Ctrl+Shift+P
快速安装NuGet
包):Install-Package AngleSharp
相关简单使用:
//获取请求后response的页面代码。 string pageData = await http.GetStringAsync($"https://tieba.baidu.com/f?kw={HubName}&ie=utf-8&pn={pnIndex}");
//AngleSharp解析页面代码 IHtmlDocument doc = await parser.ParseDocumentAsync(pageData);
5.分析百度贴吧
可以看到URL基本一致,主要是一个URL参数会跟着页数而变化就是pn(Page Number),规律就是(Page-1)*50。50大概就是每页有50个帖子
那我们就好处理了,获取每个帖子的节点然后再去依次查找我们所需要的数据就可以了。
爬取的核心代码如下
await Task.Run(() => { var http = new HttpClient(); var parser = new HtmlParser(); var result=Enumerable.Range(0, Page) .AsParallel() .AsOrdered() .SelectMany(page => { return Task.Run(async () => { var pnIndex = page * 50; //获取请求后response的页面代码。 string pageData = await http.GetStringAsync($"https://tieba.baidu.com/f?kw={HubName}&ie=utf-8&pn={pnIndex}".Dump()); //AngleSharp解析页面代码 IHtmlDocument doc = await parser.ParseDocumentAsync(pageData); return doc.QuerySelectorAll(".t_con.cleafix").Select(tag => new ArticleModel() { Title = tag.QuerySelector(".j_th_tit").TextContent?.Trim(), Brief= tag.QuerySelector(".threadlist_abs.threadlist_abs_onlyline")?.TextContent?.Trim(), CommentCount=Convert.ToInt32(tag.QuerySelector(".threadlist_rep_num.center_text")?.TextContent), AuthorName=tag.QuerySelector(".frs-author-name.j_user_card")?.TextContent?.Trim(), }); ; }).GetAwaiter().GetResult(); }); Source = new ObservableCollection<ArticleModel>(result); });
一个小细节就是dom元素如果class中有空格查找的时候一定要用'.'来代替,比如dom元素class是'ftt poot'那么查找的时候就应该是tag.QuerySelector(".ftt.poot")坑里了我很久!!!可能是我这方面没怎么接触过吧。。。
好了,爬取的功能完成了,其他的边角料就自己随意发挥吧,哈哈。
代码下载地址:https://github.com/BruceQiu1996/WPF-/tree/master
本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:c# WPF——完成一个简单的百度贴吧爬虫客户端 - Python技术站