网站是怎么无人化运行的【二-正则表达式】

最新推荐文章于 2024-07-02 08:30:00 发布

转载最新推荐文章于 2024-07-02 08:30:00 发布 · 46 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/daneas/p/4846669.html

本文介绍了如何使用正则表达式从铜梁视窗网站中提取人才招聘、房屋出租等信息，并提供了下载网页中图片的方法。通过案例演示，展示了如何区分并处理网页文本和图片数据。

我们第一篇介绍了网站数据采集的基础，那么本篇来介绍铜梁视窗人才招聘、房屋出租等等信息的采集关键：正则表达式

本格案例只是用以描述采用正则表达式提取网页数据的方法，需要网友自己去体会（这么简单完全不用体会了）

        private bool GetTLcridit(string page, out string title, out string detail)
        {

            try
            {
                string regtitle = "<td height=\"30\" align=\"center\" class=\"bt7\">(?<title>[^<]*)";

                MatchCollection ms = Regex.Matches(page, regtitle);

                title = ms[0].Groups[1].Value;

                string regdetail = "<td class=black id=fontzoom>(?<detail>[^`]*)</td>[\\s]*</tr>[\\s]*<tr align=\"right\">";

                MatchCollection ms1 = Regex.Matches(page, regdetail);

                detail = ms1[0].Groups[1].Value;

                if (title.Length > 1 && detail.Length > 100)
                {
                    return true;
                }

                return false;
            }
            catch
            {
                title = "";
                detail = "";

                return false;
            }
        }

　　以上方法只是把文字区分出来，然后供开发者自行处理。然而如看铜梁视窗是怎么无人化运行的【一】所诉如果正文中有图片怎么办？

没关系，我再提供一个下载图片的方法，以下方法为片段，开发者自行揣摩以下

            string regul = "<div class=\"yaowen\">(?<ul>[^`]*)<ul class=\"hd ywbq\">";

            string page = PushToWeb(paurl, Encoding.Default);

            MatchCollection ms = Regex.Matches(page, regul);

            page = ms[0].Groups[1].Value;

            string regurl = "<a href=\"(?<url>[^\"]*)\" target=\"_blank\"";

            ms = Regex.Matches(page, regurl);

            foreach (Match item in ms)
            {
                string url = "http://www.cqstl.gov.cn" + item.Groups[1].Value;

                if (!CheckUrl(url))
                {
                    continue;
                }

                string temppage = PushToWeb(url, Encoding.Default);


                string title = "";

                string detail = "";


                if (GetCQSTL(temppage, out title, out detail))
                {
                    CMS_Collection co = new CMS_Collection();

                    co.Title = title;

                    co.Detail = GetLocalTextImg(detail, title); //此处替换掉了外链图片

                    co.NewsFrom = url;

                    co.Bits = 1;

                    co.CreatedTime = DateTime.Now;

                    co.TagId = 0;

                    CMS_CollectionBaseDAL.Create(co);
                }

            }

登陆铜梁视窗可以看到本项目预览，给我留言，也能提前获得本片续集哟。

转载于:https://www.cnblogs.com/daneas/p/4846669.html