没啥意思的站点

wget的-e选项

wget可谓居家旅行,杀人越货的必备之物了~其体积小巧,功能强大.所以,同时,其选项也比较繁多,man也是长长的,但是却有写个小技巧没写出来.
man里有这样的描述:

Wget can follow links in HTML and XHTML pages and create local versions of remote web sites, fully recreating the directory structure of the original site. This is sometimes referred to as “recursive downloading.” While doing that, Wget respects the Robot Exclusion Standard (/robots.txt). Wget can be instructed to convert the links in downloaded HTML files to the local files for offline viewing.

于是乎,如果你想mirror一整个站点,但是人家的 /robots.txt 却是:

User-agent: *
Disallow: /

你就要开始郁闷了,呵呵.
而且,我翻遍了man也找不到解决办法的,总不能为这点事去hack源码吧…
其实有这么个选项:

-e command
–execute command
Execute command as if it were a part of .wgetrc. A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. If you need to specify more than one wgetrc command, use multiple instances of -e.

用这个,就可以忽略 robots.txt 哦,具体是 -erobots=off 嘿嘿.

4 评论

  1. 用这个参数就能下载到php的源码么?

  2. 不错,入了。

    记得有一次要下载一些网页

    可是,robots.txt的原因,一直搞不下来

发表评论