commit 411c87eb7e8f28401f5926003a222ca9262a35b2
parent b1cd1b5ccd153c9ac9dd9577a8c7a53ef892c697
Author: Tomas Hlavaty <tom@logand.com>
Date: Tue, 1 Feb 2011 02:36:43 +0100
index.org added
Diffstat:
A | index.org | | | 345 | +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |
1 file changed, 345 insertions(+), 0 deletions(-)
diff --git a/index.org b/index.org
@@ -0,0 +1,345 @@
+#+options: creator:nil timestamp:nil author:nil
+
+w3mail
+
+w3mail is a program for sending web pages via email while filtering
+out unwanted content.
+
+* Introduction
+
+There are many ways of browsing the Web. In many cases, I prefer
+using my email reader for managing the web pages I read.
+
+In addition to removing distractions like advertisements and excessive
+navigational noise, there is no excessive amount of open tabs in my
+web browser, smaller memory usage, better readibility and powerful
+management of unread and read web pages, their marking, expiry and
+deletion. It's asynchronous and the actual reading takes minimum
+keystrokes and no aiming with mouse at all.
+
+#+begin_quote
+To look at page I send mail to a demon which runs wget and mails the
+page back to me. It is very efficient use of my time, but it is slow
+in real time. -- [[http://lwn.net/Articles/262570/][rms]]
+#+end_quote
+
+I've used various conventional web browsers like [[http://www.mozilla.com/firefox/][Firefox]] and also some
+unconventional ones like [[http://emacs-w3m.namazu.org/][emacs-w3m]], [[http://emacs-w3m.namazu.org/info/emacs-w3m_69.html][emacs-w3m/shimbun]] and [[http://surf.suckless.org/][surf]] but
+none of them seems the right choice. There seem to be two kinds of
+web pages I look at:
+
+1. quickly disposable to skim over in search for a particular (often
+ brief) information
+
+2. with "deep" valuable information
+
+For the disposable browsing, Firefox or emacs-w3m works well. w3mail
+tries to fill the gap in the second case, for browsing web pages that
+carry non-trivial information of long-term value or those that require
+more focus and time to read.
+
+* Dependencies
+
+Linux only.
+
+** Build dependencies
+
+- git
+- gcc
+- make
+
+If you are using Ubuntu, you can install these programs by running:
+
+: $ sudo apt-get install git-core gcc make
+
+** Runtime dependencies
+
+*** Required runtime dependencies
+
+- wget
+- md5sum (coreutils)
+
+If you are using Ubuntu, you can install these programs by running:
+
+: $ sudo apt-get install wget coreutils
+
+*** Optional runtime dependencies
+
+- sendmail (mailutils)
+- xmlstarlet
+- tidy
+
+If you are using Ubuntu, you can install these programs by running:
+
+: $ sudo apt-get install mailutils
+: $ sudo apt-get install xmlstarlet
+: $ sudo apt-get install tidy
+
+* Download
+
+Clone the git repository:
+
+: $ git clone http://logand.com/git/w3mail.git
+
+* Building from sources
+
+Switch to the new directory and make the w3mail executable:
+
+: $ cd w3mail
+: $ make
+
+This will build the w3mail and dirpop3d programs.
+
+* Configuration
+
+** Executable path
+
+First, it is convenient to put the w3mail program somewhere reachable
+from $PATH, e.g.
+
+- create a symlink to the w3mail executable file in your ~/bin
+ directory if the ~/bin directory is in your $PATH
+
+- or add the w3mail git directory into your $PATH.
+
+** Configuration directory
+
+Next, set up the configuration directory:
+
+: $ mkdir ~/.w3mail
+
+and put the following lines into your ~/.w3mail/config file:
+
+- If you want to use the local pop3 daemon dirpop3d:
+
+ : cat /dev/stdin >`mktemp ~/.w3mail/inbox/username/XXXXXX`
+ : email@address
+ : email@address
+ : host.name
+
+ In this case, you will also need to create the inbox directory:
+
+ : $ mkdir ~/.w3mail/inbox
+
+ and an inbox directory for one user:
+
+ : $ mkdir ~/.w3mail/inbox/username
+
+- If you want to use sendmail from your local machine:
+
+ : sendmail -t
+ : email@address
+ : email@address
+ : host.name
+
+- If you want to use sendmail from a remote machine:
+
+ : ssh username@host.name -e none /usr/lib/sendmail -t
+ : email@address
+ : email@address
+ : host.name
+
+In the texts above, replace username, email@address and host.name with
+the correct values.
+
+** Content filters
+
+In the end, set up the filter directory:
+
+: $ mkdir ~/.w3mail/filter
+
+and put some filters there:
+
+- Edit ~/.w3mail/filter/default
+
+ : #!/bin/sh
+ : tidy -q -n -c -asxml -f /dev/null | xmlstarlet ed -O -N x="http://www.w3.org/1999/xhtml" -d "//x:script" -d "//x:object" -d "//x:form"
+
+- Edit ~/.w3mail/filter/xpath
+
+ : #!/bin/sh
+ : XPATH=`echo "$1" | tr \" \'`
+ : tidy -q -n -c -asxml -f /dev/null | xmlstarlet sel -O -N x="http://www.w3.org/1999/xhtml" -t -c $XPATH | xmlstarlet ed -O -N x="http://www.w3.org/1999/xhtml" -d "//x:script" -d "//x:object" -d "//x:form"
+
+- Edit ~/.w3mail/filter/bbc
+
+ : #!/bin/sh
+ : tidy -q -n -c -asxml -f /dev/null | xmlstarlet sel -O -N x="http://www.w3.org/1999/xhtml" -t -c "//x:*[@class='story-body']" -c "//x:*[@class='storybody']" | xmlstarlet ed -O -N x="http://www.w3.org/1999/xhtml" -d "//x:script" -d "//x:form" -d "//x:*[@id='page-bookmark-links-head']" -d "//x:object" -d "//x:*[@class='hidden']" -d "//x:*[@class='hyperpuff']" -d "//x:*[@class='links-list']" -d "//x:*[@class='warning']//x:p" -d "//x:*[@class='story-feature related narrow']" -d "//x:*[@class='comment-introduction']"
+
+Put any custom filters to the ~/.w3mail/filter directory.
+
+Then make the filters executable:
+
+: $ chmod +x ~/.w3mail/filter/*
+
+Also, tell w3mail when to use those filters by adding filter
+definitions into ~/.w3mail/tidy file:
+
+: bbc http://www.bbc.co.uk/ filter bbc
+: emacswiki http://www.emacswiki.org/ xpath //*[@class="content browse"]
+
+Here the first word is the filter name (unused for now), the matching
+url prefix and filter type:
+
+- filter: followed by filter name to lookup and execute in the
+ ~/.w3mail/filter directory
+
+- xpath: followed by xpath expression of the web page DOM element
+ holding the interesting content.
+
+If no filter is specified in ~/.w3mail/tidy file, the default filter
+~/.w3mail/filter/default is run.
+
+Note: to find out the XPath epression of the element I am interested
+in, I use Firebug (Firefox plug-in), point to that element and choose
+"Copy XPath".
+
+* Invocation from shell
+
+I anticipate a few ways of using w3mail:
+
+- Send single web page:
+
+ : $ w3mail 'http://logand.com/'
+
+- Send many web pages:
+
+ First save the URLs into a file, one URL per line. Then run:
+
+ : $ cat file | w3mail
+
+ Or much faster in parallel with maximum 20 processes:
+
+ : $ cat file | xargs -n1 -P20 w3mail
+
+- Run w3mail in background
+
+ First start the server:
+
+ : $ echo >~/.w3mail/in; tail -f ~/.w3mail/in | xargs -n1 -P20 w3mail 2>>~/.w3mail/log &
+
+ Then request sending a web page by running:
+
+ : $ echo 'url' >>~/.w3mail/in
+
+ or send many web pages by:
+
+ : $ cat file >>~/.w3mail/in
+
+ Watch the log for errors:
+
+ : $ tail -f ~/.w3mail/log
+
+If you configured w3mail to save web pages into
+~/.w3mail/inbox/username directory, you can use dirpop3d to retrieve
+those web pages as email messages. You will need to set up the
+following:
+
+1) Run dirpop3d:
+
+ : $ dirpop3d 3333 ~/.w3mail/inbox/username &
+
+2) Add the pop3 server at localhost:3333 to your email client and when
+ asked to authenticate, enter the username and a password (anything
+ as password is not checked on this local pop3 server).
+
+* Using w3mail with Emacs
+
+Put the following emacs-lisp code into your ~/.emacs file:
+
+#+begin_src emacs-lisp
+(defun w3mail (url &optional new-window)
+ (interactive (browse-url-interactive-arg "URL: "))
+ (shell-command (format "w3mail '%s' &" (browse-url-encode-url url))))
+
+(defun w3m-w3mail (url)
+ (interactive (list (w3m-input-url nil nil nil nil 'feeling-lucky)))
+ (when (and (stringp url)
+ (not (interactive-p)))
+ (setq url (w3m-canonicalize-url url)))
+ (set-text-properties 0 (length url) nil url)
+ (setq url (w3m-uri-replace url))
+ (unless (or (w3m-url-local-p url)
+ (string-match "\\`about:" url))
+ (w3m-string-match-url-components url)
+ (setq url (concat
+ (w3m-url-transfer-encode-string
+ (substring url 0 (match-beginning 8))
+ (or w3m-current-coding-system
+ w3m-default-coding-system))
+ (if (match-beginning 8)
+ (concat "#" (match-string 9 url))
+ ""))))
+ (w3mail url))
+
+(global-set-key [f5] 'w3m-w3mail)
+#+end_src
+
+Pressing f5 key will ask for the URL of the web page to be send.
+
+It is better to run the w3mail as a server as mentioned above and then
+it is possible to replace the
+
+: w3mail '%s' &
+
+parameter in w3mail emacs-lisp function by
+
+: echo '%s' >>~/.w3mail/in
+
+which won't block emacs at all.
+
+* Future plans
+
+** TODO fix fragile tidy
+
+Tidying (X)HTML is rather fragile at the moment and I haven't found a
+good tool for that yet.
+
+I imagine
+
+- the tidy program needs to be fixed ([[http://lists.w3.org/Archives/Public/html-tidy/2010OctDec/0022.html][unlikely]]);
+- the w3m program could be changed to dump xhtml;
+- use parser from Firefox or Webkit;
+- or I need to write yet another tolerant parser.
+
+*** emacs-w3m
+
+: $ cvs -d :pserver:anonymous@cvs.namazu.org:/storage/cvsroot login
+: $ cvs -d :pserver:anonymous@cvs.namazu.org:/storage/cvsroot co emacs-w3m
+
+*** w3m
+
+How do I check out the CVS repository directly? The official
+repository doesn't work.
+
+: wget http://www.w3m.org/download/source/w3m-0.1.10-tb2.tar.gz
+: tar zxvf w3m-0.1.10-tb2.tar.gz
+
+** TODO fix fragile xmlstarlet pyx and p2x
+
+Removing namespaces from xhtml doesn't work reliably either. Probably
+bug in xmlstarlet?
+
+** TODO handle RSS and Atom feeds better
+
+** TODO handle mime-types better
+
+For example, application/xml is quite common but doesn't work yet.
+
+** TODO avoid base64 and use text/plain
+
+This might be configurable but text/plain email messages would be
+searchable using simple grep command.
+
+It would be good if the plain text messages contained the links too.
+
+If I don't use base64, won't there be problems with line length in
+mail messages?
+
+* Licence
+
+[[http://www.gnu.org/licenses/][GPLv3+]]
+
+* Feedback
+
+Please send [[http://logand.com/contact.html][me]] an email.