Saturday, October 31, 2015

The Minimalists

I am sure that I have told you many times that I just love this siteThe Minimalists . I am literally learning tons from each and every article. It's a way of life, a way of thinking and, for me, a way to live more - to be more alive and be more aware of what I have in my life.

As a Student of the Arcane art of Programming, I thought why not have my own personal version of the Site for off-line reading. So, as I have lately been doing much study regarding Web Scraping I thought why not experiment with having the Awesome Blogs which I read often.

Fast forward a couple of hours, I had a set of 424 text files which have the text content from the various native blog posts on The Minimalists;P

Actually, it was a great experience and in the process I have learned :-

> To respect the site by slowing down the spiders ( robots/crawlers)

> To look up the robot.txt, to find out the explicit permission for crawlers

> How to avoid walking down the rabbit holes, something I am starting to be aware of in my nature.


Things which still need to be done as far as my Scraping skills are concerned :-

> Actually converting and formatting the text files to a proper, Nook-legible PDF ( or other formats ) perhaps via LaTex using the PyLatex Package.

> My Code still doesn't inform me whether there was an Image or a Video on the page, or whether the page no longer exists. Minimal intelligence might make for a better crawler, in any case it'd be far less unsophisticated than what the professionals use - think Google, Baidu et cetera.

> My Code isn't quite sound, it is still more like code-and-throw-away kind. For example, I am not really using functions which are well designed and which would be a step towards Reusable Code.

> I'd like the content to really look like the way it's should. Right now it's just the bare bones txt files, where as a bit more sophisticated PDF makes for a pleasant reading.


Next up - Text Processing, LEVEL UP!

Fight-o;P

No comments:

Post a Comment