I am sure that I have told you many times that I just love this site - The Minimalists . I am literally learning tons from each and every article. It's a way of life, a way of thinking and, for me, a way to live more - to be more alive and be more aware of what I have in my life.
As a Student of the Arcane art of Programming, I thought why not have my own personal version of the Site for off-line reading. So, as I have lately been doing much study regarding Web Scraping I thought why not experiment with having the Awesome Blogs which I read often.
Fast forward a couple of hours, I had a set of 424 text files which have the text content from the various native blog posts on The Minimalists;P
Actually, it was a great experience and in the process I have learned :-
> To respect the site by slowing down the spiders ( robots/crawlers)
> To look up the robot.txt, to find out the explicit permission for crawlers
> How to avoid walking down the rabbit holes, something I am starting to be aware of in my nature.
Things which still need to be done as far as my Scraping skills are concerned :-
> Actually converting and formatting the text files to a proper, Nook-legible PDF ( or other formats ) perhaps via LaTex using the PyLatex Package.
> My Code still doesn't inform me whether there was an Image or a Video on the page, or whether the page no longer exists. Minimal intelligence might make for a better crawler, in any case it'd be far less unsophisticated than what the professionals use - think Google, Baidu et cetera.
> My Code isn't quite sound, it is still more like code-and-throw-away kind. For example, I am not really using functions which are well designed and which would be a step towards Reusable Code.
> I'd like the content to really look like the way it's should. Right now it's just the bare bones txt files, where as a bit more sophisticated PDF makes for a pleasant reading.
Next up - Text Processing, LEVEL UP!
Fight-o;P
As a Student of the Arcane art of Programming, I thought why not have my own personal version of the Site for off-line reading. So, as I have lately been doing much study regarding Web Scraping I thought why not experiment with having the Awesome Blogs which I read often.
Fast forward a couple of hours, I had a set of 424 text files which have the text content from the various native blog posts on The Minimalists;P
Actually, it was a great experience and in the process I have learned :-
> To respect the site by slowing down the spiders ( robots/crawlers)
> To look up the robot.txt, to find out the explicit permission for crawlers
> How to avoid walking down the rabbit holes, something I am starting to be aware of in my nature.
Things which still need to be done as far as my Scraping skills are concerned :-
> Actually converting and formatting the text files to a proper, Nook-legible PDF ( or other formats ) perhaps via LaTex using the PyLatex Package.
> My Code still doesn't inform me whether there was an Image or a Video on the page, or whether the page no longer exists. Minimal intelligence might make for a better crawler, in any case it'd be far less unsophisticated than what the professionals use - think Google, Baidu et cetera.
> I'd like the content to really look like the way it's should. Right now it's just the bare bones txt files, where as a bit more sophisticated PDF makes for a pleasant reading.
Next up - Text Processing, LEVEL UP!
Fight-o;P