[00:01] shaqfu: I'm getting the ?page now [00:02] i only need --post-data instead of --post-data --user=blah --password=blah [00:02] otherwise i will get ?page=2.html?user=blah.html or something [00:43] Ah, clever [00:51] shaqfu, if you have a gun shoot me in the brain [00:51] or give me temporary amnesia [00:53] i just wish during archiving there was a way to de-stress the brain somehow so you could start fresh [00:53] i guess that is what naps are for [00:53] but time is always of the essence so its like *fuck* [00:59] woo [00:59] infocube 2.0 is now at 221% [01:00] wow. [03:10] Coderjoe: i thought we was doing in -bs [03:11] *talking [03:13] looks like starfinder is in avgeeks [03:16] ooks like a ton of nasa videos was saved by avgeeks too [03:16] i don't need a running tally of what is there [04:39] just found something funny [04:40] i torrent from kat.ph was removed by the request of copyright owner [04:42] Which? [04:43] http://kat.ph/keri-hilson-pretty-girl-rock-2010-single-sw-t4672360.html [12:58] hm, "q2l\#354ft.map": Invalid or incomplete multibyte or wide character". would that be a ascii ì ? [12:58] any idea how i can find out? [12:58] my fs are utf8 but no idea what the source was [13:21] http://www.michaelnielsen.org/ddi/how-to-crawl-a-quarter-billion-webpages-in-40-hours/ [HN discussion: ] http://news.ycombinator.com/item?id=4367933 [13:35] On December 19, 2008, BusinessWeek listed Cuil as one of the most successful U.S. startups of 2008 [13:35] , based on the amount of money they raised. [13:36] my kat.ph-community is still going [13:39] Schbirid: lol, cuil [13:49] wicked, i mounted that forumlpanet bz2 again and now cpu usage is no problem. i wonder what went wrong the other time [13:49] s [13:49] this rock [14:01] I've encountered Common Crawl before, but the Everything-Amazon-tech-and-Cloud stuff scares me away [14:15] Can't you just download the data and use it somewhere else? [14:17] yeah, but you need an Amazon account and pay for the download etc [14:17] I mean, sure - that's fair. But it make me reluctant to take a look at it [14:20] https://aws-publicdatasets.s3.amazonaws.com/?prefix=common-crawl/crawl-002 [14:21] I think you can download everything for free, no account needed. [14:22] https://s3.amazonaws.com/aws-publicdatasets/common-crawl/parse-output/segment/1341690169105/1341826131693_45.arc.gz [14:23] oh, cool [15:18] all most 13000 forum posts from kat.ph/community has been downloaded [15:32] i'm getting a lot of 404s in my kat.ph/community dump [15:33] there is also stuff like this too that needs to be backup: http://kat.ph/blog/TheBatman/ [15:35] i just have no idea how other then scan my newer dump with http://kat.ph/user/[[:alnum:]]* or something to get user name urls [15:36] then user part to blog and start grabing [15:36] i also have to look a images from all urls in this dump [16:01] blog post like this need to be saved for them: http://kat.ph/blog/Nemesis43/post/5200/ [17:11] just updated my linux jouranl collection [17:12] linux journal collection? [17:12] you get some here: http://www.missoulapubliclibrary.org/online-resources/317-linux [17:12] whats funny is that its a library [17:13] ah [17:14] also here: www.iar.unlp.edu.ar/biblio/htdocs/artic/bajad/linuxj/linuxj.htm [17:15] the library has some pdfs that are index [17:15] so i grab those index ones too [18:59] I'm picking up 'hundreds' of 5.25" floppies Monday. Will be dumping like crazy. [19:03] arkhive: excellent [19:04] arkhive: what sort of floppies? [19:19] good evening, btw [19:28] Not sure yet. [19:28] :) [19:28] evenin' [19:33] hey winr4r [19:33] :) [19:33] been busy, godane? [19:34] my kat.ph/community still is [19:34] thanks to alard i will be able to grab all images off of kat.ph/community dump [19:35] still pulling new images from it [19:37] so do sort and uniq works not just uniq [20:07] its in a url loop [20:09] i think i got most of it anyway [20:10] i should have blocked ?p_id paths [20:11] and blocked 26799 post [20:42] getting a ton of user pictures now [20:44] there is 5000+ user pics [20:44] from kastatic.com/i2/u/# path [20:45] then there is kastatic.com/i2/userpics/# [20:55] the kastatic.com image dump is very big [20:55] and i have not got to kastatic.com/i2/userpics/ [20:55] yet [21:05] my eyes [21:05] a fat guy took picture of himself naked [21:06] that is what is data dump [23:47] i'm downloading 8-bit theatre