[01:39] Oops. I just realized that my irc logger has been off for the last 3 days. Doah. http://badcheese.com/~steve/atlogs [01:42] godane: I should have something in ~15 minutes [01:52] godane: I think i'm done [01:53] :-D [01:53] code please? [01:55] http://code.hanzoarchives.com/warc-tools/src/2a7976f9e7d7/warclinks.py [01:55] should handle all sorts of links in warcs (only html though...) [01:55] handles relative urls too [01:55] I happened to have a html link extractor using the py stdlib kicking around [01:55] and it helps I wrote a warc library :-) [01:56] should be able to do hg clone ... (or grab a tarball) [01:56] export PYTHONPATH=`pwd` [01:56] python warclinks.py warc-files.... [01:56] handles gzipped, non gzipped files [01:57] if you have +6 month old warc files when the wget-warc produced weird files, I can put in a fix in for that, but warc2warc --wget-chunk-fix should sort it [01:57] it doesn't keep a set of links [01:57] it could product a list of urls found in the links, that aren't in the warc [01:57] but you can do warcdump ... | grep WARC-Target | cut ... [01:58] found a error [01:58] any questions? I've only tested it a little [01:58] ah balls [01:58] can you pastebin ? [01:59] http://pastebin.com/NfbFUy2Q [01:59] hrm, it shouldn't be raising that [01:59] oh i'm a muppet. [01:59] hrm, you've got some lovely html there :-) [02:00] i know [02:00] this is the first time that grep -ohP doesn't work to grab/filter all urls [02:01] i'm trying to use that to grab all images from sites like techcrunch and such [02:01] I pushed a fix to skip them properly [02:01] but I should replace it with something more reliable than python's built in parser [02:01] maybe I should use beautiful soup or lxml [02:02] but it will get you *some* of the urls, maybe, I hope, :-) [02:05] ugh [02:05] I am an idiot [02:05] anyway, I'm gonna try and put beautiful soup in [02:05] should handle everything [02:06] ok [02:06] rather than committing typos :3 [02:29] godane: pushed [02:29] should use lxml [02:29] well almost pushed [02:29] pushed *now* [02:30] godane: ping [02:30] hey [02:30] i got it [02:31] there looks be warnings of parse error [02:31] hrm [02:32] you may need into install lxml, via python-lxml (apt) or easy_install lxml [02:32] you didn't fix my problem [02:32] the lines still break [02:33] but this does look better and has more stuff in it now [02:33] just fixing a bug [02:33] well, maybe a bug [02:34] godane: how are you running it, I get a whole slew of urls from the examples I try [02:36] tef_: are you the tef that recently visited #hackerfurs? [02:36] python warclinks groklaw.net-articles-2006.warc.gz > log [02:36] *warclinks.py [02:37] underscor: yeah, I got dragged in by mithaldu [02:37] I heard some furries were trash talking my code :-) [02:37] I assume you're the same underscor there [02:38] what's the lines still break thing ? [02:38] hrm [02:38] like i said [02:39] i'm slow :3 [02:39] this warc.gz is special [02:39] oh so special [02:39] i'd ask for a copy but I assume It's huge [02:39] no just ~15mb [02:41] tef_: haha, yeah [02:41] small world, innit [02:41] I backed out cos well, I had a clearout of irssi windows [02:42] Aye [02:46] tef_: you can download it here: http://archive.org/details/groklaw.net-articles-2006-20120827-mirror [02:47] godane: fetching now [02:50] oh *wow* [02:51] you see what i mean now [02:51] even doing a tr -d '\n' does nothing to it [02:52] yeah [02:52] that is rather amazing [02:54] pushed a fix :3 [02:54] godane: try now [02:54] I can also try stripping fragments too, but I think sed can fix that [02:55] lots of errors now [02:55] hrm ? I get a bunch of links out [02:56] did python warclinks.py ~/Downloads/groklaw.net-articles-2006.warc.gz |sort|uniq [02:56] and without newlines and such [02:56] try repulling incase something weird happened [02:57] file "warclinks.py", line 64, in extract_links_from_warcfh [02:58] there error i have is your fix [02:58] hrm [02:58] do you have a little bit more of that error ? [02:59] it parses on mine, what version of python are you using ? [02:59] yield link.translate(None, '\n\r\t') [02:59] i'm using python2 [02:59] 2.6 or 2.7? [02:59] can you paste the entire traceback [02:59] 2.7.3 [02:59] i can't right now [02:59] baws [02:59] i'm on firefox proxy [02:59] can you copy and pase the error message at least? [03:00] rather than just the line [03:00] which exception [03:00] as it works on my machine (tm) [03:02] http://secretvolcanobase.org/~tef/warc_links.txt.gz example output [03:03] http://pastebin.com/NnaN79q1 [03:04] 2.7.3 weeerid [03:04] http://docs.python.org/library/stdtypes.html#str.translate [03:04] cos it says two arguments here [03:05] anyway, the txt.gz file has the links you want, I hope [03:05] hrm [03:05] aaaaha [03:05] for some reason on your machine it is sending in unicode [03:08] godane: pull or try the output provided [03:10] thank you [03:11] fixed? [03:11] yes [03:11] \o/ [03:11] i think [03:12] well that took longer than 15 minutes :3 [03:12] what an awful warc file [04:00] looks like that warc had 700+mb of pdfs, mp3, ogg, and images from groklaw.net [04:11] there is a error again [04:12] tef_: ping ^ [07:48] Similarly, it might be useful to disable proxy_buffering if it's enabled. That can also be done from the script with an extra HTTP header in the response, if that's easier. [07:48] underscor: Thanks for the warctozip update. Although the new POST things don't really work: your Nginx config apparently has a very low client_max_body_size. Perhaps you can increase that a bit? (It would be even nicer if it didn't buffer the request at all, but that seems to be impossible with Nginx.) [09:22] thanks for the Aktuelles Software Magazine collection! [09:36] does someone have/know a tool to completely download a reddit thread? the increments when you click "more" get tiny, so it is quite annoying to do by hand [09:37] it's called a scripting language, and it's a very sharp tool [09:37] ^_^ [09:38] Wonder how they do the comment collapsing, should take a look at that sometime [09:39] same would be handy for facebook, those threads are nearly impossible to get with a browser since they cant keep up rendering thousands of comments [09:40] Wget+Lua! [09:40] * Schbirid runs away [09:41] Ooh, should take a looksie at wget+lua sometime as well [10:49] godane: ? [13:16] tef_: hey [13:16] i'm back [13:16] it looks like some keys have problems with unicode [13:16] like 0x94 [13:16] and 0x31 [13:17] hrm [13:45] I just asked archive.org a question about scanning. [13:45] Can we have a volunteer corps of people in the SF Bay area who come in and operate a bookscanner assigned to our group, who then scan computer historical documents. [13:46] If they say yes, I'll start harassing people about joining up. [13:51] godane: put in a better fix, maybe [14:57] http://want.archive.org/ [14:57] alard: that will go through the load balancer instead of running on my dev box, if you want to update the demo app [15:39] underscor: Please add a line under "currently only for books/things with ISBNs" [15:39] Experimental: Do not use as a sign-off for large donations of books. Please contact info@archive.org. [15:39] Remove secret mode line [15:50] i got over 8gb of groklaw.org [15:50] :-D [15:51] i do have split some the warc.gz cause downloads stop sometimes [15:52] it maybe closer to 4gb cause i have the mirror .tar.gz and .warc.gz [15:56] underscor: My want-it demo app is asleep, I don't know if I will wake it up again. (I ran the human.io app on my home computer.) [15:56] Also, the want-it api is also visible on http://warctozip.archive.org/ ? [16:15] godane: did the most recent fix, well, uh fix [16:15] i don't know [16:15] heh [16:16] i see the error again with my groklaw.net 2011 dump [16:16] godane: yeah I'm not sure why your lxml is returning unicode [16:17] i think its mostly cause groklaw is special [16:17] i also get some bad urls like this: http://www.groklaw.net/htt[://www.groklaw.net/pdf3/LodsysvCombay-26.pdf [16:18] luckly all bad urls on the top of the list [16:18] heh [16:18] yeah I can't fix their broken links [16:19] the thing is i checked for that file [16:19] pushing a better check for unicode for what it is worth [16:19] either way I hope you've got more stuff than you would have had without it [16:19] despite it being buggy and crap :-) [16:20] it has that same broke line problem from what i can tell [16:22] baws [16:23] I'm not going to have a lot of time, if any to keep playing hunt the bug when I'm struggling to recreate some of the weirder errors [16:23] sorry :/ [16:23] thats ok [16:24] it filters out the bad urls better then before [16:24] and i think it does fix most of the bad urls [16:27] yay :D [16:27] you might find google refine will be good for cleaning up large data sets like this [16:39] I have a website that I would like to be archived, how would I do so? [16:39] it's going down saturday sometime, i'll just leave this here: http://www.therevoltpress.org/ [16:39] did anyone do this [16:39] godane was disconnected at the time [16:41] it looks like the website is still up [16:47] i will try to grab it soon [16:48] my groklaw.net grab is very special so i don't want it to stop downloading [16:48] godane, let me know when/where you download it when your done please [16:50] good news is it doesn't look like it was updated since last year [16:53] but there boards have been busy [16:53] yea, it will be until it closes [16:53] no ETA though [16:53] want.archive.org is apparently going to shift names, so don't get comfy with it. :) [17:12] i have to login with a user name and password [17:13] how do you do that with wget? [17:14] godane: HTTP basic authentication? wget --help | grep user [17:15] godane, you can login with anonymous / anonymous [17:15] btw [17:24] i'm get this for cookie: [17:24] therevoltpress.org FALSE / FALSE 1377710618 bblastactivity 0 [17:24] therevoltpress.org FALSE / FALSE 1377710618 bblastvisit 1346174618 [17:24] its not working [17:24] stupid me [17:24] wrong url [17:25] still doesn't work [17:25] therevoltpress.org FALSE / FALSE 1377710728 bblastactivity 0 [17:25] therevoltpress.org FALSE / FALSE 1377710728 bblastvisit 1346174728 [17:28] i don't think i can mirror it [17:35] what am i doing wrong here: [17:35] ://therevoltpress.org/boards/" --keep-session-cookies --load-cookies=cookies1.tx [17:35] cdx [17:35] t --content-disposition --mirror --warc-file=therevoltpress.org-20120828 --warc- [17:35] wget "http [17:43] can anyone help me? [17:43] its driving me nuts [17:44] cause i have no idea on how to add cookies to wget the right way [17:48] godane: do you have a cookies.txt? [17:48] and is it properly formatted? [17:48] yes [17:48] its just like the other ones [17:48] i'm using export cookies addon for firefox to get the cookie [17:49] i may not know where to point it to through [17:49] cause therevoltpress.org/boards/ is not working with wget [17:49] even therevoltpress.org/boards/login.php doesn't work [17:52] -U "Somethingelse." ? [17:52] They may be blocking wget. [17:53] that didn't work [17:56] there using vBulletin 3.8.0 if that helps [17:58] this maybe better for you guys to do it [17:58] i can't do much here [17:58] and even if i could get all of it i maybe more then 10gb [17:59] and i don't think i can get the uploaded on my internet speed [18:08] godane: that's worked for me... [18:08] are you faking the UA? [18:08] I had to for one project [18:15] yes [18:15] show me your code please? [18:15] and send me your cookie [18:15] i getting false / false with my cookies for some reasone [18:19] balrog_: can please sead me the code? [18:19] i'm dieing here [18:30] wget "http [18:30] ://therevoltpress.org/boards/login.php?do=login" --mirror --warc-file=therevoltp [18:30] ress.org-20120828 --warc-cdx -U "ArchiveTeam" --load-cookies=cookies1.txt [18:30] thats my code [18:30] you show my yours? [18:30] *me [18:31] or at least tell me the url your using [18:34] balrog_: where the hell are you? [18:35] busy, stuck at work [18:35] can you please help me? [18:35] i don't know why this site will not download [18:35] and i don't know how the hell to save the cookies through wget anymore [18:36] what's in cookies1.txt before you start? [18:36] therevoltpress.org FALSE / FALSE 1377696171 bblastvisit 1346174589 [18:36] therevoltpress.org FALSE / FALSE 1377696451 bblastactivity 0 [18:36] www.therevoltpress.org FALSE / FALSE 0 __utmc 1 [18:36] www.therevoltpress.org FALSE / FALSE 1346159957 __utmb 1.2.10.1346158150 [18:36] www.therevoltpress.org FALSE / FALSE 1361926157 __utmz 1.1346158150.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none) [18:36] www.therevoltpress.org FALSE / FALSE 1409230157 __utma 1.882311859.1346158150.1346158150.1346158150.1 [18:36] therevoltpress.org FALSE / FALSE 0 bbsessionhash a11c86836d5471bdda445db209cb2e5a [18:36] thats all my therevoltpress.org cookies [18:37] i have no idea why there not working [18:39] Patt: any ideas on how to mirror therevoltpress.org [18:39] Patt: remember you asked for me by name [18:40] Alard: fixed. Thanks. [20:58] alard: When could we make the memac search public? [21:00] Hadn't you already done that? [21:01] I think it won't get more complete than it is now. The .zip download links work. It's a pity the .warc.gz download links don't work, but I think that's an issue with the archive.org tarviewer. [21:11] Well, I'm about to give it to a press person [21:11] So if it can be set up as ready to go for press, let's do it. [21:13] the fixed-width font will scare muggles [21:14] I'm all for it [21:17] WHY MUST YOU SELL FEAR [21:17] +1 for "muggles" [21:18] Always amazed how that one goes by [21:18] Treats them like cattle [21:18] Also liked how one book basically had magic dude show up in prime minister's office going "major shit going down lol brb" [21:18] heh [21:19] Yeah, so, well, the search page is here: http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html [21:20] It may or may not need a lot of text and explanations. [21:20] Why am I not here? Why am I here? How did you hack my account? [21:20] hm [21:21] how the fuzz does this work anyway [21:21] "Email complaints@archiveteam.org to get your things removed." [21:21] ahhh [21:21] It's just a 400MB JSON file sorted alphabetically. [21:21] that's tricky [21:22] not worthwhile to split it up? [21:22] So searching requires me to download a 400 MB file? [21:22] No, you just download small bits of it. [21:22] ah, cool [21:22] Magic [21:22] it does some sort of binary windowing thing? [21:23] https://ia600403.us.archive.org/30/items/archiveteam-mobileme-index/ [21:23] There's an index to the large json file, with the locations of where items start. [21:23] hot diggity damn [21:24] Because it's sorted, you know that the item X should be in bytes n-m. [21:24] (If that's abstract enough.) [21:24] hangs infinitely in opera [21:25] Does it. [21:25] yurp [21:25] Any idea why? [21:25] * chronomex shrugs [21:25] opera's weird [21:25] I tried it in Firefox and Chrome. [21:25] yeah, works fine in chromei [21:26] It's a bit tricky, so you need a modern browser. But it doesn't need a database. [21:26] it's spiffy [21:26] I like it [21:27] this is the future [21:28] It's the past. It's just a horribly slow search engine that can only search on one key. [21:28] It's fast enough to be usable, though. [21:28] yeah [21:29] https://ia600403.us.archive.org/30/items/archiveteam-mobileme-index/mobileme-20120817.html#chronomex hah, I suppose I put my own name through the script at some point [21:31] We're flooding the channel. :) [21:33] take it to #internetarchive, you! [21:33] or #nowwhat :D or.. -bs [21:34] endless possibilities [21:34] We should have a hash function where you can enter a topic and it'll tell you to go to #archiveteam-${hash} [21:35] Let's go to #nowwhat [21:35] or just a stab at random [21:37] We'll just change channels after every second message. That's what real hackers do, I've heard. [21:39] 7 layers of channels [21:57] Installed Opera, found the problem: Opera is stupid, it doesn't do Range: headers in XmlHttpRequest, so it starts downloading the full 400MB. [21:58] (It also opens connections to ebay, booking.com and other sites, without my asking so.) [21:59] SketchCow: Anything else you need to make the search thing ready to go to press? [21:59] http://archive.org/download/archiveteam-mobileme-index/mobileme-20120817.html is what we go with, right? [21:59] Yes. It's possible to put it in an iframe somewhere on archiveteam.org, if that's better. [22:02] want.archive.org sounds great- how do you get books to IA? (is there going to be a blog post somewhere on this? or is it not public-ready yet?) [22:02] Not public ready [22:02] But you basically mail them books. I send mine in crates, media mail. [22:02] 200 went out today [22:03] archive.org wants to take it under consideration before it becomes an official API [22:03] I'd love to unload some books, I have way too many for a single man in a city :( [22:04] I'll do an inventory eventually [22:05] chronomex: Check out bookmooch.com, it allows you to trade books by mail [22:05] meh, that sounds like a lot of work [22:06] also I have *too many* books [22:06] I should scan the rare ones. [22:06] I do as well- I've had to switch to ebooks because I don't really have more room for physical copies [22:07] the space under my bed is about 80% books. [22:07] every shelf is full of books, and nearly the entire wall is lined with piles of books [22:09] I'd love to do the book scanning thing, but it takes a more disciplined and dedicated person than me to do that- I'd get distracted by reading parts of the pages as I flipped by, and it's a lot more tedious flipping pages and taking pictures than reading the book [22:09] haha I'm not the only one [22:11] I've only scanned one in toto, which is probably the most valuable book I own - http://archive.org/details/TheElectronicSwitchingSystem [22:11] that instructable on the cardboard box bookscanner makes the whole thing look easy, but apart from the aforemention issues, there's the post processing of each page- which is SO much easier if your pictures are uniform in each respect [22:12] This is why we're working on want.archive.org [22:12] Send them to archive.org, they get scanned in [22:12] I used to scan books on a flatbed for distributed proofreaders, you kids and your diy things [22:14] DFJustin: gutenberg? [22:14] yeah [22:14] unfortunately the raw scans all ate it in an hdd crash, unless dp still has them [22:15] :( ): :( [22:15] the pg guys made some wicked ebook editions though http://www.gutenberg.org/files/16410/16410-h/16410-h.htm [22:15] that's a great idea, except making space is only half the reason I'm scanning a book- the other is to have an ebook version of it (which I'm pretty sure I can't get from archive.org- books are too new) [22:15] DFJustin: oh that's sexy [22:16] I got 2/3 of the way through TeXifying that book too - http://gir.seattlewireless.net/~chronomex/bellsystem/morris/Morris.html [22:17] if you tell me I can get an electronic copy of every book I mail into IA, I'd crate a large part of books and send them very quickly [22:17] yeah. [22:17] it's not legal since they want to lend out the electronic copy [22:18] yeah :S [22:19] the other scanning project you proposed to archive.org sounds great as well- the historical computer document one [22:47] I made the formal proposal to archive.org about that [23:26] I still would like a DIY bookscanner :D [23:45] wasn't SketchCow supposed to get one of those like 6+ months ago and CHANGE COMPUTER HISTORY [23:45] Yes [23:45] I've been needling the guy - little response [23:46] I've got a few "Getting that right to you (six months ago)" so I'm not going to get too het up