[00:06] *** wvdp has joined #archiveteam [00:11] *** dashcloud has quit IRC (Read error: Connection reset by peer) [00:11] *** dashcloud has joined #archiveteam [00:12] hallo, i'm looking for a warc file to test an application i'm writing. preferably about 4 mb and containing different types of records. [00:34] *** khaoohs_ has joined #archiveteam [00:35] wvdp: You could look here: http://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20150830100001 [00:36] This is about a half a MB: https://archive.org/download/archiveteam_archivebot_go_20150830100001/twitter.com-inf-20150830-060528-34f3f-meta.warc.gz [00:36] that expands to 6-8MB [00:37] still a better suited sample than what I'd be able to share :P [00:40] *** khaoohs has quit IRC (Read error: Operation timed out) [00:40] archivebot produces a *lot* of WARCs... [00:40] I'm still somewhat confused that there isn't a definitive WARC-to-static-local-files transformer available... [00:41] (or if there is, that I haven't found it yet) [00:41] agreed, though since it's plaintext it should be trivial to write a parser-splitter [00:42] i found someting [00:42] https://archive.org/download/archiveteam_archivebot_go_001/2luxz0kz3lhxnymxhnwz2ky1s-20131010-090331.warc.gz [00:42] and i vastly overestimated the size of a warc record there are hundreds in here [00:44] I'm wrong, binary imagedata is stored in the warc as well [00:46] *** zenguy_pc has quit IRC (Read error: Connection reset by peer) [00:46] *** BlueMaxim has joined #archiveteam [00:52] I have tracked down the solution for our occasional wget-lua documentation compilation issue! [00:52] found a workaround, and a patch - the patch certainly works, the workaround presumably does also [00:53] the patch is available at https://lists.gnu.org/archive/html/bug-wget/2013-06/msg00046.html, and the workaround is to do `make install_sw` instead of `make install`, so that could also be brought into the get-wget-lua.sh script [00:53] (will make PR) [00:54] *** caber has quit IRC (Read error: Operation timed out) [00:54] hum. yipdw: chfoo: get-wget-lua is not in the README template repo, where should I point my PR at ? [00:55] er, wait... [00:55] nevermind the workaround, we'd have to apply the patch, because there's no install step [00:55] *** wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES) [00:55] *** caber has joined #archiveteam [01:06] *** wp494 has joined #archiveteam [01:11] *** wp494 has quit IRC (Client Quit) [01:15] I have a working wget-lua package for Nix! :D [01:16] *** kyan has joined #archiveteam [01:20] *** xk_id has quit IRC (Remote host closed the connection) [01:30] *** kyan has quit IRC (Quit: Leaving) [01:30] *** kyan has joined #archiveteam [01:37] *** xk_id has joined #archiveteam [01:43] *** kyan has quit IRC (Quit: Leaving) [01:45] joepie91: I just use a oneliner to fix the docs: sed -e "s/\(item \)\([0-9]\)/\1\.\2/" ./doc/wget.texi > ./doc/wget.texi.tmp && mv ./doc/wget.texi.tmp ./doc/wget.texi [01:46] in the get-wget-lua.sh file. I believe that fixes the syntax error [01:47] or at least it stops the complaining, plus docs don't really seem to matter for most AT usage [01:59] *** wvdp has quit IRC (Ping timeout: 240 seconds) [02:08] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [02:10] *** boozehoun has joined #archiveteam [02:10] *** primus104 has joined #archiveteam [02:10] *** VADemon has joined #archiveteam [02:30] *** primus104 has quit IRC (Leaving.) [02:46] *** xk_id has quit IRC (Remote host closed the connection) [02:47] *** JesseW has quit IRC (Read error: Operation timed out) [02:52] *** wp494 has joined #archiveteam [02:53] *** RichardG has quit IRC (Read error: Connection reset by peer) [02:54] *** RichardG has joined #archiveteam [03:06] *** JesseW has joined #archiveteam [03:07] *** Ravenloft has joined #archiveteam [03:12] *** VADemon has quit IRC (left4dead) [03:18] *** myself has joined #archiveteam [03:30] I found a reasonably canonical warc-to-static-files transformer, it's chfoo's warcat too, using the extract command. Now what I'd like it to do is have the wget --convert-links option supported. [03:53] You could probably whip something up in python or sed or something. [03:57] *** Ravenloft has quit IRC (Read error: Connection reset by peer) [03:58] I'd advise Python rather than sed. [03:58] Python can more reasonably parse the HTML for URLs. :) [04:02] yeah, I was just spitting something out thinking of regexes, but yeah, an html.parser would probably work better. [04:13] Noob question: Once in a while, my ISP will MitM a connection to "inform me" about upcoming network maintenance. It's well-intentioned but of course means I can't run a Warrior. Except: The banner has distinctive text in it.. I feel like there's probably a way to detect that and invalidate my current task and suspend the warrior or something, while still allowing me to contribute the other 99% of the t [04:13] ime. However, this might be a lot of work, and it might be more sensible to just not run an instance over here. Forgive me if this is an oft-recurring topic, but I'm curious what the thoughts are on this issue, if it's come up before. [04:13] *** JesseW has quit IRC (Read error: Operation timed out) [04:14] myself: My thoughts: make a lot of noise at them. [04:14] Aside from convincing your ISP to stop doing that probably the best way to work around it would be a VPN [04:15] I've already opted out of their DNS-bullshit, I'll see if I can opt out of the HTTP thing too. But yeah. If I want to go the VPN route I'd just run the warrior instance on the machine that would run my VPN endpoint and avoid handling all the data twice. :P [04:16] Specifically, mention that it's a violation of the Computer Fraud and Abuse Act and advise that you are considering contacting the appropriate authorities. [04:16] (seriously, that is a damned *attack*) [04:16] Heh. If I can get chapter and verse for that, I'll gleefully go that route. [04:17] 18 USC SS 1030 [04:17] It's a fucking crime. [04:19] hmm, under a(5)A? [04:19] that'd be... hazy... [04:19] No more hazy than a lot of the other cases [04:20] Also, you're not necessarily having to actually *do* this, you're having to make them afraid enough they back down. [04:20] (though honestly, I'm pretty sure a MITM attack on your traffic would count.) [04:20] Short story: Technical measures not practical (beacause finding text in http is hard or something), yelling at provider is the only means worth considerig. Got it, thanks. [04:21] I'd consider it higher priority than technical measures, because *this is frankly offensive*. [04:23] myself: it's easier to not run an instance if you know you're being MITMed [04:23] Right, that's what I've been doing so far. I'll keep doing that. [04:23] thanks [04:24] I'm not even kidding about the yelling at the ISP thing though. You might *actually want to give them chapter and verse* and at least watch 'em squirm. [04:25] In the long term, I still have a feeling that weird-shit-detection code could add resilience to the whole mess, say a load-balancer shits itself or whatever -- a means of recognizing that the crawl just returned 50 identical-and-very-short pages might make for better results in lots of circumstances, in addition to the situation of ISP malfeasance (that some warrior owners might not realize applies to [04:25] their connections too). [04:26] but I'm aware that ideas are not the same as submitting patches, so... nevermind I guess. [04:26] True, not sure exactly what to do but it's probably worth figuring out some way to deal with hackers on your network, erm I mean, bad ISPs. [04:27] *** JesseW has joined #archiveteam [04:27] this is more suitable for #warrior [04:27] Oh right. [04:32] *** Guest100 has joined #archiveteam [04:36] *** VADemon has joined #archiveteam [04:37] *** aaaaaaaaa has quit IRC (Leaving) [04:52] *** Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [05:07] *** Guest100 has joined #archiveteam [05:11] *** Guest100 has quit IRC (Client Quit) [05:15] *** Guest100 has joined #archiveteam [05:50] *** godane has quit IRC (Ping timeout: 258 seconds) [06:19] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [06:20] *** Guest100 has quit IRC (My Mac has gone to sleep. ZZZzzz…) [06:31] *** godane has joined #archiveteam [06:35] *** scyther has joined #archiveteam [06:41] *** Guest100 has joined #archiveteam [06:50] *** Ungstein has joined #archiveteam [07:02] *** habi has joined #archiveteam [07:02] *** habi has left [07:10] *** wutno has joined #archiveteam [07:20] *** Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [07:34] *** Atom__ has quit IRC (Read error: Connection reset by peer) [07:36] *** Atom__ has joined #archiveteam [07:45] *** habi has joined #archiveteam [07:45] *** habi has quit IRC (Read error: Connection reset by peer) [07:46] *** atomotic has joined #archiveteam [07:48] *** atomotic has quit IRC (Client Quit) [07:57] *** dashcloud has quit IRC (Read error: Connection reset by peer) [07:58] *** dashcloud has joined #archiveteam [08:01] *** Ravenloft has joined #archiveteam [08:05] *** primus104 has joined #archiveteam [08:05] *** khaoohs has joined #archiveteam [08:07] *** schbirid has joined #archiveteam [08:11] *** khaoohs_ has quit IRC (Ping timeout: 483 seconds) [08:54] *** scyther has quit IRC (Read error: Connection reset by peer) [08:55] *** scyther has joined #archiveteam [09:03] *** wp494 has joined #archiveteam [09:04] *** scyther has quit IRC (Read error: Connection reset by peer) [09:39] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [09:40] *** primus104 has quit IRC (Leaving.) [09:41] *** dashcloud has joined #archiveteam [09:42] *** vitzli has joined #archiveteam [10:12] *** xk_id has joined #archiveteam [11:06] *** tglass has joined #archiveteam [12:07] *** primus104 has joined #archiveteam [12:12] *** BlueMaxim has quit IRC (Quit: Leaving) [12:31] *** Stilett0 has joined #archiveteam [12:34] *** Stiletto has quit IRC (Ping timeout: 306 seconds) [12:54] *** lytv has quit IRC (Ping timeout: 252 seconds) [12:54] *** Ungstein has quit IRC (Ping timeout: 252 seconds) [12:55] *** Ungstein has joined #archiveteam [13:17] *** Atom__ has quit IRC (Read error: Connection reset by peer) [13:26] *** Atom__ has joined #archiveteam [13:40] *** Atom__ has quit IRC (Ping timeout: 306 seconds) [13:43] *** Spacedawg has joined #archiveteam [14:40] Any projects other then url team running? [14:43] matthusby: don't think so [14:47] this was asked yesterday, and then somebody responded "yes", iirc [14:47] but idk what projects [14:47] :p [14:52] *** human39 has joined #archiveteam [14:57] *** human39 has quit IRC (Leaving) [15:02] *** phuzion_ is now known as phuzion [15:06] *** jspiros has quit IRC (hub.efnet.us irc.umich.edu) [15:06] *** Fusl has quit IRC (hub.efnet.us irc.umich.edu) [15:06] *** filippo__ has quit IRC (hub.efnet.us irc.umich.edu) [15:06] *** trs80 has quit IRC (hub.efnet.us irc.umich.edu) [15:08] *** primus has joined #archiveteam [15:11] *** jspiros has joined #archiveteam [15:11] *** Fusl has joined #archiveteam [15:23] *** primus104 has quit IRC (Leaving.) [15:26] *** filippo__ has joined #archiveteam [15:33] *** SimpBrain has quit IRC (Quit: Leaving) [15:34] *** rizzzz has quit IRC (Remote host closed the connection) [15:35] *** rizzzz has joined #archiveteam [15:46] Nothing major, yet [15:46] I think we went to -bs last night [15:48] *** vOYtEC has joined #archiveteam [15:50] *** vOYtEC has quit IRC (Remote host closed the connection) [15:51] *** JesseW has quit IRC (Read error: Operation timed out) [15:52] *** JesseW has joined #archiveteam [15:56] *** Guest100 has joined #archiveteam [16:00] *** trs80 has joined #archiveteam [16:03] *** tsp_ has joined #archiveteam [16:15] *** yakfish has quit IRC (Read error: Operation timed out) [16:16] *** yakfish has joined #archiveteam [16:20] *** vOYtEC has joined #archiveteam [16:23] *** Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…) [16:25] *** JesseW has quit IRC (Read error: Operation timed out) [16:28] *** godane has quit IRC (Quit: Leaving.) [16:29] *** godane has joined #archiveteam [16:41] *** godane has quit IRC (Quit: Leaving.) [16:43] *** tglass has quit IRC (Quit: Page closed) [16:48] *** lytv has joined #archiveteam [17:36] *** vitzli has quit IRC (Quit: Leaving) [17:43] *** PurpleSym has joined #archiveteam [18:00] *** primus104 has joined #archiveteam [18:08] *** scyther has joined #archiveteam [18:18] *** Start has quit IRC (Quit: Disconnected.) [18:33] *** Sk1d has quit IRC (Quit: ZNC - http://znc.in) [18:34] *** Sk1d has joined #archiveteam [18:37] *** Stilett0 has quit IRC (Read error: Connection reset by peer) [18:37] *** MrPenguin has joined #archiveteam [18:37] Hello. [18:37] * myself pretends to be asleep [18:38] I don't know if that's the right channel to propose a new project, but I'd like to suggest backing up http://www.drdobbs.com/ [18:38] Crap. Are they going away too? [18:38] It appears to have a lot of useful programming-related info, but has shut down for new articles [18:39] Yeah, apparently ads can't finance it [18:39] Farewell article is on http://www.drdobbs.com/architecture-and-design/farewell-dr-dobbs/240169421 [18:40] *** xk_id_ has joined #archiveteam [18:40] They said that "The present content will remain available indefinitely", but I think it's better to back it up just in case [18:41] I'm new here but tend to agree. [18:41] Yeah, I'm just kind of browsing through the Archive Team wiki and generally approve of the effort, but can't contribute unfortunately [18:42] *** xk_id has quit IRC (Read error: Operation timed out) [18:46] My connection isn't entirely good for contributing to the mirroring... [18:50] *** Stiletto has joined #archiveteam [18:52] *** habi has joined #archiveteam [18:53] *** habi has left [18:57] *** aaaaaaaaa has joined #archiveteam [18:57] *** swebb sets mode: +o aaaaaaaaa [18:57] *** habi1 has joined #archiveteam [19:00] *** VADemon has quit IRC (Read error: Connection reset by peer) [19:02] MrPenguin: yes. that needs backing up. please hold [19:06] That may have been archivebotted [19:06] MrPenguin: myself: put it into archivebot: http://dashboard.at.ninjawedding.org/ [19:06] yeah, that site would be ok for archivebot [19:06] :) [19:06] if you see it getting stuck on stuff like a calendar or permalinks, please let me know [19:06] just highlight me and I'll fix it [19:07] (I didn't see anything obviously loopy from a quick glance around) [19:07] cc arkiver [19:07] yeah [19:07] website looks ok [19:08] FYI, that was last grabbed in december '14 [19:08] how large was the grab? [19:09] I'll do the math, hold on [19:11] just over 10 GB [19:13] mostly text, so probably compresses well despite the size [19:19] *** K4k has joined #archiveteam [19:21] joepie91: Thanks :) So the mirroring is running now? That's great. [19:22] MrPenguin: yeah [19:22] *** aaaaaaaa_ has joined #archiveteam [19:22] *** swebb sets mode: +o aaaaaaaa_ [19:22] * myself smacks a Staples button [19:23] *** aaaaaaaaa has quit IRC (Ping timeout: 600 seconds) [19:24] *** aaaaaaaa_ is now known as aaaaaaaaa [19:25] heh, maybe you should wait until the job is done before you call it easy [19:25] The ArchiveBot beta console looks quite good [19:25] *** SimpBrain has joined #archiveteam [19:26] aaaaaaaaa: pft. after Blip, everything's easy [19:26] :P [19:26] Wait, so the ArchiveBot also archives links on immediately "adjacent" sites too, I mean first-degree links to other domains? [19:26] MrPenguin: yep [19:26] I didn't know that:) [19:27] I thought it only does the mirroring on the destination site [19:28] You can explicitly tell it to do that too, and not do the links that aim off site. Most web scrapers have a setting like that. [19:34] *** habi1 has left [19:37] *** scyther has quit IRC (Leaving) [19:37] *** Stilett0 has joined #archiveteam [19:39] *** Stiletto has quit IRC (Read error: Operation timed out) [19:50] *** K4k has quit IRC (Ping timeout: 240 seconds) [19:55] I have a question. For the purposes of archiving a single site with no off-site links, which is better - wget, ArchiveBot or HTTrack? The website in question is an "Index of/"-style site with no JavaScript or anything like that. [19:56] httrack doesn't produce warc so that's out [19:57] wget or wpull should work equally well, archivebot is basically a big wrapper for wpull that you might not need if it's just a single simple grab [20:01] *** PurpleSym has quit IRC (Quit: WeeChat 1.1.1) [20:01] *** scyther has joined #archiveteam [20:03] I don't really need WARC, I just mostly want to have a mirror of this site on my local drive with least space. I'm mainly after the site contents, so integrity is not a big consideration for me in this case. [20:04] Or versioning really, the website doesn't look like it's changing often [20:12] *** godane has joined #archiveteam [20:15] https://github.com/ludios/grab-site [20:35] *** schbirid has quit IRC (Quit: Leaving) [20:41] *** atlogbot has quit IRC (Remote host closed the connection) [20:42] *** atlogbot has joined #archiveteam [20:46] *** Stilett0 is now known as Stiletto [20:53] *** aaaaaaaa_ has joined #archiveteam [20:53] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [20:53] *** swebb sets mode: +o aaaaaaaa_ [21:04] *** xk_id_ has quit IRC (Remote host closed the connection) [21:19] *** aaaaaaaa_ is now known as aaaaaaaaa [21:20] *** wyatt8740 has joined #archiveteam [21:20] anyone know where I can find an archive of this page? someone else grabbed the TLD and added a bullshit robots.txt to prevent people seeing the archive http://gyrolabs.com/2006/09/25/jediconcentrate-mod/ [21:21] *grabbed the domain [21:21] not the TLD [21:30] *** scyther has quit IRC (Leaving) [21:34] *** MrPenguin has quit IRC (Ping timeout: 240 seconds) [21:34] wyatt8740: you just need the ZIPs? [21:34] wyatt8740: if so: https://transfer.sh/%28/JCpuf/jediconcentrate-options.zip,/Gy68e/jediconcentrate-options-source.zip%29.zip [21:35] joepie91: I guess that works, but I was hoping for the webpage itself and the info on it [21:35] wyatt8740: moment [21:35] wyatt8740: there's not much content on it, fwiw [21:35] I know absolutely nothing about it and want to :\ [21:36] wyatt8740: want the blog post contents or also the comments? [21:36] saw a link to it, followd, it was dead [21:36] if the comments are useful, sure [21:36] otherwise contents are good enough I guess [21:36] wyatt8740: http://sprunge.us/QGKM [21:36] :) [21:39] wyatt8740: not much, as you can see [21:39] joepie91: thanks a lot :D [21:39] yw :P [21:48] *** aaaaaaaa_ has joined #archiveteam [21:48] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [21:48] *** swebb sets mode: +o aaaaaaaa_ [21:50] *** aaaaaaaa_ is now known as aaaaaaaaa [22:07] *** SimpBrain has quit IRC (Quit: Leaving) [22:14] *** nertzy has joined #archiveteam [22:47] *** xk_id has joined #archiveteam [23:02] *** nertzy has quit IRC (Quit: This computer has gone to sleep) [23:05] *** MrPenguin has joined #archiveteam [23:07] *** philpem has quit IRC (Remote host closed the connection) [23:07] *** BlueMaxim has joined #archiveteam [23:32] *** Sk1d has quit IRC (Read error: Operation timed out) [23:33] *** Sk1d has joined #archiveteam [23:46] https://www.eff.org/deeplinks/2015/08/speech-enables-speech-china-takes-aim-its-coders