#archiveteam 2015-08-31,Mon

↑back Search

Time Nickname Message
00:06 🔗 wvdp has joined #archiveteam
00:11 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
00:11 🔗 dashcloud has joined #archiveteam
00:12 🔗 wvdp hallo, i'm looking for a warc file to test an application i'm writing. preferably about 4 mb and containing different types of records.
00:34 🔗 khaoohs_ has joined #archiveteam
00:35 🔗 JesseW wvdp: You could look here: http://archive.fart.website/archivebot/viewer/item/archiveteam_archivebot_go_20150830100001
00:36 🔗 JesseW This is about a half a MB: https://archive.org/download/archiveteam_archivebot_go_20150830100001/twitter.com-inf-20150830-060528-34f3f-meta.warc.gz
00:36 🔗 Zandro that expands to 6-8MB
00:37 🔗 Zandro still a better suited sample than what I'd be able to share :P
00:40 🔗 khaoohs has quit IRC (Read error: Operation timed out)
00:40 🔗 JesseW archivebot produces a *lot* of WARCs...
00:40 🔗 JesseW I'm still somewhat confused that there isn't a definitive WARC-to-static-local-files transformer available...
00:41 🔗 JesseW (or if there is, that I haven't found it yet)
00:41 🔗 Zandro agreed, though since it's plaintext it should be trivial to write a parser-splitter
00:42 🔗 wvdp i found someting
00:42 🔗 wvdp https://archive.org/download/archiveteam_archivebot_go_001/2luxz0kz3lhxnymxhnwz2ky1s-20131010-090331.warc.gz
00:42 🔗 wvdp and i vastly overestimated the size of a warc record there are hundreds in here
00:44 🔗 Zandro I'm wrong, binary imagedata is stored in the warc as well
00:46 🔗 zenguy_pc has quit IRC (Read error: Connection reset by peer)
00:46 🔗 BlueMaxim has joined #archiveteam
00:52 🔗 joepie91 I have tracked down the solution for our occasional wget-lua documentation compilation issue!
00:52 🔗 joepie91 found a workaround, and a patch - the patch certainly works, the workaround presumably does also
00:53 🔗 joepie91 the patch is available at https://lists.gnu.org/archive/html/bug-wget/2013-06/msg00046.html, and the workaround is to do `make install_sw` instead of `make install`, so that could also be brought into the get-wget-lua.sh script
00:53 🔗 joepie91 (will make PR)
00:54 🔗 caber has quit IRC (Read error: Operation timed out)
00:54 🔗 joepie91 hum. yipdw: chfoo: get-wget-lua is not in the README template repo, where should I point my PR at ?
00:55 🔗 joepie91 er, wait...
00:55 🔗 joepie91 nevermind the workaround, we'd have to apply the patch, because there's no install step
00:55 🔗 wp494 has quit IRC (LOUD UNNECESSARY QUIT MESSAGES)
00:55 🔗 caber has joined #archiveteam
01:06 🔗 wp494 has joined #archiveteam
01:11 🔗 wp494 has quit IRC (Client Quit)
01:15 🔗 joepie91 I have a working wget-lua package for Nix! :D
01:16 🔗 kyan has joined #archiveteam
01:20 🔗 xk_id has quit IRC (Remote host closed the connection)
01:30 🔗 kyan has quit IRC (Quit: Leaving)
01:30 🔗 kyan has joined #archiveteam
01:37 🔗 xk_id has joined #archiveteam
01:43 🔗 kyan has quit IRC (Quit: Leaving)
01:45 🔗 aaaaaaaaa joepie91: I just use a oneliner to fix the docs: sed -e "s/\(item \)\([0-9]\)/\1\.\2/" ./doc/wget.texi > ./doc/wget.texi.tmp && mv ./doc/wget.texi.tmp ./doc/wget.texi
01:46 🔗 aaaaaaaaa in the get-wget-lua.sh file. I believe that fixes the syntax error
01:47 🔗 aaaaaaaaa or at least it stops the complaining, plus docs don't really seem to matter for most AT usage
01:59 🔗 wvdp has quit IRC (Ping timeout: 240 seconds)
02:08 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
02:10 🔗 boozehoun has joined #archiveteam
02:10 🔗 primus104 has joined #archiveteam
02:10 🔗 VADemon has joined #archiveteam
02:30 🔗 primus104 has quit IRC (Leaving.)
02:46 🔗 xk_id has quit IRC (Remote host closed the connection)
02:47 🔗 JesseW has quit IRC (Read error: Operation timed out)
02:52 🔗 wp494 has joined #archiveteam
02:53 🔗 RichardG has quit IRC (Read error: Connection reset by peer)
02:54 🔗 RichardG has joined #archiveteam
03:06 🔗 JesseW has joined #archiveteam
03:07 🔗 Ravenloft has joined #archiveteam
03:12 🔗 VADemon has quit IRC (left4dead)
03:18 🔗 myself has joined #archiveteam
03:30 🔗 JesseW I found a reasonably canonical warc-to-static-files transformer, it's chfoo's warcat too, using the extract command. Now what I'd like it to do is have the wget --convert-links option supported.
03:53 🔗 aaaaaaaaa You could probably whip something up in python or sed or something.
03:57 🔗 Ravenloft has quit IRC (Read error: Connection reset by peer)
03:58 🔗 pikhq_ I'd advise Python rather than sed.
03:58 🔗 pikhq_ Python can more reasonably parse the HTML for URLs. :)
04:02 🔗 aaaaaaaaa yeah, I was just spitting something out thinking of regexes, but yeah, an html.parser would probably work better.
04:13 🔗 myself Noob question: Once in a while, my ISP will MitM a connection to "inform me" about upcoming network maintenance. It's well-intentioned but of course means I can't run a Warrior. Except: The banner has distinctive text in it.. I feel like there's probably a way to detect that and invalidate my current task and suspend the warrior or something, while still allowing me to contribute the other 99% of the t
04:13 🔗 myself ime. However, this might be a lot of work, and it might be more sensible to just not run an instance over here. Forgive me if this is an oft-recurring topic, but I'm curious what the thoughts are on this issue, if it's come up before.
04:13 🔗 JesseW has quit IRC (Read error: Operation timed out)
04:14 🔗 pikhq_ myself: My thoughts: make a lot of noise at them.
04:14 🔗 MrRadar Aside from convincing your ISP to stop doing that probably the best way to work around it would be a VPN
04:15 🔗 myself I've already opted out of their DNS-bullshit, I'll see if I can opt out of the HTTP thing too. But yeah. If I want to go the VPN route I'd just run the warrior instance on the machine that would run my VPN endpoint and avoid handling all the data twice. :P
04:16 🔗 pikhq_ Specifically, mention that it's a violation of the Computer Fraud and Abuse Act and advise that you are considering contacting the appropriate authorities.
04:16 🔗 pikhq_ (seriously, that is a damned *attack*)
04:16 🔗 myself Heh. If I can get chapter and verse for that, I'll gleefully go that route.
04:17 🔗 pikhq_ 18 USC SS 1030
04:17 🔗 pikhq_ It's a fucking crime.
04:19 🔗 myself hmm, under a(5)A?
04:19 🔗 myself that'd be... hazy...
04:19 🔗 pikhq_ No more hazy than a lot of the other cases
04:20 🔗 pikhq_ Also, you're not necessarily having to actually *do* this, you're having to make them afraid enough they back down.
04:20 🔗 pikhq_ (though honestly, I'm pretty sure a MITM attack on your traffic would count.)
04:20 🔗 myself Short story: Technical measures not practical (beacause finding text in http is hard or something), yelling at provider is the only means worth considerig. Got it, thanks.
04:21 🔗 pikhq_ I'd consider it higher priority than technical measures, because *this is frankly offensive*.
04:23 🔗 yipdw myself: it's easier to not run an instance if you know you're being MITMed
04:23 🔗 myself Right, that's what I've been doing so far. I'll keep doing that.
04:23 🔗 yipdw thanks
04:24 🔗 pikhq_ I'm not even kidding about the yelling at the ISP thing though. You might *actually want to give them chapter and verse* and at least watch 'em squirm.
04:25 🔗 myself In the long term, I still have a feeling that weird-shit-detection code could add resilience to the whole mess, say a load-balancer shits itself or whatever -- a means of recognizing that the crawl just returned 50 identical-and-very-short pages might make for better results in lots of circumstances, in addition to the situation of ISP malfeasance (that some warrior owners might not realize applies to
04:25 🔗 myself their connections too).
04:26 🔗 myself but I'm aware that ideas are not the same as submitting patches, so... nevermind I guess.
04:26 🔗 pikhq_ True, not sure exactly what to do but it's probably worth figuring out some way to deal with hackers on your network, erm I mean, bad ISPs.
04:27 🔗 JesseW has joined #archiveteam
04:27 🔗 yipdw this is more suitable for #warrior
04:27 🔗 myself Oh right.
04:32 🔗 Guest100 has joined #archiveteam
04:36 🔗 VADemon has joined #archiveteam
04:37 🔗 aaaaaaaaa has quit IRC (Leaving)
04:52 🔗 Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
05:07 🔗 Guest100 has joined #archiveteam
05:11 🔗 Guest100 has quit IRC (Client Quit)
05:15 🔗 Guest100 has joined #archiveteam
05:50 🔗 godane has quit IRC (Ping timeout: 258 seconds)
06:19 🔗 wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES)
06:20 🔗 Guest100 has quit IRC (My Mac has gone to sleep. ZZZzzz…)
06:31 🔗 godane has joined #archiveteam
06:35 🔗 scyther has joined #archiveteam
06:41 🔗 Guest100 has joined #archiveteam
06:50 🔗 Ungstein has joined #archiveteam
07:02 🔗 habi has joined #archiveteam
07:02 🔗 habi has left
07:10 🔗 wutno has joined #archiveteam
07:20 🔗 Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
07:34 🔗 Atom__ has quit IRC (Read error: Connection reset by peer)
07:36 🔗 Atom__ has joined #archiveteam
07:45 🔗 habi has joined #archiveteam
07:45 🔗 habi has quit IRC (Read error: Connection reset by peer)
07:46 🔗 atomotic has joined #archiveteam
07:48 🔗 atomotic has quit IRC (Client Quit)
07:57 🔗 dashcloud has quit IRC (Read error: Connection reset by peer)
07:58 🔗 dashcloud has joined #archiveteam
08:01 🔗 Ravenloft has joined #archiveteam
08:05 🔗 primus104 has joined #archiveteam
08:05 🔗 khaoohs has joined #archiveteam
08:07 🔗 schbirid has joined #archiveteam
08:11 🔗 khaoohs_ has quit IRC (Ping timeout: 483 seconds)
08:54 🔗 scyther has quit IRC (Read error: Connection reset by peer)
08:55 🔗 scyther has joined #archiveteam
09:03 🔗 wp494 has joined #archiveteam
09:04 🔗 scyther has quit IRC (Read error: Connection reset by peer)
09:39 🔗 dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
09:40 🔗 primus104 has quit IRC (Leaving.)
09:41 🔗 dashcloud has joined #archiveteam
09:42 🔗 vitzli has joined #archiveteam
10:12 🔗 xk_id has joined #archiveteam
11:06 🔗 tglass has joined #archiveteam
12:07 🔗 primus104 has joined #archiveteam
12:12 🔗 BlueMaxim has quit IRC (Quit: Leaving)
12:31 🔗 Stilett0 has joined #archiveteam
12:34 🔗 Stiletto has quit IRC (Ping timeout: 306 seconds)
12:54 🔗 lytv has quit IRC (Ping timeout: 252 seconds)
12:54 🔗 Ungstein has quit IRC (Ping timeout: 252 seconds)
12:55 🔗 Ungstein has joined #archiveteam
13:17 🔗 Atom__ has quit IRC (Read error: Connection reset by peer)
13:26 🔗 Atom__ has joined #archiveteam
13:40 🔗 Atom__ has quit IRC (Ping timeout: 306 seconds)
13:43 🔗 Spacedawg has joined #archiveteam
14:40 🔗 matthusby Any projects other then url team running?
14:43 🔗 zhongfu matthusby: don't think so
14:47 🔗 joepie91 this was asked yesterday, and then somebody responded "yes", iirc
14:47 🔗 joepie91 but idk what projects
14:47 🔗 joepie91 :p
14:52 🔗 human39 has joined #archiveteam
14:57 🔗 human39 has quit IRC (Leaving)
15:02 🔗 phuzion_ is now known as phuzion
15:06 🔗 jspiros has quit IRC (hub.efnet.us irc.umich.edu)
15:06 🔗 Fusl has quit IRC (hub.efnet.us irc.umich.edu)
15:06 🔗 filippo__ has quit IRC (hub.efnet.us irc.umich.edu)
15:06 🔗 trs80 has quit IRC (hub.efnet.us irc.umich.edu)
15:08 🔗 primus has joined #archiveteam
15:11 🔗 jspiros has joined #archiveteam
15:11 🔗 Fusl has joined #archiveteam
15:23 🔗 primus104 has quit IRC (Leaving.)
15:26 🔗 filippo__ has joined #archiveteam
15:33 🔗 SimpBrain has quit IRC (Quit: Leaving)
15:34 🔗 rizzzz has quit IRC (Remote host closed the connection)
15:35 🔗 rizzzz has joined #archiveteam
15:46 🔗 SketchCow Nothing major, yet
15:46 🔗 SketchCow I think we went to -bs last night
15:48 🔗 vOYtEC has joined #archiveteam
15:50 🔗 vOYtEC has quit IRC (Remote host closed the connection)
15:51 🔗 JesseW has quit IRC (Read error: Operation timed out)
15:52 🔗 JesseW has joined #archiveteam
15:56 🔗 Guest100 has joined #archiveteam
16:00 🔗 trs80 has joined #archiveteam
16:03 🔗 tsp_ has joined #archiveteam
16:15 🔗 yakfish has quit IRC (Read error: Operation timed out)
16:16 🔗 yakfish has joined #archiveteam
16:20 🔗 vOYtEC has joined #archiveteam
16:23 🔗 Guest100 has quit IRC (Quit: My Mac has gone to sleep. ZZZzzz…)
16:25 🔗 JesseW has quit IRC (Read error: Operation timed out)
16:28 🔗 godane has quit IRC (Quit: Leaving.)
16:29 🔗 godane has joined #archiveteam
16:41 🔗 godane has quit IRC (Quit: Leaving.)
16:43 🔗 tglass has quit IRC (Quit: Page closed)
16:48 🔗 lytv has joined #archiveteam
17:36 🔗 vitzli has quit IRC (Quit: Leaving)
17:43 🔗 PurpleSym has joined #archiveteam
18:00 🔗 primus104 has joined #archiveteam
18:08 🔗 scyther has joined #archiveteam
18:18 🔗 Start has quit IRC (Quit: Disconnected.)
18:33 🔗 Sk1d has quit IRC (Quit: ZNC - http://znc.in)
18:34 🔗 Sk1d has joined #archiveteam
18:37 🔗 Stilett0 has quit IRC (Read error: Connection reset by peer)
18:37 🔗 MrPenguin has joined #archiveteam
18:37 🔗 MrPenguin Hello.
18:37 🔗 * myself pretends to be asleep
18:38 🔗 MrPenguin I don't know if that's the right channel to propose a new project, but I'd like to suggest backing up http://www.drdobbs.com/
18:38 🔗 myself Crap. Are they going away too?
18:38 🔗 MrPenguin It appears to have a lot of useful programming-related info, but has shut down for new articles
18:39 🔗 MrPenguin Yeah, apparently ads can't finance it
18:39 🔗 MrPenguin Farewell article is on http://www.drdobbs.com/architecture-and-design/farewell-dr-dobbs/240169421
18:40 🔗 xk_id_ has joined #archiveteam
18:40 🔗 MrPenguin They said that "The present content will remain available indefinitely", but I think it's better to back it up just in case
18:41 🔗 myself I'm new here but tend to agree.
18:41 🔗 MrPenguin Yeah, I'm just kind of browsing through the Archive Team wiki and generally approve of the effort, but can't contribute unfortunately
18:42 🔗 xk_id has quit IRC (Read error: Operation timed out)
18:46 🔗 MrPenguin My connection isn't entirely good for contributing to the mirroring...
18:50 🔗 Stiletto has joined #archiveteam
18:52 🔗 habi has joined #archiveteam
18:53 🔗 habi has left
18:57 🔗 aaaaaaaaa has joined #archiveteam
18:57 🔗 swebb sets mode: +o aaaaaaaaa
18:57 🔗 habi1 has joined #archiveteam
19:00 🔗 VADemon has quit IRC (Read error: Connection reset by peer)
19:02 🔗 joepie91 MrPenguin: yes. that needs backing up. please hold
19:06 🔗 aaaaaaaaa That may have been archivebotted
19:06 🔗 joepie91 MrPenguin: myself: put it into archivebot: http://dashboard.at.ninjawedding.org/
19:06 🔗 arkiver yeah, that site would be ok for archivebot
19:06 🔗 arkiver :)
19:06 🔗 joepie91 if you see it getting stuck on stuff like a calendar or permalinks, please let me know
19:06 🔗 joepie91 just highlight me and I'll fix it
19:07 🔗 joepie91 (I didn't see anything obviously loopy from a quick glance around)
19:07 🔗 joepie91 cc arkiver
19:07 🔗 arkiver yeah
19:07 🔗 arkiver website looks ok
19:08 🔗 aaaaaaaaa FYI, that was last grabbed in december '14
19:08 🔗 arkiver how large was the grab?
19:09 🔗 aaaaaaaaa I'll do the math, hold on
19:11 🔗 aaaaaaaaa just over 10 GB
19:13 🔗 aaaaaaaaa mostly text, so probably compresses well despite the size
19:19 🔗 K4k has joined #archiveteam
19:21 🔗 MrPenguin joepie91: Thanks :) So the mirroring is running now? That's great.
19:22 🔗 joepie91 MrPenguin: yeah
19:22 🔗 aaaaaaaa_ has joined #archiveteam
19:22 🔗 swebb sets mode: +o aaaaaaaa_
19:22 🔗 * myself smacks a Staples button
19:23 🔗 aaaaaaaaa has quit IRC (Ping timeout: 600 seconds)
19:24 🔗 aaaaaaaa_ is now known as aaaaaaaaa
19:25 🔗 aaaaaaaaa heh, maybe you should wait until the job is done before you call it easy
19:25 🔗 MrPenguin The ArchiveBot beta console looks quite good
19:25 🔗 SimpBrain has joined #archiveteam
19:26 🔗 joepie91 aaaaaaaaa: pft. after Blip, everything's easy
19:26 🔗 joepie91 :P
19:26 🔗 MrPenguin Wait, so the ArchiveBot also archives links on immediately "adjacent" sites too, I mean first-degree links to other domains?
19:26 🔗 joepie91 MrPenguin: yep
19:26 🔗 MrPenguin I didn't know that:)
19:27 🔗 MrPenguin I thought it only does the mirroring on the destination site
19:28 🔗 aaaaaaaaa You can explicitly tell it to do that too, and not do the links that aim off site. Most web scrapers have a setting like that.
19:34 🔗 habi1 has left
19:37 🔗 scyther has quit IRC (Leaving)
19:37 🔗 Stilett0 has joined #archiveteam
19:39 🔗 Stiletto has quit IRC (Read error: Operation timed out)
19:50 🔗 K4k has quit IRC (Ping timeout: 240 seconds)
19:55 🔗 MrPenguin I have a question. For the purposes of archiving a single site with no off-site links, which is better - wget, ArchiveBot or HTTrack? The website in question is an "Index of/"-style site with no JavaScript or anything like that.
19:56 🔗 DFJustin httrack doesn't produce warc so that's out
19:57 🔗 DFJustin wget or wpull should work equally well, archivebot is basically a big wrapper for wpull that you might not need if it's just a single simple grab
20:01 🔗 PurpleSym has quit IRC (Quit: WeeChat 1.1.1)
20:01 🔗 scyther has joined #archiveteam
20:03 🔗 MrPenguin I don't really need WARC, I just mostly want to have a mirror of this site on my local drive with least space. I'm mainly after the site contents, so integrity is not a big consideration for me in this case.
20:04 🔗 MrPenguin Or versioning really, the website doesn't look like it's changing often
20:12 🔗 godane has joined #archiveteam
20:15 🔗 Sanqui https://github.com/ludios/grab-site
20:35 🔗 schbirid has quit IRC (Quit: Leaving)
20:41 🔗 atlogbot has quit IRC (Remote host closed the connection)
20:42 🔗 atlogbot has joined #archiveteam
20:46 🔗 Stilett0 is now known as Stiletto
20:53 🔗 aaaaaaaa_ has joined #archiveteam
20:53 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
20:53 🔗 swebb sets mode: +o aaaaaaaa_
21:04 🔗 xk_id_ has quit IRC (Remote host closed the connection)
21:19 🔗 aaaaaaaa_ is now known as aaaaaaaaa
21:20 🔗 wyatt8740 has joined #archiveteam
21:20 🔗 wyatt8740 anyone know where I can find an archive of this page? someone else grabbed the TLD and added a bullshit robots.txt to prevent people seeing the archive http://gyrolabs.com/2006/09/25/jediconcentrate-mod/
21:21 🔗 wyatt8740 *grabbed the domain
21:21 🔗 wyatt8740 not the TLD
21:30 🔗 scyther has quit IRC (Leaving)
21:34 🔗 MrPenguin has quit IRC (Ping timeout: 240 seconds)
21:34 🔗 joepie91 wyatt8740: you just need the ZIPs?
21:34 🔗 joepie91 wyatt8740: if so: https://transfer.sh/%28/JCpuf/jediconcentrate-options.zip,/Gy68e/jediconcentrate-options-source.zip%29.zip
21:35 🔗 wyatt8740 joepie91: I guess that works, but I was hoping for the webpage itself and the info on it
21:35 🔗 joepie91 wyatt8740: moment
21:35 🔗 joepie91 wyatt8740: there's not much content on it, fwiw
21:35 🔗 wyatt8740 I know absolutely nothing about it and want to :\
21:36 🔗 joepie91 wyatt8740: want the blog post contents or also the comments?
21:36 🔗 wyatt8740 saw a link to it, followd, it was dead
21:36 🔗 wyatt8740 if the comments are useful, sure
21:36 🔗 wyatt8740 otherwise contents are good enough I guess
21:36 🔗 joepie91 wyatt8740: http://sprunge.us/QGKM
21:36 🔗 joepie91 :)
21:39 🔗 joepie91 wyatt8740: not much, as you can see
21:39 🔗 wyatt8740 joepie91: thanks a lot :D
21:39 🔗 joepie91 yw :P
21:48 🔗 aaaaaaaa_ has joined #archiveteam
21:48 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
21:48 🔗 swebb sets mode: +o aaaaaaaa_
21:50 🔗 aaaaaaaa_ is now known as aaaaaaaaa
22:07 🔗 SimpBrain has quit IRC (Quit: Leaving)
22:14 🔗 nertzy has joined #archiveteam
22:47 🔗 xk_id has joined #archiveteam
23:02 🔗 nertzy has quit IRC (Quit: This computer has gone to sleep)
23:05 🔗 MrPenguin has joined #archiveteam
23:07 🔗 philpem has quit IRC (Remote host closed the connection)
23:07 🔗 BlueMaxim has joined #archiveteam
23:32 🔗 Sk1d has quit IRC (Read error: Operation timed out)
23:33 🔗 Sk1d has joined #archiveteam
23:46 🔗 DFJustin https://www.eff.org/deeplinks/2015/08/speech-enables-speech-china-takes-aim-its-coders

irclogger-viewer