[00:10] http://blog.twitch.tv/2015/03/important-notice-about-your-twitch-account/ [00:10] yey [00:16] heh, "Important notice about your x account" -> h4ckz0r3d or closing [00:20] *** c_b has quit IRC (Ping timeout: 370 seconds) [00:21] yup [00:21] first one [00:21] but can someone archivebot it [00:21] *** nertzy has joined #archiveteam [00:21] for some reason i cna't find the channel [00:24] *** JMC_ has joined #archiveteam [00:26] *** JMC has quit IRC (Ping timeout: 370 seconds) [00:33] *** Emcy has joined #archiveteam [00:58] *** mistym has quit IRC (Remote host closed the connection) [01:01] *** JMC_ is now known as JMC [01:02] *** chfoo has quit IRC (Remote host closed the connection) [01:04] *** Selanda has quit IRC (west.us.hub irc.mzima.net) [01:04] *** BnAboyZ has quit IRC (west.us.hub irc.mzima.net) [01:04] *** cloudmons has quit IRC (west.us.hub irc.mzima.net) [01:04] *** bmcginty has quit IRC (west.us.hub irc.mzima.net) [01:04] *** acridAxid has quit IRC (west.us.hub irc.mzima.net) [01:04] *** patricko- has quit IRC (west.us.hub irc.mzima.net) [01:04] *** Peetz0r has quit IRC (west.us.hub irc.mzima.net) [01:04] *** goekesmi has quit IRC (west.us.hub irc.mzima.net) [01:04] *** edsu_ has quit IRC (west.us.hub irc.mzima.net) [01:04] *** tephra has quit IRC (west.us.hub irc.mzima.net) [01:04] *** dugo_ has quit IRC (west.us.hub irc.mzima.net) [01:04] *** lbft has quit IRC (west.us.hub irc.mzima.net) [01:04] *** Meeh has quit IRC (west.us.hub irc.mzima.net) [01:04] *** chazchaz_ has quit IRC (west.us.hub irc.mzima.net) [01:04] *** Sanqui has quit IRC (west.us.hub irc.mzima.net) [01:04] ran it through liveweb https://web.archive.org/web/20150324010409/http://blog.twitch.tv/2015/03/important-notice-about-your-twitch-account/ [01:07] *** chfoo has joined #archiveteam [01:08] *** svchfoo2 sets mode: +o chfoo [01:09] https://archive.org/details/archiveteam_madden&tab=collection [01:12] *** mistym has joined #archiveteam [01:12] *** Selanda has joined #archiveteam [01:12] *** BnAboyZ has joined #archiveteam [01:12] *** cloudmons has joined #archiveteam [01:12] *** bmcginty has joined #archiveteam [01:12] *** acridAxid has joined #archiveteam [01:12] *** patricko- has joined #archiveteam [01:12] *** Peetz0r has joined #archiveteam [01:12] *** goekesmi has joined #archiveteam [01:12] *** edsu_ has joined #archiveteam [01:12] *** tephra has joined #archiveteam [01:12] *** dugo_ has joined #archiveteam [01:12] *** lbft has joined #archiveteam [01:12] *** Meeh has joined #archiveteam [01:12] *** chazchaz_ has joined #archiveteam [01:12] *** Sanqui has joined #archiveteam [01:13] I threw an animated gif in to see if it would do something. [01:15] https://ia601509.us.archive.org/1/items/archiveteam_madden_20150323172542/c7e30ceecfe6da8bcd06121e5ec9a2fc.gif [01:22] *** SimpBrain has quit IRC (ny.us.hub west.us.hub) [01:22] *** rejon has quit IRC (ny.us.hub west.us.hub) [01:22] *** ohhdemgir has quit IRC (ny.us.hub west.us.hub) [01:22] *** Sue_ has quit IRC (ny.us.hub west.us.hub) [01:22] *** yipdw has quit IRC (ny.us.hub west.us.hub) [01:22] *** xmc has quit IRC (ny.us.hub west.us.hub) [01:22] *** dcmorton has quit IRC (ny.us.hub west.us.hub) [01:22] *** zenguy_pc has quit IRC (ny.us.hub west.us.hub) [01:22] *** aschmitz has quit IRC (ny.us.hub west.us.hub) [01:22] *** Nertsy has quit IRC (ny.us.hub west.us.hub) [01:22] *** Lord_Nigh has quit IRC (ny.us.hub west.us.hub) [01:22] *** marnold has quit IRC (ny.us.hub west.us.hub) [01:22] *** ersi has quit IRC (ny.us.hub west.us.hub) [01:22] *** slash` has quit IRC (ny.us.hub west.us.hub) [01:22] *** Famicoman has quit IRC (ny.us.hub west.us.hub) [01:22] *** eprillios has quit IRC (ny.us.hub west.us.hub) [01:22] *** Mayonaise has quit IRC (ny.us.hub west.us.hub) [01:22] *** Cameron_D has quit IRC (ny.us.hub west.us.hub) [01:34] *** LordNigh2 has joined #archiveteam [01:36] *** SimpBrain has joined #archiveteam [01:36] *** rejon has joined #archiveteam [01:36] *** ohhdemgir has joined #archiveteam [01:36] *** Sue_ has joined #archiveteam [01:36] *** yipdw has joined #archiveteam [01:36] *** xmc has joined #archiveteam [01:36] *** dcmorton has joined #archiveteam [01:36] *** zenguy_pc has joined #archiveteam [01:36] *** aschmitz has joined #archiveteam [01:36] *** Nertsy has joined #archiveteam [01:36] *** marnold has joined #archiveteam [01:36] *** ersi has joined #archiveteam [01:36] *** slash` has joined #archiveteam [01:36] *** Famicoman has joined #archiveteam [01:36] *** eprillios has joined #archiveteam [01:36] *** Mayonaise has joined #archiveteam [01:36] *** Cameron_D has joined #archiveteam [01:36] *** irc.eversible.com sets mode: +oooo xmc dcmorton ersi Cameron_D [01:36] *** swebb sets mode: +o xmc [01:36] *** swebb sets mode: +o ersi [01:37] *** LordNigh2 is now known as Lord_Nigh [01:37] *** svchfoo1 sets mode: +o Lord_Nigh [01:42] https://www-tracey.archive.org/details/archiveteam_madden_20150323172542 [01:42] hahahahaha [02:01] OK, who wants a challenge? [02:01] Challenge: Take .WARC.gz file and generate webpage image grabs from them. (Clipped to a certain size, say, 400x300) [02:06] *** Ara__ has quit IRC (Read error: Operation timed out) [02:09] *** londoncal has quit IRC (Leaving...) [02:11] *** Ara__ has joined #archiveteam [02:15] *** BnAboyZ has quit IRC () [02:19] Who wants it! [02:19] *** Ara__ has quit IRC (Read error: Operation timed out) [02:24] *** Ara__ has joined #archiveteam [02:26] *** JMC_ has joined #archiveteam [02:27] *** JMC has quit IRC (Ping timeout: 265 seconds) [02:30] sure [02:30] we could use that in archivebot anyway [02:30] *** JMC_ is now known as JMC [02:31] SketchCow: any particular toolset requirements, or is this a "do it however and we'll take it from there" thing [02:35] I would probably ultimately throw it on FOS. [02:35] So, simpler the better, but lots of flexibility [02:35] FOS or SIS [02:35] ok [02:35] I was thinking webarchiveplayer + {chrome, firefox} [02:36] This will kill two insane birds, one stone. [02:36] screenshot first entry in WARC [02:36] 1. We will be supporting animated GIFs that are going to come online tomorrow for image previews [02:36] one small detail is that webarchiveplayer generally achieves higher fidelity than Wayback, not sure if that's a problem [02:36] 2. Brewster has wanted, for almost a year "something interesting" with otherwise boring files. [02:37] So having archivebot and other .warc items have preview images where they now have none would be very nice. [02:38] Wow, combining that link (webarchiveplayer) with my screenshot items might be sufficient. [02:38] I could PROBABLY do this myself [02:38] let's hop to #archivebot and talk about it. [03:03] *** Ara__ has quit IRC (Read error: Operation timed out) [03:04] *** berndj has quit IRC (Excess Flood) [03:04] *** berndj has joined #archiveteam [03:08] *** Ara__ has joined #archiveteam [03:10] anyone have success using WarcQTviewer on Linux? https://github.com/odie5533/WarcQtViewer [03:19] *** Ara__ has quit IRC (Read error: Operation timed out) [03:19] *** BlueMaxim has joined #archiveteam [03:25] *** Ara__ has joined #archiveteam [03:36] *** primus105 has quit IRC (Leaving.) [03:36] yipdw: Back here - I don't need it archived. [03:37] SketchCow: I don't recall archiving anything [03:37] http://teamarchive1.fnf.archive.org/screenshot_00.jpg [03:37] oh [03:37] (Some else ran the archiver) [03:37] ah ok [03:37] that was kyan [03:37] np [03:37] kyan | !ao http://teamarchive1.fnf.archive.org/screenshot_00.jpg [03:37] Yeah [03:37] another thing i don't understand [03:37] So I popped back here. [03:38] But check THAT shit out [03:38] I'm wondering how it fares on gigantic WARCs [03:38] Going to find out VERY shortly. [03:38] IIRC webarchiveplayer indexes the whole thing, because it assumes you are browsing [03:39] that may or may not be an intolerable delay [03:40] SketchCow: http://www.archpwn.org/wiki/Main_Page.html [03:40] years ago i did archlinux distros [03:40] Trying a 2.8gb one [03:41] SketchCow: webarchiveplayer *may* be able to use the generated cdxs [03:41] I haven't tried it, but it does seem to be the same format [03:42] also ikreymer can probably help out more than me [03:43] http://teamarchive1.fnf.archive.org/screenshot_00.jpg [03:43] 2.8gb warc.gz [04:00] Now it's just logic, logic logic. [04:01] I'm going to just shove screenshots into the items [04:01] Let the v2 do the work [04:01] instead of a gif [04:07] *** aaaaaaaaa has quit IRC (Leaving) [04:13] Going along well. [04:15] *** xyzzy has joined #archiveteam [04:17] *** berndj has quit IRC (Read error: Operation timed out) [04:22] *** signius has quit IRC (Read error: Operation timed out) [04:35] *** signius has joined #archiveteam [04:53] Well, I've got it going, but it's got some jankiness I don't like. [04:58] *** ikreymer has joined #archiveteam [05:03] OK, letting this one run. [05:03] hi, saw the tweet re: screenshots, and thought i'd check here :) what's the plan? i don't have too many spare cycles, but maybe can help out.. [05:03] It'll work or it won't! [05:03] ha ha' [05:03] ikreymer: HUGGY HUGGY HUGGY [05:03] So, I whipped up a thing [05:03] The thing is working [05:03] But the thing does a couple really janky tricks to work. [05:04] I bet you know ways to make it work better. [05:04] HOWEVER [05:04] It is, currently, sort of working. [05:04] BEHOLD https://archive.org/details/archiveteam_archivebot_go_20140924163956 [05:04] BEHOLD https://archive.org/details/archiveteam_archivebot_go_20140924163956/v2 I mean [05:05] nice! so you're generating a screenshot for each item? [05:06] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ [05:06] Trying to. [05:06] I'm going to let this one run for tonight, then attack the inherent problem a little later [05:06] The key is it just kind of sits off and does it. [05:07] Right now, you're going to look at it and go "HEY WAIT IT" etc [05:07] But let me let it run initially [05:07] The problem is that it has to run the webplayer, and then execute screenshots [05:08] And the webplayer can take.... a while to generate the list off a warc.gz [05:09] ikreymer: can pywb use CDXes generated from wayback? if so, we do have those available with the WARCs [05:09] yeah, the indexing can take a bit for large warcs, since its generating cdx for the whole thing.. how do you pick which page if there are multiple? [05:09] Just doing the first one for now. [05:09] phun [05:09] so if that works we can skip the indexing process [05:10] yipdw: yes, it's the same format [05:10] ah ok, cool -- I guess that's a speedup option [05:10] If I put them in the same place, will it just know? [05:10] though the step that determines which ones are 'pages' is unique to webarchiveplayer currently, but that can be seperated out, if you need it [05:11] Well, right now, it just does: [05:11] wget --no-check-certificate https://archive.org/download/${ITEM}/${clammy} [05:11] Then webarchiveplayer --headless $clammy & [05:12] yeah, in fact, i just released pywb 0.9.0 that can automatically find warc and cdx files from given a directory structure [05:14] oh cool [05:14] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ is working except when it doesn't. [05:14] :) [05:15] how are you picking the first entry in the WARC? [05:15] :) [05:15] Tab down, press enter [05:15] some WARCs (at least from archivebot) don't have HTML [05:15] ah [05:16] I'm going to let this one grow [05:16] in pywb 0.9.0, there's support for the following: if you create ./collections/my_coll/archive/some.warc.gz and ./collections/my_coll/indexes/mycdx.cdx , then run 'wayback' in pywb, you can then access localhost:8080/my_coll/ -- it'll automatically use all the warc and the cdx [05:16] And see what it does. [05:17] Oh, I see. [05:17] It's working, but I included the meta ones as well [05:17] So those are blowing up [05:18] ikreymer: oh cool -- does it watch those directories for changes as well? [05:18] yeah, figuring out what are pages is a bit tricky.. probably needs some more tweaking and is somewhat subjective also [05:21] yipdw: yep, if you run with `wayback -a` it should pick up any new warcs, and automatically generate cdx. but not yet cdx, if you place already generated cdx it won't yet pick it up (but something that can be added, this was not a use case i had before) [05:21] sweet [05:22] SketchCow: i'd be curious about which warcs are blowing up.. is it due to size? [05:23] OK, so [05:23] First of all, you will find I am going for bone simple, not bulletproof, at this juncture. [05:24] Second, we have metadata sitting in there, some -meta files, that it was running [05:24] Those have shit-all in terms of pages in them [05:24] So the REGULAR stuff is working GREAT, and then it was analyzing wasteful, useless files. [05:24] If you watch http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ - it should be adding a page every 30 seconds, and it will work. [05:25] This is good enough for.. "oh, it would be nice to have this" [05:25] I don't want to suck up anyone's time. Once yipdw pointed me to your utility, you saved me between 3,000 and 5,000 years [05:25] ikreymer: some of our warcs contain records that use URNs, e.g. https://archive.org/download/archiveteam_archivebot_go_20140924163956/ae-mod.info-inf-20140923-195942-az4hy-meta.warc.gz [05:25] one of the records in there is the wpull log [05:26] Oh yeah, now it's rocking. [05:26] it has WARC-Target-URI urn:X-Wpull:log [05:26] I remember pywb 0.8.x(?) didn't like those, so I ended up patching around it [05:26] that might be it [05:26] Also, I'm using very old teamarchive bot stuff now [05:27] yipdw: ah ok, yes, i think webarchiveplayer attempts to parse metadata records and probably fails on these.. i should make sure to handle them gracefully [05:27] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/boards.4chan.org-shallow-20140923-003733-3s37l-00000.warc.gz.jpg [05:27] It's going along nicely. [05:27] or basically just ignore them, unless specified otherwise [05:27] Well, in my case, I'm shoving them down the gullet. [05:27] But I'm doing a trick here. [05:29] chfoo has a nice list of warc extensions: http://wpull.readthedocs.org/en/master/warc.html -- for a future update, i should make sure pywb doesn't choke on any of them when indexing [05:29] SketchCow: great! glad it's working well! [05:30] I think it's working and producing a useless screenshot and sketchcow is just not being programmer-precise in language choice [05:30] I would call it "works enough to start getting an idea" [05:31] I was slow to learn which of the warc.gz files in an archivebot collection return actual webpages, and which are not for that. [05:33] Now, if you go to https://archive.org/details/archiveteam_archivebot_go_20140924163956/v2 [05:33] You will see it showing screenshots, and automatically going between them [05:33] For this first run, some of the screengrabs will be poop [05:33] ohh yes [05:34] But this is 100% fire and forget non-human labor [05:34] And I promise you, Brewster will be over the moon this is getting done. [05:36] I'm going to waste so much time scrolling through these [05:36] They're very informative. [05:36] actually what I should do is update to pywb 0.9.0 to make it easier to throw new WARCs into a preview thing [05:36] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz.jpg happens too [05:37] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/forum.z8games.com-shallow-20140923-071310-6jnal-00000.warc.gz.jpg happens then. Boom! [05:37] you may need to wait for the HTTP server to start up [05:38] So, right now I have to do a coarse wait. 30 seconds. [05:38] ah [05:38] I don't have a way to signal. [05:38] interesting, that should be enough [05:38] it is cool to see these sites ended up pretty well-preserved [05:39] you could probably use netcat to check if the server is listening [05:39] yup [05:39] echo|nc exits with 0 if listening, 1 if not [05:39] After this run through, I'll try that. [05:39] And then knock it down to 10 seconds and 10 second checks [05:40] sweet [05:41] you'll probably want to install this at some point https://help.ubuntu.com/community/RestrictedFormats/Microsoft_Fonts [05:42] yipdw: cool, let me know if you have any feedback on the new 0.9.0, should be easier to use then previous versions.. [05:42] ikreymer: will do [05:44] 01:39 <@xmc> echo|nc exits with 0 if listening, 1 if not [05:44] Not sure it's doing that. [05:44] % echo | nc localhost 22 > /dev/null ; echo $? [05:44] 0 [05:44] % echo | nc localhost 4444 > /dev/null ; echo $? [05:44] localhost [127.0.0.1] 4444 (?) : Connection refused [05:44] 1 [05:45] yipdw: oh, and if you're checking one warc at a time, you can just have a single 'index.cdx' file and replace it with the next cdx each time. it'll pick up changes to same cdx, just not new filenames (at the moment) [05:45] *** BlueMaxim has quit IRC (Quit: Leaving) [05:53] xmc-designed anger-rage version now running [05:54] yet another shining example of hate driven development [05:54] testily-driven development [05:55] agile-stabby [05:55] You are ALL fucking committed [05:55] Everyone is pigs [05:55] all pigs [05:55] die die die [05:56] I just watched it "do the right thing" [05:56] Very inspiring. [05:56] (It threw a garbage can through Sal's pizza" [05:57] *** unityrkjs has joined #archiveteam [05:57] Also, it waited patiently for the webserver to get its shit together. [05:57] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ [05:57] If you do a shift-reload, you can see it [05:57] How the creation times are sometimes within the same minute. [05:59] ten seconds? gosh, how patient. i would have set it to 1 second. [06:00] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/c99.nl-shallow-20140924-162333-c0npj-00000.warc.gz.jpg still got caught with its pants down though [06:01] No, that's what the player returned [06:01] for the first URL [06:01] Remember, it's just going after the first URL [06:01] it looks like the page didn't load though? [06:01] It could be a simple .png or an empty index.html [06:01] bottom left corner [06:02] If it didn't load it gives an elaborate error page [06:02] lies [06:02] GRABBY: line 24: 10920 Terminated webarchiveplayer --headless $clammy (wd: /0/WARCPLAY/webbergrabber/WARCBIN) [06:02] Zzzzzz... [06:02] Zzzzzz... [06:02] See? That's it dealing with a crazy page. [06:02] Zzzzzz... [06:02] Zzzzzz... [06:02] sometimes things suck i guess [06:02] Zzzzzz... [06:02] foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz specifically [06:03] Oh, it is angry at this one. [06:04] http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz.jpg there it goes [06:04] See, that's your work, xmc [06:04] Otherwise it'd have skipped [06:05] Also, I am clipping those messages out going forward. [06:05] (The bottom) [06:06] Did it again, grabbed a 2.2gb one, it took a while to deal. [06:11] OK, this is great. I will go to sleep, and tomorrow it will likely finish these, then I will aim it at a pile of these. [06:17] *** Stilett0 has joined #archiveteam [06:17] *** Stiletto has quit IRC (Read error: Connection reset by peer) [06:19] *** MMovie2 has joined #archiveteam [06:20] *** MMovie has quit IRC (Ping timeout: 306 seconds) [06:21] *** unityrkjs has quit IRC (12( www.nnscript.com 12:: NoNameScript 4.22 12:: www.esnation.com 12)) [06:21] Very happy, thanks everyone [06:22] hey v2 has cool graphs -> https://archive.org/details/archivebot&tab=about [06:27] also gonna want some foreign font packages http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/forums.nesdev.com-inf-20140920-145426-4ej3s-00000.warc.gz.jpg [06:55] *** mistym has quit IRC (Remote host closed the connection) [07:40] *** primus104 has joined #archiveteam [07:40] *** hive-mind has quit IRC (Ping timeout: 260 seconds) [07:59] *** hive-mind has joined #archiveteam [08:08] *** antomati_ has joined #archiveteam [08:09] *** primus104 has quit IRC (Leaving.) [08:10] *** antomatic has quit IRC (Read error: Operation timed out) [08:39] *** robink has joined #archiveteam [08:42] *** schbirid has joined #archiveteam [08:43] *** johtso has joined #archiveteam [08:52] *** SmileyG has joined #archiveteam [09:04] I just want to say this: fuck flash player [09:04] and fuck flash player sites [09:11] I would say yes... except that without flash, we wouldn't have albino black sheep, and without albino black ship, we wouldn't have had the 8th Avocaco. [09:11] *Avocado [09:11] ho'kay [09:24] *** robink has quit IRC (Remote host closed the connection) [09:27] *** primus104 has joined #archiveteam [09:33] *** robink has joined #archiveteam [09:49] question [09:49] has spftp.info.apple.com been archived [09:49] (you need a cookie to get to the files, but with it dirlistings are enabled) [09:52] does IA save flash, and if not, is ABS backed up? [10:05] wtf is abs? [10:06] [17:11] I would say yes... except that without flash, we wouldn't have albino black sheep, and without albino black ship, we wouldn't have had the 8th Avocaco. [10:06] albino black sheep [10:07] 90's youtube [10:07] flash animation site [11:00] *** Ymgve has joined #archiveteam [11:13] *** Ara_ has joined #archiveteam [11:15] *** signius has quit IRC (Remote host closed the connection) [11:17] *** Ara__ has quit IRC (Ping timeout: 369 seconds) [11:17] *** signius has joined #archiveteam [11:23] *** Start has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** Jonimus has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** dashcloud has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** garyrh has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** gibigiana has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** useretail has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:23] *** will has quit IRC (ircd.shaw.ca irc.shaw.ca) [11:27] *** gibigian1 has joined #archiveteam [11:42] *** Ara_ has quit IRC (Read error: Connection reset by peer) [11:44] *** Start has joined #archiveteam [11:44] *** Jonimus has joined #archiveteam [11:44] *** dashcloud has joined #archiveteam [11:44] *** garyrh has joined #archiveteam [11:44] *** pikhq has joined #archiveteam [11:44] *** SadDM has joined #archiveteam [11:44] *** useretail has joined #archiveteam [11:44] *** matthusby has joined #archiveteam [11:44] *** will has joined #archiveteam [11:44] *** irc.shaw.ca sets mode: +o SadDM [11:44] *** swebb sets mode: +o SadDM [12:14] *** Ara_ has joined #archiveteam [12:49] *** Ara_ has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** robink has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** xyzzy has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** Selanda has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** cloudmons has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** bmcginty has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** acridAxid has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** patricko- has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** Peetz0r has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** goekesmi has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** edsu_ has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** tephra has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** dugo_ has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** lbft has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** Meeh has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** chazchaz_ has quit IRC (ircd.choopa.net irc.mzima.net) [12:49] *** Sanqui has quit IRC (ircd.choopa.net irc.mzima.net) [12:50] *** Ara_ has joined #archiveteam [12:50] *** robink has joined #archiveteam [12:50] *** xyzzy has joined #archiveteam [12:50] *** Selanda has joined #archiveteam [12:50] *** cloudmons has joined #archiveteam [12:50] *** bmcginty has joined #archiveteam [12:50] *** acridAxid has joined #archiveteam [12:50] *** patricko- has joined #archiveteam [12:50] *** Peetz0r has joined #archiveteam [12:50] *** goekesmi has joined #archiveteam [12:50] *** edsu_ has joined #archiveteam [12:50] *** tephra has joined #archiveteam [12:50] *** dugo_ has joined #archiveteam [12:50] *** lbft has joined #archiveteam [12:50] *** Meeh has joined #archiveteam [12:50] *** chazchaz_ has joined #archiveteam [12:50] *** Sanqui has joined #archiveteam [12:53] *** sankin has joined #archiveteam [12:57] *** Ara_ has quit IRC (Read error: Operation timed out) [12:58] *** Ara_ has joined #archiveteam [13:32] *** sankin has quit IRC (Leaving.) [13:33] *** primus104 has quit IRC (Leaving.) [13:36] So, update on the screenshot thing for WARC items. [13:36] It works! [13:37] However, in cases where items simply have no browsable web pages, the technique I'm using does not know that, and shows a status page. [13:39] *** SmileyG has quit IRC (Quit: Lost terminal) [13:42] *** SmileyG has joined #archiveteam [13:43] *** xtr-201 has joined #archiveteam [13:43] *** Start has quit IRC (Disconnected.) [13:44] *** Ara_ has quit IRC (Read error: Operation timed out) [13:47] *** sankin has joined #archiveteam [13:53] *** Ara_ has joined #archiveteam [13:57] Did it. It will be fixed. [13:57] (It does a wget of the page, then looks for links existing. If they don't exist, it skips screenshotting. Boom.) [14:02] oh, nice! [14:03] *** Ara_ has quit IRC (Read error: Operation timed out) [14:03] https://archive.org/details/archiveteam_archivebot_go_20140924163956 has it [14:03] It shows the broke pages too [14:03] And I just switched to .png because duh [14:03] So this one will need some cleanup [14:04] *** Ara_ has joined #archiveteam [14:04] the header is a bit too big [14:04] it takes up my whole window, it's a bit confusing [14:04] Get... a bigger..... [14:04] .....window [14:05] that is awesome though [14:05] love the thumbs [14:05] :( [14:07] you can see how this can be confusing: http://i.imgur.com/FNtYSpS.png [14:07] I don't know how it could be solved though [14:12] *** Ara_ has quit IRC (Read error: Operation timed out) [14:13] *** Ara_ has joined #archiveteam [14:26] *** Ara_ has quit IRC (Read error: Operation timed out) [14:28] *** Ara_ has joined #archiveteam [14:37] *** ikreymer has quit IRC (Remote host closed the connection) [14:38] *** Start has joined #archiveteam [14:45] *** Ara_ has quit IRC (Read error: Operation timed out) [14:47] *** Ara_ has joined #archiveteam [14:48] *** thechip has joined #archiveteam [14:49] *** ikreymer has joined #archiveteam [14:57] *** Start has quit IRC (Disconnected.) [14:59] OK, so there is an edge case where it blows up [14:59] But it's otherwise doing things 99% right [14:59] Good enough [14:59] *** balrog has quit IRC (Ping timeout: 260 seconds) [14:59] I'm going to run it against all the archivebotties while I'm in Sweden [15:00] *** Ara_ has quit IRC (Read error: Operation timed out) [15:01] Figure in 2-3 days it'll get through a lot. [15:01] *** Start has joined #archiveteam [15:01] Is there a way to get archive.org to archive the files on a particular site? [15:02] Go to http://archive.org/web/ [15:02] Bottom right, "Save Page Now" [15:07] how is the archive holding up with archivebot these days? [15:07] *** Ara_ has joined #archiveteam [15:07] hmmm maybe i should of asked that in # -bs [15:08] SketchCow: doesn't it ignore files? [15:09] Well, you give it the site, it gets the files on the site. [15:09] I meant keeping up etc ;) [15:10] Oh, the archivebot spice flows nicely. [15:10] SketchCow: But doesn't it ignore larger assets? [15:10] I have a thing constantly checking and uploading, though. [15:10] for example https://web.archive.org/web/20150315040459/http://www.awesometapes.com/elias-tebabel-1995/ [15:10] *** balrog has joined #archiveteam [15:10] *** swebb sets mode: +o balrog [15:11] johtso: we have a channel #archivebot where you can get whole sites crawled [15:12] save page now only gets one page at a time [15:12] the wayback machine does retrieve files in its normal crawls though [15:12] up to a maximum of 200mb each [15:13] oh wow [15:13] * johtso has a read through the wiki page [15:14] *** mistym has joined #archiveteam [15:14] *** mistym has quit IRC (Remote host closed the connection) [15:14] DFJustin: ah, this would be why the files are missing then.. http://traffic.libsyn.com/robots.txt [15:15] boo hiss [15:16] what's the archivebot policy on robots.txt? [15:16] complete disregard [15:16] *** Ara_ has quit IRC (Read error: Operation timed out) [15:16] lovely [15:17] the results will only be downloadable in big chunks in the archivebot collection and not browsable via wayback though [15:17] (normally archivebot feeds into the wayback machine) [15:18] definitely better than nothing! [15:19] DFJustin: they're always one-off scrapes too right? [15:20] yes [15:22] DFJustin: are subsequent scrapes in any way differential? Or does it just do the whole lot again? [15:23] whole lot [15:23] ouch [15:25] *** balrog has quit IRC (Read error: Operation timed out) [15:32] *** balrog has joined #archiveteam [15:32] *** swebb sets mode: +o balrog [15:44] *** mistym has joined #archiveteam [15:50] *** Start has quit IRC (Disconnected.) [15:56] *** Start has joined #archiveteam [16:07] *** mistym has quit IRC (Remote host closed the connection) [16:26] *** mistym has joined #archiveteam [16:31] *** Start has quit IRC (Disconnected.) [16:31] *** Start has joined #archiveteam [16:35] *** SmileyG has quit IRC (Quit: pi time) [16:37] *** Smiley has quit IRC (http://www.milkme.co.uk - You'll never understand.) [16:39] *** patricko- is now known as patrickod [16:43] *** patrickod is now known as patricko- [16:45] *** Start has quit IRC (Disconnected.) [16:49] *** primus104 has joined #archiveteam [16:57] *** Smiley has joined #archiveteam [16:59] *** khaoohs_ has joined #archiveteam [16:59] *** khaoohs has quit IRC (Read error: Connection reset by peer) [17:03] *** aaaaaaaaa has joined #archiveteam [17:10] *** patricko- is now known as patrickod [17:12] *** sankin has quit IRC (Leaving.) [17:18] *** primus104 has quit IRC (Leaving.) [17:21] *** sankin has joined #archiveteam [17:22] *** patrickod is now known as patricko- [17:27] *** Ara_ has joined #archiveteam [17:44] *** dserodio has quit IRC (Read error: Operation timed out) [17:52] *** antomati_ is now known as antomatic [17:52] *** svchfoo2 sets mode: +o antomatic [17:53] *** Emcy has quit IRC (Read error: Connection reset by peer) [17:55] *** Emcy has joined #archiveteam [17:55] *** dserodio has joined #archiveteam [18:09] *** scyther has joined #archiveteam [18:14] https://chrome.google.com/webstore/detail/screencastify-screen-vide/mmeijimgabbpbgpdklnllpncmdofkcpn?hl=en [18:14] Screencastify is a simple video screen capture software (aka. screencast recorder) for Chrome. It is able to record all screen activity inside a tab, including audio. Just press record and the content of your tab is recorded. [18:21] *** londoncal has joined #archiveteam [18:32] *** caber has quit IRC (Quit: Doei Doei!!!) [18:36] *** caber has joined #archiveteam [18:53] *** Start has joined #archiveteam [18:54] *** Start_ has joined #archiveteam [18:54] *** Start has quit IRC (Read error: Connection reset by peer) [18:59] *** BlueMaxim has joined #archiveteam [19:02] fuck, quakedev.com added a robots.txt. anyone have experience if IA would send me the existing old crawls? it was not my site [19:03] *** patricko- is now known as patrickod [19:17] *** Start has joined #archiveteam [19:17] *** Start_ has quit IRC (Read error: Connection reset by peer) [19:25] *** Start has quit IRC (Disconnected.) [19:29] *** Start has joined #archiveteam [19:38] *** patrickod is now known as patricko- [19:42] Does anyone know if there any handy command line tools for batch extracting all files in a directory, rar/zip/7zip, skipping archives that have errors, and possibly even detecting archive format for files with no extension? [19:44] *** db48x has quit IRC (Ping timeout: 258 seconds) [19:47] *** SN4T14_ has joined #archiveteam [19:53] *** SN4T14__ has quit IRC (Ping timeout: 369 seconds) [19:55] i've seen 'uncompress' scripts floating around the net [19:56] you'd want... for x in ./; do uncompress --optionToIgnoreErrors $x; done [19:59] *** SimpBrain has quit IRC (Quit: Leaving) [20:02] *** primus104 has joined #archiveteam [20:03] *** SimpBrain has joined #archiveteam [20:07] unar would work in that type of arrangement https://code.google.com/p/theunarchiver/ [20:18] hm [20:18] johtso: catarc is something that /kind of/ does something like that [20:18] but it;s not really for extracting [20:18] so much as it is for writing to stdout :P [20:19] johtso: https://pypi.python.org/pypi/catarc/1.1 [20:19] (yes, I wrote that) [20:19] ah interesting [20:19] err, https://github.com/joepie91/catarc [20:19] there [20:19] it's specifically to sidestep 7z's issues with errors [20:20] *** schbirid has quit IRC (Read error: Operation timed out) [20:20] joepie91_: could imagine it being a bit tricky to get that tool to produce directories with files for each archive [20:21] johtso: probably not too awful, can probably just replace the "print to stdout" flag with a target dir [20:21] but it'd def require some code modifications [20:21] I'm surprised there aren't more tools out there for this kind of thing.. [20:22] *** Start has quit IRC (Disconnected.) [20:26] *** Jonimus has quit IRC (Ping timeout: 370 seconds) [20:29] *** schbirid has joined #archiveteam [20:32] johtso: I suspect most people just use bash loops [20:32] having to work with a wide collection of different archive types is an edge casews [20:32] case * [20:32] it's usually a lot of the same type, and then a bash loop would suffice [20:32] :p [20:39] *** Jonimus has joined #archiveteam [20:53] *** signius has quit IRC (Read error: Operation timed out) [20:57] *** patricko- is now known as patrickod [20:58] *** WubTheCap has quit IRC (Quit: Leaving) [20:58] *** sankin has quit IRC (Leaving.) [21:04] *** patrickod is now known as patricko- [21:05] *** signius has joined #archiveteam [21:07] *** Emcy has quit IRC (Ping timeout: 512 seconds) [21:19] *** db48x has joined #archiveteam [21:27] *** okeuday has quit IRC (Ping timeout: 265 seconds) [21:30] *** schbirid has quit IRC (Leaving) [21:31] *** habi has joined #archiveteam [21:37] *** ikreymer has quit IRC (Remote host closed the connection) [21:41] *** Emcy has joined #archiveteam [21:41] *** okeuday has joined #archiveteam [21:43] *** Ara_ has quit IRC (Ping timeout: 492 seconds) [21:52] *** Waiii has joined #archiveteam [21:53] is it just me or is imcute.yt down? [21:55] so it's not just me [21:56] doesn't look up to me [21:57] but archiveteam still says it's up :p [21:57] oh, the wiki page for urlteam? [21:58] that's always out of date because all these piddly little shorteners keep dying [21:58] it's v sad [21:58] but where will i find all the glorious webms :c [22:00] um, i think it's a 4chan archive, not a url shortener [22:00] I love these archivebot screenshots [22:00] https://ia801001.us.archive.org/23/items/archiveteam_archivebot_go_010/twitter.com-inf-20131205-135447.warc.gz.png [22:00] remember when Twitter's web UI wasn't a huge sprawling mass [22:01] that's like saying remember geocities. :/ [22:02] in the sense that both are preserved yes [22:07] *** ikreymer has joined #archiveteam [22:09] SketchCow: fixed https://archive.org/details/msdos_Snack_Attack_II_1982&external_js=1 [22:09] now I've just got to test games without a dosbox.conf file and make sure they're not broken :P [22:12] DOOM2 works [22:15] *** habi has quit IRC (Quit: Leaving.) [22:16] *** Waiii has quit IRC (Quit: Page closed) [22:31] *** patricko- is now known as patrickod [22:31] *** BnAboyZ has joined #archiveteam [22:32] *** patrickod is now known as patricko- [22:39] *** SimpBrain has quit IRC (Quit: Leaving) [22:45] *** Start has joined #archiveteam [22:46] db48x: can you check Fatal Distractions? https://archive.org/details/Fataldistract All the shareware on the disc supposedly can run from the CD, so that would fascinating if you could run the shareware CD online and get access to all of those games [23:09] *** scyther has quit IRC (Read error: Operation timed out) [23:10] *** scyther has joined #archiveteam [23:15] *** patricko- is now known as patrickod [23:22] *** patrickod is now known as patricko- [23:31] *** londoncal has quit IRC (Leaving...) [23:33] *** Ymgve has quit IRC () [23:40] *** JonimusP has joined #archiveteam [23:44] *** Jonimus has quit IRC (Ping timeout: 370 seconds) [23:44] *** JonimusP is now known as Jonimus [23:49] Urgh, so frustrating when a page is archived, but the interesting bits managed to slip from the crawlers grasp :( http://web.archive.org/web/20090121052458/http://www.voanews.com/english/Africa/blog/ [23:49] **** javascript [23:58] *** JMC has quit IRC (Ping timeout: 258 seconds)