#archiveteam 2015-03-24,Tue

↑back Search

Time Nickname Message
00:10 🔗 Smiley http://blog.twitch.tv/2015/03/important-notice-about-your-twitch-account/
00:10 🔗 Smiley yey
00:16 🔗 garyrh heh, "Important notice about your x account" -> h4ckz0r3d or closing
00:20 🔗 c_b has quit IRC (Ping timeout: 370 seconds)
00:21 🔗 Smiley yup
00:21 🔗 Smiley first one
00:21 🔗 Smiley but can someone archivebot it
00:21 🔗 nertzy has joined #archiveteam
00:21 🔗 Smiley for some reason i cna't find the channel
00:24 🔗 JMC_ has joined #archiveteam
00:26 🔗 JMC has quit IRC (Ping timeout: 370 seconds)
00:33 🔗 Emcy has joined #archiveteam
00:58 🔗 mistym has quit IRC (Remote host closed the connection)
01:01 🔗 JMC_ is now known as JMC
01:02 🔗 chfoo has quit IRC (Remote host closed the connection)
01:04 🔗 Selanda has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 BnAboyZ has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 cloudmons has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 bmcginty has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 acridAxid has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 patricko- has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 Peetz0r has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 goekesmi has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 edsu_ has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 tephra has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 dugo_ has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 lbft has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 Meeh has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 chazchaz_ has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 Sanqui has quit IRC (west.us.hub irc.mzima.net)
01:04 🔗 DFJustin ran it through liveweb https://web.archive.org/web/20150324010409/http://blog.twitch.tv/2015/03/important-notice-about-your-twitch-account/
01:07 🔗 chfoo has joined #archiveteam
01:08 🔗 svchfoo2 sets mode: +o chfoo
01:09 🔗 SketchCow https://archive.org/details/archiveteam_madden&tab=collection
01:12 🔗 mistym has joined #archiveteam
01:12 🔗 Selanda has joined #archiveteam
01:12 🔗 BnAboyZ has joined #archiveteam
01:12 🔗 cloudmons has joined #archiveteam
01:12 🔗 bmcginty has joined #archiveteam
01:12 🔗 acridAxid has joined #archiveteam
01:12 🔗 patricko- has joined #archiveteam
01:12 🔗 Peetz0r has joined #archiveteam
01:12 🔗 goekesmi has joined #archiveteam
01:12 🔗 edsu_ has joined #archiveteam
01:12 🔗 tephra has joined #archiveteam
01:12 🔗 dugo_ has joined #archiveteam
01:12 🔗 lbft has joined #archiveteam
01:12 🔗 Meeh has joined #archiveteam
01:12 🔗 chazchaz_ has joined #archiveteam
01:12 🔗 Sanqui has joined #archiveteam
01:13 🔗 SketchCow I threw an animated gif in to see if it would do something.
01:15 🔗 SketchCow https://ia601509.us.archive.org/1/items/archiveteam_madden_20150323172542/c7e30ceecfe6da8bcd06121e5ec9a2fc.gif
01:22 🔗 SimpBrain has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 rejon has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 ohhdemgir has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Sue_ has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 yipdw has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 xmc has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 dcmorton has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 zenguy_pc has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 aschmitz has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Nertsy has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Lord_Nigh has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 marnold has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 ersi has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 slash` has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Famicoman has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 eprillios has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Mayonaise has quit IRC (ny.us.hub west.us.hub)
01:22 🔗 Cameron_D has quit IRC (ny.us.hub west.us.hub)
01:34 🔗 LordNigh2 has joined #archiveteam
01:36 🔗 SimpBrain has joined #archiveteam
01:36 🔗 rejon has joined #archiveteam
01:36 🔗 ohhdemgir has joined #archiveteam
01:36 🔗 Sue_ has joined #archiveteam
01:36 🔗 yipdw has joined #archiveteam
01:36 🔗 xmc has joined #archiveteam
01:36 🔗 dcmorton has joined #archiveteam
01:36 🔗 zenguy_pc has joined #archiveteam
01:36 🔗 aschmitz has joined #archiveteam
01:36 🔗 Nertsy has joined #archiveteam
01:36 🔗 marnold has joined #archiveteam
01:36 🔗 ersi has joined #archiveteam
01:36 🔗 slash` has joined #archiveteam
01:36 🔗 Famicoman has joined #archiveteam
01:36 🔗 eprillios has joined #archiveteam
01:36 🔗 Mayonaise has joined #archiveteam
01:36 🔗 Cameron_D has joined #archiveteam
01:36 🔗 irc.eversible.com sets mode: +oooo xmc dcmorton ersi Cameron_D
01:36 🔗 swebb sets mode: +o xmc
01:36 🔗 swebb sets mode: +o ersi
01:37 🔗 LordNigh2 is now known as Lord_Nigh
01:37 🔗 svchfoo1 sets mode: +o Lord_Nigh
01:42 🔗 SketchCow https://www-tracey.archive.org/details/archiveteam_madden_20150323172542
01:42 🔗 SketchCow hahahahaha
02:01 🔗 SketchCow OK, who wants a challenge?
02:01 🔗 SketchCow Challenge: Take .WARC.gz file and generate webpage image grabs from them. (Clipped to a certain size, say, 400x300)
02:06 🔗 Ara__ has quit IRC (Read error: Operation timed out)
02:09 🔗 londoncal has quit IRC (Leaving...)
02:11 🔗 Ara__ has joined #archiveteam
02:15 🔗 BnAboyZ has quit IRC ()
02:19 🔗 SketchCow Who wants it!
02:19 🔗 Ara__ has quit IRC (Read error: Operation timed out)
02:24 🔗 Ara__ has joined #archiveteam
02:26 🔗 JMC_ has joined #archiveteam
02:27 🔗 JMC has quit IRC (Ping timeout: 265 seconds)
02:30 🔗 yipdw sure
02:30 🔗 yipdw we could use that in archivebot anyway
02:30 🔗 JMC_ is now known as JMC
02:31 🔗 yipdw SketchCow: any particular toolset requirements, or is this a "do it however and we'll take it from there" thing
02:35 🔗 SketchCow I would probably ultimately throw it on FOS.
02:35 🔗 SketchCow So, simpler the better, but lots of flexibility
02:35 🔗 SketchCow FOS or SIS
02:35 🔗 yipdw ok
02:35 🔗 yipdw I was thinking webarchiveplayer + {chrome, firefox}
02:36 🔗 SketchCow This will kill two insane birds, one stone.
02:36 🔗 yipdw screenshot first entry in WARC
02:36 🔗 SketchCow 1. We will be supporting animated GIFs that are going to come online tomorrow for image previews
02:36 🔗 yipdw one small detail is that webarchiveplayer generally achieves higher fidelity than Wayback, not sure if that's a problem
02:36 🔗 SketchCow 2. Brewster has wanted, for almost a year "something interesting" with otherwise boring files.
02:37 🔗 SketchCow So having archivebot and other .warc items have preview images where they now have none would be very nice.
02:38 🔗 SketchCow Wow, combining that link (webarchiveplayer) with my screenshot items might be sufficient.
02:38 🔗 SketchCow I could PROBABLY do this myself
02:38 🔗 SketchCow let's hop to #archivebot and talk about it.
03:03 🔗 Ara__ has quit IRC (Read error: Operation timed out)
03:04 🔗 berndj has quit IRC (Excess Flood)
03:04 🔗 berndj has joined #archiveteam
03:08 🔗 Ara__ has joined #archiveteam
03:10 🔗 dashcloud anyone have success using WarcQTviewer on Linux? https://github.com/odie5533/WarcQtViewer
03:19 🔗 Ara__ has quit IRC (Read error: Operation timed out)
03:19 🔗 BlueMaxim has joined #archiveteam
03:25 🔗 Ara__ has joined #archiveteam
03:36 🔗 primus105 has quit IRC (Leaving.)
03:36 🔗 SketchCow yipdw: Back here - I don't need it archived.
03:37 🔗 yipdw SketchCow: I don't recall archiving anything
03:37 🔗 SketchCow http://teamarchive1.fnf.archive.org/screenshot_00.jpg
03:37 🔗 yipdw oh
03:37 🔗 SketchCow (Some else ran the archiver)
03:37 🔗 yipdw ah ok
03:37 🔗 xmc that was kyan
03:37 🔗 yipdw np
03:37 🔗 xmc kyan | !ao http://teamarchive1.fnf.archive.org/screenshot_00.jpg
03:37 🔗 SketchCow Yeah
03:37 🔗 xmc another thing i don't understand
03:37 🔗 SketchCow So I popped back here.
03:38 🔗 SketchCow But check THAT shit out
03:38 🔗 yipdw I'm wondering how it fares on gigantic WARCs
03:38 🔗 SketchCow Going to find out VERY shortly.
03:38 🔗 yipdw IIRC webarchiveplayer indexes the whole thing, because it assumes you are browsing
03:39 🔗 yipdw that may or may not be an intolerable delay
03:40 🔗 godane SketchCow: http://www.archpwn.org/wiki/Main_Page.html
03:40 🔗 godane years ago i did archlinux distros
03:40 🔗 SketchCow Trying a 2.8gb one
03:41 🔗 yipdw SketchCow: webarchiveplayer *may* be able to use the generated cdxs
03:41 🔗 yipdw I haven't tried it, but it does seem to be the same format
03:42 🔗 yipdw also ikreymer can probably help out more than me
03:43 🔗 SketchCow http://teamarchive1.fnf.archive.org/screenshot_00.jpg
03:43 🔗 SketchCow 2.8gb warc.gz
04:00 🔗 SketchCow Now it's just logic, logic logic.
04:01 🔗 SketchCow I'm going to just shove screenshots into the items
04:01 🔗 SketchCow Let the v2 do the work
04:01 🔗 SketchCow instead of a gif
04:07 🔗 aaaaaaaaa has quit IRC (Leaving)
04:13 🔗 SketchCow Going along well.
04:15 🔗 xyzzy has joined #archiveteam
04:17 🔗 berndj has quit IRC (Read error: Operation timed out)
04:22 🔗 signius has quit IRC (Read error: Operation timed out)
04:35 🔗 signius has joined #archiveteam
04:53 🔗 SketchCow Well, I've got it going, but it's got some jankiness I don't like.
04:58 🔗 ikreymer has joined #archiveteam
05:03 🔗 SketchCow OK, letting this one run.
05:03 🔗 ikreymer hi, saw the tweet re: screenshots, and thought i'd check here :) what's the plan? i don't have too many spare cycles, but maybe can help out..
05:03 🔗 SketchCow It'll work or it won't!
05:03 🔗 SketchCow ha ha'
05:03 🔗 SketchCow ikreymer: HUGGY HUGGY HUGGY
05:03 🔗 SketchCow So, I whipped up a thing
05:03 🔗 SketchCow The thing is working
05:03 🔗 SketchCow But the thing does a couple really janky tricks to work.
05:04 🔗 SketchCow I bet you know ways to make it work better.
05:04 🔗 SketchCow HOWEVER
05:04 🔗 SketchCow It is, currently, sort of working.
05:04 🔗 SketchCow BEHOLD https://archive.org/details/archiveteam_archivebot_go_20140924163956
05:04 🔗 SketchCow BEHOLD https://archive.org/details/archiveteam_archivebot_go_20140924163956/v2 I mean
05:05 🔗 ikreymer nice! so you're generating a screenshot for each item?
05:06 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/
05:06 🔗 SketchCow Trying to.
05:06 🔗 SketchCow I'm going to let this one run for tonight, then attack the inherent problem a little later
05:06 🔗 SketchCow The key is it just kind of sits off and does it.
05:07 🔗 SketchCow Right now, you're going to look at it and go "HEY WAIT IT" etc
05:07 🔗 SketchCow But let me let it run initially
05:07 🔗 SketchCow The problem is that it has to run the webplayer, and then execute screenshots
05:08 🔗 SketchCow And the webplayer can take.... a while to generate the list off a warc.gz
05:09 🔗 yipdw ikreymer: can pywb use CDXes generated from wayback? if so, we do have those available with the WARCs
05:09 🔗 ikreymer yeah, the indexing can take a bit for large warcs, since its generating cdx for the whole thing.. how do you pick which page if there are multiple?
05:09 🔗 SketchCow Just doing the first one for now.
05:09 🔗 xmc phun
05:09 🔗 yipdw so if that works we can skip the indexing process
05:10 🔗 ikreymer yipdw: yes, it's the same format
05:10 🔗 yipdw ah ok, cool -- I guess that's a speedup option
05:10 🔗 SketchCow If I put them in the same place, will it just know?
05:10 🔗 ikreymer though the step that determines which ones are 'pages' is unique to webarchiveplayer currently, but that can be seperated out, if you need it
05:11 🔗 SketchCow Well, right now, it just does:
05:11 🔗 SketchCow wget --no-check-certificate https://archive.org/download/${ITEM}/${clammy}
05:11 🔗 SketchCow Then webarchiveplayer --headless $clammy &
05:12 🔗 ikreymer yeah, in fact, i just released pywb 0.9.0 that can automatically find warc and cdx files from given a directory structure
05:14 🔗 yipdw oh cool
05:14 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ is working except when it doesn't.
05:14 🔗 SketchCow :)
05:15 🔗 yipdw how are you picking the first entry in the WARC?
05:15 🔗 SketchCow :)
05:15 🔗 SketchCow Tab down, press enter
05:15 🔗 yipdw some WARCs (at least from archivebot) don't have HTML
05:15 🔗 yipdw ah
05:16 🔗 SketchCow I'm going to let this one grow
05:16 🔗 ikreymer in pywb 0.9.0, there's support for the following: if you create ./collections/my_coll/archive/some.warc.gz and ./collections/my_coll/indexes/mycdx.cdx , then run 'wayback' in pywb, you can then access localhost:8080/my_coll/<url> -- it'll automatically use all the warc and the cdx
05:16 🔗 SketchCow And see what it does.
05:17 🔗 SketchCow Oh, I see.
05:17 🔗 SketchCow It's working, but I included the meta ones as well
05:17 🔗 SketchCow So those are blowing up
05:18 🔗 yipdw ikreymer: oh cool -- does it watch those directories for changes as well?
05:18 🔗 ikreymer yeah, figuring out what are pages is a bit tricky.. probably needs some more tweaking and is somewhat subjective also
05:21 🔗 ikreymer yipdw: yep, if you run with `wayback -a` it should pick up any new warcs, and automatically generate cdx. but not yet cdx, if you place already generated cdx it won't yet pick it up (but something that can be added, this was not a use case i had before)
05:21 🔗 yipdw sweet
05:22 🔗 ikreymer SketchCow: i'd be curious about which warcs are blowing up.. is it due to size?
05:23 🔗 SketchCow OK, so
05:23 🔗 SketchCow First of all, you will find I am going for bone simple, not bulletproof, at this juncture.
05:24 🔗 SketchCow Second, we have metadata sitting in there, some -meta files, that it was running
05:24 🔗 SketchCow Those have shit-all in terms of pages in them
05:24 🔗 SketchCow So the REGULAR stuff is working GREAT, and then it was analyzing wasteful, useless files.
05:24 🔗 SketchCow If you watch http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/ - it should be adding a page every 30 seconds, and it will work.
05:25 🔗 SketchCow This is good enough for.. "oh, it would be nice to have this"
05:25 🔗 SketchCow I don't want to suck up anyone's time. Once yipdw pointed me to your utility, you saved me between 3,000 and 5,000 years
05:25 🔗 yipdw ikreymer: some of our warcs contain records that use URNs, e.g. https://archive.org/download/archiveteam_archivebot_go_20140924163956/ae-mod.info-inf-20140923-195942-az4hy-meta.warc.gz
05:25 🔗 yipdw one of the records in there is the wpull log
05:26 🔗 SketchCow Oh yeah, now it's rocking.
05:26 🔗 yipdw it has WARC-Target-URI urn:X-Wpull:log
05:26 🔗 yipdw I remember pywb 0.8.x(?) didn't like those, so I ended up patching around it
05:26 🔗 yipdw that might be it
05:26 🔗 SketchCow Also, I'm using very old teamarchive bot stuff now
05:27 🔗 ikreymer yipdw: ah ok, yes, i think webarchiveplayer attempts to parse metadata records and probably fails on these.. i should make sure to handle them gracefully
05:27 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/boards.4chan.org-shallow-20140923-003733-3s37l-00000.warc.gz.jpg
05:27 🔗 SketchCow It's going along nicely.
05:27 🔗 ikreymer or basically just ignore them, unless specified otherwise
05:27 🔗 SketchCow Well, in my case, I'm shoving them down the gullet.
05:27 🔗 SketchCow But I'm doing a trick here.
05:29 🔗 ikreymer chfoo has a nice list of warc extensions: http://wpull.readthedocs.org/en/master/warc.html -- for a future update, i should make sure pywb doesn't choke on any of them when indexing
05:29 🔗 ikreymer SketchCow: great! glad it's working well!
05:30 🔗 DFJustin I think it's working and producing a useless screenshot and sketchcow is just not being programmer-precise in language choice
05:30 🔗 SketchCow I would call it "works enough to start getting an idea"
05:31 🔗 SketchCow I was slow to learn which of the warc.gz files in an archivebot collection return actual webpages, and which are not for that.
05:33 🔗 SketchCow Now, if you go to https://archive.org/details/archiveteam_archivebot_go_20140924163956/v2
05:33 🔗 SketchCow You will see it showing screenshots, and automatically going between them
05:33 🔗 SketchCow For this first run, some of the screengrabs will be poop
05:33 🔗 DFJustin ohh yes
05:34 🔗 SketchCow But this is 100% fire and forget non-human labor
05:34 🔗 SketchCow And I promise you, Brewster will be over the moon this is getting done.
05:36 🔗 yipdw I'm going to waste so much time scrolling through these
05:36 🔗 SketchCow They're very informative.
05:36 🔗 yipdw actually what I should do is update to pywb 0.9.0 to make it easier to throw new WARCs into a preview thing
05:36 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz.jpg happens too
05:37 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/forum.z8games.com-shallow-20140923-071310-6jnal-00000.warc.gz.jpg happens then. Boom!
05:37 🔗 yipdw you may need to wait for the HTTP server to start up
05:38 🔗 SketchCow So, right now I have to do a coarse wait. 30 seconds.
05:38 🔗 yipdw ah
05:38 🔗 SketchCow I don't have a way to signal.
05:38 🔗 yipdw interesting, that should be enough
05:38 🔗 yipdw it is cool to see these sites ended up pretty well-preserved
05:39 🔗 xmc you could probably use netcat to check if the server is listening
05:39 🔗 xmc yup
05:39 🔗 xmc echo|nc exits with 0 if listening, 1 if not
05:39 🔗 SketchCow After this run through, I'll try that.
05:39 🔗 SketchCow And then knock it down to 10 seconds and 10 second checks
05:40 🔗 xmc sweet
05:41 🔗 DFJustin you'll probably want to install this at some point https://help.ubuntu.com/community/RestrictedFormats/Microsoft_Fonts
05:42 🔗 ikreymer yipdw: cool, let me know if you have any feedback on the new 0.9.0, should be easier to use then previous versions..
05:42 🔗 yipdw ikreymer: will do
05:44 🔗 SketchCow 01:39 <@xmc> echo|nc exits with 0 if listening, 1 if not
05:44 🔗 SketchCow Not sure it's doing that.
05:44 🔗 xmc % echo | nc localhost 22 > /dev/null ; echo $?
05:44 🔗 xmc 0
05:44 🔗 xmc % echo | nc localhost 4444 > /dev/null ; echo $?
05:44 🔗 xmc localhost [127.0.0.1] 4444 (?) : Connection refused
05:44 🔗 xmc 1
05:45 🔗 ikreymer yipdw: oh, and if you're checking one warc at a time, you can just have a single 'index.cdx' file and replace it with the next cdx each time. it'll pick up changes to same cdx, just not new filenames (at the moment)
05:45 🔗 BlueMaxim has quit IRC (Quit: Leaving)
05:53 🔗 SketchCow xmc-designed anger-rage version now running
05:54 🔗 xmc yet another shining example of hate driven development
05:54 🔗 yipdw testily-driven development
05:55 🔗 SketchCow agile-stabby
05:55 🔗 SketchCow You are ALL fucking committed
05:55 🔗 SketchCow Everyone is pigs
05:55 🔗 SketchCow all pigs
05:55 🔗 SketchCow die die die
05:56 🔗 SketchCow I just watched it "do the right thing"
05:56 🔗 SketchCow Very inspiring.
05:56 🔗 SketchCow (It threw a garbage can through Sal's pizza"
05:57 🔗 unityrkjs has joined #archiveteam
05:57 🔗 SketchCow Also, it waited patiently for the webserver to get its shit together.
05:57 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/
05:57 🔗 SketchCow If you do a shift-reload, you can see it
05:57 🔗 SketchCow How the creation times are sometimes within the same minute.
05:59 🔗 xmc ten seconds? gosh, how patient. i would have set it to 1 second.
06:00 🔗 DFJustin http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/c99.nl-shallow-20140924-162333-c0npj-00000.warc.gz.jpg still got caught with its pants down though
06:01 🔗 SketchCow No, that's what the player returned
06:01 🔗 SketchCow for the first URL
06:01 🔗 SketchCow Remember, it's just going after the first URL
06:01 🔗 xmc it looks like the page didn't load though?
06:01 🔗 SketchCow It could be a simple .png or an empty index.html
06:01 🔗 xmc bottom left corner
06:02 🔗 SketchCow If it didn't load it gives an elaborate error page
06:02 🔗 SketchCow lies
06:02 🔗 SketchCow GRABBY: line 24: 10920 Terminated webarchiveplayer --headless $clammy (wd: /0/WARCPLAY/webbergrabber/WARCBIN)
06:02 🔗 SketchCow Zzzzzz...
06:02 🔗 SketchCow Zzzzzz...
06:02 🔗 SketchCow See? That's it dealing with a crazy page.
06:02 🔗 SketchCow Zzzzzz...
06:02 🔗 SketchCow Zzzzzz...
06:02 🔗 xmc sometimes things suck i guess
06:02 🔗 SketchCow Zzzzzz...
06:02 🔗 SketchCow foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz specifically
06:03 🔗 SketchCow Oh, it is angry at this one.
06:04 🔗 SketchCow http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/foolishreporter.wordpress.com-inf-20140923-194457-abm98-00000.warc.gz.jpg there it goes
06:04 🔗 SketchCow See, that's your work, xmc
06:04 🔗 SketchCow Otherwise it'd have skipped
06:05 🔗 SketchCow Also, I am clipping those messages out going forward.
06:05 🔗 SketchCow (The bottom)
06:06 🔗 SketchCow Did it again, grabbed a 2.2gb one, it took a while to deal.
06:11 🔗 SketchCow OK, this is great. I will go to sleep, and tomorrow it will likely finish these, then I will aim it at a pile of these.
06:17 🔗 Stilett0 has joined #archiveteam
06:17 🔗 Stiletto has quit IRC (Read error: Connection reset by peer)
06:19 🔗 MMovie2 has joined #archiveteam
06:20 🔗 MMovie has quit IRC (Ping timeout: 306 seconds)
06:21 🔗 unityrkjs has quit IRC (12( www.nnscript.com 12:: NoNameScript 4.22 12:: www.esnation.com 12))
06:21 🔗 SketchCow Very happy, thanks everyone
06:22 🔗 yipdw hey v2 has cool graphs -> https://archive.org/details/archivebot&tab=about
06:27 🔗 DFJustin also gonna want some foreign font packages http://teamarchive1.fnf.archive.org/DELETE-SCREENBIN/forums.nesdev.com-inf-20140920-145426-4ej3s-00000.warc.gz.jpg
06:55 🔗 mistym has quit IRC (Remote host closed the connection)
07:40 🔗 primus104 has joined #archiveteam
07:40 🔗 hive-mind has quit IRC (Ping timeout: 260 seconds)
07:59 🔗 hive-mind has joined #archiveteam
08:08 🔗 antomati_ has joined #archiveteam
08:09 🔗 primus104 has quit IRC (Leaving.)
08:10 🔗 antomatic has quit IRC (Read error: Operation timed out)
08:39 🔗 robink has joined #archiveteam
08:42 🔗 schbirid has joined #archiveteam
08:43 🔗 johtso has joined #archiveteam
08:52 🔗 SmileyG has joined #archiveteam
09:04 🔗 wp494 I just want to say this: fuck flash player
09:04 🔗 wp494 and fuck flash player sites
09:11 🔗 JMC I would say yes... except that without flash, we wouldn't have albino black sheep, and without albino black ship, we wouldn't have had the 8th Avocaco.
09:11 🔗 JMC *Avocado
09:11 🔗 JMC ho'kay
09:24 🔗 robink has quit IRC (Remote host closed the connection)
09:27 🔗 primus104 has joined #archiveteam
09:33 🔗 robink has joined #archiveteam
09:49 🔗 raylee question
09:49 🔗 raylee has spftp.info.apple.com been archived
09:49 🔗 raylee (you need a cookie to get to the files, but with it dirlistings are enabled)
09:52 🔗 Ctrl-S does IA save flash, and if not, is ABS backed up?
10:05 🔗 ersi wtf is abs?
10:06 🔗 Ctrl-S [17:11] <JMC> I would say yes... except that without flash, we wouldn't have albino black sheep, and without albino black ship, we wouldn't have had the 8th Avocaco.
10:06 🔗 Ctrl-S albino black sheep
10:07 🔗 Ctrl-S 90's youtube
10:07 🔗 Ctrl-S flash animation site
11:00 🔗 Ymgve has joined #archiveteam
11:13 🔗 Ara_ has joined #archiveteam
11:15 🔗 signius has quit IRC (Remote host closed the connection)
11:17 🔗 Ara__ has quit IRC (Ping timeout: 369 seconds)
11:17 🔗 signius has joined #archiveteam
11:23 🔗 Start has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 Jonimus has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 xtr-201 has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 dashcloud has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 garyrh has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 pikhq has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 gibigiana has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 SadDM has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 useretail has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 matthusby has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:23 🔗 will has quit IRC (ircd.shaw.ca irc.shaw.ca)
11:27 🔗 gibigian1 has joined #archiveteam
11:42 🔗 Ara_ has quit IRC (Read error: Connection reset by peer)
11:44 🔗 Start has joined #archiveteam
11:44 🔗 Jonimus has joined #archiveteam
11:44 🔗 dashcloud has joined #archiveteam
11:44 🔗 garyrh has joined #archiveteam
11:44 🔗 pikhq has joined #archiveteam
11:44 🔗 SadDM has joined #archiveteam
11:44 🔗 useretail has joined #archiveteam
11:44 🔗 matthusby has joined #archiveteam
11:44 🔗 will has joined #archiveteam
11:44 🔗 irc.shaw.ca sets mode: +o SadDM
11:44 🔗 swebb sets mode: +o SadDM
12:14 🔗 Ara_ has joined #archiveteam
12:49 🔗 Ara_ has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 robink has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 xyzzy has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 Selanda has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 cloudmons has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 bmcginty has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 acridAxid has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 patricko- has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 Peetz0r has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 goekesmi has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 edsu_ has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 tephra has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 dugo_ has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 lbft has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 Meeh has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 chazchaz_ has quit IRC (ircd.choopa.net irc.mzima.net)
12:49 🔗 Sanqui has quit IRC (ircd.choopa.net irc.mzima.net)
12:50 🔗 Ara_ has joined #archiveteam
12:50 🔗 robink has joined #archiveteam
12:50 🔗 xyzzy has joined #archiveteam
12:50 🔗 Selanda has joined #archiveteam
12:50 🔗 cloudmons has joined #archiveteam
12:50 🔗 bmcginty has joined #archiveteam
12:50 🔗 acridAxid has joined #archiveteam
12:50 🔗 patricko- has joined #archiveteam
12:50 🔗 Peetz0r has joined #archiveteam
12:50 🔗 goekesmi has joined #archiveteam
12:50 🔗 edsu_ has joined #archiveteam
12:50 🔗 tephra has joined #archiveteam
12:50 🔗 dugo_ has joined #archiveteam
12:50 🔗 lbft has joined #archiveteam
12:50 🔗 Meeh has joined #archiveteam
12:50 🔗 chazchaz_ has joined #archiveteam
12:50 🔗 Sanqui has joined #archiveteam
12:53 🔗 sankin has joined #archiveteam
12:57 🔗 Ara_ has quit IRC (Read error: Operation timed out)
12:58 🔗 Ara_ has joined #archiveteam
13:32 🔗 sankin has quit IRC (Leaving.)
13:33 🔗 primus104 has quit IRC (Leaving.)
13:36 🔗 SketchCow So, update on the screenshot thing for WARC items.
13:36 🔗 SketchCow It works!
13:37 🔗 SketchCow However, in cases where items simply have no browsable web pages, the technique I'm using does not know that, and shows a status page.
13:39 🔗 SmileyG has quit IRC (Quit: Lost terminal)
13:42 🔗 SmileyG has joined #archiveteam
13:43 🔗 xtr-201 has joined #archiveteam
13:43 🔗 Start has quit IRC (Disconnected.)
13:44 🔗 Ara_ has quit IRC (Read error: Operation timed out)
13:47 🔗 sankin has joined #archiveteam
13:53 🔗 Ara_ has joined #archiveteam
13:57 🔗 SketchCow Did it. It will be fixed.
13:57 🔗 SketchCow (It does a wget of the page, then looks for links existing. If they don't exist, it skips screenshotting. Boom.)
14:02 🔗 Sanqui oh, nice!
14:03 🔗 Ara_ has quit IRC (Read error: Operation timed out)
14:03 🔗 SketchCow https://archive.org/details/archiveteam_archivebot_go_20140924163956 has it
14:03 🔗 SketchCow It shows the broke pages too
14:03 🔗 SketchCow And I just switched to .png because duh
14:03 🔗 SketchCow So this one will need some cleanup
14:04 🔗 Ara_ has joined #archiveteam
14:04 🔗 Sanqui the header is a bit too big
14:04 🔗 Sanqui it takes up my whole window, it's a bit confusing
14:04 🔗 SketchCow Get... a bigger.....
14:04 🔗 SketchCow .....window
14:05 🔗 Sanqui that is awesome though
14:05 🔗 Sanqui love the thumbs
14:05 🔗 Sanqui :(
14:07 🔗 Sanqui you can see how this can be confusing: http://i.imgur.com/FNtYSpS.png
14:07 🔗 Sanqui I don't know how it could be solved though
14:12 🔗 Ara_ has quit IRC (Read error: Operation timed out)
14:13 🔗 Ara_ has joined #archiveteam
14:26 🔗 Ara_ has quit IRC (Read error: Operation timed out)
14:28 🔗 Ara_ has joined #archiveteam
14:37 🔗 ikreymer has quit IRC (Remote host closed the connection)
14:38 🔗 Start has joined #archiveteam
14:45 🔗 Ara_ has quit IRC (Read error: Operation timed out)
14:47 🔗 Ara_ has joined #archiveteam
14:48 🔗 thechip has joined #archiveteam
14:49 🔗 ikreymer has joined #archiveteam
14:57 🔗 Start has quit IRC (Disconnected.)
14:59 🔗 SketchCow OK, so there is an edge case where it blows up
14:59 🔗 SketchCow But it's otherwise doing things 99% right
14:59 🔗 SketchCow Good enough
14:59 🔗 balrog has quit IRC (Ping timeout: 260 seconds)
14:59 🔗 SketchCow I'm going to run it against all the archivebotties while I'm in Sweden
15:00 🔗 Ara_ has quit IRC (Read error: Operation timed out)
15:01 🔗 SketchCow Figure in 2-3 days it'll get through a lot.
15:01 🔗 Start has joined #archiveteam
15:01 🔗 johtso Is there a way to get archive.org to archive the files on a particular site?
15:02 🔗 SketchCow Go to http://archive.org/web/
15:02 🔗 SketchCow Bottom right, "Save Page Now"
15:07 🔗 SmileyG how is the archive holding up with archivebot these days?
15:07 🔗 Ara_ has joined #archiveteam
15:07 🔗 SmileyG hmmm maybe i should of asked that in # -bs
15:08 🔗 johtso SketchCow: doesn't it ignore files?
15:09 🔗 SketchCow Well, you give it the site, it gets the files on the site.
15:09 🔗 SmileyG I meant keeping up etc ;)
15:10 🔗 SketchCow Oh, the archivebot spice flows nicely.
15:10 🔗 johtso SketchCow: But doesn't it ignore larger assets?
15:10 🔗 SketchCow I have a thing constantly checking and uploading, though.
15:10 🔗 johtso for example https://web.archive.org/web/20150315040459/http://www.awesometapes.com/elias-tebabel-1995/
15:10 🔗 balrog has joined #archiveteam
15:10 🔗 swebb sets mode: +o balrog
15:11 🔗 DFJustin johtso: we have a channel #archivebot where you can get whole sites crawled
15:12 🔗 DFJustin save page now only gets one page at a time
15:12 🔗 DFJustin the wayback machine does retrieve files in its normal crawls though
15:12 🔗 DFJustin up to a maximum of 200mb each
15:13 🔗 johtso oh wow
15:13 🔗 * johtso has a read through the wiki page
15:14 🔗 mistym has joined #archiveteam
15:14 🔗 mistym has quit IRC (Remote host closed the connection)
15:14 🔗 johtso DFJustin: ah, this would be why the files are missing then.. http://traffic.libsyn.com/robots.txt
15:15 🔗 DFJustin boo hiss
15:16 🔗 johtso what's the archivebot policy on robots.txt?
15:16 🔗 DFJustin complete disregard
15:16 🔗 Ara_ has quit IRC (Read error: Operation timed out)
15:16 🔗 johtso lovely
15:17 🔗 DFJustin the results will only be downloadable in big chunks in the archivebot collection and not browsable via wayback though
15:17 🔗 DFJustin (normally archivebot feeds into the wayback machine)
15:18 🔗 johtso definitely better than nothing!
15:19 🔗 johtso DFJustin: they're always one-off scrapes too right?
15:20 🔗 DFJustin yes
15:22 🔗 johtso DFJustin: are subsequent scrapes in any way differential? Or does it just do the whole lot again?
15:23 🔗 DFJustin whole lot
15:23 🔗 johtso ouch
15:25 🔗 balrog has quit IRC (Read error: Operation timed out)
15:32 🔗 balrog has joined #archiveteam
15:32 🔗 swebb sets mode: +o balrog
15:44 🔗 mistym has joined #archiveteam
15:50 🔗 Start has quit IRC (Disconnected.)
15:56 🔗 Start has joined #archiveteam
16:07 🔗 mistym has quit IRC (Remote host closed the connection)
16:26 🔗 mistym has joined #archiveteam
16:31 🔗 Start has quit IRC (Disconnected.)
16:31 🔗 Start has joined #archiveteam
16:35 🔗 SmileyG has quit IRC (Quit: pi time)
16:37 🔗 Smiley has quit IRC (http://www.milkme.co.uk - You'll never understand.)
16:39 🔗 patricko- is now known as patrickod
16:43 🔗 patrickod is now known as patricko-
16:45 🔗 Start has quit IRC (Disconnected.)
16:49 🔗 primus104 has joined #archiveteam
16:57 🔗 Smiley has joined #archiveteam
16:59 🔗 khaoohs_ has joined #archiveteam
16:59 🔗 khaoohs has quit IRC (Read error: Connection reset by peer)
17:03 🔗 aaaaaaaaa has joined #archiveteam
17:10 🔗 patricko- is now known as patrickod
17:12 🔗 sankin has quit IRC (Leaving.)
17:18 🔗 primus104 has quit IRC (Leaving.)
17:21 🔗 sankin has joined #archiveteam
17:22 🔗 patrickod is now known as patricko-
17:27 🔗 Ara_ has joined #archiveteam
17:44 🔗 dserodio has quit IRC (Read error: Operation timed out)
17:52 🔗 antomati_ is now known as antomatic
17:52 🔗 svchfoo2 sets mode: +o antomatic
17:53 🔗 Emcy has quit IRC (Read error: Connection reset by peer)
17:55 🔗 Emcy has joined #archiveteam
17:55 🔗 dserodio has joined #archiveteam
18:09 🔗 scyther has joined #archiveteam
18:14 🔗 schbirid https://chrome.google.com/webstore/detail/screencastify-screen-vide/mmeijimgabbpbgpdklnllpncmdofkcpn?hl=en
18:14 🔗 schbirid Screencastify is a simple video screen capture software (aka. screencast recorder) for Chrome. It is able to record all screen activity inside a tab, including audio. Just press record and the content of your tab is recorded.
18:21 🔗 londoncal has joined #archiveteam
18:32 🔗 caber has quit IRC (Quit: Doei Doei!!!)
18:36 🔗 caber has joined #archiveteam
18:53 🔗 Start has joined #archiveteam
18:54 🔗 Start_ has joined #archiveteam
18:54 🔗 Start has quit IRC (Read error: Connection reset by peer)
18:59 🔗 BlueMaxim has joined #archiveteam
19:02 🔗 schbirid fuck, quakedev.com added a robots.txt. anyone have experience if IA would send me the existing old crawls? it was not my site
19:03 🔗 patricko- is now known as patrickod
19:17 🔗 Start has joined #archiveteam
19:17 🔗 Start_ has quit IRC (Read error: Connection reset by peer)
19:25 🔗 Start has quit IRC (Disconnected.)
19:29 🔗 Start has joined #archiveteam
19:38 🔗 patrickod is now known as patricko-
19:42 🔗 johtso Does anyone know if there any handy command line tools for batch extracting all files in a directory, rar/zip/7zip, skipping archives that have errors, and possibly even detecting archive format for files with no extension?
19:44 🔗 db48x has quit IRC (Ping timeout: 258 seconds)
19:47 🔗 SN4T14_ has joined #archiveteam
19:53 🔗 SN4T14__ has quit IRC (Ping timeout: 369 seconds)
19:55 🔗 Smiley i've seen 'uncompress' scripts floating around the net
19:56 🔗 Smiley you'd want... for x in ./; do uncompress --optionToIgnoreErrors $x; done
19:59 🔗 SimpBrain has quit IRC (Quit: Leaving)
20:02 🔗 primus104 has joined #archiveteam
20:03 🔗 SimpBrain has joined #archiveteam
20:07 🔗 DFJustin unar would work in that type of arrangement https://code.google.com/p/theunarchiver/
20:18 🔗 joepie91_ hm
20:18 🔗 joepie91_ johtso: catarc is something that /kind of/ does something like that
20:18 🔗 joepie91_ but it;s not really for extracting
20:18 🔗 joepie91_ so much as it is for writing to stdout :P
20:19 🔗 joepie91_ johtso: https://pypi.python.org/pypi/catarc/1.1
20:19 🔗 joepie91_ (yes, I wrote that)
20:19 🔗 johtso ah interesting
20:19 🔗 joepie91_ err, https://github.com/joepie91/catarc
20:19 🔗 joepie91_ there
20:19 🔗 joepie91_ it's specifically to sidestep 7z's issues with errors
20:20 🔗 schbirid has quit IRC (Read error: Operation timed out)
20:20 🔗 johtso joepie91_: could imagine it being a bit tricky to get that tool to produce directories with files for each archive
20:21 🔗 joepie91_ johtso: probably not too awful, can probably just replace the "print to stdout" flag with a target dir
20:21 🔗 joepie91_ but it'd def require some code modifications
20:21 🔗 johtso I'm surprised there aren't more tools out there for this kind of thing..
20:22 🔗 Start has quit IRC (Disconnected.)
20:26 🔗 Jonimus has quit IRC (Ping timeout: 370 seconds)
20:29 🔗 schbirid has joined #archiveteam
20:32 🔗 joepie91_ johtso: I suspect most people just use bash loops
20:32 🔗 joepie91_ having to work with a wide collection of different archive types is an edge casews
20:32 🔗 joepie91_ case *
20:32 🔗 joepie91_ it's usually a lot of the same type, and then a bash loop would suffice
20:32 🔗 joepie91_ :p
20:39 🔗 Jonimus has joined #archiveteam
20:53 🔗 signius has quit IRC (Read error: Operation timed out)
20:57 🔗 patricko- is now known as patrickod
20:58 🔗 WubTheCap has quit IRC (Quit: Leaving)
20:58 🔗 sankin has quit IRC (Leaving.)
21:04 🔗 patrickod is now known as patricko-
21:05 🔗 signius has joined #archiveteam
21:07 🔗 Emcy has quit IRC (Ping timeout: 512 seconds)
21:19 🔗 db48x has joined #archiveteam
21:27 🔗 okeuday has quit IRC (Ping timeout: 265 seconds)
21:30 🔗 schbirid has quit IRC (Leaving)
21:31 🔗 habi has joined #archiveteam
21:37 🔗 ikreymer has quit IRC (Remote host closed the connection)
21:41 🔗 Emcy has joined #archiveteam
21:41 🔗 okeuday has joined #archiveteam
21:43 🔗 Ara_ has quit IRC (Ping timeout: 492 seconds)
21:52 🔗 Waiii has joined #archiveteam
21:53 🔗 Waiii is it just me or is imcute.yt down?
21:55 🔗 Waiii so it's not just me
21:56 🔗 xmc doesn't look up to me
21:57 🔗 Waiii but archiveteam still says it's up :p
21:57 🔗 xmc oh, the wiki page for urlteam?
21:58 🔗 xmc that's always out of date because all these piddly little shorteners keep dying
21:58 🔗 xmc it's v sad
21:58 🔗 Waiii but where will i find all the glorious webms :c
22:00 🔗 chfoo um, i think it's a 4chan archive, not a url shortener
22:00 🔗 yipdw I love these archivebot screenshots
22:00 🔗 yipdw https://ia801001.us.archive.org/23/items/archiveteam_archivebot_go_010/twitter.com-inf-20131205-135447.warc.gz.png
22:00 🔗 yipdw remember when Twitter's web UI wasn't a huge sprawling mass
22:01 🔗 BlueMaxim that's like saying remember geocities. :/
22:02 🔗 yipdw in the sense that both are preserved yes
22:07 🔗 ikreymer has joined #archiveteam
22:09 🔗 db48x SketchCow: fixed https://archive.org/details/msdos_Snack_Attack_II_1982&external_js=1
22:09 🔗 db48x now I've just got to test games without a dosbox.conf file and make sure they're not broken :P
22:12 🔗 db48x DOOM2 works
22:15 🔗 habi has quit IRC (Quit: Leaving.)
22:16 🔗 Waiii has quit IRC (Quit: Page closed)
22:31 🔗 patricko- is now known as patrickod
22:31 🔗 BnAboyZ has joined #archiveteam
22:32 🔗 patrickod is now known as patricko-
22:39 🔗 SimpBrain has quit IRC (Quit: Leaving)
22:45 🔗 Start has joined #archiveteam
22:46 🔗 dashcloud db48x: can you check Fatal Distractions? https://archive.org/details/Fataldistract All the shareware on the disc supposedly can run from the CD, so that would fascinating if you could run the shareware CD online and get access to all of those games
23:09 🔗 scyther has quit IRC (Read error: Operation timed out)
23:10 🔗 scyther has joined #archiveteam
23:15 🔗 patricko- is now known as patrickod
23:22 🔗 patrickod is now known as patricko-
23:31 🔗 londoncal has quit IRC (Leaving...)
23:33 🔗 Ymgve has quit IRC ()
23:40 🔗 JonimusP has joined #archiveteam
23:44 🔗 Jonimus has quit IRC (Ping timeout: 370 seconds)
23:44 🔗 JonimusP is now known as Jonimus
23:49 🔗 johtso Urgh, so frustrating when a page is archived, but the interesting bits managed to slip from the crawlers grasp :( http://web.archive.org/web/20090121052458/http://www.voanews.com/english/Africa/blog/
23:49 🔗 johtso **** javascript
23:58 🔗 JMC has quit IRC (Ping timeout: 258 seconds)

irclogger-viewer