[00:33] so, I know it's been mentioned before, but twitch.tv is and has been doing pruning of older broadcasts- items from the last 2 months, and anything over about 2 hours, 30 minutes is likely to be pruned unless the owner specially marks it [00:33] what I've been using to archive recent (and older) streams from the channels I follow is this site: http://www.twitchtools.com/video-download.php [00:36] just copy the link address of a broadcast, and paste it there, and then hit download video. If there's more than one quality level, pick one, and then you'll get the download links (2012 and earlier items are broken into 2 hour blocks, 2013 into 30 min blocks) [00:38] also, if you get a message saying error: API limit, please reload and try again, it does go away if you reload the page and try again [00:55] The xml (rss feed) identifies the "source" video. [01:00] do I need to be logged in to see the xml/rss feed? or is it just something I can stick after the channel name? [01:13] this is probably better in #blooper.tv [01:14] The "full feature" has the xml as a relative link. [03:16] Should we be using the Hanzo warctools or the IA one for Python? Which do people use? [03:18] I use hanzo's warc-tools for reading and writing [03:42] thanks. I guess I'll stick with that [04:22] is the warc patch making its way into baseline wget at some point? [04:23] it is in trunk already [04:23] 1.14 shipped with warc writing [04:23] making its way into distributions as we speak [04:24] debian stable currently distributes 1.14 :) [04:47] http://fundanything.com/en/campaigns/penn-campaign [04:50] Why doesn't he fund it himself? [04:51] don't know [04:52] maybe just wanted to see if people wanted it first [04:52] so he put the full price of the movie on there [04:53] in his video though he said he diddn't have enough money to produce it. [04:54] which seems a bit suspect. [04:55] His exact words were: "I need more than I have myself." Goal is $999,972. [04:55] So either he's lying, or his magic show isn't that good. [04:56] like i said [04:57] he may not want to paid for it cause he was not sure anyone wanted it [04:57] and 2 [04:57] I agree, it's a good idea to run the campaign to see if people want it. but it's still a lie. [04:58] he may not know if he get any money back [04:58] also gives you guaranteed pre-orders, so there's no worry it won't make at least $1 million. [07:11] One thing about the Wayback Machine is often some images fail to display in results, is this a fault of the archive file or settings? [07:11] there are various reasons for that, depends on the site [07:12] a lot of the earlier crawls were not very deep due to limited budget and so things like images were not exhaustively retrieved [07:12] on some sites the images are on a separate folder or server which is blocked by robots.txt [07:12] sometimes there's overactive remote-linking protection such that the wayback machine got 403 errors trying to get the images [07:14] there can be a date mismatch where the images exist in later crawls but are not available for the selected date [07:14] if you try to load the image url by itself you can find out more [07:15] Right. Thanks. Was just wondering as it's strange that even in recent crawls it can happen. Also why would remote-linking affect a crawler? [07:16] *protection [07:18] i noticed that problem with theblaze.com in august of 2012 [07:18] I'm not sure but you can see it on e.g. older fortunecity crawls [07:19] if the conditions weren't just right the server would serve up a fortunecity logo rather than the image [07:20] even for recent crawls the wayback machine is a tradeoff between grabbing absolutely everything possible on the one hand and not filling up petabytes with infinitely nested script pages and the like on the other hand [07:21] I recommend doing targeted grabs of anything you want to ensure is saved [07:21] You mean by using the 'Archive this page now' button? [07:22] Or a custom crawl yourself? [07:23] I was wanting to start a crawl of a forum while it's still all up and wanted to be sure I could grab the images when it's saved to WARC files, as I know it's possible to submit the crawls to be considered for the Wayback Machine. [07:24] looks like rev3 is going to have a show called tekzilla bites [07:25] which i basicly the same as tekzilla daily series [07:25] either one, depends on whether you need one or a couple pages vs a whole site [07:25] to grab the images you want the --page-requisites option of wget [07:26] we have a bot in #archivebot that can do simple crawls but it doesn't work that well with forum type sites [07:29] I think it's custom forum code, about 10 years worth of mainly text based posts. Someone estimated 6GB worth. Haven't tried scraping a site using WARC before, and it looks like the main apps that save to it are wget and Heritrix. [07:37] what's the forum URL? [07:38] http://my.opera.com/community/forums/ [07:39] archivebot can probably do this [07:40] ok, submitted [07:40] Really? The whole thing including images? I thought it was only for smaller sites. [07:40] smaller's relative [07:41] it's not too uncommon to have WARCs of ten gigs or so from the bot [07:41] but yeah, images and other detectable assets are included [07:41] one thing that is not presently included are things like embedded flash videos [07:45] Awesome, thanks so much. Looking at it now on the archivebot dashboard. [07:45] Nice to see userscripts.org being crawled too. [07:46] it should work out, though with crawls there's always the possibility of things going wrong [07:46] #archivebot is the control channel if you'd like to monitor things there too [07:46] What happens when it's finished does someone test the links, or is that done during the archival process. [07:47] the WARC is uploaded to a staging server, and eventually uploaded to the Internet Archive [07:48] link tests aren't done; it's assumed that the requests and responses are properly captured [07:49] or more specifically: if a 200 RETRFINISHED appears, it's assumed that the data associated with that request/response made it into a valid WARC record [07:49] same with all other status codes except for 0 [07:50] Nice. I assume the pages of the threads are also saved, even though I only see the main thread links in the scrolling history. [07:53] hehe [07:53] "smaller" in terms of archivebot, means "few enough URLs to not have the VPS run out of RAM storing all the URLs it has seen" [07:53] :) [07:56] Sum1: they should be -- what do the other pages look like? [07:56] URL-wise [07:56] Actually it must be, as most of these are only one page. They would look like this: http://my.opera.com/community/forums/topic.dml?id=770512&abc=&page=2&skip=50&show=&perscreen=50 [07:57] I can see a few with such a URL now [07:59] Sum1: if you'd like to see what archivebot grabs, check out any of the WARCs in https://ia601002.us.archive.org/9/items/archiveteam_archivebot_go_002/ [08:00] and I recommend using https://github.com/alard/warc-proxy to view them [08:00] my tests of archivebot's output was done with the same tool :P [08:00] you can also hit the URLs via the IA Wayback Machine, but I did not test with that [08:06] When it's done is there some permalink created for the archive? [08:07] :P [08:07] eventually [08:07] WARCs used to go to a single location that had a simple URL generation scheme [08:08] that changed; now WARCs are put into a staging location and eventually uploaded to IA [08:08] I haven't yet written a program to track that [08:20] No probs :) btw installed the WARC viewer in Firefox, but it doesn't show up in the Tools menu. Hmm. [09:42] LOL the "Fight Spam" user is quite spamming this forum :) https://archive.org/iathreads/posts-display-new.php?forum=opensource&mode=rss [11:05] When clicking on the '[History]' link of archivebot backups in progress on the dashboard nothing displays, is that normal? [12:40] grr, it's the second time a maintenance interrupts my 200+ GB upload to s3 when it's at about 80 % [12:41] Surely there's a resume feature [12:48] hahahahha [12:51] * joepie93 reads that as a 'no' [13:13] the archive is on fire [13:13] hence maintenance [13:13] apologies [13:13] (all people are safe) [13:14] Fire? [13:14] WELL DAAAAAYYYYUM [13:20] But the warrior logo was supposed to depict services shutting down! [13:21] gonna conserve cell power and go for now, but know that data is safe [13:42] undersco2: wait, actual legitimate fire? [13:42] wha? [13:58] undersco2: :[ thanks for updating us [13:58] https://code.google.com/p/httrack2arc/ , do we know this? [13:58] :P [13:58] we do now [14:04] iirc someone here was working on fixing that perl yahoogroup dumper? [14:48] image of what is left of the on-fire scanning facility: https://pbs.twimg.com/media/BYZUtBdCUAAW6ey.jpg:large :/ [14:55] shit a brick [15:06] Yeah [15:08] https://pbs.twimg.com/media/BYZXbYiCIAAnZ4r.jpg:large [15:09] woah, damn [15:27] :( [15:32] OK, PROJECT [15:32] #archivefire [15:33] Need help finding all the photos, links, everything else related to the internet archive fire. [15:34] IA burned? [15:35] [09:48:47] image of what is left of the on-fire scanning facility: https://pbs.twimg.com/media/BYZUtBdCUAAW6ey.jpg:large :/ [15:36] https://pbs.twimg.com/media/BYZXbYiCIAAnZ4r.jpg:large [15:36] https://pbs.twimg.com/media/BYZet1xCAAA6ko0.jpg:large [15:37] https://twitter.com/search?q=internet%20archive&src=typd&f=realtime [15:37] etc [15:38] good thing i didn't send in my guttenburg bible for scanning [15:39] apparently SF does not like halon, which would have saved the day here with no extra damage [15:40] military data center here uses it because they are not bound by stupid local laws [15:40] there are halon replacements that work well [15:40] of course even that won't help if the fire is bad enough :/ [15:41] you don't want to be a +1 in the center though.. doors auto seal and you better get to a mask [15:41] the card swiper counts # of people inside to make sure everyone can get air :) [15:42] well, they are not to be used everywhere [15:43] in the last libraries we built (in my university) it was only storage rooms with compactables iirc [15:43] true, and this place can't replace it with something else as it would require destroying a nuke proof bunker [15:46] What is the extent of the damage? [15:47] halon is banned almost globally, datacenters use FM200 or similar now, it's human safe and doest destroy the ozone layer when it's released [15:49] Not much stuff was lost in the "items" department. [15:49] There's a different warehouse things are stored at - this was not that. [15:49] i'm sorry, the total amount of energy needed to retool a halon facility completely negates any perceived help you might give ozone layer [15:50] Ah, nerds and their numbers [15:50] remember, this stuff won't get released until you get a fire.. [15:50] ah, that's good news at least - I was just looking at your photos you linked on Twitter, SketchCow, and the shelves of material awaiting attention made me go 'ulp' [15:50] Cowering: FM200 is a direct replacement, replace the tanks [15:52] i'm sure our tanks are quite inaccessable. There is one little dial in the data center listing pressure, and it has been the same for 30+ years now. (lol maybe it's stuck and the tanks are empty) [15:53] they don't ever test it because even they would have to fill out shitty reams of paperwork for the REPA [15:54] where is this datacenter you are referring to located ? [15:54] all this makes me want to tap a bottle of R12.. except that those are more expensive than wine now [15:54] eglin AFB, largest in the world [15:55] we have 4 datacenters, i don't know what the newest 3 use though as i don't do any work there [15:56] our cluster is 'high' on the supercomputer list, but we can't officially list it of course [15:58] that one is probably grandfathered in [15:59] but for any new datacenters FM200 or similar is superior [16:01] govt still uses it at some sites, remmeber, they dont follow the same safety laws the rest of the world does [16:02] we have to move one of our centers soon.. we are in a dick waving contest with groom lake to see who can do the longest runway [16:02] 10,000 feet just ain't long enough lol [16:06] loss of material was minimal-to-none [16:06] lost all ~15 $50k scribes though [16:06] along with the totaled building [16:06] all insured hopefully [16:08] Assume it is. [16:08] You can't function in the city without insurance. [16:08] Who know what they'll value. [16:09] BiggieJon: yeah, but I'd assume that for new buildings they'll use modern equiv to halon. [16:11] http://abclocal.go.com/kgo/story?section=news/local/san_francisco&id=9315507#&cmp=twi-kgo-article-9315507 [16:12] is that a different fire1? [16:12] !?* [16:13] bet that whole "pretty please gimme a scribe machine for my house" thing is looking a lot smarter now [16:13] They might try to take mine back [16:16] so how long until the kickstarter [16:17] Already discussed [16:17] Before end of year, I promise [16:24] I uploaded 12gb of hentai games [16:25] You know - to show respect [16:25] lol [16:43] so we are talking at least $1 million in damage [16:43] :/ [16:44] thats the 15 scanners plus building [16:46] $750K for the scanners alone [16:51] those are some expensive scanners... [16:51] now this is funny [16:52] one of my items is on the from page [16:52] https://archive.org/details/Secret_Life_Of_Machines_101 [16:53] were there microfiche scanners as well? [16:53] godane: the statistic probability of that happening is pretty high, considering your upload rate :) [16:54] i know [16:54] i was going to upload but i got a slowdown message [16:58] balrog: better question is, were there any books stored at that location wich they dont have anymore? [16:59] equipment can be replaced, lost data is lost forever [17:00] [10:49:45] <@SketchCow> Not much stuff was lost in the "items" department. [17:00] [10:49:56] <@SketchCow> There's a different warehouse things are stored at - this was not that. [17:00] k, missed that part balrog ;) [17:01] now you see my idea of 1PB dvd-like disk not being so crazy [17:02] this way SketchCow can have a backup [17:03] M1das_: not all equipment.. [17:06] joepie93: well most of it, and alot can be rebuild i pressume [17:06] data on the other hand is less likely to be rebuild after a fire [17:07] lets see what linkedin is telling me now [17:08] Fires kill many businesses. [17:09] the data could be in more then one place in the USA [17:09] just say [17:09] *saying [17:09] IA has multiple locations AFAIK [17:09] only thinking that cause the full backup is in middle east [17:10] IA has two locations but both are in the bay area [17:10] i only know of 2 other locations [17:10] as far as I can tell the egypt backup is only wayback and only up to 2007 http://www.bibalex.org/isis/frontend/archive/archive_web.aspx [17:10] ok [17:11] and the one in .nl? does it still exist? [17:12] it's at xs4all [17:13] afaik [17:13] I presume it still exists, but no certainty [17:14] how much data does IA now contain anyway? [17:14] 15 pb [17:14] as per the 1024 presentation [17:14] in total or per cluster? [17:15] I believe that's per [17:53] seeing pics it looks like if there was any material on those metal racks waiting to be scanned or refiled it got incinerated [17:53] unless people were unprofessional i don't think anything was left on the scanners overnight [17:53] since they're manual [17:54] can see the racks as they were on http://www.flickr.com/photos/textfiles/sets/72157627449430098/ and the twisted remains of them on the pics of the building [17:54] depends how fast it was noticed, it's a lot easier to grab a cardboard box and run than, say, a scribe workstation [17:54] was anyone there when the fire happened though? [17:55] 3:45am [17:55] those pictures are from 2011 [17:55] Two people live onsite [17:55] So yes [17:56] also other than those boxes what else in the room is readily flammable? [17:57] The whole structure is made of wood [17:57] if there was any nitrate film down there being scanned i could definitely see that self-igniting [17:57] that stuff is insanely dangerous [17:57] er, perhaps 'moderately' is better than 'insanely' [17:58] as long as you're prepared for the film to suddenly spontaneously ignite at any time [18:01] "Kahle was in remarkably good spirits when we spoke to him around 9am this morning, and was optimistic about their plans to rebuild the office. He said they mostly lost electronic equipment including cameras and scanners, and thankfully no cultural materials were destroyed in the fire. He said he was still waiting for word from the SFFD on what exactly caused the fire. " [18:05] Nothing gets brewster down [18:05] He spends all his time being told what he's doing is impossible [18:20] SketchCow: quite a lot of respect for the guy [20:47] Nemo_bis: you could probably use HTTrack with one of my Warc proxies. Would probably be easier too. [20:48] hmm? [20:49] I don't use htttrack for archiveteam [20:49] +/- some 't's [23:51] ugh, what's the secret word for the AT wiki... T_T [23:51] yahoosucks [23:52] tyty