#archiveteam 2013-11-06,Wed

↑back Search

Time Nickname Message
00:33 🔗 dashcloud so, I know it's been mentioned before, but twitch.tv is and has been doing pruning of older broadcasts- items from the last 2 months, and anything over about 2 hours, 30 minutes is likely to be pruned unless the owner specially marks it
00:33 🔗 dashcloud what I've been using to archive recent (and older) streams from the channels I follow is this site: http://www.twitchtools.com/video-download.php
00:36 🔗 dashcloud just copy the link address of a broadcast, and paste it there, and then hit download video. If there's more than one quality level, pick one, and then you'll get the download links (2012 and earlier items are broken into 2 hour blocks, 2013 into 30 min blocks)
00:38 🔗 dashcloud also, if you get a message saying error: API limit, please reload and try again, it does go away if you reload the page and try again
00:55 🔗 phillipsj The xml (rss feed) identifies the "source" video.
01:00 🔗 dashcloud do I need to be logged in to see the xml/rss feed? or is it just something I can stick after the channel name?
01:13 🔗 phillipsj this is probably better in #blooper.tv
01:14 🔗 phillipsj The "full feature" has the xml as a relative link.
03:16 🔗 odie5533 Should we be using the Hanzo warctools or the IA one for Python? Which do people use?
03:18 🔗 ivan` I use hanzo's warc-tools for reading and writing
03:42 🔗 odie5533 thanks. I guess I'll stick with that
04:22 🔗 Lord_Nigh is the warc patch making its way into baseline wget at some point?
04:23 🔗 xmc it is in trunk already
04:23 🔗 ivan` 1.14 shipped with warc writing
04:23 🔗 xmc making its way into distributions as we speak
04:24 🔗 xmc debian stable currently distributes 1.14 :)
04:47 🔗 godane http://fundanything.com/en/campaigns/penn-campaign
04:50 🔗 odie5533 Why doesn't he fund it himself?
04:51 🔗 godane don't know
04:52 🔗 godane maybe just wanted to see if people wanted it first
04:52 🔗 godane so he put the full price of the movie on there
04:53 🔗 odie5533 in his video though he said he diddn't have enough money to produce it.
04:54 🔗 odie5533 which seems a bit suspect.
04:55 🔗 odie5533 His exact words were: "I need more than I have myself." Goal is $999,972.
04:55 🔗 odie5533 So either he's lying, or his magic show isn't that good.
04:56 🔗 godane like i said
04:57 🔗 godane he may not want to paid for it cause he was not sure anyone wanted it
04:57 🔗 godane and 2
04:57 🔗 odie5533 I agree, it's a good idea to run the campaign to see if people want it. but it's still a lie.
04:58 🔗 godane he may not know if he get any money back
04:58 🔗 odie5533 also gives you guaranteed pre-orders, so there's no worry it won't make at least $1 million.
07:11 🔗 Sum1 One thing about the Wayback Machine is often some images fail to display in results, is this a fault of the archive file or settings?
07:11 🔗 DFJustin there are various reasons for that, depends on the site
07:12 🔗 DFJustin a lot of the earlier crawls were not very deep due to limited budget and so things like images were not exhaustively retrieved
07:12 🔗 DFJustin on some sites the images are on a separate folder or server which is blocked by robots.txt
07:12 🔗 DFJustin sometimes there's overactive remote-linking protection such that the wayback machine got 403 errors trying to get the images
07:14 🔗 DFJustin there can be a date mismatch where the images exist in later crawls but are not available for the selected date
07:14 🔗 DFJustin if you try to load the image url by itself you can find out more
07:15 🔗 Sum1 Right. Thanks. Was just wondering as it's strange that even in recent crawls it can happen. Also why would remote-linking affect a crawler?
07:16 🔗 Sum1 *protection
07:18 🔗 godane i noticed that problem with theblaze.com in august of 2012
07:18 🔗 DFJustin I'm not sure but you can see it on e.g. older fortunecity crawls
07:19 🔗 DFJustin if the conditions weren't just right the server would serve up a fortunecity logo rather than the image
07:20 🔗 DFJustin even for recent crawls the wayback machine is a tradeoff between grabbing absolutely everything possible on the one hand and not filling up petabytes with infinitely nested script pages and the like on the other hand
07:21 🔗 DFJustin I recommend doing targeted grabs of anything you want to ensure is saved
07:21 🔗 Sum1 You mean by using the 'Archive this page now' button?
07:22 🔗 Sum1 Or a custom crawl yourself?
07:23 🔗 Sum1 I was wanting to start a crawl of a forum while it's still all up and wanted to be sure I could grab the images when it's saved to WARC files, as I know it's possible to submit the crawls to be considered for the Wayback Machine.
07:24 🔗 godane looks like rev3 is going to have a show called tekzilla bites
07:25 🔗 godane which i basicly the same as tekzilla daily series
07:25 🔗 DFJustin either one, depends on whether you need one or a couple pages vs a whole site
07:25 🔗 DFJustin to grab the images you want the --page-requisites option of wget
07:26 🔗 DFJustin we have a bot in #archivebot that can do simple crawls but it doesn't work that well with forum type sites
07:29 🔗 Sum1 I think it's custom forum code, about 10 years worth of mainly text based posts. Someone estimated 6GB worth. Haven't tried scraping a site using WARC before, and it looks like the main apps that save to it are wget and Heritrix.
07:37 🔗 yipdw what's the forum URL?
07:38 🔗 Sum1 http://my.opera.com/community/forums/
07:39 🔗 yipdw archivebot can probably do this
07:40 🔗 yipdw ok, submitted
07:40 🔗 Sum1 Really? The whole thing including images? I thought it was only for smaller sites.
07:40 🔗 yipdw smaller's relative
07:41 🔗 yipdw it's not too uncommon to have WARCs of ten gigs or so from the bot
07:41 🔗 yipdw but yeah, images and other detectable assets are included
07:41 🔗 yipdw one thing that is not presently included are things like embedded flash videos
07:45 🔗 Sum1 Awesome, thanks so much. Looking at it now on the archivebot dashboard.
07:45 🔗 Sum1 Nice to see userscripts.org being crawled too.
07:46 🔗 yipdw it should work out, though with crawls there's always the possibility of things going wrong
07:46 🔗 yipdw #archivebot is the control channel if you'd like to monitor things there too
07:46 🔗 Sum1 What happens when it's finished does someone test the links, or is that done during the archival process.
07:47 🔗 yipdw the WARC is uploaded to a staging server, and eventually uploaded to the Internet Archive
07:48 🔗 yipdw link tests aren't done; it's assumed that the requests and responses are properly captured
07:49 🔗 yipdw or more specifically: if a 200 RETRFINISHED appears, it's assumed that the data associated with that request/response made it into a valid WARC record
07:49 🔗 yipdw same with all other status codes except for 0
07:50 🔗 Sum1 Nice. I assume the pages of the threads are also saved, even though I only see the main thread links in the scrolling history.
07:53 🔗 joepie93 hehe
07:53 🔗 joepie93 "smaller" in terms of archivebot, means "few enough URLs to not have the VPS run out of RAM storing all the URLs it has seen"
07:53 🔗 joepie93 :)
07:56 🔗 yipdw Sum1: they should be -- what do the other pages look like?
07:56 🔗 yipdw URL-wise
07:56 🔗 Sum1 Actually it must be, as most of these are only one page. They would look like this: http://my.opera.com/community/forums/topic.dml?id=770512&abc=&page=2&skip=50&show=&perscreen=50
07:57 🔗 Sum1 I can see a few with such a URL now
07:59 🔗 yipdw Sum1: if you'd like to see what archivebot grabs, check out any of the WARCs in https://ia601002.us.archive.org/9/items/archiveteam_archivebot_go_002/
08:00 🔗 yipdw and I recommend using https://github.com/alard/warc-proxy to view them
08:00 🔗 yipdw my tests of archivebot's output was done with the same tool :P
08:00 🔗 yipdw you can also hit the URLs via the IA Wayback Machine, but I did not test with that
08:06 🔗 Sum1 When it's done is there some permalink created for the archive?
08:07 🔗 yipdw :P
08:07 🔗 yipdw eventually
08:07 🔗 yipdw WARCs used to go to a single location that had a simple URL generation scheme
08:08 🔗 yipdw that changed; now WARCs are put into a staging location and eventually uploaded to IA
08:08 🔗 yipdw I haven't yet written a program to track that
08:20 🔗 Sum1 No probs :) btw installed the WARC viewer in Firefox, but it doesn't show up in the Tools menu. Hmm.
09:42 🔗 Nemo_bis LOL the "Fight Spam" user is quite spamming this forum :) https://archive.org/iathreads/posts-display-new.php?forum=opensource&mode=rss
11:05 🔗 Sum1 When clicking on the '[History]' link of archivebot backups in progress on the dashboard nothing displays, is that normal?
12:40 🔗 Nemo_bis grr, it's the second time a maintenance interrupts my 200+ GB upload to s3 when it's at about 80 %
12:41 🔗 Sum1 Surely there's a resume feature
12:48 🔗 Nemo_bis hahahahha
12:51 🔗 * joepie93 reads that as a 'no'
13:13 🔗 undersco2 the archive is on fire
13:13 🔗 undersco2 hence maintenance
13:13 🔗 undersco2 apologies
13:13 🔗 undersco2 (all people are safe)
13:14 🔗 GLaDOS Fire?
13:14 🔗 GLaDOS WELL DAAAAAYYYYUM
13:20 🔗 Cameron_D But the warrior logo was supposed to depict services shutting down!
13:21 🔗 undersco2 gonna conserve cell power and go for now, but know that data is safe
13:42 🔗 joepie93 undersco2: wait, actual legitimate fire?
13:42 🔗 joepie93 wha?
13:58 🔗 Nemo_bis undersco2: :[ thanks for updating us
13:58 🔗 Nemo_bis https://code.google.com/p/httrack2arc/ , do we know this?
13:58 🔗 joepie93 :P
13:58 🔗 joepie93 we do now
14:04 🔗 balrog iirc someone here was working on fixing that perl yahoogroup dumper?
14:48 🔗 joepie93 image of what is left of the on-fire scanning facility: https://pbs.twimg.com/media/BYZUtBdCUAAW6ey.jpg:large :/
14:55 🔗 Baljem shit a brick
15:06 🔗 SketchCow Yeah
15:08 🔗 joepie93 https://pbs.twimg.com/media/BYZXbYiCIAAnZ4r.jpg:large
15:09 🔗 Sum1 woah, damn
15:27 🔗 balrog :(
15:32 🔗 SketchCow OK, PROJECT
15:32 🔗 SketchCow #archivefire
15:33 🔗 SketchCow Need help finding all the photos, links, everything else related to the internet archive fire.
15:34 🔗 Cowering IA burned?
15:35 🔗 balrog [09:48:47] <joepie93> image of what is left of the on-fire scanning facility: https://pbs.twimg.com/media/BYZUtBdCUAAW6ey.jpg:large :/
15:36 🔗 joepie93 <joepie93>https://pbs.twimg.com/media/BYZXbYiCIAAnZ4r.jpg:large
15:36 🔗 joepie93 https://pbs.twimg.com/media/BYZet1xCAAA6ko0.jpg:large
15:37 🔗 joepie93 https://twitter.com/search?q=internet%20archive&src=typd&f=realtime
15:37 🔗 joepie93 etc
15:38 🔗 Cowering good thing i didn't send in my guttenburg bible for scanning
15:39 🔗 Cowering apparently SF does not like halon, which would have saved the day here with no extra damage
15:40 🔗 Cowering military data center here uses it because they are not bound by stupid local laws
15:40 🔗 balrog there are halon replacements that work well
15:40 🔗 balrog of course even that won't help if the fire is bad enough :/
15:41 🔗 Cowering you don't want to be a +1 in the center though.. doors auto seal and you better get to a mask
15:41 🔗 Cowering the card swiper counts # of people inside to make sure everyone can get air :)
15:42 🔗 Nemo_bis well, they are not to be used everywhere
15:43 🔗 Nemo_bis in the last libraries we built (in my university) it was only storage rooms with compactables iirc
15:43 🔗 Cowering true, and this place can't replace it with something else as it would require destroying a nuke proof bunker
15:46 🔗 Sum1 What is the extent of the damage?
15:47 🔗 BiggieJon halon is banned almost globally, datacenters use FM200 or similar now, it's human safe and doest destroy the ozone layer when it's released
15:49 🔗 SketchCow Not much stuff was lost in the "items" department.
15:49 🔗 SketchCow There's a different warehouse things are stored at - this was not that.
15:49 🔗 Cowering i'm sorry, the total amount of energy needed to retool a halon facility completely negates any perceived help you might give ozone layer
15:50 🔗 SketchCow Ah, nerds and their numbers
15:50 🔗 Cowering remember, this stuff won't get released until you get a fire..
15:50 🔗 Baljem ah, that's good news at least - I was just looking at your photos you linked on Twitter, SketchCow, and the shelves of material awaiting attention made me go 'ulp'
15:50 🔗 BiggieJon Cowering: FM200 is a direct replacement, replace the tanks
15:52 🔗 Cowering i'm sure our tanks are quite inaccessable. There is one little dial in the data center listing pressure, and it has been the same for 30+ years now. (lol maybe it's stuck and the tanks are empty)
15:53 🔗 Cowering they don't ever test it because even they would have to fill out shitty reams of paperwork for the REPA
15:54 🔗 BiggieJon where is this datacenter you are referring to located ?
15:54 🔗 Cowering all this makes me want to tap a bottle of R12.. except that those are more expensive than wine now
15:54 🔗 Cowering eglin AFB, largest in the world
15:55 🔗 Cowering we have 4 datacenters, i don't know what the newest 3 use though as i don't do any work there
15:56 🔗 Cowering our cluster is 'high' on the supercomputer list, but we can't officially list it of course
15:58 🔗 balrog that one is probably grandfathered in
15:59 🔗 balrog but for any new datacenters FM200 or similar is superior
16:01 🔗 BiggieJon govt still uses it at some sites, remmeber, they dont follow the same safety laws the rest of the world does
16:02 🔗 Cowering we have to move one of our centers soon.. we are in a dick waving contest with groom lake to see who can do the longest runway
16:02 🔗 Cowering 10,000 feet just ain't long enough lol
16:06 🔗 undersco2 loss of material was minimal-to-none
16:06 🔗 undersco2 lost all ~15 $50k scribes though
16:06 🔗 undersco2 along with the totaled building
16:06 🔗 BiggieJon all insured hopefully
16:08 🔗 SketchCow Assume it is.
16:08 🔗 SketchCow You can't function in the city without insurance.
16:08 🔗 SketchCow Who know what they'll value.
16:09 🔗 balrog BiggieJon: yeah, but I'd assume that for new buildings they'll use modern equiv to halon.
16:11 🔗 joepie93 http://abclocal.go.com/kgo/story?section=news/local/san_francisco&id=9315507#&cmp=twi-kgo-article-9315507
16:12 🔗 balrog is that a different fire1?
16:12 🔗 balrog !?*
16:13 🔗 DFJustin bet that whole "pretty please gimme a scribe machine for my house" thing is looking a lot smarter now
16:13 🔗 SketchCow They might try to take mine back
16:16 🔗 DFJustin so how long until the kickstarter
16:17 🔗 SketchCow Already discussed
16:17 🔗 SketchCow Before end of year, I promise
16:24 🔗 SketchCow I uploaded 12gb of hentai games
16:25 🔗 SketchCow You know - to show respect
16:25 🔗 joepie93 lol
16:43 🔗 godane so we are talking at least $1 million in damage
16:43 🔗 joepie93 :/
16:44 🔗 godane thats the 15 scanners plus building
16:46 🔗 godane $750K for the scanners alone
16:51 🔗 joepie93 those are some expensive scanners...
16:51 🔗 godane now this is funny
16:52 🔗 godane one of my items is on the from page
16:52 🔗 godane https://archive.org/details/Secret_Life_Of_Machines_101
16:53 🔗 balrog were there microfiche scanners as well?
16:53 🔗 joepie93 godane: the statistic probability of that happening is pretty high, considering your upload rate :)
16:54 🔗 godane i know
16:54 🔗 godane i was going to upload but i got a slowdown message
16:58 🔗 M1das_ balrog: better question is, were there any books stored at that location wich they dont have anymore?
16:59 🔗 M1das_ equipment can be replaced, lost data is lost forever
17:00 🔗 balrog [10:49:45] <@SketchCow> Not much stuff was lost in the "items" department.
17:00 🔗 balrog [10:49:56] <@SketchCow> There's a different warehouse things are stored at - this was not that.
17:00 🔗 M1das_ k, missed that part balrog ;)
17:01 🔗 godane now you see my idea of 1PB dvd-like disk not being so crazy
17:02 🔗 godane this way SketchCow can have a backup
17:03 🔗 joepie93 M1das_: not all equipment..
17:06 🔗 M1das_ joepie93: well most of it, and alot can be rebuild i pressume
17:06 🔗 M1das_ data on the other hand is less likely to be rebuild after a fire
17:07 🔗 M1das_ lets see what linkedin is telling me now
17:08 🔗 phillipsj Fires kill many businesses.
17:09 🔗 godane the data could be in more then one place in the USA
17:09 🔗 godane just say
17:09 🔗 godane *saying
17:09 🔗 M1das_ IA has multiple locations AFAIK
17:09 🔗 godane only thinking that cause the full backup is in middle east
17:10 🔗 DFJustin IA has two locations but both are in the bay area
17:10 🔗 godane i only know of 2 other locations
17:10 🔗 DFJustin as far as I can tell the egypt backup is only wayback and only up to 2007 http://www.bibalex.org/isis/frontend/archive/archive_web.aspx
17:10 🔗 godane ok
17:11 🔗 M1das_ and the one in .nl? does it still exist?
17:12 🔗 joepie93 it's at xs4all
17:13 🔗 joepie93 afaik
17:13 🔗 joepie93 I presume it still exists, but no certainty
17:14 🔗 M1das_ how much data does IA now contain anyway?
17:14 🔗 DFJustin 15 pb
17:14 🔗 DFJustin as per the 1024 presentation
17:14 🔗 M1das_ in total or per cluster?
17:15 🔗 DFJustin I believe that's per
17:53 🔗 Lord_Nigh seeing pics it looks like if there was any material on those metal racks waiting to be scanned or refiled it got incinerated
17:53 🔗 Lord_Nigh unless people were unprofessional i don't think anything was left on the scanners overnight
17:53 🔗 Lord_Nigh since they're manual
17:54 🔗 Lord_Nigh can see the racks as they were on http://www.flickr.com/photos/textfiles/sets/72157627449430098/ and the twisted remains of them on the pics of the building
17:54 🔗 DFJustin depends how fast it was noticed, it's a lot easier to grab a cardboard box and run than, say, a scribe workstation
17:54 🔗 Lord_Nigh was anyone there when the fire happened though?
17:55 🔗 Lord_Nigh 3:45am
17:55 🔗 balrog those pictures are from 2011
17:55 🔗 SketchCow Two people live onsite
17:55 🔗 SketchCow So yes
17:56 🔗 Lord_Nigh also other than those boxes what else in the room is readily flammable?
17:57 🔗 SketchCow The whole structure is made of wood
17:57 🔗 Lord_Nigh if there was any nitrate film down there being scanned i could definitely see that self-igniting
17:57 🔗 Lord_Nigh that stuff is insanely dangerous
17:57 🔗 Lord_Nigh er, perhaps 'moderately' is better than 'insanely'
17:58 🔗 Lord_Nigh as long as you're prepared for the film to suddenly spontaneously ignite at any time
18:01 🔗 DFJustin "Kahle was in remarkably good spirits when we spoke to him around 9am this morning, and was optimistic about their plans to rebuild the office. He said they mostly lost electronic equipment including cameras and scanners, and thankfully no cultural materials were destroyed in the fire. He said he was still waiting for word from the SFFD on what exactly caused the fire. "
18:05 🔗 SketchCow Nothing gets brewster down
18:05 🔗 SketchCow He spends all his time being told what he's doing is impossible
18:20 🔗 joepie93 SketchCow: quite a lot of respect for the guy
20:47 🔗 odie5533 Nemo_bis: you could probably use HTTrack with one of my Warc proxies. Would probably be easier too.
20:48 🔗 Nemo_bis hmm?
20:49 🔗 Nemo_bis I don't use htttrack for archiveteam
20:49 🔗 Nemo_bis +/- some 't's
23:51 🔗 frigolit ugh, what's the secret word for the AT wiki... T_T
23:51 🔗 Nemo_bis yahoosucks
23:52 🔗 frigolit tyty

irclogger-viewer