[00:47] so, weird question. is it bad form for me to use wget to WARC stuff from the wayback machine? [00:49] i'd like to archive previous employer web sites for when the domains inevitably expire and a new domain owner potentially robots.txt's me out of stuff i worked on [00:57] No. [00:57] Save everything however you want. [00:57] ok [00:57] thanks :) [00:57] Archive Team exists to give people choices, instead of all those who would take them away. [01:03] hmm even though wayback's robots.txt looks like it disallows it? [01:06] whats the best way to strip multi-line blocks of ad code from HTML in a post process? [01:06] is there a way to set a text file as matching equivelant in a perl find/replace via cmd line? [01:14] I'd imagine sed would come in handy here? [01:14] not that I know how to use it, but still [01:15] :P [01:25] sed is great for simpler 1 line replaces [01:25] sed doesn't support lookahead/lookbehind [01:25] and its regex engine is limited [01:25] though there are more advanced versions of sed available that can do more, perl is much better for dealing with more complex multi-line regex's [01:43] Yeah, POSIX sed has expanded regex support (with -E) and GNU sed has a variety of extensions, but it's not very portable. Might as well go straight to something more advanced at that point. [01:48] (Also I really don't like how gsed makes you pass --posix to get posix-compliant behaviour... when --posix is not a valid POSIX option and will make BSD sed barf) [01:51] to force perl to parse multi lines in a cmdline type setup I had to add -0p [01:51] which is SO not obvious [01:51] had to find it on stackoverflow [01:51] Huh, that's not listed in man perl but is in perl --help. [02:49] well, i'm late to the party, but mistym is wrong about having to care about portability with sed [04:49] so full interviews of cbc spark for year 2007 is uploaded now [04:55] good news everyone [04:56] i found berwster kahle interview on spark [04:56] the full interview [04:56] this is the file url: http://thumbnails.cbc.ca/maven_legacy/thumbnails/13/481/bonussparkplus_20080909_16405_uploaded.mp3 [05:24] joepie91: i managed to get multi-line replace working with sed [05:24] I had to, since perl in cygwin likes to create .bak files for every file that it modifies [05:26] not if you do it this way instence [05:26] perl -pi -e 's/old/new/' [05:26] inline file edit [05:27] http://stackoverflow.com/questions/11074322/does-perls-i-with-no-argument-create-a-backup-file-on-cygwin [05:29] wow subtle windows problems :( [05:29] ye :/ [05:29] yea [05:31] so i'm using this: [05:31] sed -r ':a;N;$!ba; s/FIND/REPLACE/g' file.html [05:37] can anyone get this one to stream: http://www.cbc.ca/player/Radio/Spark/ID/2235141926/ [05:46] godane: nope [06:10] Anyone been to pumpcon [06:14] I have. [06:18] so funny thing about spark player [06:18] when you click on 2008 year in full episodes [06:18] only one episode is there [06:19] cbc spark is some how trying to hide or delete [06:19] stuff [07:28] 1 way to drive yourself crazy: forgetting to save .sh file as unix line formatting in a windows environment to execute in cygwin [07:29] I cannot remember does cygwin come with dos2unix? [07:34] You know, I've never liked that price increase they put on Australians.. http://i.imgur.com/JmmjzpZ.png [07:55] i would have to check [07:55] there are ways to convert the linefeeds to unix from windows using sed and other apps [07:56] though i just had to remember to use unix EOL when saving my .sh file from notepad++ [08:48] starting to upload g4 e3 2012 full episode videos [09:30] anyone have any twitter user archiving scripts? [09:40] Tephra: I just load Twitter's JS and use ctrl-s with Mozilla Archive Format [09:40] I'm sure there's some archiver on github somewhere [10:09] fantastic, the paper: "The Continued Movement for Open Access to Peer-Reviewed Literature" --- paywalled! [13:03] instence: yay! [14:00] SketchCow: am I right in assuming that http://archive.org/details/martinmanleylifeanddeath.com-20130816 will be ingested into the Wayback eventually? [14:01] Or, hell, anyone here? [14:02] hello. [14:02] Hi. [14:05] o/ [14:06] GLaDOS: i believe all warcs are, yes. [14:06] All warcs that are in the right collection are [14:06] That collection being web? [14:06] IIRC they should be in the "Web Crawl" collection [14:06] GLaDOS: Mediatype != collection [14:07] Ah. [14:07] Web Crawls > Archive Team > The Archive Team Just In Time Grabs [14:07] ahhh [14:07] It's in a sub-sub collection of Web Crawls. So yes, it'll be available in Wayback [14:08] Disclaimer: I could be wrong about this, but if I recall correctly.. this should be the case [14:08] teehee~ I've uploaded 21 videos now [14:52] I guess this needs to be moved, then: https://archive.org/details/Uponfurtherreview.blog.comPanicgrab20130815.warc [14:52] (tries to log in) [14:52] (fails) [14:52] (has forgotten password) [14:52] (dah!) [14:53] Magical goblins change my passwords without me knowing. bleeh. :) [15:40] http://gratisoptehalen.nl/advertentie.php?id=241847 [15:40] hmm [15:40] bunch of old pieces of software, some Dutch some English [15:41] (it's free, just shipping costs) [15:41] worth getting? [15:41] (and archiving) [15:41] probably :) [15:42] really should keep more of an eye on that stuff [15:42] that site in particular [15:42] lots of this old stuff coming by [15:42] joepie91, what up :o) [15:43] ohai [15:43] about to take out the trash [16:20] GLaDOS: Of course it will. [16:21] All archiveteam 'web' type objects are ingested roughly every two weeks. [16:21] ersi is wrong. It's items set 'web'. [16:21] That's why not everyone can set that. [16:22] https://archive.org/details/Uponfurtherreview.blog.comPanicgrab20130815.warc now fixed. [16:24] so i got g4 e3 2012 stuff uploaded now [16:24] and checked in [16:24] thanks so much, Sketchcow. [16:27] ah, alright [16:28] SketchCow: So the 'Web Crawls' collection-connection has nothing to do with it? Just the mediatype? [16:35] Right. [16:35] Well, to be MORE specific... [16:36] The mediatype is the definer. "web". There's attempts in terms of placing items in collections to ensure that web crawls are bunched together for a pure organizational effort, but they're not affecting The Programs. [16:36] Ah, alright. [16:36] The Programs are crawling through the archive.org items, finding ones with a "web" mediatype, and if they're new or changed, ingesting them. [16:37] But The Programs are NOT going to things with a "texts" mediatype, going "well, hmmm, it has a warc.gz, let's add it too". [16:37] The fact this goes on at all to the level it does is because of me at the archive. [16:37] It used to be something not quite done. [16:37] This is way better than nothing :) [16:37] Now we're doing it so much we're contributing major blocks of the internet into the wayback machine. [16:38] I'm just really curious how things work at Internet Archive.. Unfortunately, details are really.. not existant for us mere mortals :) [16:38] Yeah, again, that's because of how they've been set up for years. [16:38] It was rather painful for them, the amount of opening I'm forcing. [16:38] * ersi nods understandingly [16:39] One or two devs and I did not get along over it. [16:39] And there is still some fear about it. [16:39] I'm sure more than just us, would be *really* interested in reading about a lot of IA things. I'm sure that could be used to drive donations as well. [16:39] agreed with ersi there [16:40] the whole IA setup fascinates me tbh [16:40] But! You bring in the crazy open-access insane activist and you get what you get. [16:40] (which is why I quite enjoyed your recent post, SketchCow) [16:40] It could also drive volunteer effort in contributing other efforts. Like coding on wayback and tools [16:40] heh [16:40] * ersi shrugs [16:40] I forgot about those [16:40] "well, that's just part of the package, guys!" [16:40] So, I'm walking through a lot of this. [16:41] It took months to get subscriptions working in the system. [16:41] Just slow will to radical change. [16:41] I signed up, by the way. [16:41] Totally worth while [16:41] uploaded: https://archive.org/details/KeygenMusicPack-July2013 [16:41] So give me time, I'm spending a lot of effort to fix a lot of things. [16:41] ... in hindsight, would it have been better to upload it as a .zip instead of the original .7z, so that you can browse it? [16:41] http://archive.org/details/software - that didn't exist like that a mere year ago. [16:41] Of course SketchCow. I'm just.. excited about things. :) [16:41] Now it's fuckin'..... it's the bomb [16:42] da bomb [16:42] Also, I'm hand-cleaning things today. [16:43] For example, I have a small pile of Bell System Technical Journal papers to go in. [16:43] I'm REALLY trying to murder this 11tb backlog on my archive.org machine. [16:43] Hahah, niiice [16:44] Just 25 more journal papers to go in, but I can't use the script, it's all by hand. [16:44] http://archive.org/details/bstj-archives [16:44] But see? There's 4,337 items, just sitting there. [16:44] Yeah, I feel ya'. I'm doing the DebConf12-videos by hand as well. [16:44] mm, found a bunch of shady links, wat do? send email? [16:45] info@archive.org I think [16:45] When I turn to http://archive.org/details/hackercons it will be glorious. [16:45] Yes, give Jeff stuff. [16:45] (info@archive.org) [16:45] Jeff is a goddamned master. [16:47] email sent [16:48] guessing malware [16:48] spammy descriptions, files that are too small to contain the listed software, and a seemingly auto-generated e-mail address for the uploader [16:48] different for each [16:48] despite having highly similar description and title formats [16:48] * joepie91 puts the popped up red flags back in the bin [17:05] SketchCow: the title on this item should be fixed: http://archive.org/details/MmprMagazine-Fall1994 [17:11] http://sebsauvage.net/paste/?bca6cef7a70dfb9c#JqGn/10j/zVeN8kyLK0I2w4OwSBGPH2ZCabTbPc6qpY= [17:22] anyone willing to mirror my old isos and scripts? [17:22] here is the link: http://arch-live.isawsome.net/ [17:22] i ask cause i think its too big for me to mirror [17:24] godane: fixed. [17:24] joepie91: Bear in mind that we do notice things like that and do cleaning runs. Many. [17:25] SketchCow, is there an abuse only email address or is it just info@ [17:35] The media makes me want to take a drone strike out on them http://www.huffingtonpost.com/2013/08/17/michael-grunwald-julian-assange_n_3773981.html [17:35] (╯°□°)╯︵ ǝuızɐƃɐW ǝɯı⊥ [17:42] omf_: that guy certainly managed to get himself very near the top of my "people I would never get along with" list [17:47] uploaded: http://archive.org/details/G4.Comic-Con.2012.Live.HDTV.x264-Eclipse [17:47] you now got the g4 specials for 2012 [17:51] joepie91: yeah, Grunwald's a pussy [17:51] well [17:51] actually, no [17:52] that's a bad term [17:52] it implies females are weak [17:52] he just sucks [17:52] Grunwald's just an idiot [17:52] And he forgot that his twitter account is a professional representation. [17:52] I promise you, Journalists say terrible things all the time. [17:52] Just not usually into an open mike. [17:52] SketchCow: I'm not sure he so much 'forgot' as just didn't give a shit... [17:52] That was an open mike. [17:52] No, again, journalists are awful, dude. [17:53] It's just this one went awful for no reason [17:53] oh, I know that a lot of them are, I've dealt with quite a few of them, and know a few personally... [17:53] but yes [17:53] most of them know how far to publicize their thoughts [17:53] and where to stop [17:53] Grunwald apparently did not [17:56] Yeah, made a mistake. [17:57] http://i.imgur.com/Ha2wD19.gif [18:00] jetpack cat [18:12] SketchCow: Could you create an collection for DefCon12 and these items: http://burl.se/3ce ? [18:14] It was rather painful for them, the amount of opening I'm forcing. [18:14] SketchCow: I meant DebConf12 [18:14] underscor: How was Defcon? :) [18:14] I'm just imagining jason with elbow length rubber gloves and a crowbar [18:14] Mr sound guy [18:16] pretty fun! [18:16] a little bit of a blur [18:16] but I throughly enjoyed it [18:16] nice ^_^ yeah, that means it was fun though [18:16] I noticed you by name in the credits [18:16] of the documentary [18:16] aside from having a mental breakdown on, uh, saturday night I think [18:16] but I needed that [18:16] doing a lot better now [18:17] hehee, yay ^^ [18:17] sounds.. bad. Hope it helped in the long run though [18:17] it was truly awful [18:17] You guys should come over for some Europe-action sometime [18:17] I'd been spiraling into depression for the prior month [18:18] That's when it came to a head (*cough* A drink didn't help with that *cough*) [18:18] * omf_ sends hugs underscor's way [18:18] But that was also the turning point to getting out of it [18:18] naw, drinks ain't helpin' against that kind of thing [18:18] and I am doing well now [18:19] Crying for a few hours in the corner of an acquaintance's hotel room in a casino of which I am still unsure of the name of [18:19] can be therapeutic [18:19] I guess [18:19] lol [18:19] Las Vegas, baby. [18:20] http://i.imgur.com/0C9zB0k.jpg [18:20] hahah [18:20] amen! [18:20] SketchCow: hah, I thought of you when I read that earlier [18:24] Yeah, great emo weekend for underscor [18:24] Leaving Las Vegas Jr. [18:25] It was fun! [18:25] I turned the defcon knob up to 11 though [18:25] from like, 0.35 [18:25] Well, growing up is tough, especially when you have a patchy support system like before you met me. [18:25] Yeah <3 [18:25] You actually did defcon knob to 6-7 before because you kept going out at night after "work". [18:25] Which was stuuuuuuuupid [18:25] But that's what you do at 19 [18:26] Stuuuuuuuuuuppppid [18:26] * underscor sheepish grin [18:26] I have audio recording of you plotting to skip out [18:26] it is adorbs. [18:26] hahahaha [18:26] I remember that [18:27] so looks like Brewster Kahle may have archived techtv [18:27] I remember leaving the tripod on the bed [18:27] and that was the only record of my existance [18:27] Personally? I doubt that [18:27] only cause he said the tv archive started in 2000 [18:27] existence* [18:27] Oh, I see what you mean. [18:27] It is likely we have TechTV and a bunch of other material on the TV archive. [18:27] Doesn't mean we don't need you saving/classifying [18:28] For my own bit, I'm still trying to knock FOS down from 11gb of data. [18:28] i know [18:28] :D [18:28] 11GB? Surely you meant either TB or PB [18:28] i at least add key works like the people in the shows [18:29] 11tb [18:29] its 11TB [18:29] Anyway, back to some REALLY tedious tasks that I have been putting off for months. [18:30] Underscor, let me know when you're available to do work. [18:30] Also, don't let depression get the best of you next time. [18:30] You're going to have near misses a few more times in the next 10 years. [18:30] good news is i may get all g4 web videos fully uploaded in 2 weeks [18:30] Don't be that guy. [18:32] There's plenty of people that care JFYI [18:33] I know [18:33] * underscor sighs [18:33] Emotions suck [18:33] and so does life uncertainty [18:33] x3 [18:33] Life is just states of mind [18:34] IMO [18:34] http://emopotatoe.ytmnd.com/ [18:35] ooooh [18:35] what song is that? [18:35] I like it [18:35] That progression is orgasmically delicious [18:36] underscor: Simple Plan is the band [18:36] "how can this happen to me" is the song [18:36] sweet, thanks :D [18:36] np :) [18:47] Killing the BSTJ list [18:47] Going well [18:47] Blasting Lewis black [18:47] the jason scott of comedians [18:48] :D [18:48] sounds great [18:49] hmm, is there a Jason Scott youtube playlist I can give my mom? [18:50] Why would you do that [18:50] She'll call the cops [18:50] She'll call cops that don't even handle domestic situations [18:50] i agree [18:50] http://www.youtube.com/watch?v=ELji4-TogMI [18:51] SketchCow: i uploaded like 3 collections in the last 4 days [18:52] systm, foundation, and giantbomb podcast [18:54] SketchCow: she wanted to know what you talk about [18:55] Hah, oh wow - yeah, wget surely doesn't handle filename encodings right [18:55] You. Tell her I talk about you. [18:56] (C81) (?%90%8C人??%8C) [Lv.X+ (?%9F%9A?%9C?N')] ?%83%95?%83??%82??%82??%83??%83%83?%82??%83? (?%9C??%9D??%97???%98).zip [18:56] underscor, the closest thing I know of http://www.archiveteam.org/index.php?title=Talks [18:56] That's a good filename [18:56] SketchCow: :D [18:56] http://ascii.textfiles.com/speaking have a ball [19:03] Incredibly boring work continuing [19:03] hurrr yeah [19:04] trying to download stuff from the wayback machine with wget sucks [19:04] Why not just grab the .warc and then extract from it? [19:04] how do i grab a warc for a site off wayback? [19:04] the site [19:04] er [19:05] What site is it? [19:05] the site's down and i'm afraid it won't be coming back up, iw ant to keep al ocal copy [19:05] it's a former employer, a site i worked on. i'd like to keep a copy [19:05] I meant URL :) [19:05] if my former boss had warned me i'd have WARC'd all the sites that he still had up [19:05] http://www.pygmy.com/ is one of a few [19:05] aight, I'll see if I can dig it out [19:06] then explain how I got there [19:06] http://www.washingtonpost.com/blogs/the-switch/wp/2013/08/18/heres-what-you-find-when-you-scan-the-entire-internet-in-an-hour/ [19:06] thanks <3 [19:06] i'd love to know how to download warcs from wayback [19:08] just watch the speak where jason was with his boss [19:08] yeah that one is solid [19:08] that was ROLFcon I think [19:08] yes [19:12] argh, I know I've gotten to a Wayback Machine .WARC somehow [19:12] yeah, i sure couldn't figure out how [19:13] i assumed it wasn't permitted [19:13] A lot of the data is available in the "Web Crawls" collection (public) [19:13] hmm ok [19:15] http://web-beta.archive.org/web/*/pygmy.com* [19:15] That'll be a bit easier, still not WARCs though [19:16] yeah, that is an improvement [19:17] Arcing of Electrical Contacts in Telephone Switching Systems: Part IV - Mechanism of the Initiation of the Short Arc [19:17] It gets better [19:18] * ersi nods [19:22] wow yeah, i see a lot of the crawl data now but it's not obvious how to find which crawl the site is in [19:25] Adding Bell System Technical Journal articles as well as Manga. [19:33] Bell System Technical Journal, 35: 1 January 1956 pp 179-202. Statistical Techniques for Reducing the Experiment Time in Reliability Studies (Sobel, Milton) [19:46] http://www.theguardian.com/commentisfree/2013/aug/18/david-miranda-detained-uk-nsa?CMP=twt_gu [20:17] so i got skyrim for my birthday [20:17] it is the legendary edition [20:18] i told the guy at gamestop that i hate the download only content cause it will be lost in 5 to 10 years [20:19] cause there will not on a game disc for retro gamers [20:19] and archivers like us [20:30] ooo got invited to torrentleach. [20:52] root@teamarchive0:/0/PLEASUREDOME/MESS 0.149 Software List CHDs# ls [20:52] MESS_0.149_CHD_3do_m2 MESS_0.149_CHD_cdtv MESS_0.149_CHD_megacdj MESS_0.149_CHD_pippin MESS_0.149_CHD_segacd [20:52] Now begins the fun. [20:52] MESS_0.149_CHD_cd32 MESS_0.149_CHD_mac_hdd MESS_0.149_CHD_neocd MESS_0.149_CHD_psx MESS_0.149_CHD_vsmile_cd [20:52] MESS_0.149_CHD_cdi MESS_0.149_CHD_megacd MESS_0.149_CHD_pcecd MESS_0.149_CHD_saturn [20:53] godane: remind me of that torrent site? [20:53] squid? [20:53] myspleen [20:53] ty [20:53] my mind went blank [20:53] I have infinate upload xD [21:04] hah, torrentleach [21:33] i'm uploading how to beat video games [21:33] its from 1982 [21:35] also i may get the original encode version of ancient prophecies [21:35] 3 [22:07] looks like anon is getting into the archiving business? [22:07] http://www.slate.com/blogs/future_tense/2013/08/18/martin_manley_s_sister_asks_yahoo_to_put_his_suicide_website_back_up.html [22:12] but.. doesn't anon forget? [22:15] lol [22:15] also, some interesting info tied to that mirror [22:19] http://i.imgur.com/bLOcshV.jpg [22:19] http://i.imgur.com/3FENs7f.jpg [22:28] ha, yahoo took it down. [22:28] I'm not surprised one bit. [22:53] the mirrors appear to be missing this page http://webcache.googleusercontent.com/search?q=cache:SY4Clycsfk0J:martinmanleylifeanddeath.com/june_11_2012+&cd=1&hl=en&ct=clnk&gl=us [22:59] http://archive.is/1rVuQ [23:06] scribd has changed their post-login download page to make it look like you must pay to download something [23:06] but if you look at the very bottom there's a tiny link to go to the tit-for-tat upload page [23:06] scribd is so scummy [23:06] in more than one way [23:16] http://www.hackertyper.com/ [23:16] Put that in [23:16] Hit F11 [23:16] Type madly [23:16] INSTANT STREET CRED [23:29] g4tv.com-video54924: PaxTest: http://archive.org/details/g4tv.com-video54924 [23:29] got to love the test videos of g4 [23:32] it gets good about 17 mins in [23:38] uploaded: http://archive.org/details/How_To_Beat_Home_Video_Games_-_Vol.1_The_Best_Games_Vestron_1982 [23:42] uploaded: http://archive.org/details/How_To_Beat_Home_Video_Games_-_Vol.2_The_Hot_New_Games_Vestron_1982 [23:45] uploaded: http://archive.org/details/How_To_Beat_Home_Video_Games_-_Vol.3_Arcade_Quality_for_The_Home_Vestron_1982 [23:51] http://archive.org/movies/thumbnails.php?identifier=How_To_Beat_Home_Video_Games_-_Vol.1_The_Best_Games_Vestron_1982 [23:59] uploaded: http://archive.org/details/Wendys.Grill.Skills.1989.Other.Xvid-CG