[01:07] ivan`: where did you get that github-repositories.txt file? [01:07] the gothub subcommittee posted it on IA recently [01:08] http://archive.org/details/archiveteam-github-repository-index-201212 [01:09] Thanks [01:09] * phuzion considers just starting that for the hell of it, to see how much space it ends up taking. [01:27] some repos already got renamed or deleted [01:27] does anyone have a done-in-10-seconds way to submit a WARC to the internet archive? [01:27] from a headless server [01:28] I have a WARC of the BBC site of a while ago [01:28] balrog_: yeah, they will do that [01:28] joepie91: you can cobble together something that uses curl to POST it to the s3 api [01:29] :p [01:29] right, but how would I do that, seeing as I'm entirely unfamiliar with the s3 api [01:29] okay [01:30] you need to get tokens and I'll give you a command line [01:30] how do i get tokens? [01:30] phuzion: let me know if you run out, I might have 3TB of space to do some of it [01:31] ivan`: I've started on it, I'll let you know when it fills up my drive. [01:31] joepie91: http://archive.org/account/s3.php [01:31] okay, got them [01:32] phuzion: you might want to run two in parallel since half the time github will be busy counting objects [01:32] joepie91: curl '--header' 'authorization: LOW your-magic-token' '--header' 'x-archive-meta01-collection:opensource' '--header' 'x-amz-auto-make-bucket:1' '--header' 'x-archive-meta-noindex:true' --header 'x-archive-meta-(title|date|mediatype|language|etc): Value' [01:32] Hmm... Perhaps I can figure out how to split the list into even and odd lines... [01:32] or with xargs or parallel [01:32] * ivan` looks it up [01:33] yes, `parallel' is good [01:33] magic token == secret key? [01:33] joepie91: hold on don't run that yet [01:33] yes, secret key [01:33] you actually want to run it with these as well ... [01:34] curl -i '-#' ${args from above} --upload-file /dev/null "http://s3.us.archive.org/"$identifier [01:34] this will give you a progress bar and stuff [01:35] what is the $identifier? [01:35] * joepie91 is confused now [01:35] okay, let me ask it differently [01:36] if I wanted to upload a warc.gz of the BBC.co.uk site named "BBC.co.uk WARC", and the filename was at-bbc.warc.gz [01:36] what would the full command be to run (minus secret key, ofc) [01:36] so that I get a bit of a better grasp on the syntax :p [01:38] ivan`: I'm trying to figure out how to split the file in half, I want to do even and odd lines, but can't quite nail the sed syntax, you any good with sed? [01:39] Wait, hang on, I might have gotten it [01:39] no, I was busy trying to figure out how to do the subshell thing with parallel [01:41] so did someone warc this yet, closing tomorrow http://japan.gamespot.com/ [01:41] Yeah, got it [01:41] sed -n "1~2 p" github-repositories.txt > github-odd.txt and then sed -n "2~2 p" github-repositories.txt > github-even.txt [01:46] joepie91: curl -i -'#' (all the --header options from above) --upload-file at-bbc.warc.gz "http://s3.us.archive.org/BBC.co.uk-warc" [01:48] joepie91: make sense? [01:48] you should write a description and stuff [01:54] right... not much to describe though [01:54] :P [01:55] well, write where it came from, include the wget command line, etc [01:55] maybe why you got it [01:55] just a few sentences [01:56] just use \n to insert a newline? [01:57] I don't know tbh [01:57] * phuzion predicts that his 1tb drive will be full tonight, thanks to cloning github repos [01:57] phuzion: that sounds like a safe bet [01:57] meh, don't have the command I ran anymore anyway :/ [01:57] heh [01:58] how big would you think japan.gamespot.com should be? [01:58] desc? [01:58] what's the name of the description header? [01:58] ummm [01:58] Can someone take http://git.kernel.org/index.html and get all of the git:// links out of the page for me? I wanna clone all of those as well [02:00] joepie91: read the examples at http://archive.org/help/abouts3.txt [02:02] phuzion: http://sprunge.us/XOBV [02:02] ignore the first line [02:03] rest should be valid [02:03] joepie91: you sir, are a gentleman and a scholar [02:03] Mind if I ask the wizardry you used to obtain such a result? [02:07] sure, 1 sec :P [02:08] bit of a nasty method, but [02:08] http://pastie.org/5541069 [02:08] it does the job [02:08] curl http://whatever | python gitlink.py [02:09] the regex is extremely lazy though, and there's no guarantee that it'll work with other stuff :P [02:09] plus I don't think it'll match more than one git:// url per line in the html file [02:09] which is fine for this, but may not be fine for other things [02:10] I would do curl http://whatever | sed -e 's/[" ]/\n/g' | grep ^git:// [02:10] chronomex: that won't work if there's other stuff on the same line, right? [02:10] wait [02:10] I see what you're doing [02:11] :) [02:11] that would break here though, if you don't include ) in your regex [02:11] ok [02:11] there was one that would break and have a ) at the end [02:11] well, as usual, it requires tuning [02:11] :P [02:11] plus you'd have to add < [02:11] in case it's mentioned as text [02:11] well yes [02:11] but you see where I'm going with it [02:11] yes :) [02:11] I'm horrible with sed and awk so I prefer python for these kind of things :P [02:12] or you could do grep -o 'git://[-_A-Za-z./%0-9 etc]*' [02:12] -o is only-matching-regions [02:31] alard: tracker is back in swapsville http://zeppelin.xrtc.net/corp.xrtc.net/shilling.corp.xrtc.net/memory.html [02:39] so i'm starting the mirroring of japan.gamespot.com [03:27] and japan.gamespot.com is gone [03:27] :/ did it get backed up? [03:28] part of it did [03:28] not much [03:43] i'm uploading my warc for japan.gamespot.com right now [03:43] just don't expect much [03:50] :[ [03:55] uploaded: http://archive.org/details/japan.gamespot.com-20121216-mirror-incomplete [03:55] we wore not fast enough [03:56] i grabbing stuff like fireflyfans.net before it needs a panic download in under 2 hours [03:59] it was already starting to redirect to japan.cnet.com best on my wget.log [04:03] chronomex: The tracker that you talk of, is that why I can't download github stuff? [04:03] phuzion: no idea [07:01] * chronomex currently stuffing some ftp grabs from last week into .zips [09:50] chronomex: you uploaded some of these a while ago didn't you? https://archive.org/details/bellsystempractices [09:50] are http://thepiratebay.se/torrent/5946997/Bell_Systems_Technical_Journals_(Full_Site_Rip) darkened or just not on archive.org? [09:54] Nemo_bis: that is my collection, yes. [09:56] chronomex: do you anything about the bell system technical journals then? [09:56] do I what anything? [09:57] I think you a word [09:57] chronomex: The tracker likes to swap. We have too many large projects at the moment. [09:57] yeah, that was my understanding [09:58] *do you know [09:59] GitHub is done now, so that will be going. [09:59] I know some things about the BSTJ, yes? [10:00] chronomex: about them being uploaded on archive.org [10:01] don't [10:02] http://archive.org/search.php?query=bell%20system%20technical%20journal hmmm this is bad [10:03] maybe I should upload that 50G torrent [10:03] orrrrr not? [10:03] upload it from the lucent site [10:03] maybe I'll do that tomorrow [10:04] yeahhhh [10:18] chronomex: yes you should :) [10:18] unless Jason already did it? [10:25] http://japan.gamespot.com/ is gone now [10:26] I see godane managed to grab some of it [10:28] If http://andriasang.com/ comes back online it might be nice to grab a copy as well. I am not sure how much longer it will stay up [10:29] Thank you for grabbing what you did, godane. [11:58] If you want to pick some... (Or add suggestions; my proxy and myself got sick of browsing TPB. ;-) ) http://archiveteam.org/index.php?title=Magazines_and_journals [13:05] hiker3: looks like gamespotjapan twitter feed is gone too [13:05] Hi! But isn't twitter archived automatically? [13:06] don't knnow [13:06] i just know that the account doesn't exist anymore [13:06] Were you the only one grabbing the site? [13:11] i don't know [13:12] in least jason scott got it [13:12] *in less [13:12] when was it posted that it was going to be redirected to japan.cnet.com [13:13] I came in here 3 days ago and mentioned it I think [13:13] so i hope jason got the warning then [13:13] i know i was not going to get all of it [13:14] Is there any way someone can get http://andriasang.com/ if it comes back up? [13:14] It's been having errors for a few weeks now, and the author has moved on to other things so I am not sure how much longer it will stay up. [14:10] so looks like fireflyfans.net bluesunroom is very big [14:41] chronomex: http://aarnist.cryto.net:81/data/at-trancenu.warc.gz [14:41] a seemingly complete warc of trance.nu [14:44] uploaded: http://archive.org/details/www.engadget.com-images-2006-mirror [14:44] joepie91: What do you think, would all sources found for gopher be worth of uploading? [14:45] I mean the UMN gopher engine [14:48] norbert79: I have no idea what that would entail, to be perfectly honest [14:48] that was before my time :P [14:50] Lot of old gopher code; it would mean like I would say: old apache2 code :) [14:52] ah, right [14:52] sure, why not :P [14:54] Is there a list of websites which have shutdown but have archives from AT? [14:56] i have gopher plugin in for firefox [14:58] sigh people packaging PDFs in NRG packaged in multifile RARs [15:11] Looks like sharing isn't accessible atm [15:12] i found some usenet dumps [15:12] on gopher://telefisk.org/ [15:12] hiker3: I think it's on the archiveteam wiki [15:12] the archive is up to like 2011 [15:13] godane: Telefisk is still anactive gopher server [15:13] yes [15:13] from what i can tell [15:13] godane: You could also add olduse.net to this too [15:14] ah [15:14] hiker3: http://archive.org/details/archiveteam [15:15] godane: Wanted to upload Old Gopher Sources, connection died, now I can't use that keyword anymore, but am offered OldGopherSOurces_631 [15:15] godane: What now? [15:15] Shall I ignore this? [15:17] i'm donwloading this stuff to be on the safe side [15:19] norbert79: it looks like https://archive.org/details/OldGopherSources was created, so you ought to be able to go in and edit it [15:20] DFJustin: Cheers, looks like both https://archive.org/details/OldGopherSources and https://archive.org/details/OldGopherSources_693 got created and got stuck again [15:20] afaik olduse.net comes from data that is already on IA so no point in archiving it https://archive.org/details/utzoo-wiseman-usenet-archive [15:21] DFJustin: About these pages, can I somehow remove them? [15:21] no [15:21] I wish to remove the second, aw crap [15:21] it's not public yet so no big deal [15:21] Ok [15:22] DFJustin: What is the right choice for compressed source files? [15:22] I am offered movie, audio and text [15:22] and etree [15:22] pick text and an admin can move it later [15:23] cheers [15:29] Done [16:15] looks like fireflyfans.net store the bluesun images using the files md5sum [16:41] anything else that needs wget-warcing? [16:48] joepie91: are you open also to different suggestions? :) [16:48] that depends on what said suggestion is :P [16:48] I put some on http://archiveteam.org/index.php?title=Magazines_and_journals [16:50] Nemo_bis: I can't do torrents, though [16:50] ah [16:50] disallowed by the host that I'm using [16:50] because it's very IO heavy [16:51] :P [16:51] :[ [16:51] see https://srsvps.com/terms.html [16:51] even if you limit to a few connections at a time? ahh [16:51] I understand that OVH is pretty lenient [16:51] and is popular for seedboxes [16:51] ya, but my only OVH box that I could use for this would be my kimsufi [16:51] yeah [16:52] :P [16:52] and that one isn't supposed to do anything besides function as a testing box for my vps panel [16:52] don't want to risk suspension or similar [16:52] aww 503 Service Unavailable [16:52] I have one other VPS on an OVH server, but if I start torrenting on that, encyclopedia dramatica will probably slow down to a crawl, since it's a backend server >.> [16:53] there's an interesting NATO FTP site there that you could grab though [16:53] ohhh? [16:53] * balrog_ has been on the lookout for NATO documents [16:53] well, certain specific ones having to do with speech codecs [16:53] Nemo_bis: how large is it, approx? [16:53] I have about 50G of space left [16:53] joepie91: dunno, some dozens GiB perhaps [16:53] on this vps [16:53] hmm [16:53] I could do it partially [16:53] ftp.rta.nato.int/PubFullText/AGARD/ or http://thepiratebay.se/torrent/7639843/AGARD_monographs_(_AGARDographs_) 15 GiB/453 [16:54] and parent folder [16:54] what is the easiest way to mass-download from an FTP server? [16:54] wget [16:54] lftp [16:54] I'd assume warc isn't suitable for this [16:54] http://archiveteam.org/index.php?title=FTP [16:54] ah, nice :P [16:55] it doesn't really matter if you're doing a one time pull, I like lftp for updating an existing mirror [16:56] downloading... [16:57] 140kb/sec [16:57] :p [16:57] not particularly fast [16:57] KB* [16:57] :< [16:57] you'd think nato could afford a decent pipe [16:57] oh, by the way, alard, are you here? [17:04] Nemo_bis: I've started downloading the car and motorcycle manual torrents from my home connection (on my media server) [17:04] :P [17:04] it'll be slow, but it's something [17:05] it'll download at 1.1MB/sec max, and upload at like 60KB/sec max [17:06] i found a amiga virus collection [17:07] haha [17:08] joepie91: ok, upload with the bulk uploader, you know how? [17:09] Nemo_bis: no idea, and the standard uploader on the archive.org site says it doesn't work properly on unix-based systems [17:09] honestly, archive.org needs some kind of software to do uploads [17:09] easily [17:09] including the whole tagging etc [17:10] https://wiki.archive.org/twiki/bin/view/Main/IAS3BulkUploader [17:12] whoa, I did not know that existed [17:20] joepie91: if your upload bandwidth is so little, perhaps you chose too big a torrent :) [17:20] but let's see [17:21] nah, I'll just have patience :P [17:21] that server runs 24/7 anyway [17:21] plus I'll probably upload stuff separately [17:26] this item needs some help: www.engadget.com-images-2007-mirror [17:34] joepie91: i mount the ftp with curlftpfs and then use rsync [17:34] schbiridi: you mean the archive.org FTP upload? [17:35] according to the info page that's not recommended because of bandwidth [17:53] nah, for mirroring FTP servers [17:54] sorry :D [17:54] schbiridi: I usually use lftp here, or wget [17:57] the wonders of interoperable systems: everyone can use any flavour of software one likes most ;) [18:00] ahh [18:01] i find rsync the most versatile [18:53] joepie91: Yes? [19:06] alard: I have something that may be of use to you for future projects [19:06] I wrote a self-extracting python script thingie [19:06] I'm using it for the installer for my VPS panel, but it may be useful for stand-alone versions of crawlers etc as well [19:07] https://github.com/joepie91/cvm/tree/develop/tools/pysfx [19:07] it doesn't have its own repo yet (it will soon, though) [19:07] example usage: https://github.com/joepie91/cvm/blob/develop/installer/build.sh [19:07] end result is a single .py that you can run, it'll extract itself to a temp dir, and run the specified command [19:49] blip.tv a serious risk given "@richhickey says @skelter Blip doesn't want conference vids, tech talks etc, and gave us 2 weeks to move." [19:49] I have all the Clojure videos, don't worry about those [19:58] oh really? [19:58] I thought blip.tv has the HOPE video [20:00] I also have all of http://blip.tv/linuxconfau [20:04] going to grab http://blip.tv/linux-journal and other things I can find on google [20:06] (note: my upstream is terrible and no real backups) [20:07] joepie91: Ah, that's something to remember. Similar to py2exe, but for Linux? [20:08] alard: more similar to a 7zip sfx with autorun, I'd say, but for Linux :P [20:09] it doesn't include dependencies etc, it just works with whatever tar.gz you give it [20:09] you could theoretically pack up something entirely non-python with it [20:09] as it will just run whatever command you specify, but with working directory set to the temp extraction directory [20:13] what's our preferred piratepad? [20:13] piratepad? [20:16] joepie91: OK. [20:24] if anyone is really interested in blip.tv I can provide a youtube-dl patch and start listing channels in piratepad [20:25] otherwise I'll just continue sucking things down at 1M/s and hope 2 weeks only apply to hickey [20:35] surprised to see a lot of great content, site must have terrible googlejuice [21:38] SketchCow: so i have the bluesunrom of fireflyfans.net [21:38] 2.1gb warc.gz with 11000+ images [22:11] Goodness [22:13] I think hank briefly killed the site (or so) earlier today, maybe with this https://archive.org/~tracey/mrtg/derives.html [22:16] SketchCow: did you grab japan.gamespot.com? [22:16] i only grabed 90mb [22:18] SketchCow: I uploaded the GitHub files, see http://archive.org/details/github-downloads-201212-part-a (to -z and -0 to -9) [22:23] So this is the after-we-fixed-the-bugs thing? [22:24] Yes. [22:32] Fantastic. [22:32] I think I put this in software. [22:36] ivan`: I'm interested in pulling down the blip.tv stuff and I've got a good pipe here- the latest released version of youtube-dl seems to work fine with blip- any special options to use? [22:38] dashcloud: yes, you need a patch to get the Source/720p content [22:38] sec [22:41] just ping me with it, and I'll get to it sometime tonight [22:43] http://archive.org/details/github-downloads-2012-12 [22:48] Don't do mediatype data, do mediatype software [22:50] "data" is the default, I think. (I'm not even sure if non-admins can upload anything but the default type, but I haven't tried that.) [22:50] non-admins can upload anything [22:56] Anyway, it's all set now [23:01] dashcloud: http://piratepad.net/R18h7lKV1N has the patch and some channels [23:02] it's possible that the clojure channel got specifically targeted for using up too much of their bandwidth or something, but blip.tv still seems careless [23:12] ivan`: conferences don't seem a terribly good fit for blip as it is now- conferences happen once a year, and blip is geared toward episodic-type content (weekly/biweekly/monthly shows) [23:12] so i got a account to astraweb.com [23:13] only got the $10/25gb credit [23:13] holy crap- that's weird watching text suddenly show up on the page [23:13] just to test if i can get episode of attack of the show without missing parts [23:20] dashcloud: heh, you're unfamiliar with etherpad? [23:20] it's basically multiplayer notepad :P [23:21] I've never used one before [23:21] did you run the git-annex kickstarter? [23:22] git-annex kickstarter? [23:22] good news everyone [23:22] i maybe able to save more aots [23:35] also most of my engadget dumps are uploaded [23:35] will do a 2012 year dump sometime next year [23:41] oh hey http://blip.tv/acquia and its videos got nuked too [23:41] whatever it was. not that I'll have any idea now.