[00:00] there's another good wget command here: http://www.archiveteam.org/index.php?title=User:Djsmiley2k [00:02] dashcloud: yeah, need to make sure people are downloading different things. might require the Warrior [00:05] balrog: I saw your request earlier about an autoloader for dumping CDs- Google shopping turns up this: http://www.bizchair.com/rx100pc-rex.html $495 [00:07] DoubleJ: my download just finished- much smaller site than I thought it would be [00:08] dashcloud: Mine's done, too. 18 MB or so? [00:10] Is swizzle here? No, guess not. [00:11] hey jason [00:14] yeap [00:15] Hey there. [00:15] (You're not swizzle) [00:16] so, just upload gont.com.ar.warc.gz to the community texts collection, and you'll take it from there? [00:17] dashcloud: I'm checking the subpages to make sure we're not missing anything. He has a blog site that I"m downloading now. [00:19] dashcloud: Yes [00:22] here it is: https://archive.org/details/Gont.com.ar.warc [00:29] uploaded: https://archive.org/details/ftpsites_arcade.demon.co.uk_2013.06.17 [00:32] uploaded: https://archive.org/details/arcade.demon.co.uk-20130617 [00:35] ivan`: did you start a download of http://misc.yero.org/modulez/ ? [00:36] dashcloud: I WARCed the site; Smiley went to download some of the music linked within [00:36] jfranusic asked earlier, and i'm wondering myself, to upload to archive.org do you just go to http://archive.org/upload/ ? is there a specialized way for AT people? [00:36] dashcloud: we ran into a wget bug that causes segfault [00:37] I didn't know, so I started a download of the site [00:38] you can use the upload page, or there's a bulk upload script floating around somewhere for larger collections [00:39] dashcloud: thanks [00:39] jfranusic: i'll hunt around for that script for you [00:40] ivan`: I'm downloading the ftp://ftp.scene.org/pub/mirrors/scenesp.org/ bits- did you already get all of these? [00:41] arrith1: cool, thanks [00:43] jfranusic: found these: https://github.com/kngenie/ias3upload and http://askubuntu.com/questions/32763/script-to-upload-to-internet-archive-archive-org [00:43] jfranusic: kind of lower level: http://archive.org/help/abouts3.txt [00:44] i would think there would be a python script [00:44] "The intended users of this script are Internet Archive users interested in uploading batches of content alongside per-item metadata in an automated fashion." [00:45] the actually uploading part isn't what I'm worried about [00:45] jfranusic: this might be overkill for your purposes, but maybe a starting point: http://www.archiveteam.org/index.php?title=The_WARC_Ecosystem#Archive_Team_megawarc_factory [00:46] I'm just wondering if I can just start uploading files willy nilly [00:46] or if I need to follow some sort of procedure [00:47] jfranusic: hm good question. ping dashcloud ? [00:49] dashcloud: no [00:49] I've never used it, but the ias3upload script defines the metadata you need to provide [00:50] jfranusic: we're pretty low on formality [00:50] regular users can only upload to a couple specific areas on the site so someone higher up has to sort it later anyway [00:50] ah! okay [00:50] and, what are those areas [00:51] they're listed in the drop-down on the upload form, the main ones are Community Texts, Community Videos, and Community Audio [00:52] eventually most of it goes into the Archive Team collection https://archive.org/details/archiveteam [00:52] warcs usually get dumped in community texts to start because there's no ready-made community web collection [00:54] we have lots of people uploading random stuff beyond just website grabs though, e.g. historically interesting videos which will just stay in community videos [01:56] ah, cool, thanks DFJustin [02:08] I'm cleaning the software collection [02:11] arrith, I recommend the fork https://github.com/kimmel/ias3upload it has more documentation and it has a few bug fixes [02:12] Also the readme has a nice section on the metadata fields [02:14] SketchCow, what do you mean by clean [02:14] The front page's a mess. [02:14] Not informative, a lot of mess. [02:14] Dead projects, poor ideas [03:01] so i found out one thing about the glennbeck highlights [03:01] and maybe the mlb highlights too [03:01] the last number is always a odd number [03:02] omf_: nice, thanks [03:02] 0, 2, 4, 6 and 8 are never used [03:25] hi, my grab of misc.yero.org/modulez has finished without issues: I used wget -e robots=off -r -l 0 -m -p --wait 1 --warc-header "operator: Archive Team" --warc-cdx --warc-file misc-yero-org http://misc.yero.org/modulez/ [03:25] did I miss something or do something different than other people? [03:29] I'm expecting my download of ftp://ftp.scene.org/pub/mirrors/scenesp.org/ to finish up over night [03:31] Hi, I just spent a few hours completely redoing the Software collection of archive.org. [03:31] http://archive.org/details/software [03:32] It will probably take overnight for the scripts to completely redirect everything, but now we have things described, and I'm going to begin the process of putting vintage software in a proper place (vintagesoftware) instead of scattered to the four winds, etc. [03:32] computerbooks has bad url [03:33] fixed! [03:34] also someone should mirror this site: http://www.dream17.info/ [03:34] since it has a lot of amgia pd ware [03:36] also you guys should know the 37 maximum pc cds are still up: http://www.ebay.com/itm/Lot-of-37-Maximum-PC-CDs-First-Issue-Old-Shareware-Software-1998-2003/231001552845 [03:38] i'm getting this one: http://www.ebay.com/itm/18-Maximum-PC-and-Linux-Magazine-Demo-CD-Discs-/231002003264?pt=US_Wholesale_Software&hash=item35c8cad740 [03:38] comes with 18 demo discs and is only $10 [03:39] I can't quite afford buying stuff right now. [03:40] i just hope this one goes like my cnn bid went [03:40] *cnn cd bid [04:07] mistym: oh hi [04:08] winr4r: Hi! [04:23] SketchCow: the link to open source software is broken, should be open_source_software [04:23] good job cleaning all that up :D [04:24] I'm thinking of adding a "datasets" collection [04:24] For things like the Internet Census and Twitter downloads and all that crap. [04:25] ideally that collection should be titled something like "Community Software" though because most of it is not actually OSS [04:26] These changes will happen in waves. [04:26] much like "opensource" has been titled "Community Texts" [04:26] I agree, but that one I have to get clearance for [04:27] natch [04:28] yeah would be nice to get stuff like 301works off the "this just in" list [04:28] Well, that won't happen. [04:29] This Just In is a mess, the way it's done. [04:29] It basically adds anything with a software mediatype [04:29] there is https://archive.org/details/data but it's rather neglected [04:29] Wow, that's a mess [04:29] :) [04:30] that's another mediatype wildcard thing, it's all s3 uploads with no mediatype set and stuff [04:31] with the occasional proper thing like https://archive.org/details/BrownCorpus [04:32] Made datasets [04:35] Put some things into it [04:49] SketchCow: Damn, /software looks good! [05:43] Any archiveteamers that live in the bay area who'd be interested in touring IA and/or hanging out sometime? [05:43] It's kinda boring here with nobody to do things with after hours x3 [05:44] Ha [05:44] We'll work together to find you things [05:45] I'm sure underscor could find SOMETHING to do [05:45] I mean, there's the internet to chat with people [05:45] and always more work things [05:45] BlueMax: :P [05:46] I mean, I can wander to things, but again, kinda sucks alone [05:46] why don't you get people on Skype or something? [05:46] I'd be up for tha [05:46] that [05:46] :o cool [05:46] add me! [05:46] alex.buie.kwd [05:46] You should clean your space [05:46] I'm on the phone with my mom right now though [05:46] SketchCow: I did. It's much better. [05:47] That was embarassing. [05:47] (but entirely my fault) [05:47] oh underscor you're so messy [05:49] * underscor giggles innocently [05:49] morning folks [05:51] best hour and a half of sleep evar [05:52] morning lame windows program :P [05:56] how does it feel for nobody to pay for you [05:59] hah [06:03] by the way, if anyone like backing up FTP sites and has a very fast connection, ftp.hp.com (godane) [06:04] i was looking for some tru64 patch or other and it literally took me 12+ hours to generate a list of the files [06:07] i think that one will be for some else [06:07] only cause i think its too big for me [06:24] i forgot that i have braingames collection: http://archive.org/search.php?query=braingames%20AND%20collection%3Aopensource_movies [06:25] underscor: heck yeah, east bay here [06:25] and nerds 2.0.1: http://archive.org/search.php?query=subject%3A%22Nerds+2.0.1+-+A+Brief+History+of+the+Internet%22 [06:25] arrith: didn't know i could wander in, i'm so down for that [06:26] and triumph of the nerds: http://archive.org/search.php?query=subject%3A%22Triumph+of+the+Nerds%22 [06:33] arrith1: yeah, totally [06:33] we have open catered lunch and tours on fridays at noon [06:33] you should come by on a friday you're free [06:34] underscor: then like a big box store i'll hide in a corner until they turn off the lights :P [06:34] I'll show you around and stuff, and we could hang after or something [06:34] arrith1: I'm here 24/7! Nice try! [06:34] underscor: more time to hang out i mean haha :) [06:34] Lights are all off right now, though, actually. Kinda weird to be at my desk in the dark [06:34] (I could turn them on, I just like it dim) [06:35] underscor: and yeah that sounds awesome. would be super fun to hang [06:35] ever since i got a dimmer for my office lights it's basically dim always [06:36] the archive work area is just a big open room with a bunch of table clusters [06:36] so it's either "dark" or "bright as fuck with 300 watt overheads" [06:36] hm nice. sounds like a hackerspace almost [06:36] haha [06:36] It's *very* much like a hackerspace [06:36] iirc hacker dojo is like that with the lighting [06:36] Lots of tables, ethernet drops, couches for people to just chill at, talk, hack on stuff [06:37] their presentation room is super dark, or can barely see the projected image [06:37] haha [06:37] wow nice. just needs some soylent feeding tubes and it'd be all you need [06:40] hahaha [06:40] we have a coffee robot [06:40] if that counts [06:41] delivers to your table? [06:41] one nice thing about sf is good food really isn't ever too far away [06:43] arrith1: unfortunately not, although that would probably be embraced by everyone here [06:43] It's just one of those really fancy coffee/espresso/cappuchino/everything machines [06:43] you put a cup on one arm, and science happens [06:43] coffee... robot? [06:43] and you end up with a cup of caffinated sludge [06:43] oh! [06:44] underscor: ahh, one of those super basic "robots" [06:44] underscor: personal delivery of pizza and/or beer is one of the primary goals of the hercules robot project at hacker dojo, so depending on how successful they are, that work could be re-purposed [06:45] though with hot coffee it better have some safety features.. [06:45] hahahaha [06:45] that would be awesome [06:45] well, a pizza to the face would probably hurt too [06:45] also, should move to -bs [06:45] oops [06:46] no big, I started it XD [06:59] common crawl says they have 5 billion URLs, but downloading everything with common_crawl_index gets me 2,412,755,840 URLs [06:59] the last two are [06:59] zw.org.zwrcn.www/women-voice-blog/view-topiclist/forum-1-women-discussions.html:http [06:59] zz_seay662TT/indexb.php:0223744@gothicundine.com:http [07:00] I have a 22GB bz2 in case anyone wants it [07:05] http://204.12.192.194:32047/common_crawl_index_urls.bz2 wrong mimetype, don't open it in your browser, sha1sum c296782cf01fa4f4e111f58a3b02200d3a475d24 [07:11] heh https://github.com/trivio/common_crawl_index/issues/12 [08:44] could a few people run https://github.com/ArchiveTeam/greader-directory-grab please? [08:44] getting 14 item/min, need about 30 [08:44] --concurrent 2-4 should be fine [08:45] ping GLaDOS underscor [10:12] winr4r: did you get anybody biting on ftp.hp.com? that could be useful - especially with VMS getting EoLed end of 2016... [10:12] ... I've found useful VMS patches on the FTP site before that otherwise would have required paying $$$ for [10:13] ... and I have bandwidth on various machines if we could divvy up the work somehow. [10:14] i'd be happy to help out somehow, don't think i can do the entire site but i can definitely do a portion :) [10:19] hopefully winr4r still has the file list that took 12+ hours to generate, so we can get some idea of the scale and how best to carve it up. [10:20] I'd take a look myself but I have customers screaming at me (thankfully not about VMS today) so I need to go and do some work :-/ [10:31] Baljem: how big is it? [10:32] I thought I/we had already downloaded it [10:32] absolutely no idea, I'm afraid, but winr4r made it sound pretty big. which makes sense, given the size of HP [10:32] oh, in that case... job done ;) [10:33] but I can't find it [10:33] maybe https://archive.org/details/ftp-ftp.rta.nato.int is the only one I did [10:34] I remember playing with KIOslaves and html files in HP FTP though [10:35] just playing with a tool for doing a recursive ls, let's see what happens [10:35] Baljem: i don't even have the list anymore, i didn't think it'd be useful and it just occurred to me that it might not be archived, so, lesson learned [10:37] i'd save it myself, but 1) 8mbit connection 2) 20gb montlhy bandwidth cap [10:38] aww [10:40] (and $3 for each gigabyte over that, so you know, not sure i like it enough to spend like $1000 on it) [10:49] ftp1/ is 241,8 GiB [10:50] Nemo_bis: did you seriously find that out in like 20 minutes? [10:50] sure [10:50] because i wasn't kidding about it taking me 12+ hours [10:51] I told you, KIOslaves [10:51] oh! [10:51] dude, ftp.hp.com is going to be walking funny for *weeks* after that [10:52] nah, I doubt they care [10:54] :) [10:54] ftp2/ 6000 subdirs and counting [10:55] Smiley: https://ludios.org/tmp/0001-fix-segfault-in-ftp.c-ftp_loop_internal.patch from the bug-wget list [10:55] I find 8857453594 files [10:56] ... [10:56] that seems... hmm. let me double-check my perl! [10:56] yeah, something's fucked there, there are only 1.1 million lines in the listing I generated [10:56] ^^ [10:57] haha, 8 billion [10:58] oh, duh, that's the size line I was adding up, not the file count [10:58] so what is that smaller that 214GB? obviously this recursive FTP tool is not great. [10:58] why* [11:00] good grief, ftp1/pub/all_in_one - they kept all the ancient DEC stuff. nice. [11:07] it was onlt the ftp1/ dir [11:07] ftp2/pub/ at 10k subdirs and counting [11:13] doing the same recursive listing a second time seems to be taking a lot longer. weird. will see what this comes out with in the end [11:16] I sometimes get no permission to access a directory that later works, I don't know why :) [11:17] Baljem: how much disk space do you have? [11:19] I'm thinking nowhere near enough, unfortunately [11:19] I need to shift some VMs around on the work cluster, but even then I think I can only free up a hundred gig or so [11:20] and I've just realised that 8.8-billion number I quoted was in traditional 512-byte blocks. so that's 4TB... gonna need a bigger boat [11:22] you sure? [11:22] not in the slightest. I'm presuming this tool is doing the right thing, which may not be wise seeing as the second listing is taking so much longer [11:23] ftp2/pub is only 10 809 subdirs, 29 421 files and 388,7 GiB here [11:24] ah, it followed symlinks, blast [11:27] how does one tell wget not to? [11:36] no idea, I'm afraid, using an FTP client called lftp instead [11:36] but the results it's giving me are wildly different from yours so I suspect I'm doing something wrong [11:40] Baljem: Using lftp and 'rels -lR bin' I'm getting slightly different time stamps every time, and once the file sizes were wrong too [11:40] Seems like the server's lying :-P [11:46] you guys are awesome [11:57] it's entirely possible HP's actually adding new stuff on an on-going basis [11:59] The timestamp of the root directory ('..' in bin/) had varying values all in 2011-2012 and in random order [11:59] If that's due to them adding stuff then that's pretty weird [12:00] SketchCow: https://archive.org/details/MiscYeroOrg.warc is up- it's just a wget-warc of the website. The music is hosted on the artist's site [12:41] i'm tryign to grab all non-ftp music files [13:54] https://twitter.com/KimDotcom/status/347342896174866433 [13:54] "#Leaseweb has wiped ALL #Megaupload servers." [14:28] ReLeaseWeb rather [15:14] ivan`: Running --concurrent 4 [15:16] thanks [15:17] going to add some non-english words there soon [16:28] ivan`: you any good with bash [16:28] ? [16:30] probably, what do you need? [17:00] https://github.com/djsmiley2k/smileys-random-tools/blob/master/get_xanga_users fix this? XD [17:09] * ivan` looks [17:35] dashcloud: Thanks. [17:44] Smiley: did you see the wget patch [17:44] Smiley: it doesn't crash on ftp with it [17:45] sweet [17:45] i fixed my bash issue [17:45] SketchCow: I've got script now running which iscollecting as many usernames as possible [17:45] Already done a few hundred to test. works nicely [17:45] and i've basically learnt awk... to the point of usablity doing it. [17:53] * winr4r salutes Smiley [17:53] takin' one for the team [17:55] :D [17:55] i feel EPIC [17:58] pretty sure you can actually shoot lightning from your fingertips now man [18:49] ^ [18:51] found something: http://www.youtube.com/user/EverySteveJobsVideo [18:56] hahaha [19:00] Thanks, Smiley [21:10] no worries SketchCow [21:27] Cool, a new burning building. [21:35] XANGA projtect guys - DO IT :) [21:35] Already on it. [21:35] "Xanga is getting old. Archive Team investigates." That's a pretty strange description of the project. [21:35] I think it's well beyond investigation now that you're grabbing stuff. [21:38] i found out that a friend of glenn beck dead: http://www.glennbeck.com/2013/06/19/glenn-remembers-his-good-friend-author-vince-flynn/ [21:38] i'm mirroring vinceflynn.com [21:39] I can't change the discription of the project namespace [21:39] Smiley: *shrug* Later then. [21:40] godane: Yeah, I actually wanted to talk about small sites. [21:40] I'm really impressed with the work that's been done saving these enormous spires of burning documents. [21:41] But if I wanted to go archive some smaller sites on my own, where would I start? [21:41] (Tools, etc.) [21:41] i use wget [21:41] my code: wget $website --mirror --warc-file=$website-$(date +%Y%m%d) --warc-cdx -o wget.log [21:41] namespace: you go to my wiki user page [21:41] it has a default "wtf save this site!"" code [21:42] Yes, I'm attempting to be as awesome as Jason. [21:42] http://archiveteam.org/index.php?title=User:Djsmiley2k [21:42] also thanks to godane as being the source of that original command :) [21:44] Smiley: I'd prefer a nice document to let me know how to use wget. [21:45] I mean, there are man pages, but that's like saying that my desktop came with a manual. [21:45] (Actually, to be fair man pages are probably more useful than the 'manual' that comes with most computers.) [21:46] namespace: I'll think about it, but tbh I know nothing about wget [21:46] Smiley: Be smart, somebody else probably already wrote one. [21:46] Instead of writing a mediocre one, try finding somebody elses great one. [21:47] (I'm working on it right now BTW.) [21:48] Oh, it's GNU, nevermind then the documentation is probably excellent. [21:48] https://www.gnu.org/software/wget/manual/wget.html [22:02] the wget manpage is pretty good. though there might be some undocumented features, for example erobots for handling robots.txt isn't in the manpage, i'm not sure of others [22:03] would be nice to have a submission process to the ArchiveTeam Warrior for smaller sites. could run a basic wget warc command on the clients. maybe have manual review before the grab happens, but the submission process could be automated [22:03] and i really should learn awk [22:04] uploaded: https://archive.org/details/www.vinceflynn.com-20130619 [22:21] nice [22:25] i'm doing another mirror of torrentfreak.com [22:36] balrog: still looking for a CD autoloader? [22:36] dashcloud: yeah [22:36] did you see this one I posted yesterday? http://www.bizchair.com/rx100pc-rex.html [22:39] yes [22:39] they don't make it anymore [22:54] so, they don't let you check out with it, but you can put it in your cart? [23:45] to test going up to checkout in a site, but not placing the order, fakenamegenerator.com is good. it provides fake credit card numbers to enter to get past the payment step