[00:00] I`ll try to delete that folder and run get-wget-warc.sh again. How do we delete things? -_-` (In MS-DOS it`s del, but when I type this it says unknown command) [00:01] first look at the files in the directory [00:01] use ls -l [00:01] marceloan: rm [00:02] marceloan: This may be helpful: http://www.yolinux.com/TUTORIALS/unix_for_dos_users.html [00:02] Thanks, it`ll help ?) [00:02] :) [00:07] No problem! Welcome to the bird side. ;) [00:08] "A Linux terminal emulates an emulation of a terminal of an old DEC VT100 and somehow that's still FAR better than the Windows command prompt." [00:09] ERROR contacting tracker. Could not mark `4Boa` done. [00:09] Hey guys [00:10] Hallo [00:11] Well, it could download the file but couldn`t contact the tracker... [00:12] Btw since Friendster I haven't been hanging around IRC much but I'm not sure how to get word of new projects [00:12] Anyone know the best way to "receive the batsignal" when there's something going down? [00:13] I've tried @archiveteam on twitter but I dunno if I'm missing things [00:14] IRC's all I know of. @archiveteam seems to be what's been done, not what's going on. [00:15] Hmm, well I guess I'll just try to pop in every now and then [00:15] It'd be great if there was a mailing list or something [00:15] indeed [00:15] we've got three projects going at the moment [00:16] Splinder, Anyhub...and? [00:16] mobileme [00:16] Would the /topic be a good place for these? [00:16] http://memac.heroku.com/ [00:17] Possibly. Still has the "being on IRC" problem though. [00:17] haha, oh great, mobileme. thanks apple [00:17] DoubleJ: IRC logs help, no? [00:17] DoubleJ: i think it'd still help [00:18] Wyatt|Wor: Apparently Qwerty0 hasn't been reading the logs, either :) [00:19] nerp [00:19] Qwerty0: Better than nothing, yes. But if someone's not on IRC for a while they can miss a lot. And depending on client, /topic only shows ~75 characters. [00:19] I didn't think that'd be a good way to quickly find out the active projects [00:19] To be fair, logs are rather hard for humans to parse without grep [00:20] DoubleJ: oh totally, it's not the best solution [00:21] Maybe leverage the wiki top page too? [00:21] A "current projects" sidebar or such. [00:21] I could see that working. I'd be concerned that a mailing list would be someone else's job and never get used. Easy to update the wiki page. [00:21] but if it's the best there is, I'll probably start checking them [00:21] yeah, and IRC chatter is usually 95% details without explaining the project itself, let alone the other active projects [00:22] I'd find that extremely effective [00:22] Lesson learned: if you're going to use pv, make sure pv is installed. [00:22] But like all things it's dependent on interest in updating it [00:23] Something simple, like a bulleted list: $PROJECTNAME: Ask $IRCHANDLE [00:23] (or, $PROJECTNAME: See $CHANNEL) [00:23] Add a wiki link to the relevant page too, IMO [00:24] Agree. [00:26] Of course, all I need is just a single notice that something's going on [00:26] That'd be 98% of it for me [00:27] I just don't have the time to always be on IRC, but I like to join the effort whenever there's a crisis [00:28] Well there's always something happening. Some of it's just really fast and doesn't even get mentioned on-channel. I'd say anything that requires more than a few people would be good to put up. [00:29] good point [00:50] A googlecode mailing list for "xxxxx is going down, see this" announcements would be pretty nice [00:54] Oh, about the dld-client, is there a graceful way of killing it, or just send SIGHUP? [00:56] I wonder how hard it would be to hook signal 10 or 11 and make it exit after the current child finishes? [00:56] I haven't actually done any signal handling in bash. [01:07] touch STOP [01:08] any running clients will notice and end after they finish their current job [01:08] - Discovering urls (JSON)... ERROR (6). [01:08] Downloading thefonzsays - Sun Nov 13 17:08:23 PST 2011 [01:08] Downloading web.me.com/thefonzsays [01:08] alard: Getting next username from tracker... done. [01:09] Oh. Well, okay then! [01:09] Thanks [01:09] you're welcome [01:12] alard: If I run the curl command, I get a page that says "This account does not exist" [01:12] http://web.me.com/c32040821/?webdav-method=truthget&feedfmt=json&depth=Infinity [01:15] Oh, hmm, looks like a curl problem again [01:15] Nevermind [01:55] Qwerty0/Wyatt/DoubleJ: http://www.archiveteam.org/index.php?title=Projects -- I added a "Projects with BASH scripts that need more people running them" section. Mind you, it's not an automatic alert or anything. [01:56] I'm not sure what'd work better. I'll try to keep the page current, but I can't say I'll reliably be the most informed person. [01:56] Err, keep the section current. I have no idea how to usefully keep the entire page current. [02:00] That's an interesting one. How do people normally schedule temporary alerts with expiration? [03:03] hrm [03:03] I'm getting lots of errors [03:03] Downloading 124 media files... done, with HTTP errors [03:06] It's also happening to me [03:06] - Downloading 4 media files... done, with HTTP errors. [03:07] looks like it's all 404 errors [03:07] but I didn't notice any before [03:08] mostly from files.splinder.com [03:11] perhaps theu have been deleted [03:12] What should we do about it [03:12] ? [03:14] I jsut got some too [03:17] hmm.. http://files.us.splinder.com/47e09e7aa78f749b5081204479d6a5c5.png is a 404 to wget, but shows up in a web browser, or with curl [03:18] ok, I guess they serve a 404 followed by a dummy image [03:18] Maybe they recognize the user agent? [03:18] no, I tried changing it [03:20] I tried checking downforeveryoneorjustme.com , and it's down for them, too, so my guess is that it's not a response to Archive Team. [03:22] Going to http://www.us.splinder.com/ , not terribly many of the front page things have the thumbnails. [03:23] ...though I'm not sure that's saying much of anything. [04:08] alard: There is a problem with dashes for sure - Downloading blog from -------mydi------------.splinder.com... done, with network errors. [04:40] Question about the heroku stats: How is the size calculated? Is it sending the size back? [04:40] yea, when the script reports that you've finished one, it sends along the size of the data [04:40] I don't think it's accurate, cos I have 100 gb here [04:41] it only counts the size of the warc files [04:41] Aaah, I see. [04:41] still there's just a few logs otherwise [04:41] I've got something like 3GB across a few machines, from what I can tell [04:42] Still, this live stats thing is really cool [04:42] indeed [04:43] Hey closure, you said you had something like a thousand threads running? How were you keeping load down at the 300 level? [04:43] 1000 was a few too many.. I dropped it to around 600-800 and got that load [04:43] $( ./du-helper.sh -bsc "${userdir}/"*"-media.warc.gz" [04:44] Aah, so it looks like it does scale rougly linear. [04:49] otoh, I have 207 wgets running now and a load of 4 [04:49] Weird... [04:49] some of them with large downloads get bogged down on the network and don't use much resouces [04:50] Amazon EC2? [04:50] real hw [04:50] Oh, nice [05:08] http://vimeo.com/32001208 [05:38] wow, I didn't know about anyhub [05:38] websites need to stop dying [05:58] then what would we archive? [06:04] All the things that never made it to the internet but are still in digital form. [06:06] db48x: I have this grand vision of a future where, given a webapp that accepts user-generated content, you could plug in https://example.com/user.warc and get back 200 OK or 202 Accepted [06:06] and either get back a WARC that was current of your request date or a URL to a location that you could check whilst the WARC was built [06:06] and it'd be neat to help that out with library code [06:07] (or 403 Forbidden, I guess, for private stuff) [06:07] well ok it's not that grand, but whatever [06:07] It's not? [06:07] heh [06:08] well it would probably be less of a load on websites than what we're doing now :P [06:11] A future where people care about their data sounds pretty grand to me. :) [06:12] yeah, or -- in this case -- a future where archiving of user data is common enough that there exists code out there to plug in to your app, tell it about archivable things and preferred formats, etc [06:12] oh, and I guess you'd need an account-discovery protocol [06:12] Accept: application/json; GET https://example.com/accounts or something [06:13] nothing really groundbreaking [06:13] hm [06:13] I wonder how hard that'd be to integrate with a typical e.g. Rail sapp [06:13] obviously, an archiver library can't auto-archive your models [06:13] too much domain-specific knowledge there [06:14] but the gruntwork of building the WARC, yeah, that can and should be standard [06:14] hmmmm. [06:14] if only we had a Rails app to try this out with [06:14] oh hey wait, Diaspora! [06:15] shit, that means I need to try to get it running again :( [06:15] whoa. [06:15] http://techcrunch.com/2011/11/13/diaspora-co-founder-ilya-zhitomirskiy-passes-away-at-21/ [06:19] wow [06:19] 21, goddamn [06:19] that's really terrible [06:19] *22 [06:19] er, yeah [06:22] I saw it earlier, but I'm still saddened to hear it again. [08:03] Holy fuck [08:16] ersi: what's up [08:16] ? [08:20] I was "Holy fuck"ing @ Ilya [08:20] other than that, werk [08:44] ersi: indeed [08:44] ersi: surprising [09:01] hrm [09:01] the users/hour on splinder has dropped off quite a bit [13:07] Evidently the Splinder "HTTP errors" have spread to the blogs, now, too. Yet we're still getting SOME data. [13:09] Paradoks: HTTP errors is nothing new, it's just that the script now tells you about them. [13:10] Oh. Okay. So, data-wise, we're getting as much stuff as we were a day ago? [13:12] Yes. What happened before was that the script just ignored any HTTP error. Some images are not found, that's to be expected: not everyone has a profile image, for instance, and because the script generates new urls there is a chance that you'll get 404 errors. [13:13] But it turned out that the US version sometimes returns 502 or 504 gateway errors, which isn't good. So the new version checks if wget found HTTP errors, then looks in the log to see if any of those are 502 or 504. If there are only harmless 404 errors it continues. [13:13] If there was a 502 or 504 error, it retries the user. [13:15] Cool. Thanks for the info. There was some worry (during the time you were asleep, I think) that we were getting an increasing/excessive amount of 404s. [13:51] Wow. Still downloading the blog that was up when I switched to the new scripts yesterday. [13:52] It's still making new files, so I guess it's working. But jeez. Thing's been going for at least 18 hours now. [13:54] Random request: Could the dashboard be changed to use • instead of •? The old Firefox I have at work doesn't understand the latter. [15:54] SketchCow: Are you there? [17:28] DoubleJ: I've got an EC2 instance that's been downloading splinder/Redazione for about 24 hours now [17:29] it is, somehow, still making progress [17:29] I guess it's because (1) the journal dates back to 2002 and (2) Splinder Italy is terribly bogged right now [17:48] yipdw: Yeah, it seems to be huge! [17:56] that's what She said! Whooo! [18:06] PepsiMax: 119 MB and counting [18:06] also, weird, I just had git totally space out on origin/* pointers in a repo at work [18:07] never seen that happen before [18:08] how mature ersi :P [18:08] Sometimes that just burps right out of me [18:08] I think it's what's holding me alive, but I'm not sure! [18:09] sudo apt-get upgrade your-live [18:09] E: Unable to locate package your-life [18:09] etc [18:11] Havn't you heard it's The Small Things In Life? [18:11] Atleast I'm able to enjoy myself >_> [18:11] :-( [18:22] dpkg: dependency problems prevent configuration of your-life [18:22] requres the source: money [18:29] I was going to make an alcoholism joke, but I guess that also works [18:51] https://imgur.com/gallery/XspuW [18:58] heh [18:58] one of the anyhub WARCs I've got is just a bunch of BitTorrent files [18:58] interesting way to get around the legal restrictions, I guess [18:58] oh [18:58] Content-Disposition: inline; filename=black-pro.exe; [19:04] yipdw: yeah, it seems to be a lot of shady files... [19:04] even found some DoS tools... [19:04] I reported them to antivirus vendors, tought. [19:05] and i shred'd em [19:05] er, what's the point of archiving if you're gonna shred them [19:05] yipdw: I don't want lose exe files around. [19:06] they aren't loose, they're in WARCs [19:06] I found the rar files trought the /stats page [19:06] no. [19:06] I don't look insede the gzips/ [19:06] http://www.anyhub.net/stats [19:07] why would someone upload loose exes [19:07] I don't know, but their intention is irrelevant, IMO [19:07] I mean, from that page I can't even tell if it's actually a Windows PE file [19:08] I think that if the intention is to archive anyhub's public files, then you might as well archive all of it [19:08] PepsiMax: You suck at archiving if you're deleting stuff [19:08] the security experts and lawyers etc can pick apart the archive later [19:09] Well [19:09] You won't find any zero day .exe's anyway, and there will be anti virii signatures for those lame RATs [19:09] somehwere deep I do agree [19:09] theses viruses would be a shame to lose. [19:09] http://www.anyhub.net/stats [19:09] some unknown stuff [19:09] new stuff to submit :D [19:09] Sorry, I was in lala land, now here. What up. [19:10] SketchCow: PepsiMax's ranting about viruses in executables, how he's not going to archive them [19:10] blah blah [19:10] PepsiMax: one thing to keep in mind is that, unless you've actually looked at the content of those files, you can't tell if it's even a virus [19:10] So archive them without him. [19:10] ersi: well, i have 11,7GB of stuff ready. I'm moving them to a secure location. [19:11] I'm, just ranting about how people exploit a great fileupload [19:11] hurr [19:11] PepsiMax: I mean, sure, there's a high probability that something named "LOIC.exe" is really the LOIC [19:11] but who knows [19:11] Calm down, boys. [19:11] filename isn't a criteria for deletion from an archive, IMO [19:11] yipdw: thats why I don't open the gzips. Then I did not knew. [19:11] Is that all the current news? [19:11] yipdw: PR0DDOS.EXE [19:12] PepsiMax: again, I don't know :P [19:12] Might be a Disk Operating System, you don't know that. [19:12] ProDoS v1.0 by 0v3rd0z3r aka SatansWrath [19:12] I am NOT responsible for your actions, and what you do with this program. Education purposes only, thank you. [19:12] Note: This program is twice as powerful then LOIC or ServerAttack, so be careful with it. [19:12] I also think the danger is minimal, even if you gunzip the WARC and pipe it through less [19:12] especially if you're on a UNIX system where PEs are pretty hard to execute [19:13] it would be a good idea to run all this archiving stuff in a sandbox, though [19:13] I don't care about executing. I care about owning them [19:13] eh, well [19:14] PepsiMax: if you could just let alard know the identifiers of the files that you shredded so that he can re-add them to the tracker, that'd be nice [19:14] so that someone else can grab them. [19:14] yipdw: i never shred'd any warc downloads. [19:14] sheesh, I woke up late [19:14] then I must have misread 13:05:31 and i shred'd em [19:14] Because I don't know whats inside them [19:15] yes. [19:15] the "pxF-pro-dos.rar" [19:15] I'm confused [19:15] how did you actually get that file if you never gunzipped the WARC [19:15] Im not sure either. [19:15] http://www.anyhub.net/stats [19:15] did you download it manually? [19:16] People use anyhub for shady files. Thats all. [19:16] I know [19:16] I'm just archiving. [19:16] I'm just trying to figure out whether or not "and i shred'd them" applies to any of the stuff you downloaded [19:16] yipdw: it doesn't. I didn't tamper with any warcs. [19:17] We do not have time for that. [19:17] ok cool [19:17] cool, that's been sorted out [19:18] How do get my 18GB to the web? [19:18] Will we use internet archive? [19:19] We'll shred them when we've downloaded it all [19:19] Because it could be used for dangerous things :P [19:23] reminds me of the various "shove all the biological and electronic pathogens to the Moon" plot devices in sci-fi novels [19:23] e.g. 3001 [19:27] oh hi [19:27] hi :) [19:34] SketchCow? [19:35] the moon? surely the sun [19:39] the moon is better [19:39] you can get stuff back [19:45] Yes [19:45] in case the last anthrax spores are needed to fight invaders from another dimension, presumably [19:45] See, I live [19:45] SketchCow: Hi! [19:45] Perhaps it would be handy if you could set up something on batcave to rsync anyhub, splinder stuff to. [19:46] Assuming you can handle that, of course. [19:46] Perhaps one module where people can rsync to a subdirectory? [19:46] fwiw, I have been sending some splinder stuff to my rsync on batcave -- and have 100 gb I plan to send there soon, as I'm getting low on disk [19:48] There are multiple people with small to medium-sized chunks, it would be useful if we could point them somewhere. [19:49] How big is it. [19:49] (Just need to know) [19:51] A guess: far less than 300GB from 'little people'? (Not the underscors, Coderjoes who rake in terabytes.) [19:51] We have about 14 terabytes free on batcave at the moment. [19:52] it'll be around a terabyte all told, I suspect [19:52] In total, MobileMe is currently at 3586GB; AnyHub is at 265 GB; Splinder at 120GB. [19:53] and splinder is 15% or so done [19:53] Individual downloaders have tens of GBs each. [19:54] Holy fuck that's some.. data. [19:54] note these are du --apparent-size numbers, and with lots of small files, I see up to 2x what your tracker sees with regular du [19:54] Not too bad. [19:54] You heard about the disk thing with archive.org, right. [19:55] Slowdown of purchases until the Thailand situation clears up. [19:55] yeah.. I hope this doesn't turn out to be like the 90's with ram [19:55] It's not just a standstill thing, because a number of drives die every day. [19:55] So they're using them just to stay afloat. [19:55] So we'll keep going, but a 200tb block would be significant right now. [19:56] 200TB T_T [19:57] MobileMe is staying until June next year, so that's not urgent. [20:01] "Downloading it:volevochiamarmipuckmaeraoccupato profile" [20:01] wtf [20:01] So, this Berlios thing. [20:01] What's the opinion. [20:10] finish transfer and see what happens. [20:10] I looked at their mailing list for the takeover and there were only a few posts. But I don't read German [20:14] There's our German. :) But where's the mailing list? [20:16] https://lists.berlios.de/pipermail/berlios-verein/2011-October/thread.html [20:16] also last month [20:18] The German things, as far as I've seen, come down to this: https://lists.berlios.de/pipermail/berlios-verein/2011-October/000006.html [20:18] There will be an non-profit association that will continue Berlios. [20:19] They're looking for volunteers to help. The association will be founded 'in November 2011'. [20:20] OK. [20:20] Der bisher angefallene Hauptkostenblock waren Personalkosten, [20:20] So what do we think is the best way to present this stuff? [20:20] The main cost was in personnel, which a volunteer association doesn't have. [20:20] Do I do a per-site archive? [20:20] They're hoping to find hosting sponsors. [20:22] SketchCow: ym for rsync? [20:22] ym? [20:23] you mean [20:23] I mean when I put these into a collection on archive.org [20:23] Because I'm going to pull them off batcave. [20:23] ah. well, for berlios, we have a nice division into per-repo directories, which could be separate archive.org items. I don't know how hard it would be to create thousands of items though [20:24] easy peasy [20:24] Yeah. [20:24] So that's the smart way? I can do that. [20:24] that way if a project needs their git repo they can get it without hunting thru some ginormous tarball [20:25] otoh, I have no personal problem with a ginormous tarball either. really up to you dude [20:25] I vote per-project item [20:25] or maybe [20:25] nah, yeah, collection for berlios, 1 item per project [20:27] there are 12 thousand projects fyi [20:27] understood. [20:28] How many mobileme accounts do we think there are? [20:29] SketchCow: there are ~340,000 in the queue [20:29] ummm, a few million? [20:34] SketchCow: we're rsynced up untarred directories. I can try to write a script you could run that does another rsync to get recent activity and tars them up nicely for archival= [20:35] It'd probably be better to write something that converts the uploaded directories into a nice package for per-package items [20:35] metadata extraction too [20:35] Then I can keep going [20:36] what kind of package and metadata format do you have in mind? [20:37] Like, making each item a .tar.gz or whatever, and a .txt file with the name, author, whatever else comes with the entry. [20:37] So I can blow in that information into the item. [20:38] absolutely. author info is a bit hard, since these are everything from git repositories to mailing list archives. But at least name and original rsync url I can do [20:39] I will develop it and get back to you [20:41] Whatever we can get. [21:06] I'm helping with Splinder (42 instances and didn't do it on purpose) [21:06] there's a user who keeps failing: http://p.defau.lt/?J4RlPPKettnFG0loB0eX2Q [21:07] defau.lt? [21:07] looks like there's some extra dash in the URL: http://ladyvengeance.splinder.com/ [21:07] PepsiMax, yep, a pastebin [21:08] :D [21:08] hmm [21:08] * chronomex sees PepsiMax's eyes light up [21:08] Well, I don't see anything. Ask closure/ alard/ yipdw/ ersi etc [21:09] Nemo_bis: what's in the wget logs? [21:09] yipdw, where are they? [21:10] http://-ladyvengeance-.splinder.com/ exists [21:10] Nemo_bis: they'll be in data/-/-l/-la or some such [21:10] yes, I've also seen the problem with directories starting with a dash. It makes the downloader loop forever [21:10] hold on [21:10] * yipdw will try to debug this [21:10] something needs to use ./$dir instead of $dir [21:10] i can't open that URl in my browser [21:11] works here [21:11] "unknown hostname" [21:11] on Firefox 8 [21:11] (on wget) [21:11] a DNS problem? :-? [21:11] https://gist.github.com/69346f7d072a4cdd4e77 [21:12] would not be surprised if some crap dns server doesn't like dashes at start either :) [21:12] what DNS server are you using? [21:12] actually, looks like chrome has a bug with it too :) [21:12] my dns is ok, chrome shows a dns error though [21:12] before that, though, let me try to download Crystailline's profile [21:13] Fastweb DNS [21:13] closure: the rational response is clearly that Chrome sucks and that you should abandon it for Opera [21:13] assuming slashdot is any indication of logic [21:13] do you want IPs to try and reproduce it? [21:13] --2011-11-14 17:13:37-- http://-ladyvengeance-.splinder.com/ [21:13] Resolving -ladyvengeance-.splinder.com (-ladyvengeance-.splinder.com)... failed: Name or service not known. [21:13] wget: unable to resolve host address `-la [21:13] dyvengeance-.splinder.com' [21:13] closure: quote it [21:13] otherwise it'll be interpreted as an option [21:13] it's not a quoting problem, I ran wget http://-lady... [21:14] -ladyvengeance-.splinder.com is an alias for blog.splinder.com. [21:14] and my dns is ok: host -- -ladyvengeance-.splinder.com [21:14] blog.splinder.com has address 195.110.103.13 [21:14] did you run wget 'http://-lady...' or wget http://-lady... [21:14] they're absolutely equivilant. I ran both thogh :P [21:14] got me, then [21:14] me too, none worked [21:15] and DNS working here too [21:15] all I can say is https://gist.github.com/69346f7d072a4cdd4e77 [21:15] I'm checking if the download scripts choke on dashes [21:17] different wget version? :-/ [21:17]