[00:02] hey Famicoman [08:44] http://www.obviouswinner.com/obvwin/2013/4/26/batman-dude-builds-himself-150000-secret-basement-batcave.html [08:44] Batman Dude Builds Himself $150,000 Secret Basement Batcave [08:45] sorry didn't know this was not #-bs [14:25] Oh goddamnit [14:25] People are telling me/us about sites WAY too late. [14:26] I realize it's redundant, but I'm going to try grabbing a site, who else wants it? [14:26] streetfiles.org [14:28] I'm currently hopping IPs to be nearly the only posterous downloader currently [14:29] what can I do for streetfiles? I can spin up debian VM's [14:29] on VPS with limited HD space to retransmit or larger to save the files [14:32] I don't know how big it is. [14:32] I have 2tb set aside [14:33] (back after another IP flip) [14:41] Hi [14:41] how may I download several Webpages simultainoiusly? [14:41] at the moment I'm running 3 VMs :/ [14:42] or is there another way to squeeze more out of the box? ;-) [14:47] fork() [14:48] you could try running multiple instances of your downloader [14:48] atm I'm running 3 seperate VMs .... i can only control the first one because of the ports :/ [14:49] is it possible to simply change a configfile? [15:09] SketchCow: is for streetfiles a script available? don't have much spare gbs here, but could give bandwidth [15:11] No, no worries. [15:11] I'll do it and another team member will do it. [15:12] aweseom :) [15:12] SketchCow: Want to make Streetfiles a warrior thing? [15:13] All the photo pages are just increasing numbers, just need to find the start point [15:15] well that was simple, it starts at 1 [15:15] http://streetfiles.org/photos/detail/1 [15:17] Yes [15:17] alard: Yes, sure, why not. [15:17] alard: I'm more concerned about nwnet and the related site - can that go warrior? All of them die on the 30th [15:18] Yes it can, someone was brute forcing it to find more urls [15:18] Streetfiles: I've almost finished a lua script that makes a per-user package. Testing it now [15:19] nwnet: Yes, what we need is a list of urls to feed to wget, and a wget command. [15:20] I'm having a 2 servers which have nothing to do... if you provide me a vm I can spend as much bandwidth as u want... [15:20] and the servers are capable of ;-) [15:20] Someone was supposed to take a dictionary attack and someone was supposed to do a google dictionary attack [15:21] for x in {1..10}; do wget .... $x; done [15:21] dashcloud, was working on it [15:22] btw is archiveteam present at OHM2013? [15:28] No. [15:32] Uploaded the first 100gb of my linux iso archive. Here is one release version http://archive.org/details/opensuse-10.2_release [15:35] Looks fun, though. [15:36] OK, omf_ [15:36] First, set it as software, not texts. [15:36] I knew I was forgetting something [15:36] Next, let me make a collection. [15:37] These will all be ISOs? [15:37] some of the older ones are floppies [15:37] and tars of code [15:38] OK. [15:38] How big is this again, just for trivia's sake? [15:38] 3tb at last estimate [15:38] I got boxes and boxes of drives [15:39] OK, easy enough. [15:39] Keep uploading. [15:39] I have gotten back all the data from the dead FOS machine. [15:42] I'm in the process of getting all the scripts and stuff I had there working, when when I do, stuff will come back, including collection making, and I can put your stuff into the collection. [15:42] whoop. [15:42] Yeah, we didn't have much data loss if any. [15:42] SketchCow: do you need notifying about all the warc's we are uploading for the ign/gamespy grab? They are all tagged archiveteam anyway. [15:42] No. [15:43] No, but I will be doing massive bombing runs of cleaning up what we have. [15:44] I found I have some freebsd and openbsd isos as well. I guess the collection title should be "Open Source Operating Systems"? or something better? I am terrible at naming things [15:44] Yes [15:44] I'll be approaching that. [15:44] IT'll e nice, it's an amazing collection. [15:45] Obviously, we have the ones that Walnut Creek did, but I consider this one to be a separate set, and one which might have some redundancies but I'd rather that they e redundant. [15:45] (Sorry, b key sticks on this laptop) [16:24] Now on a warrior near you: https://github.com/ArchiveTeam/streetfiles-grab [16:25] * Baljem hops - beats watching sweet FA happen on posterous :/ [16:27] Still looking for lists of usernames, though. [16:31] I'm really impressed how much the warrior has evolved. New project just showed up, seamless switch. Great software. [16:36] heh. mine appears to be dead. time to open the other laptop and poke the VM host... [16:38] oh, no, that was my fault. I think I clicked the 'stop' button by mistake (!), as it's powered itself off. doh [16:38] 11G simtelnet.bu.mirror.2013.04.zip [16:38] root@teamarchive-0:/1/SIMTELNET/ftp.bu.edu/mirrors# du -sh sim*.zip [16:46] SketchCow, I cannot change the mediatype of the items I already uploaded but all the new ones are software [16:52] Featurerequest for the warrior: What about an option to participate at multiple projects at the same time? [17:03] 2 days left, no idea how big it is, lets save some ART - http://tracker.archiveteam.org/streetfiles/ [17:11] hmm - something strange just happened on one of my streetfiles tasks - 'westberlinoldschool' I think was the name [17:12] it had downloaded something like 3000 URLs, then wget quit with exit code 4, I saw it 'waiting for 10 seconds' and now it's gone completely :-/ [17:43] damnit, it's just done it again on a job that had downloaded > 4500 URLs [17:43] that's a bit of a pain, it had been working on it for the best part of an hour :-/ [17:45] I think the username was 'mahatma-ganja' or some approximation thereof if that's of any interest to anyone. [17:47] Baljem: I'm having a look at westberlinoldshool now. [17:49] cool; fingers crossed I don't have another one vanish (currently have one on 4740 URLs downloaded, another on 3100, 2780 and 1970 - lot of work for someone else to redo [17:49] Wget's timeout setting is probably too low (10 seconds) [17:49] Is the site slow for anyone else? It is for me. [17:50] aaaaaand the one on 4740 ('rays') just did the same thing. that had been running for even longer than the last one, gawd knows how big it was [17:50] the download rate graph is being very variable for me - sometimes it drops to < 1kB/s, sometimes it's around 2MB/s [17:51] The static pictures are very fast, the HTML pages are slow. [17:51] I've dropped my concurrent items to 3 to try and back off a bit [17:51] If you browse it, does it feel slow? [17:51] although I have five running at the moment :-/ [17:51] one sec, I'll try [17:52] yes, slower now than it was when the project started [17:52] takes about ten seconds to load a page at the moment :-/ [17:52] I've reduced the number of requests given out by the tracker. [17:53] I can't test from the same connection as my Warrior is using, though, but that sort of response does seem like server load at their end rather than bandwidth [17:53] Yes, the server is serving pictures very quickly. [17:54] Is anyone else downloading from them? omf_? SketchCow? [17:55] think I've just lost another job, although I don't remember which name has gone missing from the dashboard. [17:55] yeah I got a streetfiles warrior running [17:55] alard: me and a mate are running on it [17:55] Yes, a warrior, but nothing else? [17:55] 2 fast machines [17:55] with the script [17:56] Which script? [17:56] on github, in archiveteam? [17:56] Ah, okay, that's going through the tracker then. [17:56] this one? https://github.com/ArchiveTeam/streetfiles-grab [17:56] right [17:56] There was talk earlier in this channel about people downloading it, before it was a warrior project. [17:57] E.g.: SketchCow: "I'll do it and another team member will do it." [17:58] oh, damn, I've gotta run [17:58] yeah I believe that got supersceeded by using the tracker since you mentioned you had it working [17:58] my Warrior will trundle along as usual - want me to do anything to the settings before I disappear? (back it off further perhaps?) [17:59] Baljem: No, keep it running. The tracker can handle the backing off/scaling up bit. [18:00] cool. currently set to three concurrent jobs (down from six earlier). just noticed another one gave up after 790 URLs ('okse1' I think) and getting very worried about the two that remain, but oh well [18:01] shall i stop a warrior? getting rate limiting post... [18:01] I have a vague suspicion I may have been banned or something, only getting 0.2KB/s [18:01] Baljem: I'm working on an update. [18:01] great, I'll check back after dinner then :) [18:03] so its better to change project on the warrior? [18:03] SilSte: You can, if you want. We'll have to figure out what works for this site. [18:04] k [18:04] But there's also no harm in keeping it running. [18:05] kk [18:18] alard: project code ist out of date... will it update automatical oder shall i reboot the warrior? [18:21] SilSte: It will update automatically, within an hour. To update immediately: stop the project (not the warrior), then start again. [18:21] kk [18:21] worked [18:22] alard, can I pause pipeline? [18:22] Pause wget? [18:22] Ctrl+Z should work. [18:22] no this command: run-pipeline --concurrent 5 pipeline.py [18:23] You can Ctrl+Z that, if you want. Why? [18:23] I am getting the project code out of date flowing by [18:23] and it makes it hard to keep an eye on what is going on [18:23] No, you can't pause that, unfortunately. [18:24] stop and upgrade it or leave it running? [18:25] Click the stop button, let it finish and start a new one (on a different port)? [18:25] this is the cli script on a cloud instance [18:27] Then do what you prefer: kill it, or wait until it finishes. [18:37] mowk died on a recent version [18:37] oh maybe not so recent, sorry [18:50] i get a lot of 500 @formspring... [19:43] Hey! For omf_ and anyone else who was interested, I've put a transcript of the Defcon 'soy sauce' archive team talk up at http://www.archiveteam.org/index.php?title=DEFCON_19_Talk_Transcript as well as a timed caption file that could be uploaded to YouTube if wanted. [19:44] Er, that's all. [19:44] [Not sure if the wiki was the right place; obviously do wipe it out if it's not appropriate.] [20:00] Can someone download the http://streetfiles.org/blog/ ? [20:13] looks like archiveteam is a user name on there there now [20:15] Yes, I signed up. [20:17] streetfiles down? [20:19] its going to die in 3 days [20:19] SilSte: No, I don't think it is. [20:20] okay... only very slow ^^ [20:21] i may not be much help with this one anyways [20:21] alard: i hope you can get it [20:22] godane: Why can't you help? [20:22] i don't have much hard drive space [20:24] Ah. I thought you were always uploading. :) [20:24] i am [20:24] just i have to much stuff to upload right now [20:24] i'm also starting to do a full mirror of newamerica.net [20:24] http://developer.streetfiles.org/ [20:25] Eek! streetfiles.org - 785,152 photos, 92,319 members. [20:27] Looks like such a good site, too. [20:29] yeah [20:29] 'bZ-Q(@K7ljlJRft'<)crcA [20:29] Sorry, Keepass. :) [20:31] mmhmmmm [20:32] It's a pity that streetfiles.org is so slow. [20:39] mm. we've done about, what, 10% in 4 hours? going to be tight I fear [20:40] We haven't done 10%. [20:41] There are 92,319 users, they're just not all in the tracker. [20:41] ah, bugger [20:42] I was going to qualify it with '10% of what the tracker knows about' but thought that might be overly-pedantic ;) [20:43] looking at the graph on the tracker page mine seems to be struggling recently, for some reason. perhaps it keeps finding things that have a lot of pages but not much data [20:43] It's an important difference in this case. [20:44] yes. I didn't realise there was quite that number of users not in the tracker :( [20:44] would be nice to record http://69.13.218.21 [20:46] will they be added to the tracker? [20:46] We have to find them first. [20:48] kk [20:53] * SmileyG looks in [20:55] i just got glenn beck freepac (FreedomWorks) live speech [20:56] the torrent was almost dead [21:05] balrog what is that [21:05] a stream from the CoCoFest 2013 [21:05] http://www.glensideccc.com/cocofest/ [21:22] is the tracker rate limiting being especially cautious with streetfiles at the moment? [21:23] I'd like to not kill it. [21:23] (nods) [21:23] The current items are groups, perhaps those are easier for them. [21:28] Question: metadata is more important than photos? [21:29] We don't have to download all those /photos/detail/ pages to download to the large photos. [21:29] But I think that's not a good idea. [21:30] Does the metadata tell you anything about the photo that could help prioritise how 'important' it is? - e..g number of views, popularity, size, etc? [21:31] I think that downloading the photos, once you have the metadata, is not a problem. [21:31] The bottleneck is in those web pages, I think. [21:31] hm [21:33] why not downloading more photos to less stress the server? [21:34] most of thosegroups sites are very smalll [21:34] Because that would give you a large bunch of anonymous photos. [21:34] Knowing where, when, what is probably interesting. [21:34] 49 to do? ^^ :D [21:35] Groups. :) [21:35] ^^ [21:35] agreed, and without the metadata you also don't know the author [21:35] no, that's rubbish, ignore me [21:35] you do have a user ID [21:35] Yes, you could derive that from the "photos by user X" page. [21:35] heh. think it's going to run out of items before mine asks for another after 30 seconds [21:36] gru_soldier seems to be very fat... downloading for hours... [21:36] ^^ [21:36] looks like flaushy got about 1.6GB in a chunk a short while again! [21:37] i get a lot of 500 @formspring... [21:37] Baljem: that was a long download ^^ [21:38] slf-city [21:38] I'm getting exit code 4 on my posterous downloaders, is this expected? [21:38] you should add time @the warrior ;-) [21:38] or a timer :D [21:38] posterous blocked me completely [21:38] another big one is on this machine as well [21:38] okay finally we got a wiki page - http://www.archiveteam.org/index.php?title=Streetfiles and an irc channel #streetsoffire [21:44] well with 5000 to do it shouldn't take us too long hopefully [21:45] Not all, not all. [21:46] (Someone should run a bot that repeats this sad message every time someone says we're almost done. :) [21:47] then gimme more work :P [21:48] is posterous working for anyone? [21:54] hey, don't bogart all the work ;) plenty to go round, but the site's creaky enough under this much load, by the looks of it [22:01] Are all the active projects deadline projects on this list? http://www.archiveteam.org/index.php?title=Current_Projects [22:04] Upcoming is done I think [22:07] Yeah I'll remove that [22:15] SilSte: posterous regularly bans now (like every 10 minutes it seems). [22:16] Are posterous smart enough not to ban, say, ISP web proxy servers, etc? [22:16] unlikely [22:17] my rootserver is blocked ... so ... no :D [22:17] stopped hours ago... still blocked. [22:18] Any idea how they know what to ban? Is it user-agent, or amount of downloads.. [22:18] amount of requests [22:18] Could they be looking at the tracker and banning the most recent IP to access that username, etc? [22:18] amount, right. [22:20] hmm. disappointing that they seem so opposed to the legitimate preservation of their users' content. [22:22] indeed. [22:22] :( [22:22] Other avenues? Google cache? (Not suitable?_ [22:22] antomatic: the tracker is ours.... [22:22] ) [22:22] google bans ;) [22:22] buh! [22:23] Some days an archivist can't get a clean break. [22:23] Good thing the bad press won't stop for them [22:23] Ah, I meant the dashboard rather than the tracker. That's how I'd interfere, if I were an evil site owner. [22:27] Loved the comment in the Defcon speech, "But Google is a library or an archive in the same way that a supermarket is a food museum." [22:30] ok so I'm mirroring a site that contains a lot of realmedia .ram files [22:30] after doing wget-warc, I need to cat all the ram files together (each contains a url to a .rm file that actually contains the media), and then what do I do? [22:31] feed the list into wget and generate a second warc file? [22:33] are the .rm files coming off a normal HTTP server or are they using RTSP-type streaming? [22:33] might be more complicated if so. If HTTP then no problem, just as you say, I reckon. [22:36] antomatic: http [22:36] [there are separate tools for RTSP] [22:37] I remember how much trouble RTSP used to cause me, back in the day. :) [22:42] Posterous isn't banning me for some reason? [22:43] yet ;) [22:43] It's been running all day though. [22:43] blimey. you're our last hope then ;) [22:44] It appears so. [22:44] Which is a scary thought! [22:45] Excellent luck, noah! [22:46] I'm downloading between 100 - 200kb. [22:47] I wonder why I'm not banned through…. [22:59] balrog: if you have to generate a second warc you can always concatenate them later [23:52] https://twitter.com/waxpancake/status/328158604765036546 [23:52] FYI [23:52] LJ is deleting old blogs with fewer than 3 posts