[00:01] the rsync problems seem to sort themselves out after a retry or two anyway [00:01] alard: is there a reason all of the upcoming.org jobs are 6mb? [00:01] That's strange. Is 216.245.195.218 far away from you? [00:02] 36ms, 13 hops [00:04] Are you downloading for other projects? That should go to the same server. [00:04] nay. [00:04] just one job on a spanking new instance [00:06] 6MB could be correct. [00:07] just wanted to be sure before returning piles of botched jobs [00:08] That's more or less what I want to find out. [00:08] You will get blocked by Yahoo, probably. [00:08] The script does not yet detect that. [00:09] I'm going to hold off firing up a pile of instances until that is implemented [00:09] Yes please. (That's why it says "We're still testing this" in the warrior menu. :) [00:10] giving you some samples to work with at least :) [00:10] Yes, I hope you get blocked so we can see what that looks like. [00:10] one [00:10] second [00:10] I'll get banned. [00:11] once the tracker gives me enough jobs. [00:11] That's the idea. [00:12] rate limited by the tracker at the moment. [00:13] I increase the rate limit. [00:13] here we go. [00:14] (I set it very low because I'm going to bed in a moment and don't want someone to start 'helping' with hundreds of instances yet.) [00:14] * nwh nods [00:14] I'm presuming once I get banned they'll either show a rate limiting page, or drop my requests? [00:15] Normally Yahoo has a special error code for that (499, I think). [00:16] But this http://waxy.org/2013/04/the_death_of_upcomingorg/ suggests "Yahoo starts serving blank responses". That would be different. [00:16] doing --wait and --random-wait is a good thing for yahoo [00:16] probably wouldn't hurt to detect blank responses... [00:16] I'm really going to sleep now. [00:17] If you want to tinker, go ahead (update the version number). [00:17] I'll look at the results tomorrow, then we can go faster. [00:17] I haven't been banned yet, which is unexpected [00:17] on the yahoo messages grab I would have been banned ages ago [00:17] Is skithund here? [00:18] doesn't look like it [00:18] 4MB, yours are 6MB, strange. [00:18] some of skithund's are 4MB too [00:19] Anyway: we might have to redo some or all of this, so don't go too crazy yet. [00:19] Bye. [00:19] night [00:26] skithund has gone absolutely nuts. [00:34] well I'll hold off until it's properly ready before making a billion instances. till then. [01:04] this is interesting: http://archive.org/details/1001_Games_and_More_ValuSoft_1997/ there's a crack of Secret of Monkey Island included on the CD [01:16] alard: Yes, all of my current work segments are blocked on RSyncUpload. [01:17] (Happy to help you debug if needed; was AFK before.) [01:21] http://waxy.org/2013/04/the_death_of_upcomingorg/ — I hope he knows about AT's efforts [01:21] since it wouldn't hurt for this to be coordinated [01:22] Two different people (one me) already told him to ping AT. [01:22] (look in the comments) [01:22] ok [01:27] i found something that i forgot to upload [01:27] offical nintendo sticker book [01:53] Any progress re: RsyncUpload errors? Alard? [01:55] ussjoin: they're asleep. [01:55] ...well, shit. [01:55] are they continuous for you [01:55] hi [01:55] lionheart: evening. [01:55] nwh: Yes. On all 6 tasks. [01:56] ussjoin: I don't know enough about the backend to diagnose it unfortunately. hopefully one of the core team will be able to help you out with that—when they're around. [01:57] Fair enough. Just makes me a bit sad that on my first day of trying to help out, I can't, um, help. :-) [01:57] which job are you working on? [01:57] It *appears* FormSpring. Are all the tasks from the same job? [01:58] (FormSpring is the banner at the top. I'm just on AT Select.) [01:58] mm, that'd be formspring then [01:59] Wow, I'm impressed that the Upcoming task is up already. [01:59] Should I just throw away my AT Choice work and tell it to work on Upcoming or Posterous instead? [01:59] can't hurt to try the Posterous job I suppose [02:00] the Upcoming task isn't presently finished, I was intending to try and get myself banned, but it doesn't seem that you /do/ get banned [02:01] That sounds good...? [02:01] well, yes [02:01] for the Yahoo Messages job, you got banned within minutes [02:01] (BTW: Restarted warrior, told it to work on Posterous, it's kicking ass.) [02:02] awesome [02:49] http://www.scribd.com/doc/136875051/-why-s-complete-printer-spool-as-one-book [02:49] the whole thing is very worth reading, but page 9 asks what if franz kafka wrote for the 32 bit power pc? really good writing [02:50] https://archive.org/details/136875051WhySCompletePrinterSpoolAsOneBook [02:50] whelp. [02:51] yeah regardless it's really good [02:52] RedType: thanks [04:10] So, I have a huge amount of throughput (100 Mbps down, 5 up). How can I help ArchiveTeam use more of that when I'm not? My Warrior keeps using ~0.1kBps. [04:35] DID SOMEONE SAY YAHOO UPCOMING? [04:48] -ENOSPC :) [05:49] underscor: are you around? [05:50] underscor: your worker for 'upcoming' is returning bad results [05:51] * nwh cringes [06:15] nwh, is it still returning bad results? [06:15] yep [06:15] he's been banned from the server, and just returning 0mb jobs [06:15] ~5k of them so far [06:15] Blocked him tracker-side. [06:17] handy function. [06:21] I think it rounds so 0mb doesn't necessarily mean 0 bytes [06:21] the jobs for that should be about 4-6MB each though [06:22] returning thousands of negligible-sized jobs just means they pushed it too hard and got banned [06:47] GLaDOS: 'short' is going to get banned pretty quickly too. the current scripts on github don't check for that. [06:48] ping me when he does [06:49] im...around [06:49] i guess ill watch for 0 byte [06:49] or lower my threads [06:50] S[h]O[r]T: apologies, I didn't notice you in the OPs ;) [06:50] it seems pretty hard to get banned, or I was just lucky before [06:50] its cool. overlooked that error code wasnt in the script yet, 20 threads going atm [06:51] I survived with 15 when I was running it, I just wasn't sure if alard was done making the pack yet [06:53] If I have a value set in wgetrc for -D, and I also set -D in the execution string, does that override or append? [06:56] super random guess, its overrides [07:01] I want to have a default list of cdn's, image hosts, etc [07:01] and then add the primary domain on top of it [07:01] rather than having that stuff clog up my string it would be nice to store it in an external default [07:05] there's probably going to be too many image hosts than you can ever whitelist [07:05] especially as places like http://getcloudapp.com/ allow people to have vanity domains [07:11] yea but I am just trying to have a core list of image hosts and cdn's, not necessarily complete [07:11] I run a post link check that shows me missed files on completion and I will often manually patch in anything of importance [07:12] regardless of the use, wondering if it is possible to have a default set of commands in wget that is appended to rather than overwritten [07:13] so if i set a list of domains in wgetrc, it would be cool if setting -D in the string just added to that list rather than overwrote it [07:17] maybe I need to approach it where I use -p and set an exclude list using hosts from http://winhelp2002.mvps.org/ [07:19] re. upcoming - bad idea to go all out with 6 concurrent, I see? if so, how much is safe? [07:23] AFAIK, we don't know what the safe limit is. [07:24] mmk, will run 2, but if things still get hairy, I'll go to 1 [07:24] wp494: it seems very hard to get banned [07:25] skithund managed to, but I'm not sure what they did [07:26] probably all 6 for a period of time would have been my guess [07:26] no, must have been 30+ or something [07:26] seems weird that Yahoo would have inconsistent 999 triggers [07:29] anything else I should watch for? [07:43] wp494: just 0MB jobs [07:44] GLaDOS: paulv is returning 0MB jobs now too [07:45] yeah, I'm getting rsync: failed to connect to tracker.archiveteam.org: Connection refused (111) [07:45] ah - so it's a tracker issue rather than a ban? [07:45] oh, no [07:45] that's for formspring, sorry [07:45] you might want to kill your 'upcoming' jobs [07:46] yeah, I dont know what happened for upcoming. I can debug if someone wants [07:47] I'm assuming that yahoo is just returning blank pages to you now [07:47] how many concurrent jobs do you have paul? [07:48] not sure why you and skithund got that though, short and I have been a lot more persistent [07:48] i'm just running it on the cli w/ the defaults, so I think 1 [07:48] I'm running 20, so I'm not sure what's happening there [07:51] paulv: I'd stop running it altogether, you're just returning junk [07:51] yeah [07:52] I was trying to figure out why it was happening, but it goes so fast [07:53] `curl` a page and see what it is returning [07:54] if it's like the other yahoo jobs, it'll just be a blank page [07:54] I was trying to find a url [07:54] curl http://upcoming.yahoo.com/event/10204150/NSW/Sydney/Social-Media-Marketing-Course-Facebook-Twitter-and-Blogs/Centre-for-Continuing-Education-CCE-The-University-of-Sydney/ [07:56] error 999 [07:56] damn [07:56] the question is, why are you banned and I'm not [07:57] well, the machine I'm using is in the archive's friends and family rack, which is in archive.org ip space. I wonder if they're being mean b/c of that. [07:58] wonder what skithund's on [07:59] i also have no idea how fast the connection is, so it's possible that I just overwhelmed them [07:59] oops. [08:00] * Smiley looks in [08:00] think it's worth making a channel for upcoming? [08:01] there's not a heap of archiving to do in it [08:01] I wouldn't mind [08:03] you can't really make anything puny out of "upcoming" either [08:04] "whatcomesupgoesbackdown" would be the only thing remotely good I have, but it's probably too long of a name [08:05] you can't beat "preposterus". it just can't be done. [09:54] I'm getting rsync timeouts constantly. Is it just me? (I've got my warrior set to upcoming.org if it matters) [10:09] anyone else having rsync upload issues? [10:10] I'm getting the rsync errors as well [10:11] Yup [10:11] cool atleast its not just me :P [10:14] hah, and mine just sorted itself out [10:14] Mine too. Just now. [10:17] same here [10:21] complaining on IRC clearly does work ;) [11:05] alard, guess what Yahoo's doing! [11:05] ERROR 999 [11:31] GLaDOS: Upcoming? [11:31] yep [11:31] The thread contiunes on, and uploads an incomplete WARC [11:33] Yes, it's not checking for that error. [11:35] Is 999 the only sign? [11:35] Or do they sometimes return empty pages, like the post on waxy.org suggests? [11:36] Not entirely sure. [11:38] So far I don't see any 999 in the warc files. [11:38] If I wanted/needed to have a quick chat with Jason Scott, is he likely to pop up on IRC with any regularity or is email a better way? [11:39] creature: Look for SketchCow here, but he seems to prefer email if it's about something he needs to do. [11:41] alard: Kind of the opposite. :) Something I need to do. [11:41] Then just lurk here. SketchCow usually makes some noise when he's active. [11:42] Thanks. [11:42] GLaDOS: Found the 999. [11:43] (There's just much more data than I expected. 23GB!) [11:50] All Upcoming downloaders: The script now checks for error 999. You'll have to update to get the latest version. (The warrior updates automatically.) [12:06] Love that upcoming is already running. Great work! [12:07] Running five instances, is there a better way to optimize? And is 6 ”concurrent items" generally better than the default2? [12:10] The number of items depends on your VMs memory, and the size of the project. Upcoming doesn't use that much memory, I think. [12:12] If you're on Linux you could try to run a standalone client, outside the warrior VM. [12:12] That would let you run more than 6, but it's a little bit more work to set up. [12:16] Thanks. I',m mainly on OSX so I'll just keep going then ;) [12:23] "You'll have to update to get the latest version. (The warrior updates automatically.)" Does this mean I have to do something or not? [12:24] bjrn: If you are running the warrior virtual machine, no. Your warrior will update within 60 minutes. [12:24] If you have manually cloned the git repository, you'll have do to something to update. [12:24] Ok. Great :) [12:27] I've removed the "we're still testing" tag from the project. I think we're ready to start. [12:27] So feel free to scale up, if you want. :) [12:31] hey guys [12:32] I'm also running an upcoming crawl, just to let you know [12:32] i've been in touch with andy [12:32] how do you split the event IDs ? [12:32] sylvinus: Hi. Batches of 25. [12:32] Wow, that sure made some difference ;) [12:33] alard: starting from 0 to 10M ? [12:33] Yes. [12:33] any idea about your speed? [12:34] We've only been testing so far, to see how Yahoo does the blocking. [12:34] I'm doing the same & running around 4K urls / minute right now [12:34] We'll now start for real, I hope. [12:34] I've set the github IRC hook to update to this channel. [12:34] Feel free to disable it if it gets annoying. [12:35] sylvinus: In what form are you saving it? [12:36] html + xml [12:36] 2 urls per event. I've sent a dump to andy who ok'd it [12:37] maybe I should do the IDs in reverse so that we don't overlap from the start? [12:38] We're doing it in a random order (the tracker is unable to do anything else). [12:38] 01[13yahoo-upcoming-grab01] 15none pushed 3 new commits to 06master: 02https://github.com/ArchiveTeam/yahoo-upcoming-grab/compare/d3ccd4327951...a54c402ab0b9 [12:38] 13yahoo-upcoming-grab/06master 145b8f9d3 15Alard: Now with pipeline.py and README. [12:38] 13yahoo-upcoming-grab/06master 14a54c402 15Alard: Retry on error 999. [12:38] 13yahoo-upcoming-grab/06master 14b31f506 15Alard: Domain is upcoming.yahoo.com. [12:39] alard: ah, that's unfortunate :( [13:00] ok I got the first 100k [13:00] :) [13:58] http://archive.org/details/archiveteam_upcoming_20130420072639 [14:01] alard: that was fast! how many events do you have in there? [14:03] is there a wiki page for upcoming? [14:07] sylvinus: I think a little more than 100k. [14:08] sep332: I don't think so. Make one! [14:09] I made one - http://www.archiveteam.org/index.php?title=Yahoo_upcoming [14:17] alard: I'm curious to know what you guys are grabbing. I have 200k events now and it's only about 7g [14:18] uncompressed! [14:19] sylvinus: Images. [14:19] We're getting the whole page, more or less. [14:20] ha, ok [14:24] i'm wondering what an "item" is for the Upcoming job. [14:25] It's a batch of 25 event IDs. events-3657850 is event ..50 to ..74 [14:25] according to the founder, all id's are autoincrementing. so we should be able to make a complete list of items pretty trivially? [14:25] ok interesting [14:32] alard: are you fetching the full attendee list? [14:49] there are holes in the ID list [14:49] event 3 for instance [14:49] russss: If that's just one URL, yes. https://github.com/ArchiveTeam/yahoo-upcoming-grab/blob/master/upcoming.lua#L37-L38 [14:50] ah cool [14:50] And the good thing is that it's a GET request without any special random parameters, so with a bit of luck it'll still work in the wayback machine. [15:54] So is yahoo really slow too? [15:54] my warrior doesn't appear to be pulling much. [15:54] 300B [18:25] nwh: Just FYI, the Upcoming task is still displaying its name as Yahoo Messages. :-) [18:29] 01[13yahoo-upcoming-grab01] 15alard pushed 1 new commit to 06master: 02https://github.com/ArchiveTeam/yahoo-upcoming-grab/commit/d922927a16bd4d66efc410c7dfadd83736caad41 [18:29] 13yahoo-upcoming-grab/06master 14d922927 15Alard: Upcoming, not Yahoo Messages. [18:29] Thank you, GLaDOS. :-) [20:36] Morning. [20:42] heya [20:49] SketchCow: Hi. I started uploading the Upcoming data. See http://archive.org/details/archiveteam_upcoming_20130420130753 and others. Want them added to a collection or do you want to do that later? [20:56] Brilliant work on the upcoming thing, by the way [20:56] I can add them later. [21:01] Brilliant work all round. That Warrior thing is fantastic. [21:02] :) [21:02] glad you like it, it was nothing to do with me :D [21:04] Just putting it out there. It feels nice to be able to help with just a few clicks. [21:07] I agree, it is a really impressive piece of kit. [21:14] * SketchCow shoving a pile of CD-ROMs into the CD-ROM collection. [21:25] SketchCow: where's the yahoo-upcoming-grab IRC channel? [21:25] I have to ask the boys here [21:25] Hey, boys, what's the channel? [21:25] Is there one? [21:26] ( i was away when this happened) [21:27] No. [21:27] #outgoing [21:46] I don't understand the warrior. How does distributing a 200MB virtual machine to thousands of volunteers who then just send you the sites they download save you guys any bandwidth? [21:46] You know how low-carb diets work? It's like that. [21:47] 01[13yahoo-upcoming-grab01] 15alard pushed 1 new commit to 06master: 02https://github.com/ArchiveTeam/yahoo-upcoming-grab/commit/6e82e5315a6b3641d87710d88dafb8549d547950 [21:47] 13yahoo-upcoming-grab/06master 146e82e53 15Alard: Add logic for groups and users (next step). [21:47] gcr: It's usually not about the pure bandwidth. [21:47] Are you actually going to explain it out? You have more time than I do. [21:48] Crawling takes CPU and memory. In the case of Yahoo, there's the IP blocking. [21:48] I'm a friendly guy. [21:48] You are, I'll give you that. [21:48] (And I don't understand low-carb diets.) [21:48] oh, they limit bandwidth for individual IPs? [21:48] Also, you made the thing, I guess I'll stop everything if someone queries hard about one of my movies, too [21:49] Oh, it's nothing like a low carb dioet, that information is extremely false. [21:50] gcr: Are you running the warrior? If not, try it. You'll see. :) [21:51] Ok, I'll try it [21:52] I'm just curious if this is the most efficient way of doing a project like this [21:52] Yes. [21:52] Yes, it is on multiple levels, and it actually advances issues far and beyond this project. [21:53] I have another question. On the dashboard it says "data 171GB 5MB/u" Is that total data pulled, and average per user? [21:53] Yes. [21:53] Average per item, which in this case is a batch of 25 events. [21:54] We normally do one task per username, hence the "u". [21:55] Ah. I was just going to ask about the "u". [21:55] Thanks [21:56] SketchCow: glad to see you putting up cds images again [22:08] on Posterous, I've got a bunch of tasks with lots of 503 or 504 errors. Anything I should be doing differently? [22:15] * SketchCow and DFjustin are rocking the CDs [22:15] ussjoin: No, that's just the Posterous server. [22:16] alard: Cool. (I know it is when it's just those HTTP errors, but not sure if/when the Warrior thing is munging errors.) [22:25] Is it normal to bounce super fast between say 500k/s and 0k/s for several seconds at a time? [22:25] This suggests that it makes more sense to set the default number of concurrent projects much higher than 2 at a time [22:26] or well, at least *my* number of concurrent projects anyway [22:27] We're not in it to flood the networks of everyone who runs it. [22:27] that's how you get tourists [22:34] sure, ok [22:40] I'm suddenly having warrior troubles [22:40] I think it has to do with me updating VBox [22:41] http://pastebin.com/PZXVZuwH [22:42] huh, livejournal now has a thingie where they delete accounts with less than three entries after two years of not logging in [22:52] Question: running Warrior, I think I've chosen the max settings (6 concurrent items, 4 rsync threads). I've got a fast Net connection, though, and it's not being particularly well-utilized... is there any way for me to control that and make Warrior suck up even more? [22:56] pjlover: you can run teh scripts directly if your on *nix or can run a nix vm and understand how [22:56] howevr most of the time the slownessi s due to the servers at the other end being slow [22:58] Ah, ok, fair enough. I just see siliconvalleypark on the leaderboard, and he's pushing items at a pretty high rate, and wanted to understand if he was doing anything special. [22:58] I've got a Linux box I could certainly try to get it set up on, though... where are the scripts? [22:59] Smiley: any idea on my kernel failure? [22:59] http://pastebin.com/PZXVZuwH [23:00] alard: anything happening on the AMI front? I've got an evening to kill, and was planning on spending it putting something a bit more generic together. [23:00] hneio_: hmm not running the vbox modules? [23:00] I'm running the ova under vbox [23:02] alard: (than the yahoo messages AMI I put together, that is) [23:11] hneio_: lsmod | grep vbox return anything? [23:13] Smiley: Win7 :x [23:39] hneio_: ah ok no idea :<