[00:59] how are we doing on webtv [03:33] need lots more help- my download got stopped, and I'm not entirely sure how far I got because of the massive number of duplicates --mirror entails [03:35] here's the last line in my download log: community-2.webtv.net/@HH!BC!ED!761326D9D04B/ValSpegeln/CLOSEENCOUNTERSNEWS/clipart/Education/ed00030_.gifâ saved [03:37] is there an easy way to resume my wget-warc download without duplicating the things I've already done? ( skip all the existing stuff, and just start downloading from some point on?) [03:44] so, I do plan to restart a download, but definitely more people are needed [03:49] wget's resume capability is worthless. It would re-checked everything you already got first before resuming, a huge time waste [03:49] and no way to skip it [03:49] so, I tried to approximate where I think I got to (hard to know for sure with the bizarre URL schemes for webtv), and started the download there [03:53] if you are generating warcs then you should be able to ls -lStr *.warc.gz and see the last warc created. Are you using the url as the filename? [03:55] I kind of wish I had thought of doing that before I started to download again [03:56] here is the shorter list I'm working with- I've restarted on line 3109 or so, and may have done all the previous lines. http://paste.archivingyoursh.it/quqoweyiso.avrasm [03:57] here's the longer Bing list that's unfiltered and unduplicated: http://paste.archivingyoursh.it/ficequtape.avrasm (12.6k lines) [03:57] I have to get going now- good night and good luck! [04:29] Hi, gang. [04:29] Anyone jump on zapd? [04:32] bsmith093 i think the urltream tracker has been down for 2-3 months now but the project is still alive [04:33] They deliver images via javascript so we need a javascript piece somewhere in the mix for getting most of the content [04:34] As an example load http://anna-heimbichner.zapd.com/cake-pops without javascript and it is essentially an empty page. With js on we find that page is the index for a series of posts [04:39] This is more annoying than snapjoy [04:42] does their iphone app load pages too or just create? maybe an easier way to grab via how the app does.assuming it doesnt use javascript [04:44] Anyone with iphone want to try that out? [04:48] im downloading the app now. what if you change your user agent to something mobile, does it load differently? [04:58] S[h]O[r]T: just tried (android mobile UA), looks like its just a bunch och js either way [05:00] yeah same [05:04] looking at grabbing the ipad traffic now [05:05] zapd does not have an api [05:05] I'm going to keep attacking that guy in social media, if that's OK. [05:07] need to drag my self to school will look at it when i get there [05:07] i'm just tossing out an idea, something like selenium could load up the page, take a fullpage screenshot and save the rendered dom page source [05:08] SketchCow: this closes tomorrow and we only have partial grabs http://archiveteam.org/index.php?title=MSN_TV [05:14] chfoo, I already have a cli application that can do that in parallel. It does not solve the url discovery problem though [05:15] zapd force crashed when i try to proxy its https so i can see it. might have to do it on my jailbroken ipad and install tcpdump [05:15] I have a search running against the common crawl index looking for urls. It should be done in an hour or so [05:16] im as far as http://zapd2-mobile-gateway.herokuapp.com [05:18] there's also the domain zapd.co [05:19] it looks liek xxxx.zapd.co could be incremental [05:20] they redirect to the user's full site [05:23] theirs tons of cnames for both no good A records tho [05:41] lets take this zapd talk to #at-zapd [05:47] Sigh, only 362 GB left on the disk for this item https://archive.org/details/wikimediacommons-201209 [05:52] DFJustin: Is there anything we can do? [05:54] people just need to wget the shit out of the url lists I think [06:12] omf_: need custom python code for zapd? [06:33] hmm [06:33] I got an idea [06:33] DFJustin: community-*.webtv.net seems like a good place to start, yeah [06:40] ok [06:40] DFJustin: http://archivebot.at.ninjawedding.org:4567/ [06:40] this is gonna be interseting [06:48] GLaDOS: FYI, dumpground.archivingyoursh.it/archive now has some real stuff on it now (namely, WebTV grabs) [06:48] GLaDOS: I'll talk with you later re: extracting them [06:56] argh, this @Lookup shit is annoying [08:35] yipdw_: sweet [08:36] GLaDOS: I'll generate a URL manifest in the morning [08:36] gotta crash atm [08:36] Alright [08:36] watching http://archivebot.at.ninjawedding.org:4567/ is pretty funny, though [08:36] If you want, I can give you access to that dir on anarchive [08:36] hmm [08:37] I don't think I'll need it so long as dumpground has its HTTP accessibility [08:37] It'll always be HTTP accessible. [11:03] good news- webtv is still up- I'll leave my grab going until it finishes or it times out [13:49] If you know anything about ZAPD not on this page http://archiveteam.org/index.php?title=Zapd please add it [13:58] omf_: need custom Python code for this y/n? [13:58] if yes, can probably whip up some stuff [14:05] * brayden throws a shoe at joepie92 [14:05] No! I'll do it! [14:05] hey D [14:05] D: * [14:05] also [14:06] if serious, I'll just go play red eclipse [14:06] after I handle this importantthing [14:06] Well I can do it but I use urllib and can't thread for shit so it might be a bit slow [14:06] can use tornado to do it async though [14:06] and use beautifulsoup to parse the page [14:08] knocked out! [14:08] oh well [14:09] goddamn [14:09] D: * [14:09] after I handle this importantthing [14:09] also [14:09] hey D [14:09] if serious, I'll just go play red eclipse [14:09] lol nice [14:09] missed the rest? [14:09] STUPID INTERNET PROVIDER [14:09] yes [14:09] missed everything after that [14:09] http://brayden.id.au/images/2013-09-30_22-09-38.txt [14:11] All content is served via javascript, with js disabled you just get an empty template. [14:11] well.. HTML parsing might not be helpful [14:12] oh that reminded me to put the info about the comments on there [14:13] wtf.. the "Read more" link on their home page 404s? [14:15] I just shoved the rest of the site restrictions I know about on the wiki [14:15] wow.. it just has a huge array called data that has.. everything? [14:17] ===================================== [14:17] ===================================== [14:17] Point all your warriors at it! [14:17] The tracker is now located at http://urlteam.terrywri.st/ [14:17] URLTeam is active again! [14:17] The main tracker page links have been updated as well. [14:17] omf_, do you have an example of a zapd page with lots of comments and content? [14:17] basically a "worst case" [14:18] I added it to the wiki [14:18] brayden: I see... [14:18] GLaDOS: does it work with standard warrior config? [14:18] ah good [14:18] It does. [14:18] oh that site [14:18] my eyes are dying [14:20] :) [14:21] Save for the comment issue it seems actually surprisingly easy [14:21] given that huge javascript array [14:21] parse it as JSON and can easily pull in the data I reckon! [14:21] * brayden is still going through it though [14:21] Easier than that stupid yahoo blog thing anyway [14:22] Is there a zapd channel? [14:22] #zapped [14:22] Good [14:22] Sick of my optimistic spam? :( [14:22] #at-zapd [14:23] brayden: Sorry, but yes a little :) It's great that you keep up the work though! [14:28] If any admins are missing from http://www.archiveteam.org/index.php?title=Tracker#People please let me know or update the wiki page. Thanks. [14:30] hmm archivebot seems to have escaped into youtube [14:30] brayden: perhaps you should do the code stuff [14:30] my connection seems to be too unstable [14:30] to actually use [14:30] I am considering setting up a UDP VPN [14:31] as it seems to be just TCP connections that are affected [14:31] god damn.. hope you don't have a lot of packet loss! [14:31] red eclipse still runs flawlessly [14:31] brayden; I have none [14:31] that's the strange thing [14:31] there is no measurable network issue [14:31] other than, you know, all my connections dropping every 2 minutes [14:31] ADSL? Probably line issues then [14:31] no, FttH. [14:31] yes, fiber, really. [14:31] lol.. RDNS says direct-adsl. [14:31] silly ISP [14:31] :| [14:32] yeah [14:32] old IP ranges [14:32] this ISP also does ADSL [14:32] and these are ex-ADSL IP ranges [14:32] they just never fixed the rDNS [14:32] (they also still have the rDNS on some ranges of an ISP they took over over like 6 years ago) [14:33] rubbish ISP is rubbish [14:33] * ersi grumbles [14:33] * brayden hides [14:48] :D [16:54] I see nobody in #zapped [17:00] SketchCow: i think the discussion is in #at-zapped [17:00] SketchCow: no #at-zapd [17:00] sorry [17:05] tephra: man, there were so many beautiful puns that could've been made with "zapd" [17:06] why did we settle for "at-zapd" :( [17:25] Yeah, what the fuck. [17:25] It's because I wasn't here. [17:26] I have failed everyone [17:26] It was a busy month. [18:11] zappideedoodaa [21:08] Smiley: we need a zapd project in the AT tracker [21:08] Smiley: actually, get in #crapd [21:10] actually, any of you tracker admins get in there