#archiveteam 2013-09-30,Mon

โ†‘back Search

Time Nickname Message
00:59 ๐Ÿ”— DFJustin how are we doing on webtv
03:33 ๐Ÿ”— dashcloud need lots more help- my download got stopped, and I'm not entirely sure how far I got because of the massive number of duplicates --mirror entails
03:35 ๐Ÿ”— dashcloud here's the last line in my download log: community-2.webtv.net/@HH!BC!ED!761326D9D04B/ValSpegeln/CLOSEENCOUNTERSNEWS/clipart/Education/ed00030_.gifรขย€ย™ saved
03:37 ๐Ÿ”— dashcloud is there an easy way to resume my wget-warc download without duplicating the things I've already done? ( skip all the existing stuff, and just start downloading from some point on?)
03:44 ๐Ÿ”— dashcloud so, I do plan to restart a download, but definitely more people are needed
03:49 ๐Ÿ”— omf_ wget's resume capability is worthless. It would re-checked everything you already got first before resuming, a huge time waste
03:49 ๐Ÿ”— omf_ and no way to skip it
03:49 ๐Ÿ”— dashcloud so, I tried to approximate where I think I got to (hard to know for sure with the bizarre URL schemes for webtv), and started the download there
03:53 ๐Ÿ”— omf_ if you are generating warcs then you should be able to ls -lStr *.warc.gz and see the last warc created. Are you using the url as the filename?
03:55 ๐Ÿ”— dashcloud I kind of wish I had thought of doing that before I started to download again
03:56 ๐Ÿ”— dashcloud here is the shorter list I'm working with- I've restarted on line 3109 or so, and may have done all the previous lines. http://paste.archivingyoursh.it/quqoweyiso.avrasm
03:57 ๐Ÿ”— dashcloud here's the longer Bing list that's unfiltered and unduplicated: http://paste.archivingyoursh.it/ficequtape.avrasm (12.6k lines)
03:57 ๐Ÿ”— dashcloud I have to get going now- good night and good luck!
04:29 ๐Ÿ”— SketchCow Hi, gang.
04:29 ๐Ÿ”— SketchCow Anyone jump on zapd?
04:32 ๐Ÿ”— S[h]O[r]T bsmith093 i think the urltream tracker has been down for 2-3 months now but the project is still alive
04:33 ๐Ÿ”— omf_ They deliver images via javascript so we need a javascript piece somewhere in the mix for getting most of the content
04:34 ๐Ÿ”— omf_ As an example load http://anna-heimbichner.zapd.com/cake-pops without javascript and it is essentially an empty page. With js on we find that page is the index for a series of posts
04:39 ๐Ÿ”— omf_ This is more annoying than snapjoy
04:42 ๐Ÿ”— S[h]O[r]T does their iphone app load pages too or just create? maybe an easier way to grab via how the app does.assuming it doesnt use javascript
04:44 ๐Ÿ”— omf_ Anyone with iphone want to try that out?
04:48 ๐Ÿ”— S[h]O[r]T im downloading the app now. what if you change your user agent to something mobile, does it load differently?
04:58 ๐Ÿ”— tephra S[h]O[r]T: just tried (android mobile UA), looks like its just a bunch och js either way
05:00 ๐Ÿ”— S[h]O[r]T yeah same
05:04 ๐Ÿ”— S[h]O[r]T looking at grabbing the ipad traffic now
05:05 ๐Ÿ”— omf_ zapd does not have an api
05:05 ๐Ÿ”— SketchCow I'm going to keep attacking that guy in social media, if that's OK.
05:07 ๐Ÿ”— tephra need to drag my self to school will look at it when i get there
05:07 ๐Ÿ”— chfoo i'm just tossing out an idea, something like selenium could load up the page, take a fullpage screenshot and save the rendered dom page source
05:08 ๐Ÿ”— DFJustin SketchCow: this closes tomorrow and we only have partial grabs http://archiveteam.org/index.php?title=MSN_TV
05:14 ๐Ÿ”— omf_ chfoo, I already have a cli application that can do that in parallel. It does not solve the url discovery problem though
05:15 ๐Ÿ”— S[h]O[r]T zapd force crashed when i try to proxy its https so i can see it. might have to do it on my jailbroken ipad and install tcpdump
05:15 ๐Ÿ”— omf_ I have a search running against the common crawl index looking for urls. It should be done in an hour or so
05:16 ๐Ÿ”— S[h]O[r]T im as far as http://zapd2-mobile-gateway.herokuapp.com
05:18 ๐Ÿ”— chfoo there's also the domain zapd.co
05:19 ๐Ÿ”— chfoo it looks liek xxxx.zapd.co could be incremental
05:20 ๐Ÿ”— chfoo they redirect to the user's full site
05:23 ๐Ÿ”— S[h]O[r]T theirs tons of cnames for both no good A records tho
05:41 ๐Ÿ”— omf_ lets take this zapd talk to #at-zapd
05:47 ๐Ÿ”— Nemo_bis Sigh, only 362 GB left on the disk for this item https://archive.org/details/wikimediacommons-201209
05:52 ๐Ÿ”— SketchCow DFJustin: Is there anything we can do?
05:54 ๐Ÿ”— DFJustin people just need to wget the shit out of the url lists I think
06:12 ๐Ÿ”— joepie92 omf_: need custom python code for zapd?
06:33 ๐Ÿ”— yipdw hmm
06:33 ๐Ÿ”— yipdw I got an idea
06:33 ๐Ÿ”— yipdw DFJustin: community-*.webtv.net seems like a good place to start, yeah
06:40 ๐Ÿ”— yipdw ok
06:40 ๐Ÿ”— yipdw DFJustin: http://archivebot.at.ninjawedding.org:4567/
06:40 ๐Ÿ”— yipdw this is gonna be interseting
06:48 ๐Ÿ”— yipdw GLaDOS: FYI, dumpground.archivingyoursh.it/archive now has some real stuff on it now (namely, WebTV grabs)
06:48 ๐Ÿ”— yipdw GLaDOS: I'll talk with you later re: extracting them
06:56 ๐Ÿ”— yipdw argh, this @Lookup shit is annoying
08:35 ๐Ÿ”— GLaDOS yipdw_: sweet
08:36 ๐Ÿ”— yipdw_ GLaDOS: I'll generate a URL manifest in the morning
08:36 ๐Ÿ”— yipdw_ gotta crash atm
08:36 ๐Ÿ”— GLaDOS Alright
08:36 ๐Ÿ”— yipdw_ watching http://archivebot.at.ninjawedding.org:4567/ is pretty funny, though
08:36 ๐Ÿ”— GLaDOS If you want, I can give you access to that dir on anarchive
08:36 ๐Ÿ”— yipdw_ hmm
08:37 ๐Ÿ”— yipdw_ I don't think I'll need it so long as dumpground has its HTTP accessibility
08:37 ๐Ÿ”— GLaDOS It'll always be HTTP accessible.
11:03 ๐Ÿ”— dashcloud good news- webtv is still up- I'll leave my grab going until it finishes or it times out
13:49 ๐Ÿ”— omf_ If you know anything about ZAPD not on this page http://archiveteam.org/index.php?title=Zapd please add it
13:58 ๐Ÿ”— joepie92 omf_: need custom Python code for this y/n?
13:58 ๐Ÿ”— joepie92 if yes, can probably whip up some stuff
14:05 ๐Ÿ”— * brayden throws a shoe at joepie92
14:05 ๐Ÿ”— brayden No! I'll do it!
14:05 ๐Ÿ”— joepie92 hey D
14:05 ๐Ÿ”— joepie92 D: *
14:05 ๐Ÿ”— joepie92 also
14:06 ๐Ÿ”— joepie92 if serious, I'll just go play red eclipse
14:06 ๐Ÿ”— joepie92 after I handle this importantthing
14:06 ๐Ÿ”— brayden Well I can do it but I use urllib and can't thread for shit so it might be a bit slow
14:06 ๐Ÿ”— brayden can use tornado to do it async though
14:06 ๐Ÿ”— brayden and use beautifulsoup to parse the page
14:08 ๐Ÿ”— brayden knocked out!
14:08 ๐Ÿ”— brayden oh well
14:09 ๐Ÿ”— joepie92 goddamn
14:09 ๐Ÿ”— joepie92 <joepie92>D: *
14:09 ๐Ÿ”— joepie92 <joepie92>after I handle this importantthing
14:09 ๐Ÿ”— joepie92 <joepie92>also
14:09 ๐Ÿ”— joepie92 <joepie92>hey D
14:09 ๐Ÿ”— joepie92 <joepie92>if serious, I'll just go play red eclipse
14:09 ๐Ÿ”— brayden lol nice
14:09 ๐Ÿ”— brayden missed the rest?
14:09 ๐Ÿ”— joepie92 STUPID INTERNET PROVIDER
14:09 ๐Ÿ”— joepie92 yes
14:09 ๐Ÿ”— joepie92 missed everything after that
14:09 ๐Ÿ”— brayden http://brayden.id.au/images/2013-09-30_22-09-38.txt
14:11 ๐Ÿ”— brayden All content is served via javascript, with js disabled you just get an empty template.
14:11 ๐Ÿ”— brayden well.. HTML parsing might not be helpful
14:12 ๐Ÿ”— omf_ oh that reminded me to put the info about the comments on there
14:13 ๐Ÿ”— brayden wtf.. the "Read more" link on their home page 404s?
14:15 ๐Ÿ”— omf_ I just shoved the rest of the site restrictions I know about on the wiki
14:15 ๐Ÿ”— brayden wow.. it just has a huge array called data that has.. everything?
14:17 ๐Ÿ”— GLaDOS =====================================
14:17 ๐Ÿ”— GLaDOS =====================================
14:17 ๐Ÿ”— GLaDOS Point all your warriors at it!
14:17 ๐Ÿ”— GLaDOS The tracker is now located at http://urlteam.terrywri.st/
14:17 ๐Ÿ”— GLaDOS URLTeam is active again!
14:17 ๐Ÿ”— omf_ The main tracker page links have been updated as well.
14:17 ๐Ÿ”— brayden omf_, do you have an example of a zapd page with lots of comments and content?
14:17 ๐Ÿ”— brayden basically a "worst case"
14:18 ๐Ÿ”— omf_ I added it to the wiki
14:18 ๐Ÿ”— joepie92 brayden: I see...
14:18 ๐Ÿ”— joepie92 GLaDOS: does it work with standard warrior config?
14:18 ๐Ÿ”— brayden ah good
14:18 ๐Ÿ”— GLaDOS It does.
14:18 ๐Ÿ”— brayden oh that site
14:18 ๐Ÿ”— brayden my eyes are dying
14:20 ๐Ÿ”— joepie92 :)
14:21 ๐Ÿ”— brayden Save for the comment issue it seems actually surprisingly easy
14:21 ๐Ÿ”— brayden given that huge javascript array
14:21 ๐Ÿ”— brayden parse it as JSON and can easily pull in the data I reckon!
14:21 ๐Ÿ”— * brayden is still going through it though
14:21 ๐Ÿ”— brayden Easier than that stupid yahoo blog thing anyway
14:22 ๐Ÿ”— ersi Is there a zapd channel?
14:22 ๐Ÿ”— GLaDOS #zapped
14:22 ๐Ÿ”— ersi Good
14:22 ๐Ÿ”— brayden Sick of my optimistic spam? :(
14:22 ๐Ÿ”— omf_ #at-zapd
14:23 ๐Ÿ”— ersi brayden: Sorry, but yes a little :) It's great that you keep up the work though!
14:28 ๐Ÿ”— omf_ If any admins are missing from http://www.archiveteam.org/index.php?title=Tracker#People please let me know or update the wiki page. Thanks.
14:30 ๐Ÿ”— DFJustin hmm archivebot seems to have escaped into youtube
14:30 ๐Ÿ”— joepie92 brayden: perhaps you should do the code stuff
14:30 ๐Ÿ”— joepie92 my connection seems to be too unstable
14:30 ๐Ÿ”— joepie92 to actually use
14:30 ๐Ÿ”— joepie92 I am considering setting up a UDP VPN
14:31 ๐Ÿ”— joepie92 as it seems to be just TCP connections that are affected
14:31 ๐Ÿ”— brayden god damn.. hope you don't have a lot of packet loss!
14:31 ๐Ÿ”— joepie92 red eclipse still runs flawlessly
14:31 ๐Ÿ”— joepie92 brayden; I have none
14:31 ๐Ÿ”— joepie92 that's the strange thing
14:31 ๐Ÿ”— joepie92 there is no measurable network issue
14:31 ๐Ÿ”— joepie92 other than, you know, all my connections dropping every 2 minutes
14:31 ๐Ÿ”— brayden ADSL? Probably line issues then
14:31 ๐Ÿ”— joepie92 no, FttH.
14:31 ๐Ÿ”— joepie92 yes, fiber, really.
14:31 ๐Ÿ”— brayden lol.. RDNS says direct-adsl.
14:31 ๐Ÿ”— brayden silly ISP
14:31 ๐Ÿ”— joepie92 :|
14:32 ๐Ÿ”— joepie92 yeah
14:32 ๐Ÿ”— joepie92 old IP ranges
14:32 ๐Ÿ”— joepie92 this ISP also does ADSL
14:32 ๐Ÿ”— joepie92 and these are ex-ADSL IP ranges
14:32 ๐Ÿ”— joepie92 they just never fixed the rDNS
14:32 ๐Ÿ”— joepie92 (they also still have the rDNS on some ranges of an ISP they took over over like 6 years ago)
14:33 ๐Ÿ”— joepie92 rubbish ISP is rubbish
14:33 ๐Ÿ”— * ersi grumbles
14:33 ๐Ÿ”— * brayden hides
14:48 ๐Ÿ”— ersi :D
16:54 ๐Ÿ”— SketchCow I see nobody in #zapped
17:00 ๐Ÿ”— tephra SketchCow: i think the discussion is in #at-zapped
17:00 ๐Ÿ”— tephra SketchCow: no #at-zapd
17:00 ๐Ÿ”— tephra sorry
17:05 ๐Ÿ”— joepie92 tephra: man, there were so many beautiful puns that could've been made with "zapd"
17:06 ๐Ÿ”— joepie92 why did we settle for "at-zapd" :(
17:25 ๐Ÿ”— SketchCow Yeah, what the fuck.
17:25 ๐Ÿ”— SketchCow It's because I wasn't here.
17:26 ๐Ÿ”— SketchCow I have failed everyone
17:26 ๐Ÿ”— SketchCow It was a busy month.
18:11 ๐Ÿ”— ersi zappideedoodaa
21:08 ๐Ÿ”— yipdw Smiley: we need a zapd project in the AT tracker
21:08 ๐Ÿ”— yipdw Smiley: actually, get in #crapd
21:10 ๐Ÿ”— yipdw actually, any of you tracker admins get in there

irclogger-viewer