Time |
Nickname |
Message |
00:59
๐
|
DFJustin |
how are we doing on webtv |
03:33
๐
|
dashcloud |
need lots more help- my download got stopped, and I'm not entirely sure how far I got because of the massive number of duplicates --mirror entails |
03:35
๐
|
dashcloud |
here's the last line in my download log: community-2.webtv.net/@HH!BC!ED!761326D9D04B/ValSpegeln/CLOSEENCOUNTERSNEWS/clipart/Education/ed00030_.gifรขยย saved |
03:37
๐
|
dashcloud |
is there an easy way to resume my wget-warc download without duplicating the things I've already done? ( skip all the existing stuff, and just start downloading from some point on?) |
03:44
๐
|
dashcloud |
so, I do plan to restart a download, but definitely more people are needed |
03:49
๐
|
omf_ |
wget's resume capability is worthless. It would re-checked everything you already got first before resuming, a huge time waste |
03:49
๐
|
omf_ |
and no way to skip it |
03:49
๐
|
dashcloud |
so, I tried to approximate where I think I got to (hard to know for sure with the bizarre URL schemes for webtv), and started the download there |
03:53
๐
|
omf_ |
if you are generating warcs then you should be able to ls -lStr *.warc.gz and see the last warc created. Are you using the url as the filename? |
03:55
๐
|
dashcloud |
I kind of wish I had thought of doing that before I started to download again |
03:56
๐
|
dashcloud |
here is the shorter list I'm working with- I've restarted on line 3109 or so, and may have done all the previous lines. http://paste.archivingyoursh.it/quqoweyiso.avrasm |
03:57
๐
|
dashcloud |
here's the longer Bing list that's unfiltered and unduplicated: http://paste.archivingyoursh.it/ficequtape.avrasm (12.6k lines) |
03:57
๐
|
dashcloud |
I have to get going now- good night and good luck! |
04:29
๐
|
SketchCow |
Hi, gang. |
04:29
๐
|
SketchCow |
Anyone jump on zapd? |
04:32
๐
|
S[h]O[r]T |
bsmith093 i think the urltream tracker has been down for 2-3 months now but the project is still alive |
04:33
๐
|
omf_ |
They deliver images via javascript so we need a javascript piece somewhere in the mix for getting most of the content |
04:34
๐
|
omf_ |
As an example load http://anna-heimbichner.zapd.com/cake-pops without javascript and it is essentially an empty page. With js on we find that page is the index for a series of posts |
04:39
๐
|
omf_ |
This is more annoying than snapjoy |
04:42
๐
|
S[h]O[r]T |
does their iphone app load pages too or just create? maybe an easier way to grab via how the app does.assuming it doesnt use javascript |
04:44
๐
|
omf_ |
Anyone with iphone want to try that out? |
04:48
๐
|
S[h]O[r]T |
im downloading the app now. what if you change your user agent to something mobile, does it load differently? |
04:58
๐
|
tephra |
S[h]O[r]T: just tried (android mobile UA), looks like its just a bunch och js either way |
05:00
๐
|
S[h]O[r]T |
yeah same |
05:04
๐
|
S[h]O[r]T |
looking at grabbing the ipad traffic now |
05:05
๐
|
omf_ |
zapd does not have an api |
05:05
๐
|
SketchCow |
I'm going to keep attacking that guy in social media, if that's OK. |
05:07
๐
|
tephra |
need to drag my self to school will look at it when i get there |
05:07
๐
|
chfoo |
i'm just tossing out an idea, something like selenium could load up the page, take a fullpage screenshot and save the rendered dom page source |
05:08
๐
|
DFJustin |
SketchCow: this closes tomorrow and we only have partial grabs http://archiveteam.org/index.php?title=MSN_TV |
05:14
๐
|
omf_ |
chfoo, I already have a cli application that can do that in parallel. It does not solve the url discovery problem though |
05:15
๐
|
S[h]O[r]T |
zapd force crashed when i try to proxy its https so i can see it. might have to do it on my jailbroken ipad and install tcpdump |
05:15
๐
|
omf_ |
I have a search running against the common crawl index looking for urls. It should be done in an hour or so |
05:16
๐
|
S[h]O[r]T |
im as far as http://zapd2-mobile-gateway.herokuapp.com |
05:18
๐
|
chfoo |
there's also the domain zapd.co |
05:19
๐
|
chfoo |
it looks liek xxxx.zapd.co could be incremental |
05:20
๐
|
chfoo |
they redirect to the user's full site |
05:23
๐
|
S[h]O[r]T |
theirs tons of cnames for both no good A records tho |
05:41
๐
|
omf_ |
lets take this zapd talk to #at-zapd |
05:47
๐
|
Nemo_bis |
Sigh, only 362 GB left on the disk for this item https://archive.org/details/wikimediacommons-201209 |
05:52
๐
|
SketchCow |
DFJustin: Is there anything we can do? |
05:54
๐
|
DFJustin |
people just need to wget the shit out of the url lists I think |
06:12
๐
|
joepie92 |
omf_: need custom python code for zapd? |
06:33
๐
|
yipdw |
hmm |
06:33
๐
|
yipdw |
I got an idea |
06:33
๐
|
yipdw |
DFJustin: community-*.webtv.net seems like a good place to start, yeah |
06:40
๐
|
yipdw |
ok |
06:40
๐
|
yipdw |
DFJustin: http://archivebot.at.ninjawedding.org:4567/ |
06:40
๐
|
yipdw |
this is gonna be interseting |
06:48
๐
|
yipdw |
GLaDOS: FYI, dumpground.archivingyoursh.it/archive now has some real stuff on it now (namely, WebTV grabs) |
06:48
๐
|
yipdw |
GLaDOS: I'll talk with you later re: extracting them |
06:56
๐
|
yipdw |
argh, this @Lookup shit is annoying |
08:35
๐
|
GLaDOS |
yipdw_: sweet |
08:36
๐
|
yipdw_ |
GLaDOS: I'll generate a URL manifest in the morning |
08:36
๐
|
yipdw_ |
gotta crash atm |
08:36
๐
|
GLaDOS |
Alright |
08:36
๐
|
yipdw_ |
watching http://archivebot.at.ninjawedding.org:4567/ is pretty funny, though |
08:36
๐
|
GLaDOS |
If you want, I can give you access to that dir on anarchive |
08:36
๐
|
yipdw_ |
hmm |
08:37
๐
|
yipdw_ |
I don't think I'll need it so long as dumpground has its HTTP accessibility |
08:37
๐
|
GLaDOS |
It'll always be HTTP accessible. |
11:03
๐
|
dashcloud |
good news- webtv is still up- I'll leave my grab going until it finishes or it times out |
13:49
๐
|
omf_ |
If you know anything about ZAPD not on this page http://archiveteam.org/index.php?title=Zapd please add it |
13:58
๐
|
joepie92 |
omf_: need custom Python code for this y/n? |
13:58
๐
|
joepie92 |
if yes, can probably whip up some stuff |
14:05
๐
|
* |
brayden throws a shoe at joepie92 |
14:05
๐
|
brayden |
No! I'll do it! |
14:05
๐
|
joepie92 |
hey D |
14:05
๐
|
joepie92 |
D: * |
14:05
๐
|
joepie92 |
also |
14:06
๐
|
joepie92 |
if serious, I'll just go play red eclipse |
14:06
๐
|
joepie92 |
after I handle this importantthing |
14:06
๐
|
brayden |
Well I can do it but I use urllib and can't thread for shit so it might be a bit slow |
14:06
๐
|
brayden |
can use tornado to do it async though |
14:06
๐
|
brayden |
and use beautifulsoup to parse the page |
14:08
๐
|
brayden |
knocked out! |
14:08
๐
|
brayden |
oh well |
14:09
๐
|
joepie92 |
goddamn |
14:09
๐
|
joepie92 |
<joepie92>D: * |
14:09
๐
|
joepie92 |
<joepie92>after I handle this importantthing |
14:09
๐
|
joepie92 |
<joepie92>also |
14:09
๐
|
joepie92 |
<joepie92>hey D |
14:09
๐
|
joepie92 |
<joepie92>if serious, I'll just go play red eclipse |
14:09
๐
|
brayden |
lol nice |
14:09
๐
|
brayden |
missed the rest? |
14:09
๐
|
joepie92 |
STUPID INTERNET PROVIDER |
14:09
๐
|
joepie92 |
yes |
14:09
๐
|
joepie92 |
missed everything after that |
14:09
๐
|
brayden |
http://brayden.id.au/images/2013-09-30_22-09-38.txt |
14:11
๐
|
brayden |
All content is served via javascript, with js disabled you just get an empty template. |
14:11
๐
|
brayden |
well.. HTML parsing might not be helpful |
14:12
๐
|
omf_ |
oh that reminded me to put the info about the comments on there |
14:13
๐
|
brayden |
wtf.. the "Read more" link on their home page 404s? |
14:15
๐
|
omf_ |
I just shoved the rest of the site restrictions I know about on the wiki |
14:15
๐
|
brayden |
wow.. it just has a huge array called data that has.. everything? |
14:17
๐
|
GLaDOS |
===================================== |
14:17
๐
|
GLaDOS |
===================================== |
14:17
๐
|
GLaDOS |
Point all your warriors at it! |
14:17
๐
|
GLaDOS |
The tracker is now located at http://urlteam.terrywri.st/ |
14:17
๐
|
GLaDOS |
URLTeam is active again! |
14:17
๐
|
omf_ |
The main tracker page links have been updated as well. |
14:17
๐
|
brayden |
omf_, do you have an example of a zapd page with lots of comments and content? |
14:17
๐
|
brayden |
basically a "worst case" |
14:18
๐
|
omf_ |
I added it to the wiki |
14:18
๐
|
joepie92 |
brayden: I see... |
14:18
๐
|
joepie92 |
GLaDOS: does it work with standard warrior config? |
14:18
๐
|
brayden |
ah good |
14:18
๐
|
GLaDOS |
It does. |
14:18
๐
|
brayden |
oh that site |
14:18
๐
|
brayden |
my eyes are dying |
14:20
๐
|
joepie92 |
:) |
14:21
๐
|
brayden |
Save for the comment issue it seems actually surprisingly easy |
14:21
๐
|
brayden |
given that huge javascript array |
14:21
๐
|
brayden |
parse it as JSON and can easily pull in the data I reckon! |
14:21
๐
|
* |
brayden is still going through it though |
14:21
๐
|
brayden |
Easier than that stupid yahoo blog thing anyway |
14:22
๐
|
ersi |
Is there a zapd channel? |
14:22
๐
|
GLaDOS |
#zapped |
14:22
๐
|
ersi |
Good |
14:22
๐
|
brayden |
Sick of my optimistic spam? :( |
14:22
๐
|
omf_ |
#at-zapd |
14:23
๐
|
ersi |
brayden: Sorry, but yes a little :) It's great that you keep up the work though! |
14:28
๐
|
omf_ |
If any admins are missing from http://www.archiveteam.org/index.php?title=Tracker#People please let me know or update the wiki page. Thanks. |
14:30
๐
|
DFJustin |
hmm archivebot seems to have escaped into youtube |
14:30
๐
|
joepie92 |
brayden: perhaps you should do the code stuff |
14:30
๐
|
joepie92 |
my connection seems to be too unstable |
14:30
๐
|
joepie92 |
to actually use |
14:30
๐
|
joepie92 |
I am considering setting up a UDP VPN |
14:31
๐
|
joepie92 |
as it seems to be just TCP connections that are affected |
14:31
๐
|
brayden |
god damn.. hope you don't have a lot of packet loss! |
14:31
๐
|
joepie92 |
red eclipse still runs flawlessly |
14:31
๐
|
joepie92 |
brayden; I have none |
14:31
๐
|
joepie92 |
that's the strange thing |
14:31
๐
|
joepie92 |
there is no measurable network issue |
14:31
๐
|
joepie92 |
other than, you know, all my connections dropping every 2 minutes |
14:31
๐
|
brayden |
ADSL? Probably line issues then |
14:31
๐
|
joepie92 |
no, FttH. |
14:31
๐
|
joepie92 |
yes, fiber, really. |
14:31
๐
|
brayden |
lol.. RDNS says direct-adsl. |
14:31
๐
|
brayden |
silly ISP |
14:31
๐
|
joepie92 |
:| |
14:32
๐
|
joepie92 |
yeah |
14:32
๐
|
joepie92 |
old IP ranges |
14:32
๐
|
joepie92 |
this ISP also does ADSL |
14:32
๐
|
joepie92 |
and these are ex-ADSL IP ranges |
14:32
๐
|
joepie92 |
they just never fixed the rDNS |
14:32
๐
|
joepie92 |
(they also still have the rDNS on some ranges of an ISP they took over over like 6 years ago) |
14:33
๐
|
joepie92 |
rubbish ISP is rubbish |
14:33
๐
|
* |
ersi grumbles |
14:33
๐
|
* |
brayden hides |
14:48
๐
|
ersi |
:D |
16:54
๐
|
SketchCow |
I see nobody in #zapped |
17:00
๐
|
tephra |
SketchCow: i think the discussion is in #at-zapped |
17:00
๐
|
tephra |
SketchCow: no #at-zapd |
17:00
๐
|
tephra |
sorry |
17:05
๐
|
joepie92 |
tephra: man, there were so many beautiful puns that could've been made with "zapd" |
17:06
๐
|
joepie92 |
why did we settle for "at-zapd" :( |
17:25
๐
|
SketchCow |
Yeah, what the fuck. |
17:25
๐
|
SketchCow |
It's because I wasn't here. |
17:26
๐
|
SketchCow |
I have failed everyone |
17:26
๐
|
SketchCow |
It was a busy month. |
18:11
๐
|
ersi |
zappideedoodaa |
21:08
๐
|
yipdw |
Smiley: we need a zapd project in the AT tracker |
21:08
๐
|
yipdw |
Smiley: actually, get in #crapd |
21:10
๐
|
yipdw |
actually, any of you tracker admins get in there |