Time |
Nickname |
Message |
02:01
🔗
|
|
JesseW has joined #urlteam |
02:02
🔗
|
|
svchfoo1 sets mode: +o JesseW |
03:03
🔗
|
|
JesseW has quit IRC (Leaving.) |
03:08
🔗
|
|
W1nterFox has joined #urlteam |
03:10
🔗
|
|
WinterFox has quit IRC (Read error: Operation timed out) |
03:12
🔗
|
|
Muad-Dib has quit IRC (Ping timeout: 252 seconds) |
03:20
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
03:20
🔗
|
|
Start has joined #urlteam |
05:01
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
05:21
🔗
|
|
W1nterFox has quit IRC (Read error: Operation timed out) |
05:26
🔗
|
|
W1nterFox has joined #urlteam |
06:24
🔗
|
|
JesseW has joined #urlteam |
06:25
🔗
|
|
svchfoo1 sets mode: +o JesseW |
07:05
🔗
|
JesseW |
Deewiant: you generated an interesting error: |
07:05
🔗
|
JesseW |
File "/home/deewiant/terroroftinytown-client-grab/terroroftinytown/terroroftinytown/services/isgd.py", line 42, in process_unavailable |
07:05
🔗
|
JesseW |
raise errors.UnexpectedNoResult("Could not find processing unavailable for %s" % self.current_shortcode) |
07:05
🔗
|
JesseW |
UnexpectedNoResult: Could not find processing unavailable for Wupip9 |
07:05
🔗
|
JesseW |
for project isgd_6 at 2015-12-09 06:49:21.339344 |
07:07
🔗
|
Deewiant |
Mm-hm |
07:09
🔗
|
Deewiant |
<div id="main"><p>Rate limit exceeded - you must wait at least 1798 seconds before we'll service this request.</p></div> |
07:09
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
07:10
🔗
|
JesseW |
mostly just putting this here to remind me to look into it. |
07:10
🔗
|
|
Start has joined #urlteam |
07:10
🔗
|
Deewiant |
That one doesn't seem to match anything in that function |
07:10
🔗
|
|
Start_ has joined #urlteam |
07:10
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
07:12
🔗
|
Deewiant |
Quite a long timeout though, that's a bit annoying |
07:13
🔗
|
JesseW |
ah, they changed the format of the Rate limit message, apparently. |
07:14
🔗
|
JesseW |
feel free to make PR :-) |
07:14
🔗
|
JesseW |
er, make *a* PR |
07:15
🔗
|
Deewiant |
Is PleaseRetry() appropriate even though the wait is around 30 minutes instead of 1? |
07:20
🔗
|
JesseW |
yep, because it will retry from another IP. |
07:20
🔗
|
JesseW |
IIRC |
07:20
🔗
|
JesseW |
-------- |
07:20
🔗
|
JesseW |
Started a new project: qr.cx |
07:28
🔗
|
JesseW |
and (after a slow start) 212 found! |
07:29
🔗
|
JesseW |
out of a total of 2,377,573 |
07:30
🔗
|
JesseW |
(they are kind enough to list the total number of URLs on their front page (and not be accepting new ones)) |
07:30
🔗
|
JesseW |
(they're also in 301works, but who knows if they've kept it up to date, or when if ever that data will be available) |
07:31
🔗
|
Deewiant |
Made a PR for the is.gd thing |
07:34
🔗
|
JesseW |
cool. I can't merge them myself, but I'll look over it. |
07:34
🔗
|
JesseW |
feel free to review my PR, too, if you'd like. |
07:35
🔗
|
JesseW |
looks good |
07:38
🔗
|
Deewiant |
I'm not nearly familiar enough with the codebase to be of much use reviewing; took a peek though and didn't spot any issues |
07:39
🔗
|
JesseW |
ok, thanks |
07:40
🔗
|
JesseW |
I'm still getting more familiar with it, myself. |
08:07
🔗
|
|
JesseW has quit IRC (Leaving.) |
09:03
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
09:07
🔗
|
|
dashcloud has joined #urlteam |
09:08
🔗
|
|
svchfoo1 sets mode: +o dashcloud |
09:45
🔗
|
|
Muad-Dib has joined #urlteam |
11:30
🔗
|
|
Coderjoe has quit IRC (Read error: Operation timed out) |
11:51
🔗
|
|
Coderjoe has joined #urlteam |
12:56
🔗
|
|
W1nterFox has quit IRC (Remote host closed the connection) |
14:54
🔗
|
|
Start_ has quit IRC (Quit: Disconnected.) |
15:45
🔗
|
|
Start has joined #urlteam |
16:10
🔗
|
|
Start has quit IRC (Remote host closed the connection) |
16:11
🔗
|
|
Start has joined #urlteam |
17:07
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
17:13
🔗
|
|
Start has joined #urlteam |
18:25
🔗
|
|
bzc6p has joined #urlteam |
18:25
🔗
|
|
swebb sets mode: +o bzc6p |
18:30
🔗
|
bzc6p |
I think we should save URL shortener redirects also in WARC and add them to the Wayback Machine. |
18:32
🔗
|
bzc6p |
I think many more people know the WM than those who know URLTeam, and also finding out the destination from URLteam files is much more complicated than just waybacking it. |
18:33
🔗
|
bzc6p |
This could be done just alongside the "regular" URLTeam job. |
18:36
🔗
|
bzc6p |
What do you think? |
18:38
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
19:11
🔗
|
|
aaaaaaaaa has joined #urlteam |
19:11
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
19:19
🔗
|
ersi |
It's not a bad idea, although both forms have their merits. |
19:19
🔗
|
ersi |
Although it needs to be implemented. That's about the only bad part that I can think of. |
19:52
🔗
|
|
Start has joined #urlteam |
20:45
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
20:50
🔗
|
|
Start has joined #urlteam |
20:56
🔗
|
xmc |
i am always and forever in favor of shorteners in warc |
20:56
🔗
|
xmc |
we could even retire the stupidformat in favor of warc/cdx! |
21:09
🔗
|
|
JW_work has joined #urlteam |
21:11
🔗
|
JW_work |
I think doing a WARC grab of found shorturls (and their targets) is a fine idea — but it certainly shouldn't *replace* the current logic — it would slow down the process of *finding* shorturls a lot, and produce a whole lot more completely useless data. |
21:12
🔗
|
ersi |
well, we can debate this shit forever and it doesn't really matter |
21:12
🔗
|
JW_work |
Many shorteners can be searched with just HEAD requests — doing GET requests to each of them would be very wasteful (and slow). |
21:12
🔗
|
ersi |
but if anyone gets pumped up if I praise it and actually do it, I'll praise it :) |
21:12
🔗
|
JW_work |
I will too. |
21:12
🔗
|
JW_work |
as I said, it would be lovely to *add* WARC scraping of found ones. |
21:14
🔗
|
JW_work |
As for retiring the stupidformat (aka BECON) — shrug. It would make downloading the whole corpus even more of a hassle, which I'm not particularly in favor of. |
21:14
🔗
|
xmc |
you can have a warc record containing a HEAD |
21:14
🔗
|
phuzion |
Would it be possible to hack something together that takes the existing formats, and converts it into one big-ass WARC that could be ingested into the wayback machine to fix broken links? |
21:15
🔗
|
xmc |
JW_work: BECON? |
21:15
🔗
|
xmc |
phuzion: yeah, but i'm idgy about fabricating warc records |
21:15
🔗
|
JW_work |
phuzion: you'd need to know more about exactly what the wayback machine expects. |
21:15
🔗
|
JW_work |
BEACON — https://gbv.github.io/beaconspec/beacon.html |
21:15
🔗
|
JW_work |
forgot the A |
21:16
🔗
|
xmc |
ah hm |
21:16
🔗
|
JW_work |
i.e. a formalization of the stupidformat |
21:17
🔗
|
xmc |
ughhhh |
21:17
🔗
|
xmc |
or we could poke IA and make them ingest it :) |
21:18
🔗
|
xmc |
i mean ... it's a standard, and i'm sure they're interested |
21:19
🔗
|
JW_work |
regarding putting HEAD requests in WARC — that's good to know, and would mean we could just only keep the HEAD requests made to find existing shorturls and stuff them into WARCs — so I guess my remaining issue is that we still only store WARCs for actually exisitng shorturls. |
21:19
🔗
|
xmc |
? |
21:19
🔗
|
JW_work |
I don't want to make the warriors go to the trouble of converting 404s into WARCs, and sending them back. |
21:20
🔗
|
xmc |
there are a million ways that this can be made smaller |
21:20
🔗
|
xmc |
what are you trying to economize |
21:20
🔗
|
JW_work |
I think we're talking past each other. |
21:21
🔗
|
xmc |
quite possibly |
21:21
🔗
|
phuzion |
JW_work: Personally, I feel that if we're brute forcing, we should only return valid URLs, but if we're using searches from IA or other datasets, we should WARC the 404s and return them to IA for Wayback machine ingestion. |
21:22
🔗
|
xmc |
sounds reasonable |
21:22
🔗
|
JW_work |
Yep, that sounds reasonable to me, too. |
21:22
🔗
|
JW_work |
If we have some reason other than brute force to think a particular short URL exists, then yeah, storing whatever random crap we get back seems like a good idea. |
21:22
🔗
|
phuzion |
Because if we're brute forcing, there's a very real chance that a URL that 404s was never once used. But if there's a record of it in the wayback machine or somewhere else, then it's obviously been used at least once somewhere. |
21:24
🔗
|
JW_work |
Regarding CDX — I don't think it has a way to directly represent redirects, AFAIK... |
21:24
🔗
|
JW_work |
and I do like having (one of) our output formats be a minimal mapping between shortcodes and target URLs. |
21:25
🔗
|
JW_work |
if we want to *generate* that from WARCs — that would be fine (although it might be tricky to do in general) |
21:25
🔗
|
aaaaaaaaa |
JW_work: have you ever looked at what WARCs look like uncompressed? |
21:30
🔗
|
JW_work |
yep |
21:30
🔗
|
JW_work |
what about them? |
21:32
🔗
|
aaaaaaaaa |
I just was curious how informed you statements were. Some people seem to regard them as magical in various ways. |
21:33
🔗
|
aaaaaaaaa |
Or as flat zips in others. |
21:34
🔗
|
JW_work |
ah. Yeah, I've read the spec, poked at various ones, ran some of the tools on them. |
21:38
🔗
|
arkiver |
I created some warc files a while ago for url shorteners |
21:39
🔗
|
arkiver |
We don't what the exact responses were from the server |
21:39
🔗
|
arkiver |
So we have to fabricate data |
21:39
🔗
|
arkiver |
And that's the reason I stopped creating the WARC files |
21:41
🔗
|
JW_work |
The idea now is to start doing it going forward — just keep (and store) the full request data for newly found shortURLs. |
21:41
🔗
|
JW_work |
we could also run a retroactive effort to go through still-existing shorteners and grab the full data, but that's a separate effort. |
21:52
🔗
|
phuzion |
For sure. |
21:52
🔗
|
phuzion |
How do we send the data to the tracker as of right now? HTTP request to some API or something? Or are the files prepped and rsync'd? |
21:53
🔗
|
JW_work |
HTTP request: see https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/api.py#L114 |
21:54
🔗
|
JW_work |
https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/app.py#L51 |
21:55
🔗
|
phuzion |
Gotcha. So, we'd need an rsync target as well for sending the WARCs to. |
21:55
🔗
|
JW_work |
https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/model.py#L761 |
21:55
🔗
|
JW_work |
yep. |
22:07
🔗
|
|
WinterFox has joined #urlteam |
22:08
🔗
|
|
WinterFox has quit IRC (Read error: Connection reset by peer) |
22:09
🔗
|
|
squadette has joined #urlteam |
22:09
🔗
|
|
WinterFox has joined #urlteam |
22:10
🔗
|
|
qwebirc52 has joined #urlteam |
22:12
🔗
|
squadette |
hi. does anybody know about the status of ff.im links database? Wiki mentions @chronomex having 1M+ URLs, but they're not online. |
22:13
🔗
|
|
qwebirc52 has quit IRC (Client Quit) |
22:13
🔗
|
* |
xmc waves |
22:14
🔗
|
xmc |
hm, i don't remember doing ff.im, let me look on my computers though |
22:15
🔗
|
squadette |
wiki table says 1,189,782 links in dump :) maybe some of those are ours. |
22:15
🔗
|
xmc |
no ... i ... have no memory of running ff.im |
22:15
🔗
|
xmc |
this is very strange |
22:16
🔗
|
squadette |
ok I see :) |
22:16
🔗
|
|
Start has quit IRC (Quit: Disconnected.) |
22:21
🔗
|
squadette |
thanks for the answer! |
22:24
🔗
|
xmc |
sorry i can't help much more than that |
22:24
🔗
|
JW_work |
squadette: they are in the last dump |
22:24
🔗
|
JW_work |
https://archive.org/details/URLTeamTorrentRelease2013July |
22:24
🔗
|
JW_work |
https://archive.org/download/URLTeamTorrentRelease2013July/ff.im.txt.xz |
22:26
🔗
|
squadette |
JW_work, thanks a lot! |
22:26
🔗
|
JW_work |
sure |
22:26
🔗
|
JW_work |
that's what the "# in dump" refers to |
22:28
🔗
|
squadette |
ah, ok. We're restoring our personal friendfeed archives in one of the re-implementations, mokum.place |
22:28
🔗
|
xmc |
spiffy |
22:28
🔗
|
JW_work |
nice |
22:29
🔗
|
JW_work |
the ff.im work was attributed to xmc in this edit: http://www.archiveteam.org/index.php?title=URLTeam&diff=2056&oldid=1775 |
22:29
🔗
|
JW_work |
on Xmas Day, 2010 by Soult. |
22:30
🔗
|
xmc |
soultcer, now there's a name i haven't seen in a minute |
22:33
🔗
|
JW_work |
and the claim of 1M urls ripped from ff.im was added in this edit http://www.archiveteam.org/index.php?title=URLTeam&diff=506&oldid=489 on 27 April 2009, by http://www.archiveteam.org/index.php?title=User:Scumola |
22:33
🔗
|
squadette |
well, wc -l reports precisely this number |
22:35
🔗
|
JW_work |
it better — that's how I generated it. :-) |
22:37
🔗
|
|
deathy___ has quit IRC (Read error: Connection reset by peer) |
22:45
🔗
|
|
xero has joined #urlteam |
22:49
🔗
|
|
squadette has quit IRC (Quit: Page closed) |
23:16
🔗
|
|
deathy___ has joined #urlteam |
23:17
🔗
|
|
Start has joined #urlteam |
23:17
🔗
|
|
xero has quit IRC (Leaving) |
23:19
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Connection reset by peer) |
23:20
🔗
|
|
aaaaaaaaa has joined #urlteam |
23:20
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
23:33
🔗
|
|
bwn has joined #urlteam |
23:34
🔗
|
arkiver |
JW_work: so basically your plan is to regrab the links that don't return 404 as WARCs? |
23:34
🔗
|
JW_work |
not my plan. :-) |
23:35
🔗
|
arkiver |
our* |
23:35
🔗
|
arkiver |
so is that the plan? |
23:36
🔗
|
JW_work |
Two (separate) plans I wouldn't object to are: 1) To keep the full requests for non-404'ing shorturls and upload them as WARCs; 2) To go through the existing urlteam results and re-scrape them as WARCs. |
23:36
🔗
|
JW_work |
I don't intend to work on either of them, but I would be happy if someone else did. |