#urlteam 2015-12-09,Wed

↑back Search

Time Nickname Message
02:01 🔗 JesseW has joined #urlteam
02:02 🔗 svchfoo1 sets mode: +o JesseW
03:03 🔗 JesseW has quit IRC (Leaving.)
03:08 🔗 W1nterFox has joined #urlteam
03:10 🔗 WinterFox has quit IRC (Read error: Operation timed out)
03:12 🔗 Muad-Dib has quit IRC (Ping timeout: 252 seconds)
03:20 🔗 Start has quit IRC (Read error: Connection reset by peer)
03:20 🔗 Start has joined #urlteam
05:01 🔗 aaaaaaaaa has quit IRC (Leaving)
05:21 🔗 W1nterFox has quit IRC (Read error: Operation timed out)
05:26 🔗 W1nterFox has joined #urlteam
06:24 🔗 JesseW has joined #urlteam
06:25 🔗 svchfoo1 sets mode: +o JesseW
07:05 🔗 JesseW Deewiant: you generated an interesting error:
07:05 🔗 JesseW File "/home/deewiant/terroroftinytown-client-grab/terroroftinytown/terroroftinytown/services/isgd.py", line 42, in process_unavailable
07:05 🔗 JesseW raise errors.UnexpectedNoResult("Could not find processing unavailable for %s" % self.current_shortcode)
07:05 🔗 JesseW UnexpectedNoResult: Could not find processing unavailable for Wupip9
07:05 🔗 JesseW for project isgd_6 at 2015-12-09 06:49:21.339344
07:07 🔗 Deewiant Mm-hm
07:09 🔗 Deewiant <div id="main"><p>Rate limit exceeded - you must wait at least 1798 seconds before we'll service this request.</p></div>
07:09 🔗 Start has quit IRC (Read error: Connection reset by peer)
07:10 🔗 JesseW mostly just putting this here to remind me to look into it.
07:10 🔗 Start has joined #urlteam
07:10 🔗 Deewiant That one doesn't seem to match anything in that function
07:10 🔗 Start_ has joined #urlteam
07:10 🔗 Start has quit IRC (Read error: Connection reset by peer)
07:12 🔗 Deewiant Quite a long timeout though, that's a bit annoying
07:13 🔗 JesseW ah, they changed the format of the Rate limit message, apparently.
07:14 🔗 JesseW feel free to make PR :-)
07:14 🔗 JesseW er, make *a* PR
07:15 🔗 Deewiant Is PleaseRetry() appropriate even though the wait is around 30 minutes instead of 1?
07:20 🔗 JesseW yep, because it will retry from another IP.
07:20 🔗 JesseW IIRC
07:20 🔗 JesseW --------
07:20 🔗 JesseW Started a new project: qr.cx
07:28 🔗 JesseW and (after a slow start) 212 found!
07:29 🔗 JesseW out of a total of 2,377,573
07:30 🔗 JesseW (they are kind enough to list the total number of URLs on their front page (and not be accepting new ones))
07:30 🔗 JesseW (they're also in 301works, but who knows if they've kept it up to date, or when if ever that data will be available)
07:31 🔗 Deewiant Made a PR for the is.gd thing
07:34 🔗 JesseW cool. I can't merge them myself, but I'll look over it.
07:34 🔗 JesseW feel free to review my PR, too, if you'd like.
07:35 🔗 JesseW looks good
07:38 🔗 Deewiant I'm not nearly familiar enough with the codebase to be of much use reviewing; took a peek though and didn't spot any issues
07:39 🔗 JesseW ok, thanks
07:40 🔗 JesseW I'm still getting more familiar with it, myself.
08:07 🔗 JesseW has quit IRC (Leaving.)
09:03 🔗 dashcloud has quit IRC (Read error: Operation timed out)
09:07 🔗 dashcloud has joined #urlteam
09:08 🔗 svchfoo1 sets mode: +o dashcloud
09:45 🔗 Muad-Dib has joined #urlteam
11:30 🔗 Coderjoe has quit IRC (Read error: Operation timed out)
11:51 🔗 Coderjoe has joined #urlteam
12:56 🔗 W1nterFox has quit IRC (Remote host closed the connection)
14:54 🔗 Start_ has quit IRC (Quit: Disconnected.)
15:45 🔗 Start has joined #urlteam
16:10 🔗 Start has quit IRC (Remote host closed the connection)
16:11 🔗 Start has joined #urlteam
17:07 🔗 Start has quit IRC (Quit: Disconnected.)
17:13 🔗 Start has joined #urlteam
18:25 🔗 bzc6p has joined #urlteam
18:25 🔗 swebb sets mode: +o bzc6p
18:30 🔗 bzc6p I think we should save URL shortener redirects also in WARC and add them to the Wayback Machine.
18:32 🔗 bzc6p I think many more people know the WM than those who know URLTeam, and also finding out the destination from URLteam files is much more complicated than just waybacking it.
18:33 🔗 bzc6p This could be done just alongside the "regular" URLTeam job.
18:36 🔗 bzc6p What do you think?
18:38 🔗 Start has quit IRC (Quit: Disconnected.)
19:11 🔗 aaaaaaaaa has joined #urlteam
19:11 🔗 swebb sets mode: +o aaaaaaaaa
19:19 🔗 ersi It's not a bad idea, although both forms have their merits.
19:19 🔗 ersi Although it needs to be implemented. That's about the only bad part that I can think of.
19:52 🔗 Start has joined #urlteam
20:45 🔗 Start has quit IRC (Quit: Disconnected.)
20:50 🔗 Start has joined #urlteam
20:56 🔗 xmc i am always and forever in favor of shorteners in warc
20:56 🔗 xmc we could even retire the stupidformat in favor of warc/cdx!
21:09 🔗 JW_work has joined #urlteam
21:11 🔗 JW_work I think doing a WARC grab of found shorturls (and their targets) is a fine idea — but it certainly shouldn't *replace* the current logic — it would slow down the process of *finding* shorturls a lot, and produce a whole lot more completely useless data.
21:12 🔗 ersi well, we can debate this shit forever and it doesn't really matter
21:12 🔗 JW_work Many shorteners can be searched with just HEAD requests — doing GET requests to each of them would be very wasteful (and slow).
21:12 🔗 ersi but if anyone gets pumped up if I praise it and actually do it, I'll praise it :)
21:12 🔗 JW_work I will too.
21:12 🔗 JW_work as I said, it would be lovely to *add* WARC scraping of found ones.
21:14 🔗 JW_work As for retiring the stupidformat (aka BECON) — shrug. It would make downloading the whole corpus even more of a hassle, which I'm not particularly in favor of.
21:14 🔗 xmc you can have a warc record containing a HEAD
21:14 🔗 phuzion Would it be possible to hack something together that takes the existing formats, and converts it into one big-ass WARC that could be ingested into the wayback machine to fix broken links?
21:15 🔗 xmc JW_work: BECON?
21:15 🔗 xmc phuzion: yeah, but i'm idgy about fabricating warc records
21:15 🔗 JW_work phuzion: you'd need to know more about exactly what the wayback machine expects.
21:15 🔗 JW_work BEACON — https://gbv.github.io/beaconspec/beacon.html
21:15 🔗 JW_work forgot the A
21:16 🔗 xmc ah hm
21:16 🔗 JW_work i.e. a formalization of the stupidformat
21:17 🔗 xmc ughhhh
21:17 🔗 xmc or we could poke IA and make them ingest it :)
21:18 🔗 xmc i mean ... it's a standard, and i'm sure they're interested
21:19 🔗 JW_work regarding putting HEAD requests in WARC — that's good to know, and would mean we could just only keep the HEAD requests made to find existing shorturls and stuff them into WARCs — so I guess my remaining issue is that we still only store WARCs for actually exisitng shorturls.
21:19 🔗 xmc ?
21:19 🔗 JW_work I don't want to make the warriors go to the trouble of converting 404s into WARCs, and sending them back.
21:20 🔗 xmc there are a million ways that this can be made smaller
21:20 🔗 xmc what are you trying to economize
21:20 🔗 JW_work I think we're talking past each other.
21:21 🔗 xmc quite possibly
21:21 🔗 phuzion JW_work: Personally, I feel that if we're brute forcing, we should only return valid URLs, but if we're using searches from IA or other datasets, we should WARC the 404s and return them to IA for Wayback machine ingestion.
21:22 🔗 xmc sounds reasonable
21:22 🔗 JW_work Yep, that sounds reasonable to me, too.
21:22 🔗 JW_work If we have some reason other than brute force to think a particular short URL exists, then yeah, storing whatever random crap we get back seems like a good idea.
21:22 🔗 phuzion Because if we're brute forcing, there's a very real chance that a URL that 404s was never once used. But if there's a record of it in the wayback machine or somewhere else, then it's obviously been used at least once somewhere.
21:24 🔗 JW_work Regarding CDX — I don't think it has a way to directly represent redirects, AFAIK...
21:24 🔗 JW_work and I do like having (one of) our output formats be a minimal mapping between shortcodes and target URLs.
21:25 🔗 JW_work if we want to *generate* that from WARCs — that would be fine (although it might be tricky to do in general)
21:25 🔗 aaaaaaaaa JW_work: have you ever looked at what WARCs look like uncompressed?
21:30 🔗 JW_work yep
21:30 🔗 JW_work what about them?
21:32 🔗 aaaaaaaaa I just was curious how informed you statements were. Some people seem to regard them as magical in various ways.
21:33 🔗 aaaaaaaaa Or as flat zips in others.
21:34 🔗 JW_work ah. Yeah, I've read the spec, poked at various ones, ran some of the tools on them.
21:38 🔗 arkiver I created some warc files a while ago for url shorteners
21:39 🔗 arkiver We don't what the exact responses were from the server
21:39 🔗 arkiver So we have to fabricate data
21:39 🔗 arkiver And that's the reason I stopped creating the WARC files
21:41 🔗 JW_work The idea now is to start doing it going forward — just keep (and store) the full request data for newly found shortURLs.
21:41 🔗 JW_work we could also run a retroactive effort to go through still-existing shorteners and grab the full data, but that's a separate effort.
21:52 🔗 phuzion For sure.
21:52 🔗 phuzion How do we send the data to the tracker as of right now? HTTP request to some API or something? Or are the files prepped and rsync'd?
21:53 🔗 JW_work HTTP request: see https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/api.py#L114
21:54 🔗 JW_work https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/app.py#L51
21:55 🔗 phuzion Gotcha. So, we'd need an rsync target as well for sending the WARCs to.
21:55 🔗 JW_work https://github.com/ArchiveTeam/terroroftinytown/blob/master/terroroftinytown/tracker/model.py#L761
21:55 🔗 JW_work yep.
22:07 🔗 WinterFox has joined #urlteam
22:08 🔗 WinterFox has quit IRC (Read error: Connection reset by peer)
22:09 🔗 squadette has joined #urlteam
22:09 🔗 WinterFox has joined #urlteam
22:10 🔗 qwebirc52 has joined #urlteam
22:12 🔗 squadette hi. does anybody know about the status of ff.im links database? Wiki mentions @chronomex having 1M+ URLs, but they're not online.
22:13 🔗 qwebirc52 has quit IRC (Client Quit)
22:13 🔗 * xmc waves
22:14 🔗 xmc hm, i don't remember doing ff.im, let me look on my computers though
22:15 🔗 squadette wiki table says 1,189,782 links in dump :) maybe some of those are ours.
22:15 🔗 xmc no ... i ... have no memory of running ff.im
22:15 🔗 xmc this is very strange
22:16 🔗 squadette ok I see :)
22:16 🔗 Start has quit IRC (Quit: Disconnected.)
22:21 🔗 squadette thanks for the answer!
22:24 🔗 xmc sorry i can't help much more than that
22:24 🔗 JW_work squadette: they are in the last dump
22:24 🔗 JW_work https://archive.org/details/URLTeamTorrentRelease2013July
22:24 🔗 JW_work https://archive.org/download/URLTeamTorrentRelease2013July/ff.im.txt.xz
22:26 🔗 squadette JW_work, thanks a lot!
22:26 🔗 JW_work sure
22:26 🔗 JW_work that's what the "# in dump" refers to
22:28 🔗 squadette ah, ok. We're restoring our personal friendfeed archives in one of the re-implementations, mokum.place
22:28 🔗 xmc spiffy
22:28 🔗 JW_work nice
22:29 🔗 JW_work the ff.im work was attributed to xmc in this edit: http://www.archiveteam.org/index.php?title=URLTeam&diff=2056&oldid=1775
22:29 🔗 JW_work on Xmas Day, 2010 by Soult.
22:30 🔗 xmc soultcer, now there's a name i haven't seen in a minute
22:33 🔗 JW_work and the claim of 1M urls ripped from ff.im was added in this edit http://www.archiveteam.org/index.php?title=URLTeam&diff=506&oldid=489 on 27 April 2009, by http://www.archiveteam.org/index.php?title=User:Scumola
22:33 🔗 squadette well, wc -l reports precisely this number
22:35 🔗 JW_work it better — that's how I generated it. :-)
22:37 🔗 deathy___ has quit IRC (Read error: Connection reset by peer)
22:45 🔗 xero has joined #urlteam
22:49 🔗 squadette has quit IRC (Quit: Page closed)
23:16 🔗 deathy___ has joined #urlteam
23:17 🔗 Start has joined #urlteam
23:17 🔗 xero has quit IRC (Leaving)
23:19 🔗 aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
23:20 🔗 aaaaaaaaa has joined #urlteam
23:20 🔗 swebb sets mode: +o aaaaaaaaa
23:33 🔗 bwn has joined #urlteam
23:34 🔗 arkiver JW_work: so basically your plan is to regrab the links that don't return 404 as WARCs?
23:34 🔗 JW_work not my plan. :-)
23:35 🔗 arkiver our*
23:35 🔗 arkiver so is that the plan?
23:36 🔗 JW_work Two (separate) plans I wouldn't object to are: 1) To keep the full requests for non-404'ing shorturls and upload them as WARCs; 2) To go through the existing urlteam results and re-scrape them as WARCs.
23:36 🔗 JW_work I don't intend to work on either of them, but I would be happy if someone else did.

irclogger-viewer