Time |
Nickname |
Message |
01:00
🔗
|
|
JesseW has joined #urlteam |
02:40
🔗
|
wumpus |
If urlteam data did go into the wayback, then all the users of our upcoming 404 browser integrations can enjoy urlteam's work. |
02:50
🔗
|
Frogging |
it would be easy to put it into wayback, wouldn't it? isn't it all WARCs with 301 records? |
02:52
🔗
|
bwn |
it's in beacon link dump format |
02:53
🔗
|
bwn |
https://gbv.github.io/beaconspec/beacon.html |
02:53
🔗
|
|
VADemon has quit IRC (Quit: left4dead) |
02:57
🔗
|
xmc |
i still can't believe that the stupid ad-hoc format that i came up with for my initial archiveteam experiments has become an internet-draft |
02:57
🔗
|
xmc |
makes me pretty sad |
03:01
🔗
|
wumpus |
no date info for WARC purposes |
03:01
🔗
|
xmc |
yeah it's a terrible format and i would like to go back in time and start saving warcs |
03:06
🔗
|
Frogging |
oh shit it's not warcs? |
03:06
🔗
|
Frogging |
not that I had any actual reason to think it was, but I'm surprised somewhat |
03:07
🔗
|
JesseW |
it's a lot smaller than WARCs would be |
03:07
🔗
|
wumpus |
You could recrawl the known IDs into WARCs, it's only 4.7 billion urls |
03:08
🔗
|
JesseW |
which I appreciate because it means I can mirror all of it more easily |
03:08
🔗
|
JesseW |
Yeah, I think re-crawling the known ones would be a great project. |
03:08
🔗
|
JesseW |
I just haven't gotten around to doing it |
03:09
🔗
|
wumpus |
Well, do you want the 404 handler of the web to process these links? Or to minimize your personal disk space? You can always go warc -> beacon ;-) |
03:09
🔗
|
JesseW |
(well, I have actively avoided working on that -- but I'm glad to cheer on others who do) |
03:10
🔗
|
JesseW |
that is a sensible argument for keeping the full headers from the terroroftinytown results, yes |
03:12
🔗
|
* |
JesseW is encouraged by all this discussion to look over the contributions people have made recently and see what can be added to the tracker |
03:17
🔗
|
JesseW |
shortdoi.org seems feasible to scrape |
03:17
🔗
|
JesseW |
making a project now |
03:19
🔗
|
wumpus |
the first urlteam to WARC project? |
03:19
🔗
|
JesseW |
ha -- no, just another one using the currently lossy format |
03:19
🔗
|
wumpus |
just kidding |
03:20
🔗
|
JesseW |
heh. but if you keep poking, I might make a PR (although I'd much prefer if you or someone else did so) |
03:21
🔗
|
JesseW |
ok, shortdoi-org started |
03:22
🔗
|
JesseW |
note, it actually maps from dx.doi.org/10/ not shortdoi.org, as that is just an initial redirect |
03:24
🔗
|
JesseW |
ok, 681 found so far |
03:24
🔗
|
JesseW |
it seems to have missed some, though, which I'm confused by |
03:32
🔗
|
bwn |
i've been working on aggregating the lists in my spare time, i was looking to do something similar to what you're discussing above (jessew: we were discussing it briefly a while back, a run through to make sure everything is in the wayback, archivebot-like thing to get them if not) |
03:33
🔗
|
bwn |
wb jessew, btw :) |
03:33
🔗
|
JesseW |
yeah, I remembered you were working on that |
03:33
🔗
|
JesseW |
I think someone else was, too -- maybe Frogging or vitzli (not sure if I'm misspelling their nick)... |
03:34
🔗
|
Frogging |
sadly I haven't been working on anything |
03:34
🔗
|
bwn |
i had started with luckcolor's list of dead shorteners though so we can't re-crawl them :\ |
03:35
🔗
|
Frogging |
except $dayjob |
03:35
🔗
|
JesseW |
Frogging: ah, ok -- I think I got you confused with someone else |
03:35
🔗
|
JesseW |
$dayjob is useful and important. At a minimum, it enables you to pay for #archivebot pipelines (which is *GREATLY* appreciated) |
03:35
🔗
|
Frogging |
that is true :p |
03:37
🔗
|
JesseW |
grumble -- I screwed up the alphabet on shortdoi-org |
03:38
🔗
|
bwn |
jessew: i think there are some 3 digit identifiers, i was going to update the wiki.. i forgot to add a note about doing a bit more research |
03:38
🔗
|
JesseW |
yeah, I knew about the 3 digit identifiers -- the error I made was that there are identifiers that *start* with "a", so "a" can't be the first character in the alphabet |
03:39
🔗
|
bwn |
ah, cool |
03:40
🔗
|
JesseW |
we may have missed various other synchronous ones with a similar error |
03:40
🔗
|
JesseW |
someone (else) should probably go through them and check |
03:41
🔗
|
JesseW |
spot check, I mean -- then I can fix the alphabets and we can grab them |
03:43
🔗
|
JesseW |
ok, shortdoi-org queue up to 40 -- it should be done in a couple of hours |
03:48
🔗
|
JesseW |
flip-it is *still* going, since June 6th -- 21,867,543 found |
03:53
🔗
|
bwn |
moar urls! |
03:54
🔗
|
bwn |
ah, _i_ messed up the alphabet you mean.. :) i didn't see '0', oops |
03:55
🔗
|
JesseW |
no, there *isn't* an '0' -- I just need something at the beginning, because the first character in the alphabet doesn't get used as an initial letter in the generated shortcodes |
03:55
🔗
|
JesseW |
and 'a' *is* used that way in shortdoi-org |
03:55
🔗
|
bwn |
ah, i gotcha now |
03:56
🔗
|
JesseW |
yeah, it's basically just a bug in terroroftinytown |
03:56
🔗
|
JesseW |
albeit one that (maybe) doesn't cause us to miss much (as it may be a similar bug in various of the shorteners we work over) |
03:57
🔗
|
bwn |
ah |
04:14
🔗
|
|
Start_ has joined #urlteam |
04:14
🔗
|
|
Start has quit IRC (Ping timeout: 260 seconds) |
04:16
🔗
|
JesseW |
shortdoi-org is done |
05:00
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
05:17
🔗
|
wumpus |
That was... short. |
05:18
🔗
|
wumpus |
(sorry) |
05:21
🔗
|
bwn |
*drum hit* |
05:22
🔗
|
bwn |
it gets through the sequential shorteners pretty quickly from what i've seen |
05:23
🔗
|
bwn |
question from above: is there any way to massage the data we have for the dead shorteners into something that's usable for wayback/your 404 handler? |
05:23
🔗
|
bwn |
s/from/re/ |
06:12
🔗
|
|
JesseW has joined #urlteam |
06:38
🔗
|
wumpus |
The main issue is that we would like accurate crawl dates in WARC files. |
06:39
🔗
|
wumpus |
so if we create WARC from another format, we'd like that info to be available. And, it does not exist in beacon. |
06:39
🔗
|
JesseW |
they are dated approximately daily (although more recent ones are more like weekly) |
06:41
🔗
|
JesseW |
but re-crawling all the non-dead ones is certainly a good idea, because as a nice side-effect, it would let us grab the target pages, too (which we currently don't) |
06:41
🔗
|
wumpus |
So, if we're going to hand out affidavits to courts, as you can imagine "approximately" is not a good thing. |
06:41
🔗
|
wumpus |
and indeed, it would be interesting to also have the target pages. |
06:43
🔗
|
JesseW |
I'd hope that the question of "what exact minute did you check that this shortcode pointed to this address" wouldn't come up very often in affidavits -- but I can see the problem if there's no way to *mark* some entries as "circa" |
06:43
🔗
|
wumpus |
Just as an example, we have a 80-billion-page horizontal crawl for which we have yet to figure out accurate dates for. This makes me very sad. |
06:43
🔗
|
|
dashcloud has quit IRC (Ping timeout: 244 seconds) |
06:43
🔗
|
wumpus |
And no, no existing "circa" system. |
06:43
🔗
|
JesseW |
what about for the *very* early material (I'm thinking of the BBC website from the very early 1990s, that they gave you, and got converted) |
06:43
🔗
|
wumpus |
I don't know about that one, yet. |
06:44
🔗
|
* |
JesseW goes to try and dig up a link |
06:44
🔗
|
wumpus |
Personally I'm eager to get some super-early Stanford crawls into the wayback, so that the initial Darwin Awards webpages are properly archived. |
06:44
🔗
|
JesseW |
awesome |
06:45
🔗
|
wumpus |
(the CS department asked Wendy to leave, because it was too popular. :-) ) |
06:46
🔗
|
JesseW |
ha |
06:46
🔗
|
JesseW |
http://www.forbes.com/sites/kalevleetaru/2015/11/16/how-much-of-the-internet-does-the-wayback-machine-really-archive <- interesting |
06:48
🔗
|
wumpus |
Uh, yeah. I encourage anyone interested in that article to try out https://web-beta.archive.org/ because it provides a lot more info than was available when that article was written. |
06:48
🔗
|
JesseW |
heh |
06:48
🔗
|
|
dashcloud has joined #urlteam |
06:49
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
06:51
🔗
|
wumpus |
Among other things, you can see how important ArchiveTeam is for many sites. I never appreciated you guys properly until I built that thing. |
06:51
🔗
|
wumpus |
Now I'm a fan! |
06:51
🔗
|
JesseW |
hehheheh |
06:51
🔗
|
wumpus |
Albeit not a fan of non-WARC stuff. |
06:52
🔗
|
JesseW |
yeah, urlteam is somewhat of a red-headed stepchild in someways |
06:58
🔗
|
wumpus |
SketchCow mostly has you guys moving in the right direction, I'm just trying to fill in a few Wayback-specific details. |
06:58
🔗
|
JesseW |
it's very welcome. |
06:59
🔗
|
* |
JesseW is not finding the bbc collection I was thinking of -- I'll let you know if it pops up |
07:01
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
07:04
🔗
|
bwn |
it sounds like something that monitors urlteam exports, grabs them and generates WARCs might be worthwhile going forward? |
07:05
🔗
|
JesseW |
that would likely be a good workaround -- eventually, I think integrating full capture into terroroftinytown would be better |
07:06
🔗
|
JesseW |
but even before making something that monitors, just writing a Warrior project that goes through the existing urlteam exports and makes WARCs from them would be great |
07:09
🔗
|
|
dashcloud has joined #urlteam |
07:09
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
07:14
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
09:08
🔗
|
|
Fusl has quit IRC (Read error: Operation timed out) |
09:12
🔗
|
|
WinterFox has joined #urlteam |
11:40
🔗
|
luckcolor |
JesseW i would suggest splitting the work |
11:40
🔗
|
luckcolor |
i mean |
11:41
🔗
|
luckcolor |
if we are going that ropute i suggest that we keep the current setup |
11:41
🔗
|
luckcolor |
and then have some servers that receive work |
11:41
🔗
|
luckcolor |
and print out warcs with only the 301 records |
11:42
🔗
|
luckcolor |
and then in other warcs we do archive only like crawl |
11:42
🔗
|
luckcolor |
I mean i would prefer to have the data separated: beacon, becaonwarc, archived urls warc, |
11:59
🔗
|
|
Fusl has joined #urlteam |
12:28
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
12:32
🔗
|
|
dashcloud has joined #urlteam |
12:32
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
13:41
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
13:45
🔗
|
|
dashcloud has joined #urlteam |
13:46
🔗
|
|
svchfoo3 sets mode: +o dashcloud |
14:18
🔗
|
|
luckcolor has quit IRC (Remote host closed the connection) |
14:19
🔗
|
|
luckcolor has joined #urlteam |
14:24
🔗
|
|
luckcolor has quit IRC (Read error: Connection reset by peer) |
14:25
🔗
|
|
luckcolor has joined #urlteam |
14:33
🔗
|
|
luckcolor has quit IRC (Read error: Connection reset by peer) |
14:33
🔗
|
|
luckcolor has joined #urlteam |
14:43
🔗
|
|
luckcolor has quit IRC (Read error: Connection reset by peer) |
14:55
🔗
|
|
luckcolor has joined #urlteam |
15:21
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
15:29
🔗
|
|
SilSte has quit IRC (Read error: Operation timed out) |
15:30
🔗
|
|
JesseW has joined #urlteam |
15:55
🔗
|
|
JesseW has quit IRC (Read error: Operation timed out) |
16:10
🔗
|
JW_work1 |
luckcolor: that does make sense — but it does add load. Why not just save the full headers in the initial pass? (I agree about grabbing the targets in a separate pass) |
16:10
🔗
|
|
Start_ is now known as Start |
16:11
🔗
|
luckcolor |
well in this manner we have the "simple to manage" text file with the lsit of urls because that's what beacon basically is and we also then have warc |
16:11
🔗
|
luckcolor |
i mean if we have a script to easily extract becaon or just a url list from warc then yesy we can omit the first part |
16:13
🔗
|
* |
luckcolor needs url lists for the resolve that he's making |
16:13
🔗
|
* |
luckcolor *resolver |
16:14
🔗
|
JW_work1 |
yeah, I certainly would *produce* both beacon and WARCs |
16:14
🔗
|
JW_work1 |
it's just a matter of whether the initial probing throws away the header info or not |
16:14
🔗
|
luckcolor |
no ofc not |
16:15
🔗
|
JW_work1 |
well, it currently *does* — that's what I was thinking we should (eventually) fix |
16:15
🔗
|
luckcolor |
yeah the change would be so that the crawler will generate mini warcs |
16:15
🔗
|
luckcolor |
that we can collect |
16:15
🔗
|
luckcolor |
:P |
16:16
🔗
|
JW_work1 |
yes |
16:16
🔗
|
luckcolor |
mini warcs team NEW Urlteam technology |
16:16
🔗
|
luckcolor |
*mini warcs! NEW Urlteam technology |
16:19
🔗
|
luckcolor |
if we are going this rate i don't reccomand using wpull |
16:20
🔗
|
luckcolor |
as it would be a hassle to succefully ship it to the crawlers |
16:31
🔗
|
luckcolor |
*rate > route |
17:09
🔗
|
xmc |
yes! mini warcs! |
17:09
🔗
|
xmc |
<N3 |
17:09
🔗
|
xmc |
<3 |
18:01
🔗
|
luckcolor |
I know right |
18:01
🔗
|
luckcolor |
So tiny and cute warcs :P |
18:02
🔗
|
xmc |
like cocktail sausages |
18:02
🔗
|
* |
luckcolor goes to lookup what are cocktail sausages |
18:03
🔗
|
luckcolor |
ah i see what you mean XD |
18:06
🔗
|
|
VADemon has joined #urlteam |
18:07
🔗
|
|
SilSte has joined #urlteam |
18:33
🔗
|
luckcolor |
so JesseW do you approve of the tiny warc technology (idea) for urlteam crawlers? :) |
18:34
🔗
|
luckcolor |
actually mini warcs |
18:34
🔗
|
luckcolor |
because mini is better |
19:02
🔗
|
JW_work1 |
sure, works for me |
20:58
🔗
|
|
JW_work has joined #urlteam |
20:59
🔗
|
|
JW_work1 has quit IRC (Read error: Operation timed out) |
22:17
🔗
|
|
SilSte has quit IRC (Read error: Operation timed out) |
22:26
🔗
|
|
SilSte has joined #urlteam |