Time |
Nickname |
Message |
00:00
🔗
|
joepie91_ |
I certainly haven't ever seen it :P |
00:00
🔗
|
joepie91_ |
it's enough of an edge case to not care about a misnaming in that case |
00:00
🔗
|
|
serapeum has joined #archiveteam-bs |
00:00
🔗
|
joepie91_ |
if it's accurate in 99% of the cases, that's better than confusing in 100% of the cases.. :) |
00:01
🔗
|
dan_ |
fair enough, exactly :) |
00:08
🔗
|
joepie91_ |
dan_: https://gist.github.com/joepie91/09aed84c45dc44967699 |
00:09
🔗
|
joepie91_ |
a lot more consistent than the RFC |
00:09
🔗
|
joepie91_ |
:P |
00:10
🔗
|
dan_ |
aha yep, RFC had to deal with implementation-specific things tacked on over years though, so I sorta forgive it~ |
00:13
🔗
|
joepie91_ |
dan_: heh, this is 14x, it has no excuse |
00:13
🔗
|
joepie91_ |
:) |
00:14
🔗
|
|
c_b2 has joined #archiveteam-bs |
00:14
🔗
|
|
c_b has quit IRC (Ping timeout: 260 seconds) |
00:16
🔗
|
|
c_b2 is now known as c_b |
00:21
🔗
|
joepie91_ |
dan_: hm. is there an equivalent of HTTP 400/500 in IRC? |
00:21
🔗
|
joepie91_ |
"some error that I don't have an error code for" |
00:22
🔗
|
joepie91_ |
oh |
00:22
🔗
|
joepie91_ |
400 |
00:22
🔗
|
joepie91_ |
heh |
00:24
🔗
|
|
mistym_ has quit IRC (Remote host closed the connection) |
00:24
🔗
|
|
mistym has joined #archiveteam-bs |
00:25
🔗
|
|
primus has quit IRC (Read error: Operation timed out) |
00:36
🔗
|
dan_ |
https://www.alien.net.au/irc/irc2numerics.html |
00:36
🔗
|
dan_ |
all those conflicts :) |
00:37
🔗
|
joepie91_ |
yep |
00:37
🔗
|
joepie91_ |
that has been my goto numeric guide for a long tiem |
00:37
🔗
|
joepie91_ |
lol |
00:37
🔗
|
dan_ |
haha, (Last updated: Tue, 11 Jan 2005 22:30:30 GMT) |
00:38
🔗
|
dan_ |
gotta love irc |
00:38
🔗
|
joepie91_ |
and it's still accurate! heh |
00:49
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
00:51
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
00:52
🔗
|
|
primus104 has quit IRC (Leaving.) |
01:27
🔗
|
|
schbirid2 has quit IRC (Read error: Operation timed out) |
01:39
🔗
|
|
schbirid2 has joined #archiveteam-bs |
01:45
🔗
|
|
wp494 has quit IRC (Read error: No route to host) |
01:48
🔗
|
BlueMaxim |
what do you guys think of BetaArchive |
01:49
🔗
|
* |
kyan thinks they're jackasses, because they won't let other sites mirror their collection — a single point of failure for a valuable chunk of history, with a bureaucratic attitude |
01:51
🔗
|
chfoo |
logchfoo: off |
01:51
🔗
|
|
logchfoo has left |
01:52
🔗
|
|
logchfoo starts logging #archiveteam-bs at Sun Mar 29 01:52:11 2015 |
01:52
🔗
|
|
logchfoo has joined #archiveteam-bs |
01:52
🔗
|
chfoo |
(sorry to interrupt, i wanted to remove ops from the log bot) |
01:59
🔗
|
joepie91_ |
lol, wow: https://en.wikipedia.org/wiki/FoundationDB |
01:59
🔗
|
joepie91_ |
On March 25, 2015 it was reported that Apple has acquired the company.[6] A notice on the FoundationDB web site indicated that the company has "evolved" its mission and would no longer offer downloads of the software.[7] |
01:59
🔗
|
joepie91_ |
"ha ha fuck you now you can't download our software anymore that you've built your infra on" |
02:00
🔗
|
joepie91_ |
looks like Apple may soon be joining Yahoo in the list of douchebag-acquisition companies |
02:03
🔗
|
Rotab |
lol |
02:03
🔗
|
aaaaaaaaa |
looks like they evolved from "extend" to "extinguish" |
02:04
🔗
|
garyrh |
Gotta acquire 'em all! |
02:31
🔗
|
|
wp494 has joined #archiveteam-bs |
02:57
🔗
|
|
vitzli has joined #archiveteam-bs |
03:27
🔗
|
|
necenzura has joined #archiveteam-bs |
03:53
🔗
|
|
necenzura has quit IRC (Quit: Page closed) |
04:00
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
04:04
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
04:11
🔗
|
|
dashcloud has joined #archiveteam-bs |
04:12
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
04:24
🔗
|
|
mistym has joined #archiveteam-bs |
04:30
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
04:31
🔗
|
|
vitzli has joined #archiveteam-bs |
04:59
🔗
|
|
Start_ has joined #archiveteam-bs |
04:59
🔗
|
|
Start has quit IRC (Read error: Connection reset by peer) |
05:06
🔗
|
|
brayden has joined #archiveteam-bs |
05:11
🔗
|
|
c_b has quit IRC (Quit: c_b) |
05:43
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
05:49
🔗
|
godane |
https://www.youtube.com/watch?v=aOOE7KrrCpE |
06:25
🔗
|
|
primus104 has joined #archiveteam-bs |
07:13
🔗
|
|
edsu has joined #archiveteam-bs |
07:20
🔗
|
|
john has joined #archiveteam-bs |
07:21
🔗
|
john |
Does wpull not support --no-clobber, despite --help listing it? |
07:24
🔗
|
yipdw |
it's implemented |
07:25
🔗
|
yipdw |
if you're writing WARCs you won't need to worry about it |
07:25
🔗
|
john |
Really? Because for me it downloads everything again. |
07:25
🔗
|
john |
And when I append --no-clobber it prints the usage and exits. |
07:25
🔗
|
john |
I built it from git master today. |
07:26
🔗
|
yipdw |
use a stable version |
07:26
🔗
|
yipdw |
master is generally good enough for use but I haven't been tracking it |
07:27
🔗
|
john |
Okay. |
07:27
🔗
|
john |
I thought it'd be one of those projects where git master is always the reccomended version. |
07:27
🔗
|
yipdw |
what gave you that impression |
07:28
🔗
|
yipdw |
chfoo is generally pretty good about releases |
07:28
🔗
|
yipdw |
http://wpull.readthedocs.org/en/master/changelog.html |
07:28
🔗
|
ersi |
Who doesn't like bleeding edge? It should cut you, else it ain't good |
07:28
🔗
|
ersi |
and new |
07:29
🔗
|
yipdw |
FWIW, we don't use no-clobber anywhere in archivebot |
07:29
🔗
|
yipdw |
I don't know what options you're passing, but download twice is not the default behavior |
07:29
🔗
|
john |
Still doesn't work. |
07:30
🔗
|
yipdw |
the list of options we pass is as follows: https://github.com/ArchiveTeam/ArchiveBot/blob/master/pipeline/archivebot/seesaw/wpull.py#L22-L57 |
07:30
🔗
|
john |
http_proxy="127.0.0.1:4444" wpull http://echelon.i2p/ --warc-file echelon.i2p --page-requisites --recursive --level inf --warc-max-size 5000000000 --no-clobber |
07:30
🔗
|
john |
That's what I'm trying. |
07:30
🔗
|
yipdw |
clobber doesn't occur with WARC writing |
07:31
🔗
|
yipdw |
so you don't need to specify it |
07:31
🔗
|
yipdw |
the data goes right into the WARC |
07:32
🔗
|
john |
All right. |
07:32
🔗
|
john |
But it still downloads everything again. |
07:32
🔗
|
yipdw |
are you seeing duplicate HTTP requests or files along with WARC records |
07:33
🔗
|
john |
Yes. |
07:33
🔗
|
yipdw |
what the hell does that mean |
07:34
🔗
|
john |
It means, it requests files that are already in the warc archive. |
07:34
🔗
|
yipdw |
if the request comes from a redirect, that'll happen |
07:35
🔗
|
yipdw |
wpull operates on URLs, not files |
07:35
🔗
|
yipdw |
at least when doing websitse |
07:35
🔗
|
yipdw |
es |
07:36
🔗
|
john |
It's not just that, it will again fetch the robots.txt and index file too. |
07:36
🔗
|
yipdw |
post the logs |
07:36
🔗
|
john |
All right. |
07:37
🔗
|
john |
http://sprunge.us/ZebB |
07:38
🔗
|
yipdw |
that log looks normal, there's no duplicate fetches in there |
07:40
🔗
|
yipdw |
are you resuming a stopped grab? |
07:40
🔗
|
yipdw |
if so you need to record the results to a database with --database |
07:41
🔗
|
yipdw |
otherwise wpull will use an in-memory database that goes away once the process exits |
07:41
🔗
|
john |
Oh… |
07:41
🔗
|
john |
So that's what that's for. All right. |
07:41
🔗
|
yipdw |
http://wpull.readthedocs.org/en/master/usage.html#stopping-resuming |
08:27
🔗
|
john |
I must say, I'm very happy with the web archive's new design. ^_^ |
09:10
🔗
|
|
schbirid2 has quit IRC (Leaving) |
09:33
🔗
|
|
schbirid has joined #archiveteam-bs |
09:33
🔗
|
schbirid |
anyone know a twitter bot that one can simply feed any text corpus to for funny markov chain tweets? all i found so far are based on your own tweet archive |
09:36
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
09:36
🔗
|
schbirid |
nvm, cant find the corpus i wanted to use as text anyways :( |
11:49
🔗
|
godane |
i'm uploading Computer Power User 2014 pdfs |
12:04
🔗
|
godane |
btw i'm also uploading Archival Outlook |
12:04
🔗
|
godane |
from Society of American Archivists |
12:05
🔗
|
godane |
i'm only doing that cause there is no collection of it on IA |
12:06
🔗
|
godane |
and 2014 pdf are being put on bluetoad.org |
12:08
🔗
|
godane |
https://archive.org/details/Archival_Outlook-2004-07 |
12:27
🔗
|
|
primus104 has quit IRC (Leaving.) |
12:39
🔗
|
Smiley |
john: it was originally (afaik) blood for the blood god |
12:40
🔗
|
Smiley |
schbirid: you wanted to use the sweary one? |
12:40
🔗
|
john |
All right. |
12:42
🔗
|
schbirid |
Smiley? |
12:42
🔗
|
Smiley |
sweary corpus |
12:45
🔗
|
|
lysobit has quit IRC (Quit: quit) |
12:46
🔗
|
schbirid |
nah |
12:49
🔗
|
Smiley |
aww |
12:55
🔗
|
|
lysobit has joined #archiveteam-bs |
14:29
🔗
|
godane |
so i found something called flightglobal.com |
14:29
🔗
|
godane |
its has tons of Flight pdf |
14:30
🔗
|
|
primus104 has joined #archiveteam-bs |
14:32
🔗
|
godane |
i may have to convert pdf pages into one pdf |
14:32
🔗
|
godane |
cause they put every page as its own pdf |
14:35
🔗
|
john |
Hmm… that's weird. |
14:35
🔗
|
john |
I thought the .com file extension was usually used for flat binary files. |
14:39
🔗
|
|
primus104 has quit IRC (Leaving.) |
14:49
🔗
|
joepie91_ |
john: a file extension is nothing but bytes |
14:49
🔗
|
joepie91_ |
it doesn't define a file |
14:49
🔗
|
joepie91_ |
it's just the name |
14:49
🔗
|
john |
I know. |
14:50
🔗
|
john |
Usually the header gives you a good idea, but even that can be decieving. |
14:50
🔗
|
joepie91_ |
so it's probably an archive of a site named flightglobal.com :P |
14:55
🔗
|
john |
Oh… |
15:19
🔗
|
|
underscor has quit IRC (Ping timeout: 370 seconds) |
15:28
🔗
|
|
underscor has joined #archiveteam-bs |
15:28
🔗
|
|
swebb sets mode: +o underscor |
15:28
🔗
|
|
primus104 has joined #archiveteam-bs |
15:36
🔗
|
godane |
so i finally figured out way kbs korea culture news stopped at the end of Jan 2003 |
15:42
🔗
|
godane |
it looks like they just had high bit rate wmv between june 2002 to jan 2003 |
15:42
🔗
|
godane |
btw i'm getting something called Classic Odyssey |
15:45
🔗
|
johtso |
anyone know of any very lenient regexes for matching URLs? |
15:45
🔗
|
johtso |
ie. not requiring the protocol |
15:46
🔗
|
johtso |
maybe even using a valid tld list.. |
15:47
🔗
|
|
underscor has quit IRC (Ping timeout: 370 seconds) |
15:47
🔗
|
|
brayden has quit IRC (Ping timeout: 606 seconds) |
15:54
🔗
|
joepie91_ |
johtso: "valid TLD list" became infeasible since ICANN went overboard with gTLDs |
15:55
🔗
|
joepie91_ |
technically speaking, 'hi' is a valid URL if you want to ignore the protocol |
15:55
🔗
|
johtso |
joepie91_, https://www.publicsuffix.org/list/effective_tld_names.dat |
15:56
🔗
|
johtso |
just grab that and compile it into your regex :) |
15:57
🔗
|
joepie91_ |
yeah, no |
15:57
🔗
|
joepie91_ |
there's a number of issues with that list and you probably don't want a regex that large |
15:57
🔗
|
johtso |
joepie91_, by URL I really mean publicly accessible web address |
15:57
🔗
|
joepie91_ |
not to mention that this is NOT a complete list |
15:57
🔗
|
joepie91_ |
yes |
15:57
🔗
|
joepie91_ |
johtso: try ctrl+Fing that list for .onion |
15:57
🔗
|
joepie91_ |
publicly accessible, just on a different network |
15:57
🔗
|
joepie91_ |
not on the list |
15:57
🔗
|
johtso |
mm, okay |
15:58
🔗
|
johtso |
well, .onion wouldn't really be something I'd be looking for anyway ;) |
15:58
🔗
|
johtso |
really I'm trying to extract file locker urls, but for my first pass I want to make sure I don't miss anything |
15:58
🔗
|
joepie91_ |
extract from |
15:59
🔗
|
joepie91_ |
? |
15:59
🔗
|
johtso |
the html content of blogger posts/comments |
15:59
🔗
|
johtso |
and can't rely on the links being in html markup |
16:00
🔗
|
joepie91_ |
and why without the protocol? |
16:00
🔗
|
johtso |
just guessing that there must be *some* links out there that are missing the protocol |
16:01
🔗
|
johtso |
I'd rather not miss them |
16:05
🔗
|
joepie91_ |
just grab anything [\x21-\X7E-]+\.[\x21-\X7E-]+\/[\x21-\X7E-]+ |
16:06
🔗
|
joepie91_ |
chars<dot>chars<slash>chars |
16:06
🔗
|
|
Start_ is now known as Start |
16:06
🔗
|
joepie91_ |
you'll get a bunch of false positives I'm sure |
16:06
🔗
|
joepie91_ |
but that's one HEAD away |
16:06
🔗
|
johtso |
sounds like a great idea, seeing as I'm not interested in bare urls |
16:06
🔗
|
joepie91_ |
:P |
16:06
🔗
|
|
Start has quit IRC (Disconnected.) |
16:06
🔗
|
|
Start has joined #archiveteam-bs |
16:06
🔗
|
|
Start has quit IRC (Remote host closed the connection) |
16:06
🔗
|
|
Start has joined #archiveteam-bs |
16:06
🔗
|
johtso |
one HEAD away? :) |
16:07
🔗
|
Sanqui |
the problem is you might be too greedy, and grab a period at the end -> 404 |
16:07
🔗
|
Sanqui |
or you might NOT grab the period -> 404 |
16:07
🔗
|
Sanqui |
ideally, you'd get both variations, but I don't think you can do that with a regex |
16:08
🔗
|
joepie91_ |
johtso: HEAD request |
16:08
🔗
|
joepie91_ |
requests headers, not body |
16:08
🔗
|
johtso |
ah right! |
16:08
🔗
|
joepie91_ |
so you get the status code |
16:08
🔗
|
joepie91_ |
if it's 200, it's probably valid |
16:08
🔗
|
johtso |
yeah, see if they're alive |
16:08
🔗
|
joepie91_ |
Sanqui: that's postprocessing :) |
16:10
🔗
|
dashcloud |
anyone have a good macro recording program so I can record me clicking a button, pressing a different button for a screenshot, and then closing any windows opened by my first button press? |
16:11
🔗
|
joepie91_ |
OS? |
16:14
🔗
|
dashcloud |
either windows or linux |
16:14
🔗
|
dashcloud |
under linux, I'd be running the program under wine |
16:16
🔗
|
schbirid |
simplescreenrecorder maybe? |
16:16
🔗
|
schbirid |
i just use ffmpeg x11grab if i need to record something |
16:16
🔗
|
schbirid |
ffmpeg -f x11grab -s 1280x800 -r 30 -i :0.0 -qscale 0 /tmp/x11grab4.mpg |
16:16
🔗
|
schbirid |
oh ignore me |
16:16
🔗
|
schbirid |
haha |
16:17
🔗
|
joepie91_ |
dashcloud: windows, autohotkey |
16:17
🔗
|
joepie91_ |
linux, nfi. I never do GUI automation other than some wmctrl hacks to make XBMC play nice with multiple monitors |
16:17
🔗
|
joepie91_ |
:p |
16:19
🔗
|
johtso |
dashcloud, I haven't used it, but you might want to check out http://www.sikuli.org/ |
16:21
🔗
|
dashcloud |
thanks! |
16:27
🔗
|
|
brayden has joined #archiveteam-bs |
16:48
🔗
|
Start |
have we grabbed the videos from joystiq yet? |
16:49
🔗
|
Start |
it now redirects to engadget |
16:52
🔗
|
ersi |
I think godane did a loot of them |
16:52
🔗
|
Start |
ok |
16:55
🔗
|
|
underscor has joined #archiveteam-bs |
16:55
🔗
|
|
swebb sets mode: +o underscor |
17:03
🔗
|
godane |
i uploaded the tuaw videos to Jason's ftp |
17:04
🔗
|
godane |
but joystiq videos i didn't grab all yet |
17:05
🔗
|
Start |
oh |
17:05
🔗
|
Start |
how much did you get? |
17:07
🔗
|
godane |
i really don't remember how much i got |
17:07
🔗
|
godane |
but i want to say 400 to 500 videos |
17:07
🔗
|
godane |
also joystiq youtube channel still has all of the videos |
17:08
🔗
|
Start |
that's a relief |
17:09
🔗
|
joepie91_ |
Facebook is killing their XMPP API on April 30: https://developers.facebook.com/docs/chat |
17:12
🔗
|
xmc |
oh really, nice. |
17:20
🔗
|
|
mistym has joined #archiveteam-bs |
17:23
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
17:37
🔗
|
|
schbirid has quit IRC (Leaving) |
17:37
🔗
|
|
schbirid has joined #archiveteam-bs |
18:39
🔗
|
|
dashcloud has quit IRC (Read error: Connection reset by peer) |
18:42
🔗
|
|
dashcloud has joined #archiveteam-bs |
18:44
🔗
|
|
xtr-201 has quit IRC (Read error: Connection reset by peer) |
18:49
🔗
|
|
aaaaaaaaa has quit IRC (Read error: Operation timed out) |
19:14
🔗
|
SketchCow |
https://www.youtube.com/watch?v=uPVQMZ4ikvM |
19:17
🔗
|
schbirid |
ffs, git/github privacy leaking is ridiculous |
19:18
🔗
|
schbirid |
if you have more than one account, you are bound to accidentally post with random ones every now and then |
19:20
🔗
|
|
underscor has quit IRC (Ping timeout: 370 seconds) |
19:24
🔗
|
|
underscor has joined #archiveteam-bs |
19:24
🔗
|
|
swebb sets mode: +o underscor |
19:30
🔗
|
|
BlueMaxim has quit IRC (Ping timeout: 512 seconds) |
19:31
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
19:48
🔗
|
|
SN4T14__ has joined #archiveteam-bs |
19:50
🔗
|
|
aaaaaaaaa has joined #archiveteam-bs |
19:51
🔗
|
|
lytv has quit IRC (Read error: Operation timed out) |
19:51
🔗
|
|
lytv has joined #archiveteam-bs |
19:55
🔗
|
|
SN4T14_ has quit IRC (Ping timeout: 512 seconds) |
20:18
🔗
|
|
schbirid has quit IRC (Leaving) |
20:41
🔗
|
useretail |
SketchCow: have they received their nobel prizes? |
20:42
🔗
|
SketchCow |
They should! |
20:43
🔗
|
useretail |
http://www.engineering.com/DesignerEdge/DesignerEdgeArticles/ArticleID/9848/VIDEO-Introducing-a-Fire-Extinguisher-Fuelled-by-Sound.aspx |
20:43
🔗
|
useretail |
In fact, the Defense Advanced Research Agency (DARPA) developed a system back in 2012 that utilized sound to put out flames. |
20:44
🔗
|
useretail |
However, this marks the first time engineers have created an actual extinguisher using sound. |
20:44
🔗
|
joepie91_ |
"Engineering seniors Viet Tran and Seth Robertson now hold a preliminary patent application for their potentially revolutionizing device. " |
20:44
🔗
|
joepie91_ |
well, was nice while it lasted |
20:45
🔗
|
SketchCow |
Dude, inventors patent shit |
20:46
🔗
|
useretail |
yep, patents are killing innovation |
20:46
🔗
|
|
SketchCow changes topic to: Archive Team: https://i.imgur.com/d9dPE6s.gif |
20:47
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
20:50
🔗
|
SketchCow |
I don't agree, but the latest rustling in the fuck drawer came up entry |
20:50
🔗
|
SketchCow |
empty |
20:50
🔗
|
SketchCow |
Also, roughly $2000 went out the door yesterday into bills and debt and I am not happy |
20:56
🔗
|
|
dashcloud has joined #archiveteam-bs |
21:06
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
21:23
🔗
|
|
mistym has joined #archiveteam-bs |
22:16
🔗
|
|
dashcloud has quit IRC (Read error: Operation timed out) |
22:19
🔗
|
|
dashcloud has joined #archiveteam-bs |
23:24
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
23:24
🔗
|
|
mistym has joined #archiveteam-bs |
23:24
🔗
|
|
mistym has quit IRC (Remote host closed the connection) |
23:37
🔗
|
|
mistym has joined #archiveteam-bs |
23:38
🔗
|
|
primus104 has quit IRC (Leaving.) |
23:49
🔗
|
johtso |
still getting 503 trying to upload to IA :( |
23:50
🔗
|
johtso |
hopefully they'll sort it out tomorrow |