| Time |
Nickname |
Message |
|
01:28
🔗
|
godane |
i'm at 563k items now |
|
01:52
🔗
|
|
RichardG has quit IRC (Ping timeout: 499 seconds) |
|
02:01
🔗
|
|
RichardG has joined #archiveteam-bs |
|
02:18
🔗
|
|
RichardG has quit IRC (Ping timeout: 615 seconds) |
|
02:33
🔗
|
Ravenloft |
do you guys think Kim Dotcom will be extradited to US? |
|
02:41
🔗
|
|
RichardG has joined #archiveteam-bs |
|
02:56
🔗
|
|
RichardG has quit IRC (Ping timeout: 250 seconds) |
|
03:09
🔗
|
|
RichardG has joined #archiveteam-bs |
|
03:59
🔗
|
Sketchcow |
Probably. |
|
04:08
🔗
|
godane |
Turning_Point_Presents_-_Super_Sheep_199x_VHSRip |
|
04:08
🔗
|
godane |
http://archive.org/details/Turning_Point_Presents_-_Super_Sheep_199x_VHSRip |
|
04:09
🔗
|
godane |
https://archive.org/details/NASA_-_The_First_25_Years_-_Good_Times_Home_Video_1987_VHSRip |
|
04:17
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
|
04:25
🔗
|
|
Nertsy has joined #archiveteam-bs |
|
05:41
🔗
|
|
JetBalsa has quit IRC (Read error: Connection reset by peer) |
|
07:00
🔗
|
godane |
https://archive.org/details/The_Making_of_the_Stooges_1984_VHSRip |
|
07:18
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
08:58
🔗
|
|
robink has joined #archiveteam-bs |
|
09:16
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
|
09:50
🔗
|
|
schbirid has joined #archiveteam-bs |
|
09:55
🔗
|
godane |
https://archive.org/details/Breakin_In_The_USA_1984_VHSRip |
|
10:46
🔗
|
|
VADemon has quit IRC (left4dead) |
|
14:32
🔗
|
schbirid |
https://events.ccc.de/congress/2015/wiki/Lightning:Internet_Radio_Recorder |
|
14:33
🔗
|
schbirid |
https://events.ccc.de/congress/2015/wiki/Static:Crawling |
|
15:46
🔗
|
|
marvinw is now known as ivan` |
|
15:48
🔗
|
ivan` |
do IA's massaged URLs (in their CDXes) cause problems in practice? I see that they always lowercase, which could cause problems with things like imgur, but I don't know if I've ever observed problems |
|
15:48
🔗
|
ivan` |
investigating this because I'm going to load a lot of CDXes into a database |
|
15:50
🔗
|
ivan` |
hmm, I guess if you get multiple results for a massaged URL, you can look up an exact-case match |
|
15:58
🔗
|
arkiver |
ivan`: we got the problem with newsgrabber figured out |
|
15:58
🔗
|
arkiver |
it was due to encoding problems |
|
15:58
🔗
|
arkiver |
in this case with the dari language |
|
16:02
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
|
16:20
🔗
|
|
schbirid has joined #archiveteam-bs |
|
16:30
🔗
|
ivan` |
arkiver: ok if it's a grab-site thing please file a bug |
|
17:04
🔗
|
ivan` |
"This module depends on the tldextract module to query the Public Suffix List. tldextract can be installed via pip" https://github.com/rajbot/surt |
|
17:05
🔗
|
ivan` |
that is worrying to say the least |
|
17:05
🔗
|
ivan` |
what happens when the list changes and SURTs don't match |
|
17:13
🔗
|
godane |
https://archive.org/details/We_Are_the_World_-_The_Story_Behind_the_Song_ATV-10_1987 |
|
17:23
🔗
|
ivan` |
oh, it implements some public suffix thing but it's behind a boolean that's always False |
|
17:24
🔗
|
HCross |
Sketchcow, can you please move the Cryengine files from godane to the IA please |
|
17:37
🔗
|
godane |
https://archive.org/details/The_Red_Nose_Express_1987_VHSRip |
|
17:41
🔗
|
|
JesseW has joined #archiveteam-bs |
|
17:57
🔗
|
arkiver |
ivan`: I'll do that |
|
17:57
🔗
|
arkiver |
I found a very strange problem |
|
17:57
🔗
|
arkiver |
~/.local/bin/grab-site http://www.eqmweekly.com.af/international/8288-???????-??-?????-?????-????? --level=0 --no-sitemaps --concurrency=5 --1 --warc-max-size=524288000 --wpull-args="--no-check-certificate --timeout=300" |
|
17:57
🔗
|
arkiver |
that works |
|
17:58
🔗
|
arkiver |
~/.local/bin/grab-site http://www.eqmweekly.com.af/technology/8287-???-???????-???-???-??-??????? --level=0 --no-sitemaps --concurrency=5 --1 --warc-max-size=524288000 --wpull-args="--no-check-certificate --timeout=300" |
|
17:58
🔗
|
arkiver |
that does not work |
|
17:59
🔗
|
|
JetBalsa has joined #archiveteam-bs |
|
18:27
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
18:46
🔗
|
|
VADemon has joined #archiveteam-bs |
|
19:05
🔗
|
godane |
https://archive.org/details/1994-05-12_David_Copperfield_15_Years_of_Magic |
|
19:29
🔗
|
ohhdemgir |
midas, |
|
19:29
🔗
|
ohhdemgir |
get in #effteepee |
|
19:29
🔗
|
ohhdemgir |
then shout at me |
|
19:57
🔗
|
|
Stilett0 has joined #archiveteam-bs |
|
19:58
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
|
20:11
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
|
20:20
🔗
|
yipdw |
dumping a postgresql database over inflight wifi is not the best experience |
|
20:37
🔗
|
CatButts |
hurp |
|
20:40
🔗
|
DFJustin |
ivan`: I have seen wayback return the wrong imgur image if there is a case-insensitive match |
|
20:41
🔗
|
DFJustin |
I'm not sure what happens if there are multiple matches, one of which is exact |
|
20:45
🔗
|
CatButts |
I want to make sweet sweet love |
|
20:46
🔗
|
CatButts |
to a womancat |
|
20:54
🔗
|
ohhdemgir |
yipdw, >inflight wifi is not the best experience |
|
20:54
🔗
|
yipdw |
I did indeed write that, yes |
|
21:17
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
|
21:22
🔗
|
godane |
https://archive.org/details/1989-07-26_Japan_TV |
|
21:36
🔗
|
SmileyG |
Sooooooo |
|
21:37
🔗
|
SmileyG |
at some point the FAA will put up a public list |
|
21:37
🔗
|
SmileyG |
of all registered drone owners |
|
21:37
🔗
|
SmileyG |
.... publically searchable etc |
|
21:37
🔗
|
godane |
https://archive.org/details/Fisher-Price_Grimms_Fairy_Tales_-_The_Frog_Prince_1989_VHSRip |
|
22:02
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
|
22:02
🔗
|
|
RichardG_ has joined #archiveteam-bs |
|
22:03
🔗
|
|
yakfish has quit IRC (Read error: Operation timed out) |
|
22:03
🔗
|
|
myself has quit IRC (Read error: Operation timed out) |
|
22:03
🔗
|
|
robink has quit IRC (Write error: Broken pipe) |
|
22:03
🔗
|
|
sep332 has quit IRC (Write error: Broken pipe) |
|
22:03
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
|
22:04
🔗
|
|
botpie91 has quit IRC (Read error: Operation timed out) |
|
22:06
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
|
22:09
🔗
|
|
Zebranky has quit IRC (Read error: Operation timed out) |
|
22:09
🔗
|
|
Zebranky has joined #archiveteam-bs |
|
22:09
🔗
|
|
JetBalsa has quit IRC (Read error: Operation timed out) |
|
22:10
🔗
|
|
JetBalsa has joined #archiveteam-bs |
|
22:10
🔗
|
|
rduser has quit IRC (Read error: Operation timed out) |
|
22:10
🔗
|
|
rduser has joined #archiveteam-bs |
|
22:10
🔗
|
godane |
https://archive.org/details/In_The_Aftermath_New_World_Entertainment_1988_VHSRip |
|
22:11
🔗
|
|
Sketchcow has quit IRC (Read error: Operation timed out) |
|
22:12
🔗
|
|
is- has quit IRC (Read error: Operation timed out) |
|
22:12
🔗
|
|
is-_ has joined #archiveteam-bs |
|
22:13
🔗
|
|
Baljem_ has quit IRC (Read error: Operation timed out) |
|
22:14
🔗
|
|
Sketchcow has joined #archiveteam-bs |
|
22:14
🔗
|
|
midas sets mode: +o Sketchcow |
|
22:14
🔗
|
|
swebb sets mode: +o Sketchcow |
|
22:14
🔗
|
|
GLaDOS sets mode: +o Sketchcow |
|
22:19
🔗
|
|
Baljem has joined #archiveteam-bs |
|
22:30
🔗
|
|
is-_ is now known as is- |
|
22:30
🔗
|
|
kyan has joined #archiveteam-bs |
|
22:35
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
|
22:40
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
|
22:45
🔗
|
ivan` |
DFJustin: it looks like it prefers the latest snapshot instead of the exact-case match |
|
22:46
🔗
|
ivan` |
I just contaminated https://news.ycombinator.com/user?id=rms with https://news.ycombinator.com/user?id=RMS in wayback |
|
22:46
🔗
|
ivan` |
I'm probably going to have domain-specific rules for my massaged URLs and re-generate them whenever I add new rules |
|
22:47
🔗
|
ivan` |
even if you priority exact-case matches it's bad UX to tell a user you have something when it's the wrong thing |
|
22:47
🔗
|
ivan` |
prioritize |
|
23:05
🔗
|
ivan` |
arkiver: works for me. I assume you are quoting URLs with question marks if you are dumping them into a shell? |
|
23:19
🔗
|
arkiver |
ivan`: for me only the first one line works. And then I just dump the exact same line as I pasted above in the terminal |
|
23:22
🔗
|
ivan` |
arkiver: can you paste an error? |
|
23:22
🔗
|
arkiver |
sorry, they don't contain question marks |
|
23:22
🔗
|
arkiver |
wait I'll put them up somewhere else |
|
23:24
🔗
|
|
Stiletto has joined #archiveteam-bs |
|
23:25
🔗
|
arkiver |
ivan`: https://ia601500.us.archive.org/35/items/testlinesurls36943/testlines.txt |
|
23:25
🔗
|
arkiver |
you should see some kind of arabic characters |
|
23:26
🔗
|
arkiver |
the first lines works for me only |
|
23:26
🔗
|
ivan` |
heh yes finally an error |
|
23:26
🔗
|
ivan` |
(I see it here) |
|
23:26
🔗
|
arkiver |
the second line gives an 'URL is not printable' error |
|
23:26
🔗
|
arkiver |
ok |
|
23:27
🔗
|
ivan` |
arkiver: I blame wpull. try encoding your input URLs? |
|
23:27
🔗
|
arkiver |
utf-8? |
|
23:27
🔗
|
ivan` |
urlencoding, that is, unicode -> utf-8 -> %XX%XX%XX for the path |
|
23:27
🔗
|
arkiver |
yeah |
|
23:27
🔗
|
arkiver |
sorry, not very into encoding |
|
23:29
🔗
|
ivan` |
I suppose I should either fix this in grab-site or wpull |
|
23:31
🔗
|
arkiver |
seems to be working with encoding them first |
|
23:31
🔗
|
|
botpie91 has joined #archiveteam-bs |
|
23:31
🔗
|
arkiver |
I feel this is more a wpull problem |
|
23:31
🔗
|
|
yakfish has joined #archiveteam-bs |
|
23:32
🔗
|
|
robink has joined #archiveteam-bs |
|
23:33
🔗
|
|
beardicus has joined #archiveteam-bs |
|
23:34
🔗
|
|
sep332 has joined #archiveteam-bs |
|
23:36
🔗
|
|
myself has joined #archiveteam-bs |
|
23:40
🔗
|
|
xmc has joined #archiveteam-bs |
|
23:40
🔗
|
|
swebb sets mode: +o xmc |
|
23:41
🔗
|
arkiver |
I filed a bug for wpull |