Time |
Nickname |
Message |
01:28
🔗
|
godane |
i'm at 563k items now |
01:52
🔗
|
|
RichardG has quit IRC (Ping timeout: 499 seconds) |
02:01
🔗
|
|
RichardG has joined #archiveteam-bs |
02:18
🔗
|
|
RichardG has quit IRC (Ping timeout: 615 seconds) |
02:33
🔗
|
Ravenloft |
do you guys think Kim Dotcom will be extradited to US? |
02:41
🔗
|
|
RichardG has joined #archiveteam-bs |
02:56
🔗
|
|
RichardG has quit IRC (Ping timeout: 250 seconds) |
03:09
🔗
|
|
RichardG has joined #archiveteam-bs |
03:59
🔗
|
Sketchcow |
Probably. |
04:08
🔗
|
godane |
Turning_Point_Presents_-_Super_Sheep_199x_VHSRip |
04:08
🔗
|
godane |
http://archive.org/details/Turning_Point_Presents_-_Super_Sheep_199x_VHSRip |
04:09
🔗
|
godane |
https://archive.org/details/NASA_-_The_First_25_Years_-_Good_Times_Home_Video_1987_VHSRip |
04:17
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
04:25
🔗
|
|
Nertsy has joined #archiveteam-bs |
05:41
🔗
|
|
JetBalsa has quit IRC (Read error: Connection reset by peer) |
07:00
🔗
|
godane |
https://archive.org/details/The_Making_of_the_Stooges_1984_VHSRip |
07:18
🔗
|
|
JesseW has quit IRC (Leaving.) |
08:58
🔗
|
|
robink has joined #archiveteam-bs |
09:16
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
09:50
🔗
|
|
schbirid has joined #archiveteam-bs |
09:55
🔗
|
godane |
https://archive.org/details/Breakin_In_The_USA_1984_VHSRip |
10:46
🔗
|
|
VADemon has quit IRC (left4dead) |
14:32
🔗
|
schbirid |
https://events.ccc.de/congress/2015/wiki/Lightning:Internet_Radio_Recorder |
14:33
🔗
|
schbirid |
https://events.ccc.de/congress/2015/wiki/Static:Crawling |
15:46
🔗
|
|
marvinw is now known as ivan` |
15:48
🔗
|
ivan` |
do IA's massaged URLs (in their CDXes) cause problems in practice? I see that they always lowercase, which could cause problems with things like imgur, but I don't know if I've ever observed problems |
15:48
🔗
|
ivan` |
investigating this because I'm going to load a lot of CDXes into a database |
15:50
🔗
|
ivan` |
hmm, I guess if you get multiple results for a massaged URL, you can look up an exact-case match |
15:58
🔗
|
arkiver |
ivan`: we got the problem with newsgrabber figured out |
15:58
🔗
|
arkiver |
it was due to encoding problems |
15:58
🔗
|
arkiver |
in this case with the dari language |
16:02
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
16:20
🔗
|
|
schbirid has joined #archiveteam-bs |
16:30
🔗
|
ivan` |
arkiver: ok if it's a grab-site thing please file a bug |
17:04
🔗
|
ivan` |
"This module depends on the tldextract module to query the Public Suffix List. tldextract can be installed via pip" https://github.com/rajbot/surt |
17:05
🔗
|
ivan` |
that is worrying to say the least |
17:05
🔗
|
ivan` |
what happens when the list changes and SURTs don't match |
17:13
🔗
|
godane |
https://archive.org/details/We_Are_the_World_-_The_Story_Behind_the_Song_ATV-10_1987 |
17:23
🔗
|
ivan` |
oh, it implements some public suffix thing but it's behind a boolean that's always False |
17:24
🔗
|
HCross |
Sketchcow, can you please move the Cryengine files from godane to the IA please |
17:37
🔗
|
godane |
https://archive.org/details/The_Red_Nose_Express_1987_VHSRip |
17:41
🔗
|
|
JesseW has joined #archiveteam-bs |
17:57
🔗
|
arkiver |
ivan`: I'll do that |
17:57
🔗
|
arkiver |
I found a very strange problem |
17:57
🔗
|
arkiver |
~/.local/bin/grab-site http://www.eqmweekly.com.af/international/8288-???????-??-?????-?????-????? --level=0 --no-sitemaps --concurrency=5 --1 --warc-max-size=524288000 --wpull-args="--no-check-certificate --timeout=300" |
17:57
🔗
|
arkiver |
that works |
17:58
🔗
|
arkiver |
~/.local/bin/grab-site http://www.eqmweekly.com.af/technology/8287-???-???????-???-???-??-??????? --level=0 --no-sitemaps --concurrency=5 --1 --warc-max-size=524288000 --wpull-args="--no-check-certificate --timeout=300" |
17:58
🔗
|
arkiver |
that does not work |
17:59
🔗
|
|
JetBalsa has joined #archiveteam-bs |
18:27
🔗
|
|
JesseW has quit IRC (Leaving.) |
18:46
🔗
|
|
VADemon has joined #archiveteam-bs |
19:05
🔗
|
godane |
https://archive.org/details/1994-05-12_David_Copperfield_15_Years_of_Magic |
19:29
🔗
|
ohhdemgir |
midas, |
19:29
🔗
|
ohhdemgir |
get in #effteepee |
19:29
🔗
|
ohhdemgir |
then shout at me |
19:57
🔗
|
|
Stilett0 has joined #archiveteam-bs |
19:58
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
20:11
🔗
|
|
Stilett0 has quit IRC (Read error: Operation timed out) |
20:20
🔗
|
yipdw |
dumping a postgresql database over inflight wifi is not the best experience |
20:37
🔗
|
CatButts |
hurp |
20:40
🔗
|
DFJustin |
ivan`: I have seen wayback return the wrong imgur image if there is a case-insensitive match |
20:41
🔗
|
DFJustin |
I'm not sure what happens if there are multiple matches, one of which is exact |
20:45
🔗
|
CatButts |
I want to make sweet sweet love |
20:46
🔗
|
CatButts |
to a womancat |
20:54
🔗
|
ohhdemgir |
yipdw, >inflight wifi is not the best experience |
20:54
🔗
|
yipdw |
I did indeed write that, yes |
21:17
🔗
|
|
BlueMaxim has joined #archiveteam-bs |
21:22
🔗
|
godane |
https://archive.org/details/1989-07-26_Japan_TV |
21:36
🔗
|
SmileyG |
Sooooooo |
21:37
🔗
|
SmileyG |
at some point the FAA will put up a public list |
21:37
🔗
|
SmileyG |
of all registered drone owners |
21:37
🔗
|
SmileyG |
.... publically searchable etc |
21:37
🔗
|
godane |
https://archive.org/details/Fisher-Price_Grimms_Fairy_Tales_-_The_Frog_Prince_1989_VHSRip |
22:02
🔗
|
|
xmc has quit IRC (Read error: Operation timed out) |
22:02
🔗
|
|
RichardG_ has joined #archiveteam-bs |
22:03
🔗
|
|
yakfish has quit IRC (Read error: Operation timed out) |
22:03
🔗
|
|
myself has quit IRC (Read error: Operation timed out) |
22:03
🔗
|
|
robink has quit IRC (Write error: Broken pipe) |
22:03
🔗
|
|
sep332 has quit IRC (Write error: Broken pipe) |
22:03
🔗
|
|
beardicus has quit IRC (Read error: Operation timed out) |
22:04
🔗
|
|
botpie91 has quit IRC (Read error: Operation timed out) |
22:06
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
22:09
🔗
|
|
Zebranky has quit IRC (Read error: Operation timed out) |
22:09
🔗
|
|
Zebranky has joined #archiveteam-bs |
22:09
🔗
|
|
JetBalsa has quit IRC (Read error: Operation timed out) |
22:10
🔗
|
|
JetBalsa has joined #archiveteam-bs |
22:10
🔗
|
|
rduser has quit IRC (Read error: Operation timed out) |
22:10
🔗
|
|
rduser has joined #archiveteam-bs |
22:10
🔗
|
godane |
https://archive.org/details/In_The_Aftermath_New_World_Entertainment_1988_VHSRip |
22:11
🔗
|
|
Sketchcow has quit IRC (Read error: Operation timed out) |
22:12
🔗
|
|
is- has quit IRC (Read error: Operation timed out) |
22:12
🔗
|
|
is-_ has joined #archiveteam-bs |
22:13
🔗
|
|
Baljem_ has quit IRC (Read error: Operation timed out) |
22:14
🔗
|
|
Sketchcow has joined #archiveteam-bs |
22:14
🔗
|
|
midas sets mode: +o Sketchcow |
22:14
🔗
|
|
swebb sets mode: +o Sketchcow |
22:14
🔗
|
|
GLaDOS sets mode: +o Sketchcow |
22:19
🔗
|
|
Baljem has joined #archiveteam-bs |
22:30
🔗
|
|
is-_ is now known as is- |
22:30
🔗
|
|
kyan has joined #archiveteam-bs |
22:35
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
22:40
🔗
|
|
kyan has quit IRC (Quit: This computer has gone to sleep) |
22:45
🔗
|
ivan` |
DFJustin: it looks like it prefers the latest snapshot instead of the exact-case match |
22:46
🔗
|
ivan` |
I just contaminated https://news.ycombinator.com/user?id=rms with https://news.ycombinator.com/user?id=RMS in wayback |
22:46
🔗
|
ivan` |
I'm probably going to have domain-specific rules for my massaged URLs and re-generate them whenever I add new rules |
22:47
🔗
|
ivan` |
even if you priority exact-case matches it's bad UX to tell a user you have something when it's the wrong thing |
22:47
🔗
|
ivan` |
prioritize |
23:05
🔗
|
ivan` |
arkiver: works for me. I assume you are quoting URLs with question marks if you are dumping them into a shell? |
23:19
🔗
|
arkiver |
ivan`: for me only the first one line works. And then I just dump the exact same line as I pasted above in the terminal |
23:22
🔗
|
ivan` |
arkiver: can you paste an error? |
23:22
🔗
|
arkiver |
sorry, they don't contain question marks |
23:22
🔗
|
arkiver |
wait I'll put them up somewhere else |
23:24
🔗
|
|
Stiletto has joined #archiveteam-bs |
23:25
🔗
|
arkiver |
ivan`: https://ia601500.us.archive.org/35/items/testlinesurls36943/testlines.txt |
23:25
🔗
|
arkiver |
you should see some kind of arabic characters |
23:26
🔗
|
arkiver |
the first lines works for me only |
23:26
🔗
|
ivan` |
heh yes finally an error |
23:26
🔗
|
ivan` |
(I see it here) |
23:26
🔗
|
arkiver |
the second line gives an 'URL is not printable' error |
23:26
🔗
|
arkiver |
ok |
23:27
🔗
|
ivan` |
arkiver: I blame wpull. try encoding your input URLs? |
23:27
🔗
|
arkiver |
utf-8? |
23:27
🔗
|
ivan` |
urlencoding, that is, unicode -> utf-8 -> %XX%XX%XX for the path |
23:27
🔗
|
arkiver |
yeah |
23:27
🔗
|
arkiver |
sorry, not very into encoding |
23:29
🔗
|
ivan` |
I suppose I should either fix this in grab-site or wpull |
23:31
🔗
|
arkiver |
seems to be working with encoding them first |
23:31
🔗
|
|
botpie91 has joined #archiveteam-bs |
23:31
🔗
|
arkiver |
I feel this is more a wpull problem |
23:31
🔗
|
|
yakfish has joined #archiveteam-bs |
23:32
🔗
|
|
robink has joined #archiveteam-bs |
23:33
🔗
|
|
beardicus has joined #archiveteam-bs |
23:34
🔗
|
|
sep332 has joined #archiveteam-bs |
23:36
🔗
|
|
myself has joined #archiveteam-bs |
23:40
🔗
|
|
xmc has joined #archiveteam-bs |
23:40
🔗
|
|
swebb sets mode: +o xmc |
23:41
🔗
|
arkiver |
I filed a bug for wpull |