Time |
Nickname |
Message |
00:04
π
|
|
TC01 has quit IRC (Read error: Connection reset by peer) |
00:09
π
|
|
TC01 has joined #archiveteam-bs |
00:38
π
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzzβ¦) |
01:04
π
|
|
SilSte has joined #archiveteam-bs |
01:29
π
|
|
MRX3 has quit IRC (Quit: Leaving) |
01:36
π
|
hook54321 |
If a teacher says something and a student records it is the recording part of the Public Domain? |
01:42
π
|
|
drumstick has quit IRC (Ping timeout: 255 seconds) |
01:43
π
|
|
drumstick has joined #archiveteam-bs |
01:54
π
|
dashcloud |
no |
01:55
π
|
dashcloud |
everything now is copyrighted, unless you do something specifically to change that |
01:56
π
|
dashcloud |
that's what Creative Commons does for non-software things, and what all the open-source licenses do for software |
01:56
π
|
Somebody2 |
Copyright in recordings of extemporaneous speech generally belong to the person who makes the recording, IIRC. |
01:57
π
|
Somebody2 |
But recording people without their consent can be illegal, depending on other factors. |
01:57
π
|
Somebody2 |
(like what state you are in, whether it was a private conversation, and others) |
01:58
π
|
dashcloud |
if you're in person, in a school environment, unless you are specifically requested not to do so, you should be able to record without issue |
01:58
π
|
Somebody2 |
And if a speech was written down before it was delivered, whoever wrote it down holds the copyright on it, and audio recordings are derivative works. |
01:59
π
|
dashcloud |
hook54321: I get the feeling that none of this really answers the question you had |
01:59
π
|
Somebody2 |
It's a *VERY* gray area just how detailed notes have to be to make a recording of a speech a derivative work. |
01:59
π
|
Somebody2 |
But yeah, I suspect you had a different question. |
02:00
π
|
hook54321 |
It wasn't really a speech, it was this teacher's rant. |
02:00
π
|
hook54321 |
https://archive.org/details/BillJohnson |
02:47
π
|
Somebody2 |
That certainly sounds extemporaneous, so copyright is likely not a concern. |
02:48
π
|
Somebody2 |
But it also seems likely to attract the attention of an irrational and angry person, so I, at least, will be staying far away. |
02:50
π
|
|
schbirid2 has quit IRC (Ping timeout: 255 seconds) |
02:57
π
|
godane |
i'm splitting the bbc america bowie tape into 2 parts |
02:58
π
|
godane |
cause one recording is from 2000 and the other is from 1975 |
03:03
π
|
|
schbirid2 has joined #archiveteam-bs |
03:10
π
|
godane |
so another unlabel tape has Cinemax recording of Excalibur |
03:10
π
|
godane |
i'm very sure thats on dvd some where |
03:14
π
|
godane |
anyways turns out there is some sort of Live event recorded after it |
03:14
π
|
godane |
called Film Independent's Spirit Awards |
03:18
π
|
godane |
this must have been the 2008 one |
03:19
π
|
godane |
SketchCow: btw its hosted by Rainn Wilson |
03:20
π
|
godane |
also your going to better bitrate the commercial tapes with this one |
03:21
π
|
godane |
i'm getting 8300k to 8700k |
03:25
π
|
|
Stilett0 has joined #archiveteam-bs |
03:29
π
|
|
pizzaiolo has quit IRC (Remote host closed the connection) |
03:53
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
04:17
π
|
|
qw3rty116 has joined #archiveteam-bs |
04:20
π
|
|
bitBaron has joined #archiveteam-bs |
04:23
π
|
|
qw3rty115 has quit IRC (Read error: Operation timed out) |
04:25
π
|
|
bitBaron has quit IRC (Quit: My computer has gone to sleep. ZZZzzzβ¦) |
04:48
π
|
|
zhongfu has joined #archiveteam-bs |
05:11
π
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
05:12
π
|
hook54321 |
JAA: Do you still have the partial recording of Bryan Lunduke's 24 hour thing? |
05:14
π
|
|
Lord_Nigh has joined #archiveteam-bs |
05:42
π
|
|
balrog has quit IRC (Read error: Operation timed out) |
05:46
π
|
|
balrog has joined #archiveteam-bs |
05:46
π
|
|
swebb sets mode: +o balrog |
07:17
π
|
godane |
SketchCow: we got some good old hbo previews at the end of this tape too |
07:34
π
|
godane |
these hbo previews are from 1990/1991 since the is inside the nfl talking about super bowl 25 |
07:57
π
|
|
Stilett0 is now known as Stiletto |
08:05
π
|
godane |
one tape i'm skipping is the 'in treatment' hbo episodes |
08:06
π
|
godane |
i question its from a 2008 series and its on dvd |
09:25
π
|
godane |
1 minute of footage from this tape is missing |
09:26
π
|
|
pizzaiolo has joined #archiveteam-bs |
09:26
π
|
godane |
audio capture but video goes back and white then to black |
09:26
π
|
godane |
around 34:16 to 35:15 this happen |
09:30
π
|
|
Stiletto has quit IRC () |
09:41
π
|
|
Stilett0 has joined #archiveteam-bs |
09:50
π
|
|
nyaomi has quit IRC (Read error: Operation timed out) |
09:59
π
|
|
nyaomi has joined #archiveteam-bs |
09:59
π
|
|
drumstick has quit IRC (Ping timeout: 255 seconds) |
10:00
π
|
|
drumstick has joined #archiveteam-bs |
10:00
π
|
|
pizzaiolo has quit IRC (pizzaiolo) |
10:01
π
|
|
pizzaiolo has joined #archiveteam-bs |
10:05
π
|
|
pizzaiolo has quit IRC (Ping timeout: 246 seconds) |
10:31
π
|
godane |
i may stop the tape after the current episode only cause is having problems |
10:32
π
|
godane |
there is frame issue with episode 4 on this tape |
10:34
π
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
10:45
π
|
JAA |
hook54321: Yes, I do. |
10:57
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
11:03
π
|
|
zhongfu has joined #archiveteam-bs |
11:12
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
11:12
π
|
|
zhongfu has joined #archiveteam-bs |
11:16
π
|
|
ScruffyB has joined #archiveteam-bs |
11:19
π
|
|
decay_ has joined #archiveteam-bs |
11:19
π
|
|
pikhq_ has joined #archiveteam-bs |
11:20
π
|
|
RKenshin has joined #archiveteam-bs |
11:20
π
|
|
tuluu has joined #archiveteam-bs |
11:22
π
|
|
SN4T14_ has joined #archiveteam-bs |
11:22
π
|
|
ppsym has joined #archiveteam-bs |
11:22
π
|
|
Hecatz- has joined #archiveteam-bs |
11:23
π
|
|
db420 has joined #archiveteam-bs |
11:23
π
|
|
db420 has quit IRC (Connection closed) |
11:23
π
|
|
LeG0ax has joined #archiveteam-bs |
11:29
π
|
|
phillipsj has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
Aerochrom has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
purplebot has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
PurpleSym has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
JensRex has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
tuluu_ has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
i0npulse has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
pikhq has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
Hecatz has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
Kenshin has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
Ing3b0rg has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
dboard2 has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
Rai-chan has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
medowar has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
decay has quit IRC (se.hub irc.underworld.no) |
11:29
π
|
|
SN4T14 has quit IRC (se.hub irc.underworld.no) |
11:37
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
11:37
π
|
|
zhongfu has joined #archiveteam-bs |
11:45
π
|
|
RKenshin is now known as Kenshin |
11:45
π
|
|
ppsym is now known as PurpleSym |
11:45
π
|
|
LeG0ax is now known as Ing3b0rg |
11:45
π
|
|
Hecatz- is now known as Hecatz |
11:45
π
|
|
Aerochrom has joined #archiveteam-bs |
11:56
π
|
|
pizzaiolo has joined #archiveteam-bs |
11:57
π
|
|
drumstick has quit IRC (Read error: Operation timed out) |
11:59
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
11:59
π
|
|
zhongfu has joined #archiveteam-bs |
12:24
π
|
|
dboard2 has joined #archiveteam-bs |
12:24
π
|
|
dboard2 has quit IRC (Connection closed) |
13:27
π
|
godane |
so i decided to subscribe The New Yorker for the digital issues on there site |
13:50
π
|
|
kyounko_ has joined #archiveteam-bs |
13:50
π
|
|
kyounko_ has quit IRC (Excess Flood) |
13:50
π
|
|
alfie has quit IRC (Ping timeout: 260 seconds) |
13:50
π
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
13:50
π
|
|
Meroje has quit IRC (Ping timeout: 260 seconds) |
13:50
π
|
|
dan- has quit IRC (Ping timeout: 260 seconds) |
13:50
π
|
|
DopefishJ has joined #archiveteam-bs |
13:50
π
|
|
swebb sets mode: +o DopefishJ |
13:50
π
|
|
Meroje has joined #archiveteam-bs |
13:50
π
|
|
kyounko_ has joined #archiveteam-bs |
13:51
π
|
|
r3c0d3x has joined #archiveteam-bs |
13:51
π
|
|
dan- has joined #archiveteam-bs |
13:51
π
|
|
Hecatz has quit IRC (Ping timeout: 260 seconds) |
13:51
π
|
|
ld1 has quit IRC (Ping timeout: 260 seconds) |
13:51
π
|
|
Muad-Dib has quit IRC (Ping timeout: 260 seconds) |
13:51
π
|
|
ZexaronS- has joined #archiveteam-bs |
13:51
π
|
|
jsa has quit IRC (Quit: No Ping reply in 180 seconds.) |
13:51
π
|
|
zhongfu has quit IRC (Remote host closed the connection) |
13:51
π
|
|
ld1 has joined #archiveteam-bs |
13:52
π
|
|
jsa has joined #archiveteam-bs |
13:52
π
|
|
kyounko has quit IRC (Ping timeout: 260 seconds) |
13:52
π
|
|
DFJustin has quit IRC (Ping timeout: 260 seconds) |
13:52
π
|
|
ZexaronS has quit IRC (Ping timeout: 260 seconds) |
13:52
π
|
|
Hecatz has joined #archiveteam-bs |
13:52
π
|
|
alfie has joined #archiveteam-bs |
13:52
π
|
JAA |
What is going on with all these ping timeouts? |
13:53
π
|
|
zhongfu has joined #archiveteam-bs |
13:57
π
|
|
alfie has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
alembic has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
riking has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
ThisAsYou has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
midas has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
ItsYoda has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
JSharp has quit IRC (Ping timeout: 260 seconds) |
13:57
π
|
|
riking has joined #archiveteam-bs |
13:57
π
|
|
JSharp has joined #archiveteam-bs |
13:57
π
|
|
ThisAsYou has joined #archiveteam-bs |
13:57
π
|
|
alembic has joined #archiveteam-bs |
13:58
π
|
|
DopefishJ has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
DrasticAc has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
bitspill has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
robogoat has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
trvz has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
Hecatz has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
ld1 has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
dan- has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
pikhq_ has quit IRC (Ping timeout: 260 seconds) |
13:58
π
|
|
spacegirl has quit IRC (Ping timeout: 260 seconds) |
13:59
π
|
|
zhongfu has quit IRC (Ping timeout: 260 seconds) |
13:59
π
|
|
robogoat has joined #archiveteam-bs |
14:00
π
|
|
pikhq has joined #archiveteam-bs |
14:00
π
|
|
spacegirl has joined #archiveteam-bs |
14:02
π
|
|
DFJustin has joined #archiveteam-bs |
14:02
π
|
|
swebb sets mode: +o DFJustin |
14:02
π
|
|
zhongfu has joined #archiveteam-bs |
14:03
π
|
|
ld1 has joined #archiveteam-bs |
14:03
π
|
|
midas has joined #archiveteam-bs |
14:03
π
|
|
DrasticAc has joined #archiveteam-bs |
14:03
π
|
|
bitspill has joined #archiveteam-bs |
14:03
π
|
|
Hecatz has joined #archiveteam-bs |
14:03
π
|
|
alfie has joined #archiveteam-bs |
14:03
π
|
|
ItsYoda has joined #archiveteam-bs |
14:04
π
|
|
trvz has joined #archiveteam-bs |
14:04
π
|
|
Muad-Dib has joined #archiveteam-bs |
14:13
π
|
godane |
so looks like i have this tape from jason : https://archive.org/details/ShigeruMiyamotoGdcKeynote1999 |
14:14
π
|
godane |
we have the opening of it so it is different then that one |
14:17
π
|
|
dan- has joined #archiveteam-bs |
14:34
π
|
|
tuluu has quit IRC (Read error: Operation timed out) |
14:35
π
|
|
tuluu has joined #archiveteam-bs |
14:37
π
|
|
purplebot has joined #archiveteam-bs |
14:37
π
|
|
Rai-chan has joined #archiveteam-bs |
14:38
π
|
|
dboard2 has joined #archiveteam-bs |
14:42
π
|
|
i0npulse has joined #archiveteam-bs |
14:48
π
|
|
godane has left |
14:48
π
|
|
godane has joined #archiveteam-bs |
14:50
π
|
|
bitBaron has joined #archiveteam-bs |
14:54
π
|
|
bitBaron has quit IRC (Client Quit) |
16:22
π
|
|
Stilett0 is now known as Stiletto |
16:26
π
|
|
HCross2 has quit IRC (Ping timeout: 260 seconds) |
16:26
π
|
|
mattl has quit IRC (Ping timeout: 260 seconds) |
16:26
π
|
|
voltagex has quit IRC (Ping timeout: 260 seconds) |
16:26
π
|
|
jiphex has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
trvz has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
bitspill has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
DrasticAc has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
r3c0d3x has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
tklk has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
floogulin has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
DedSec has quit IRC (Ping timeout: 260 seconds) |
16:27
π
|
|
fallenoak has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
ThisAsYou has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
alembic has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
JSharp has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
riking has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
SN4T14_ has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
BartoCH has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
Ctrl-S___ has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
deathy has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
xarph has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
Muad-Dib has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
jsa has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
Meroje has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
victorbje has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
johtso has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
octarine has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
|
jrwr has quit IRC (Ping timeout: 260 seconds) |
16:28
π
|
JAA |
WTF |
16:29
π
|
|
BartoCH has joined #archiveteam-bs |
16:29
π
|
|
mattl has joined #archiveteam-bs |
16:29
π
|
|
deathy has joined #archiveteam-bs |
16:29
π
|
|
JSharp has joined #archiveteam-bs |
16:29
π
|
|
ThisAsYou has joined #archiveteam-bs |
16:29
π
|
|
riking has joined #archiveteam-bs |
16:29
π
|
|
jiphex has joined #archiveteam-bs |
16:29
π
|
|
voltagex has joined #archiveteam-bs |
16:29
π
|
|
alembic has joined #archiveteam-bs |
16:29
π
|
|
Ctrl-S___ has joined #archiveteam-bs |
16:29
π
|
|
HCross2 has joined #archiveteam-bs |
16:29
π
|
|
trvz has joined #archiveteam-bs |
16:29
π
|
|
octarine has joined #archiveteam-bs |
16:29
π
|
|
victorbje has joined #archiveteam-bs |
16:29
π
|
|
r3c0d3x has joined #archiveteam-bs |
16:30
π
|
|
Meroje has joined #archiveteam-bs |
16:30
π
|
|
floogulin has joined #archiveteam-bs |
16:30
π
|
|
tklk has joined #archiveteam-bs |
16:30
π
|
|
DrasticAc has joined #archiveteam-bs |
16:30
π
|
|
fallenoak has joined #archiveteam-bs |
16:30
π
|
|
DedSec has joined #archiveteam-bs |
16:30
π
|
|
bitspill has joined #archiveteam-bs |
16:30
π
|
|
johtso has joined #archiveteam-bs |
16:31
π
|
|
jsa has joined #archiveteam-bs |
16:31
π
|
|
SN4T14 has joined #archiveteam-bs |
16:34
π
|
|
Muad-Dib has joined #archiveteam-bs |
17:03
π
|
joepie91 |
https://motherboard.vice.com/en_us/article/bj7vam/why-twitter-is-the-best-social-media-platform-for-disinformation |
17:10
π
|
|
midas2 has quit IRC (Read error: Operation timed out) |
17:16
π
|
|
midas2 has joined #archiveteam-bs |
17:22
π
|
|
K4k has quit IRC (Read error: Operation timed out) |
17:24
π
|
|
K4k has joined #archiveteam-bs |
17:30
π
|
Coderjo |
hmm... not quite a user-driven site, but... |
17:30
π
|
Coderjo |
http://support.comixology.com/customer/portal/articles/2887181-pull-list-retirement-faq |
17:30
π
|
Coderjo |
it somewhat was, with the retailer portal bit, I guess |
17:33
π
|
Coderjo |
And who is surprised at Amazon killing this part of the company after acquiring it? Show of hands? |
17:45
π
|
|
xarph has joined #archiveteam-bs |
18:02
π
|
|
JensRex has joined #archiveteam-bs |
18:28
π
|
|
K4k has quit IRC (Quit: WeeChat 1.9.1) |
18:29
π
|
|
jrwr has joined #archiveteam-bs |
18:30
π
|
|
K4k has joined #archiveteam-bs |
18:37
π
|
|
K4k has quit IRC (Quit: WeeChat 1.9.1) |
18:37
π
|
|
K4k has joined #archiveteam-bs |
18:38
π
|
|
jrochkind has joined #archiveteam-bs |
18:38
π
|
jrochkind |
Hello, I am a librarian-programmer, but not professionally involved in digital archiving,a nd dontβ know much about archiveteam. BUTβ¦. |
18:39
π
|
jrochkind |
Baltimore City Paper, Baltimoreβs 40-year old alternative free weekly, just published their last issue, after being bought by Tribune Media/TRONC. The website is still up, with lots and lots of content, but who knows for how long. I want to try to to preserve as much as possible. |
18:39
π
|
jrochkind |
Can anyone here help? Either via archiveteam project, or advice, or whatever? |
18:42
π
|
JAA |
Thank you. I'll add it to ArchiveBot. |
18:43
π
|
JAA |
That might not exactly grab everything though. |
18:45
π
|
jrochkind |
Thanks! http://www.citypaper.com/ I will continue exporing various other approaches. Is there a place i can check to see ArchiveBot progress, or find the results of what it managers to get? Sorry, I am starting from zero knowledge about how your tools work, although I am an engineer and understand stuff. |
18:45
π
|
JAA |
Yeah, the site uses JS for quite a lot of stuff. |
18:46
π
|
JAA |
http://dashboard.at.ninjawedding.org/ |
18:46
π
|
jrochkind |
awesome, thank you. |
18:46
π
|
JAA |
It will be job at569nt11fsuk3019kimdq036 (displayed on the far right), but it might not start until in a few days. |
18:46
π
|
JAA |
I'll also throw in some subdomains, e.g. http://events.citypaper.com/ |
18:47
π
|
JAA |
http://digitaledition.citypaper.com/ definitely won't work with ArchiveBot at all. |
18:47
π
|
JAA |
And even on the main site, galleries etc. all only work with JavaScript. :-| |
18:47
π
|
jrochkind |
i actually didnβt even know about digitaledition.citypaper.com, ha. Thereβs def years of content jsut available on HTML pages, although I donβt know about the internal links, if a scraper is going to find them. |
18:50
π
|
JAA |
Yeah, I'm not quite sure either. |
18:50
π
|
jrochkind |
Hereβs an example page I found on google (happens to have a letter to the editor from me, is how I targetted it), which is not currently in the IA wayback machine. Itβs just an ordinary HTML page, but I dunno about internal links for a scraper to find it. http://www.citypaper.com/bcp-cms-1-1406281-migrated-story-cp-20121121-mail-20121121-story.html |
18:56
π
|
JAA |
I think it should discover quite a large part of the site through http://www.citypaper.com/topic/ |
18:56
π
|
JAA |
Luckily, the listings within topics are using URL-based pagination, e.g. http://www.citypaper.com/topic/politics-government/government/catherine-e.-pugh-PEPLT00007656-topic.html -> http://www.citypaper.com/topic/politics-government/government/catherine-e.-pugh-PEPLT00007656-topic.html?page=2& |
18:59
π
|
jrochkind |
hmm. what if I get a list of every `site:citypaper.com` hit URL from google, perhaps using a google CSE I pay for. Is there anything useful I can do with that? |
19:01
π
|
JAA |
Yes, we could make use of that. But keep in mind that search engines (especially Google) have strict rate limits. Scraping it for results is only really possible for smallish websites, in my experience. |
19:01
π
|
jrochkind |
oh nice, yeah that topics index with paginated lists of topics is pretty good. |
19:01
π
|
JAA |
Specifically, they'll make you fill out captchas, so you can't really automate it. |
19:03
π
|
jrochkind |
Google has 30-40K hits for citypaper.com. If you pay google, you actually get an allowed API, no captcha, unless theyβve cancelled that service since I used it last. It will not be expensive to use to just get all the paginated results. (Iβd pay for it). Although the allowed API actually might not let me get em all, it might stop you from paginating beyond a certai point. But I might mess with it, if a giant list of |
19:03
π
|
jrochkind |
URLs would be useful to you. If I do get a list of a few tens of K of URLs, can I share them with you somehow? |
19:03
π
|
JAA |
Ah, right. |
19:04
π
|
jrochkind |
$5 per 1000 queries, if it really lets me paginate thorugh 30K at 10 at a time, thatβs only $15. |
19:06
π
|
JAA |
It could be useful, but if those articles are all (or almost all) discovered through /topic anyway, it's probably not worth it. |
19:07
π
|
JAA |
I need to leave for a bit. Maybe someone else has better ideas. |
19:07
π
|
jrochkind |
drat, I believe Google actually shut down that API anyway. Even though their docs still doc it, it gives me an error when I try to create one, and I vaguely remember them saying they were gonna shut it down. Ah, Google. Anyway, ok, than you JAA! |
19:33
π
|
|
TheLovina has quit IRC (Read error: Connection reset by peer) |
19:36
π
|
jrochkind |
JAA if they come back or any other interested parties, they do have a sitemap.xml, although it seems to only have some very limited things in it, itsβ not really a sitemap. Dontβ know if your tools will use sitemap. |
19:41
π
|
jrochkind |
their robots.txt actualy disallows all those topic/ pages, which seemed the most useful for scraping links. donβt know what archivebot does with robots.txt |
19:52
π
|
JAA |
jrochkind: wpull (used by ArchiveBot) knows about both sitemaps and robots.txt. With the options used in ArchiveBot, it grabs both to discover additional content (i.e. ignores Disallow directives). |
19:52
π
|
jrochkind |
cool. looking at it, this site might not be very scrapable, itβs a pretty poorly designed site. weβll find out! |
19:53
π
|
jrochkind |
those topics are actually pretty useless. I think itβs just a listing of terms from some standard vocabularly, I have yet to find one that actually leads to articles. |
19:53
π
|
jrochkind |
which may be why they are disallowed in robots.txt. |
19:54
π
|
JAA |
Yes, most of those "topics" seem useless, but some do have links to articles, e.g. the one I linked above. |
19:55
π
|
JAA |
In that case, it seems to be the author of the articles. |
19:56
π
|
jrochkind |
ah, cool. it might trip up a scraper in requesting thousands of useless links too though. |
19:57
π
|
JAA |
Yeah, but some thousands of links aren't really that problematic in the big picture. |
19:58
π
|
jrochkind |
interesting. there are some weird topic links for sure. http://www.citypaper.com/topic/education/schools/high-schools/05005003-topic.html |
19:58
π
|
jrochkind |
i wonder who they licensed that vocabulary from haha |
20:25
π
|
|
schbirid2 has quit IRC (Quit: Leaving) |
20:33
π
|
|
jschwart has joined #archiveteam-bs |
20:38
π
|
|
Mateon1 has quit IRC (Ping timeout: 250 seconds) |
20:40
π
|
|
Mateon1 has joined #archiveteam-bs |
21:16
π
|
|
tuluu has quit IRC (Remote host closed the connection) |
21:19
π
|
|
tuluu has joined #archiveteam-bs |
21:53
π
|
|
dashcloud has quit IRC (Remote host closed the connection) |
22:02
π
|
|
kyounko_ has quit IRC (Ping timeout: 255 seconds) |
22:27
π
|
|
drumstick has joined #archiveteam-bs |
22:44
π
|
|
jschwart has quit IRC (Konversation terminated!) |
23:24
π
|
|
dashcloud has joined #archiveteam-bs |
23:28
π
|
|
BlueMaxim has joined #archiveteam-bs |
23:47
π
|
|
jrochkind has quit IRC (jrochkind) |