Time |
Nickname |
Message |
00:01
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
00:05
🔗
|
|
BlueMaxim has joined #archiveteam |
00:26
🔗
|
|
nightpool has joined #archiveteam |
00:34
🔗
|
|
tomwsmf has quit IRC (Read error: Operation timed out) |
00:47
🔗
|
|
DiscantX has joined #archiveteam |
01:06
🔗
|
Lord_Nigh |
SketchCow: it is in /0/cdrom/ |
01:14
🔗
|
|
JesseW has joined #archiveteam |
01:19
🔗
|
|
philpem has quit IRC (Ping timeout: 260 seconds) |
01:38
🔗
|
|
atrocity has joined #archiveteam |
01:45
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
01:45
🔗
|
|
pguth_ has joined #archiveteam |
01:45
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
01:45
🔗
|
|
pguth_ has joined #archiveteam |
01:46
🔗
|
|
jmad980 has quit IRC (Ping timeout: 250 seconds) |
01:48
🔗
|
|
jmad980 has joined #archiveteam |
01:54
🔗
|
|
tomwsmf has joined #archiveteam |
02:11
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
02:25
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
02:25
🔗
|
|
pguth_ has joined #archiveteam |
02:48
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
02:48
🔗
|
|
pguth_ has joined #archiveteam |
02:48
🔗
|
|
ndiddy has quit IRC (Leaving) |
02:49
🔗
|
|
ndiddy has joined #archiveteam |
03:10
🔗
|
|
RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) |
03:11
🔗
|
|
RichardG has joined #archiveteam |
03:13
🔗
|
|
tomwsmf has quit IRC (Ping timeout: 258 seconds) |
03:13
🔗
|
|
RichardG has quit IRC (Read error: Connection timed out) |
03:14
🔗
|
|
RichardG has joined #archiveteam |
03:26
🔗
|
|
zenguy has quit IRC (Ping timeout: 370 seconds) |
03:28
🔗
|
|
robink has quit IRC (Ping timeout: 260 seconds) |
03:29
🔗
|
|
zenguy has joined #archiveteam |
03:35
🔗
|
|
zenguy has quit IRC (Read error: Operation timed out) |
03:42
🔗
|
|
ndiddy has quit IRC (Leaving) |
03:46
🔗
|
|
zenguy has joined #archiveteam |
03:57
🔗
|
|
robink has joined #archiveteam |
03:59
🔗
|
|
Aranje has quit IRC (Quit: Three sheets to the wind) |
04:28
🔗
|
|
RichardG has quit IRC (Read error: Operation timed out) |
04:28
🔗
|
|
RichardG has joined #archiveteam |
04:32
🔗
|
|
JesseW has joined #archiveteam |
04:33
🔗
|
|
zenguy has quit IRC (Read error: Operation timed out) |
04:36
🔗
|
|
Sk1d has quit IRC (Ping timeout: 250 seconds) |
04:38
🔗
|
|
zenguy has joined #archiveteam |
04:45
🔗
|
|
Sk1d has joined #archiveteam |
04:49
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
04:49
🔗
|
|
pguth_ has joined #archiveteam |
05:06
🔗
|
|
zenguy has quit IRC (Ping timeout: 246 seconds) |
06:03
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
06:12
🔗
|
|
Stiletto has quit IRC () |
06:27
🔗
|
|
Atom-- has quit IRC (Ping timeout: 190 seconds) |
06:50
🔗
|
|
jmad980 has quit IRC (Remote host closed the connection) |
07:37
🔗
|
|
Stiletto has joined #archiveteam |
07:52
🔗
|
|
Discant has joined #archiveteam |
07:54
🔗
|
|
Honno has joined #archiveteam |
07:56
🔗
|
|
DiscantX has quit IRC (Read error: Operation timed out) |
08:15
🔗
|
|
schbirid has joined #archiveteam |
08:24
🔗
|
|
Rondom_ has quit IRC (Remote host closed the connection) |
08:24
🔗
|
|
Rondom has joined #archiveteam |
08:27
🔗
|
|
MMovie1 has joined #archiveteam |
08:28
🔗
|
|
MMovie has quit IRC (Read error: Operation timed out) |
08:29
🔗
|
|
DiscantX has joined #archiveteam |
08:32
🔗
|
|
Discant has quit IRC (Read error: Operation timed out) |
08:59
🔗
|
|
Discant has joined #archiveteam |
09:04
🔗
|
|
DiscantX has quit IRC (Ping timeout: 501 seconds) |
09:24
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
09:25
🔗
|
|
pguth_ has joined #archiveteam |
09:51
🔗
|
|
Lord_Nigh has quit IRC (Read error: Operation timed out) |
09:51
🔗
|
|
JW_work1 has quit IRC (Read error: Connection reset by peer) |
09:52
🔗
|
|
espes__ has quit IRC (Read error: Operation timed out) |
09:52
🔗
|
|
espes__ has joined #archiveteam |
09:52
🔗
|
|
JW_work has joined #archiveteam |
09:52
🔗
|
|
Lord_Nigh has joined #archiveteam |
09:58
🔗
|
|
DiscantX has joined #archiveteam |
10:00
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
10:00
🔗
|
|
pguth_ has joined #archiveteam |
10:01
🔗
|
|
Discant has quit IRC (Read error: Operation timed out) |
10:03
🔗
|
|
Discant has joined #archiveteam |
10:05
🔗
|
|
DiscantX has quit IRC (Ping timeout: 244 seconds) |
10:06
🔗
|
|
JW_work has quit IRC (Read error: Connection reset by peer) |
10:06
🔗
|
|
JW_work has joined #archiveteam |
10:12
🔗
|
|
Simpbrain has joined #archiveteam |
10:15
🔗
|
|
Medowar has quit IRC (Ping timeout: 244 seconds) |
10:15
🔗
|
|
Rye has quit IRC (Ping timeout: 244 seconds) |
10:16
🔗
|
|
PurpleSym has quit IRC (Ping timeout: 244 seconds) |
10:16
🔗
|
|
PotcFdk has quit IRC (Ping timeout: 506 seconds) |
10:17
🔗
|
|
toddf has quit IRC (Read error: Connection reset by peer) |
10:17
🔗
|
|
toddf has joined #archiveteam |
10:18
🔗
|
|
Medowar has joined #archiveteam |
10:18
🔗
|
|
Rye has joined #archiveteam |
10:19
🔗
|
|
PotcFdk has joined #archiveteam |
10:19
🔗
|
|
PurpleSym has joined #archiveteam |
10:28
🔗
|
|
jk[SVP] has quit IRC (zoop) |
10:36
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
10:36
🔗
|
|
pguth_ has joined #archiveteam |
10:37
🔗
|
|
jk[SVP] has joined #archiveteam |
10:44
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
10:44
🔗
|
|
pguth_ has joined #archiveteam |
11:42
🔗
|
|
Emcy has joined #archiveteam |
11:57
🔗
|
|
SadDM has quit IRC (leaving) |
11:57
🔗
|
|
SadDM has joined #archiveteam |
11:57
🔗
|
|
swebb sets mode: +o SadDM |
12:26
🔗
|
|
Meroje has quit IRC (Quit: bye!) |
12:28
🔗
|
|
Meroje has joined #archiveteam |
12:48
🔗
|
|
Meroje has quit IRC (Quit: bye!) |
12:48
🔗
|
|
Meroje has joined #archiveteam |
12:50
🔗
|
|
Meroje has quit IRC (Client Quit) |
12:51
🔗
|
|
Meroje has joined #archiveteam |
12:53
🔗
|
|
Meroje has quit IRC (Client Quit) |
12:53
🔗
|
|
Meroje has joined #archiveteam |
12:55
🔗
|
|
Meroje has quit IRC (Client Quit) |
12:55
🔗
|
|
Meroje has joined #archiveteam |
12:57
🔗
|
|
Meroje has quit IRC (Client Quit) |
12:57
🔗
|
|
Meroje has joined #archiveteam |
12:58
🔗
|
|
Meroje has quit IRC (Client Quit) |
12:59
🔗
|
|
Meroje has joined #archiveteam |
13:06
🔗
|
|
vitzli has joined #archiveteam |
13:07
🔗
|
|
tomwsmf has joined #archiveteam |
13:24
🔗
|
|
nightpool has quit IRC (Read error: Operation timed out) |
13:33
🔗
|
midas |
https://tweakers.net/geek/114249/internet-archive-zet-dertien-jaargangen-nintendo-power-magazine-online.html |
13:40
🔗
|
|
Discant has quit IRC (Read error: Operation timed out) |
13:47
🔗
|
|
Simpbrain has quit IRC (Quit: Leaving) |
14:38
🔗
|
|
nightpool has joined #archiveteam |
14:41
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
14:42
🔗
|
|
pguth_ has joined #archiveteam |
15:03
🔗
|
|
Simpbrain has joined #archiveteam |
15:36
🔗
|
|
BlueMaxim has quit IRC (Quit: Leaving) |
16:22
🔗
|
|
ndiddy has joined #archiveteam |
16:25
🔗
|
|
TC02 has joined #archiveteam |
16:30
🔗
|
|
DoomTay has joined #archiveteam |
16:35
🔗
|
|
anjacks0n has joined #archiveteam |
16:36
🔗
|
|
JesseW has joined #archiveteam |
16:40
🔗
|
|
Simpbrain has quit IRC (Quit: Leaving) |
16:41
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
16:41
🔗
|
|
pguth_ has joined #archiveteam |
16:44
🔗
|
|
Phoen1x has joined #archiveteam |
16:49
🔗
|
|
philpem has joined #archiveteam |
17:08
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
17:11
🔗
|
|
JesseW has quit IRC (Ping timeout: 370 seconds) |
17:13
🔗
|
|
zenguy has joined #archiveteam |
17:18
🔗
|
|
mr-b has quit IRC (Read error: Operation timed out) |
17:38
🔗
|
|
DoomTay has joined #archiveteam |
17:45
🔗
|
|
JW_work has quit IRC (Quit: Leaving.) |
17:45
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
17:46
🔗
|
|
RichardG has joined #archiveteam |
17:46
🔗
|
|
zenguy has quit IRC (Read error: Operation timed out) |
17:49
🔗
|
|
RichardG_ has joined #archiveteam |
17:51
🔗
|
|
RichardG has quit IRC (Ping timeout: 244 seconds) |
17:53
🔗
|
|
RichardG has joined #archiveteam |
17:58
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
17:58
🔗
|
|
pguth_ has joined #archiveteam |
17:59
🔗
|
|
RichardG_ has quit IRC (Read error: Operation timed out) |
18:07
🔗
|
|
PepsiMax has joined #archiveteam |
18:12
🔗
|
|
JW_work has joined #archiveteam |
18:13
🔗
|
JW_work |
https://www.getdatajoy.com/ <- shutting down Jan 2, 2017 ; unclear what public data is available |
18:15
🔗
|
JW_work |
https://news.ycombinator.com/item?id=12216896 |
19:00
🔗
|
ErkDog |
Is there a way to maybe increase the items/hour on goodlecode? |
19:03
🔗
|
|
Phoen1x has quit IRC (Read error: Operation timed out) |
19:04
🔗
|
|
Phoen1x has joined #archiveteam |
19:11
🔗
|
|
vitzli has quit IRC (Read error: Operation timed out) |
19:22
🔗
|
|
Start_ is now known as Start |
19:40
🔗
|
|
Phoen1x has quit IRC (Quit: Leaving) |
19:55
🔗
|
|
Hybrid_ has joined #archiveteam |
19:56
🔗
|
|
Hybrid_ has quit IRC (Client Quit) |
20:23
🔗
|
|
Aranje has joined #archiveteam |
20:25
🔗
|
|
Discant has joined #archiveteam |
20:34
🔗
|
|
khaoohs_ has quit IRC (Read error: Operation timed out) |
20:52
🔗
|
|
Discant has quit IRC (Ping timeout: 633 seconds) |
20:58
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
21:12
🔗
|
Medowar |
Do we have a wiki page to dump all infos regarding the turkey newspapers crackdown? |
21:19
🔗
|
|
JW_work1 has joined #archiveteam |
21:19
🔗
|
JW_work1 |
Medowar: not that I know of — you could make one |
21:26
🔗
|
|
JW_work has quit IRC (Ping timeout: 633 seconds) |
21:29
🔗
|
Medowar |
lol: https://www.youtube.com/watch?v=HvXtPk8gjYE |
21:29
🔗
|
Medowar |
Turkish News Channel Mistakes GTA Cheats for Coup Codes |
21:30
🔗
|
|
Simpbrain has joined #archiveteam |
21:31
🔗
|
Medowar |
current dump: http://archiveteam.org/index.php?title=Turkey_Media_Crackdown |
21:52
🔗
|
|
kristian_ has joined #archiveteam |
22:00
🔗
|
|
pguth_ has quit IRC (Remote host closed the connection) |
22:00
🔗
|
|
pguth_ has joined #archiveteam |
22:10
🔗
|
|
khaoohs has joined #archiveteam |
22:30
🔗
|
|
khaoohs has quit IRC (Quit: Leaving) |
22:30
🔗
|
|
khaoohs has joined #archiveteam |
22:37
🔗
|
ErkDog |
Is there a way to maybe increase the items/hour on goodlecode? |
22:45
🔗
|
|
Honno has quit IRC (Read error: Operation timed out) |
23:08
🔗
|
toddf |
probably old hat around here, but yahoo! I just succeeded in getting my genealogy scraping routine to use the https://web.archive.org/save/$url api and then verify its available via the https://archive.org/wayback/available?url=$url api (though one must transform & into %26 for this api) and then retrieve for my own analysis via the https://web.archive.org/web/<timestamp>id_/$url api .. |
23:11
🔗
|
xmc |
:) |
23:11
🔗
|
toddf |
archive.org can hit them up better and faster than I can for some reason, and so instead of 6 urls / min I'm now doing 46 urls / min .. reducing my time to complete my current set of urls (before I scrape and find more) from 2y to 3mo and some change |
23:11
🔗
|
toddf |
presuming they don't blacklist archive.org |
23:12
🔗
|
xmc |
oh nice i didn't know archive.org had better thruput |
23:13
🔗
|
toddf |
I got 50% syn syn syn ack delays so 42s / url and sometimes even can't connect issues at the tcp layer from my laptop directly to the target site, archive.org seems to have those issues licked, at least right now |
23:13
🔗
|
toddf |
it is not a whole lot to recode to turn this into a generic site scraper .. |
23:13
🔗
|
xmc |
very strange |
23:14
🔗
|
toddf |
archive.org claims to have unlimited api use for v1 of its api, 'for now, may have to limit in the future' |
23:14
🔗
|
toddf |
so I'll presume they don't mind me doing this in serial. if someone has a link to any limits I should impose to calling the above listed api urls please let me know, I'd rather not get blacklisted from archive.org ;-) |
23:15
🔗
|
toddf |
I only have 6.4 million urls to scrape until I rinse and repeat and find more links inside there |
23:15
🔗
|
xmc |
not bad |
23:15
🔗
|
xmc |
go for it |
23:16
🔗
|
toddf |
pretty sure I won't blow my sqlite3 max db size for this, going on 8gb now and if my math is right 120gb or so is the max in my env |
23:16
🔗
|
xmc |
sqlite might be the wrong choice for that but, |
23:16
🔗
|
xmc |
if it works for you then do it |
23:16
🔗
|
toddf |
I think I'll be stuffing it into a postgresql db before its all said and done |
23:18
🔗
|
toddf |
it was an easy first pick, and I've been trying really hard to tune the network code and scientifically experiment with self imposed delays to see if it effected the afliction of syn syn syn ack delay and/or can't connect |
23:18
🔗
|
xmc |
_nod_ |
23:19
🔗
|
toddf |
then I stumbled upon the archive.org api to save a page and retrieve the original content unadulterated with the id_ bit after the timestamp, and .. the day is lost for anything prodctive except running the scraper full tilt now ;-) |
23:20
🔗
|
xmc |
i knw THAT feeling |
23:21
🔗
|
toddf |
perhaps I've simply reinvented a sqlite3+perl version of the team vm thingie that is serialized but hey, learned a lot in the process |
23:21
🔗
|
toddf |
$ wc -l git/sw/genscripts/genwebscrape |
23:21
🔗
|
toddf |
3580 git/sw/genscripts/genwebscrape |
23:21
🔗
|
toddf |
lots o learning in that bit o code ;-) |
23:21
🔗
|
xmc |
o my |
23:22
🔗
|
arkiver |
ErkDog: we got complains from google for going too fast, it'll stay at this speed |
23:22
🔗
|
toddf |
this way I don't even have to learn about warc's (I read a bit, format seems a bit overkill but anyway) .. archive.org can handle that bit for me in this scenario ;-) |
23:22
🔗
|
arkiver |
the current batch of items is the last batch, after that we are done |
23:36
🔗
|
|
nightpool has quit IRC (Ping timeout: 501 seconds) |
23:47
🔗
|
|
DoomTay has quit IRC (Quit: Page closed) |
23:47
🔗
|
|
lytv has quit IRC (Read error: Operation timed out) |
23:53
🔗
|
|
DoomTay has joined #archiveteam |