#newsgrabber 2017-09-18,Mon

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***newsbuddy has joined #newsgrabber [00:17]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:17]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:18]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:18]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:19]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:20]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:21]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:21]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:22]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:22]
***newsbuddy has quit IRC (Remote host closed the connection)
newsbuddy has joined #newsgrabber
[00:22]
newsbuddyHello! I've just been (re)started. Follow my newsgrabs in #newsgrabberbot [00:24]
arkiverfound 3+ TB of newsbuddy WARCs from april
they'll be in the megaWARCs with the other stuff too
[00:31]
.................. (idle for 1h27mn)
jrwrarkiver: added a dedupe status bot to the bot channel :) [01:58]
arkiverjrwr: awesome! :D
as soon as we've cleared the videobot data we'll restart this
should be ~5 hours
[02:06]
jrwrim about to also add in how many hashes have been updates/inserted into the databased from the WARCs [02:07]
arkiverthat'd be perfect
well
arkiver is afk again
jrwr: almost 1 billion!
[02:08]
..... (idle for 20mn)
jrwrya [02:30]
.................. (idle for 1h28mn)
arkiver: FYI, the Records Added is number of records updated or added since last update posted to the channel [03:58]
................. (idle for 1h22mn)
***kyan has quit IRC (Read error: Operation timed out) [05:20]
............................................................................... (idle for 6h31mn)
mls has quit IRC (Ping timeout: 250 seconds) [11:51]
mls has joined #newsgrabber [12:03]
.............. (idle for 1h6mn)
mls has quit IRC (Ping timeout: 250 seconds) [13:09]
........ (idle for 35mn)
mls has joined #newsgrabber [13:44]
....... (idle for 31mn)
arkiverjrwr: ok
looks good
[14:15]
***dd0a13f37 has joined #newsgrabber [14:23]
arkiverthe project is restarted [14:29]
jrwrnice
you can see them coming in
[14:41]
arkiveryes
at a god ~250 req/min
good*
[14:43]
dd0a13f37arkiver, are you something like head of this subproject? [14:50]
arkiverI'd say me, HCross2, jrwr, and all people running discoverers and warrior are heads of this project [14:51]
dd0a13f37Do you agree with the gist of http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-15,Fri&sel=190#l186 ? [14:52]
arkiverI don't know
I haven't read all the text
[14:53]
dd0a13f37It should go to a specific quote [14:53]
arkiveryeah I've seen it
I don't know the reason for that quote though
[14:54]
dd0a13f37oh okay
from http://archive.fart.website/bin/irclogger_log/archiveteam-bs?date=2017-09-15,Fri&sel=179#l175 down to the quote
[14:54]
arkiverah
Some other people commented on hat question
I think those comments are good
[14:57]
HCross2arkiver: is there an easy way to keep an eye on how much derive capacity I've got left? I'm about to start shoveling a load more web crawls I've done myself in [14:58]
arkiver'derive capacity?
'
as in on IA for the WARCs?
[14:58]
HCross2So we don't get told to slow down, yep [14:59]
arkiverI think we'll be fine
feel free to start more crawls
err, upload more crawls
make sure the are mediatype web
[14:59]
HCross2Yeah I have been
I've been shoving more north korean propaganda in
[15:01]
arkiverawesome!
north korea stuff is always good
[15:02]
dd0a13f37Is their livestream still online? Could be worthwhile to capture it [15:03]
HCross2Yep. Especially IP blocked stuff which is now accessible [15:03]
dd0a13f37It was pretty low resolution and only broadcasting 8h/d or so, so it would not be a large burden
tell me about the IP blocks
[15:04]
HCross2kcna.co.jp only accessible from JP IPs [15:04]
arkiverhttps://archive.org/search.php?query=kctv&and%5B%5D=mediatype%3A%22movies%22&sort=-publicdate
HCross2 nice...
the KCTV MMS stream was down for a long time though :(
only went up again like a day ago
[15:05]
dd0a13f37And you now have japanese IPs, or they lifted it? [15:05]
HCross2I have a JP IP [15:06]
JAAOn a related note, I'd like to reiterate that it would be nice to archive kcna.kp. Unfortunately, the website design is so terrible that it's really hard to archive it in an automated way.
(kcna.co.jp is a mirror of the official kcna.kp, as far as I know.)
[15:13]
dd0a13f37Is it down right now? [15:14]
JAAWorks for me. [15:15]
dd0a13f37oh yes I got it to load after 1 minute or so
They route it through china, right?
[15:15]
JAAYeah, it's very slow.
Probably.
[15:15]
dd0a13f37Probably GFW then taking ~1 minute to verify your IP
since subsequent connections are fine
They have autoincrementing ID I think
[15:16]
JAAAll navigation happens through JavaScripted form submits... [15:17]
dd0a13f37Yes, but those have autoinc ID
Fun fact: north korea runs on windows XP that they reimage constantly since they're full of malware but it doesn't matter since everything is airgapped
[15:18]
JAAThey do? I thought they run Red Star OS. [15:20]
dd0a13f37for some computers, but most of their technology is imported from china
which uses wXP for everything
I think it's quite easy to scrape, just grep for '"AR[0-9]+", "", "NT[0-9]+", "I"'
then feed that into http://kcna.kp/kcna.user.special.getArticlePage.kcmsf
[15:21]
JAAI wonder how well that would work with wpull and its URL deduplication... [15:30]
dd0a13f37If it handles post requests, splendit
It gives the full page as HTML
Might be easier to just ask them for a copy
[15:32]
JAAYes, wpull does support POST. Normally (using GET), it deduplicates by URL though, so if it doesn't take the POST data into account, that would be a problem.
Good luck with that. Make sure it's WARC. ;-)
[15:35]
dd0a13f37Even easier: http://kcna.kp/kcna.user.article.retrieveArticleListForPage.kcmsf
gives you a list of articles
curl 'http://kcna.kp/kcna.user.article.retrieveArticleListForPage.kcmsf' -H 'Host: kcna.kp' -H 'User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:52.0) Gecko/20100101 Firefox/52.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: http://kcna.kp/kcna.user.article.retrieveNewsViewInfoList.kcmsf' -H 'Content-Type: application/x-www-form-urlencoded' -H 'Cookie: JSESSIONID=XXXXXXXXXXXXXx' -H 'Connection: keep-alive' --data 'page_sta
oh hey the only important part got truncated
[15:36]
JAA:-D [15:38]
dd0a13f37'page_start=30 kwDispTitle=f keyword= newsTypeCode= articleTypeList= photoCount=0 movieCount=0 kwDispTitle=f kwContent= fromDate= toDate='
It gives a list of articles
[15:38]
I can't wrap my head around it, for an article ID you can issue different queries apparently, one that just gets you the text
and there are different kinds
[15:44]
JAAYeah [15:46]
dd0a13f37I don't think it's feasible to use the existing scraper, can't you just use warcproxy and write your own?
There is a POST request to get the entire article page, what does the other parameters do?
One of them is language presumably
[15:51]
JAAI didn't really look into it in detail yet, honestly. I'm busy with other scrapes currently.
But I'd probably write a wpull plugin which essentially implements those fn_showArticle etc. functions.
[15:52]
dd0a13f37Can't you feed wpull a list of URLs?
or use wproxy
[15:56]
JAASure, but not if you want the POST data to be different for each URL. [15:58]
dd0a13f37Can't you feed it the POST data too? [15:58]
JAAYeah [15:59]
dd0a13f37there are 100k articles in 6 languages, they might not enjoy that load [15:59]
JAABut you can't retrieve the same URL multiple times with different POST data just from the command line. You need a plugin for that, as far as I know. [15:59]
dd0a13f37The hard part is putting it up on IA though [16:01]
JAAWell, integrating it into the Wayback Machine. [16:02]
dd0a13f37You would either have to edit it to get something like /lang/ARxxxxxx.html, or do they handle POST tioo
yes exactly, IA can handle arbitrary data
[16:02]
JAANo, they don't support POST.
It would work with local playback though.
[16:02]
dd0a13f37Oh hey, their CMS is stateful [16:03]
JAAI.e. with pywb or something similar. [16:03]
dd0a13f37you have a SID cookie
you change the language
and the same request gives a different response
[16:03]
JAAEven better, ugh. [16:04]
dd0a13f37Yes, this means you have to emulate their CMS :)
if you want it playable
and a scrape using the getArticlePage.kcmsf would not be exact either, since internally it's not used everywhere
[16:04]
Then you also have stuff like this http://kcna.kp/kcna.user.exploit.exploit.kcmsf which isn't linked anywhere [16:10]
You can apparently register an account, and you can log in with it, but you can't do anything
It doesn't even show up as logged in
[16:19]
There is also rodong and pyongyang times(naenara) [16:27]
the account isn't displayed, but it IS used for logging [16:36]
I think the site is running on a 2mbit connection shared with others (I heard the 2mbit figure mentioned somewhere else
and it was the peak speed I was getting), so there's a possibility that it's simply not possible to scrape the site
[16:43]
.............................. (idle for 2h27mn)
***dd0a13f37 has left [19:10]
................................... (idle for 2h50mn)
JAA has quit IRC (Read error: Operation timed out)
JAA has joined #newsgrabber
svchfoo3 sets mode: +o JAA
svchfoo1 sets mode: +o JAA
[22:00]
........ (idle for 39mn)
svchfoo1 has quit IRC (Remote host closed the connection)
svchfoo1 has joined #newsgrabber
Lagittaja has quit IRC (Quit: Leaving)
svchfoo3 sets mode: +o svchfoo1
[22:41]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)