#newsgrabber 2017-11-27,Mon

Logs of this channel are not protected. You can protect them by a password.

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)


WhoWhatWhen
***grfck has joined #newsgrabber [01:35]
........ (idle for 38mn)
grfck has quit IRC (Quit: Page closed) [02:13]
............. (idle for 1h2mn)
jrwrarkiver: warrior-install.sh doesn't work on non-x86 CPUs :)
it exploded on my arm box I was trying it on
[03:15]
................ (idle for 1h16mn)
Nov 27 04:30:01 scw-d46a3b root: Warcio was not imported correctly.
Nov 27 04:30:01 scw-d46a3b root: Location: /data/data/projects/newsgrabber-1dde1a0/warcio/__init__.py.
thats not helping :)
[04:31]
............. (idle for 1h1mn)
https://github.com/ArchiveTeam/NewsGrabber-Warrior/blob/master/pipeline.py#L21
what is this crazy, did you mode warcio?
mod*
[05:32]
................................... (idle for 2h54mn)
anonymoosI think I know why cdn dedup was getting stuck for me. It was going too fast and hitting the connection limit of my router or vbox NAT. Once I put in a router with more RAM and switched my VM to bridged mode instead of NAT, cdn dedup has been working smoothly for me. [08:26]
................................................... (idle for 4h12mn)
Kazadd your router to the list of AT victims
we'll break anything
[12:38]
HCross2^
Didn't we break a Scandinavian university at one point
[12:39]
.... (idle for 18mn)
Kazwouldn't surprise me
I remember we melted some RAID card(s) back when we were doing Twitch
[12:57]
............ (idle for 59mn)
***kyan has quit IRC (Remote host closed the connection) [13:56]
...................................... (idle for 3h8mn)
Kazbefore I go and buy this on ebay.. does anyone have an Intel S2600cp motherboard in the uk? [17:04]
................. (idle for 1h23mn)
***CoolCanuc has joined #newsgrabber [18:27]
CoolCanucI'd like to submit URLs / sites to newsgrabber. How can I get started. Do I run the discovery server? [18:37]
Kazif you've got a specific site in mind, best way to do it is submit a pull request on github with a service template
current list is here: https://github.com/ArchiveTeam/NewsGrabber-Services/tree/master/services
[18:47]
CoolCanucooh good :D I was afraid I'd never get a response. Thank you VERY much!!
should I also propose a project, or no (deathwatch?)
[18:48]
IglooYeah you can do that if you wish
That's not a problem
[18:50]
CoolCanucThanks
Is there a generator I can use, or is it all manual by hand?
[18:50]
Kazall manual for those files, as they're used by the scripts to find pages to download [18:55]
CoolCanucah ok
the RSS feed for the news sites have only like 10 articles. The site won't be updated anymore. Is it okay to ignore them? They wont be changing
would linking to this be acceptable? http://www.thebarrieexaminer.com/sitemap
[18:58]
.... (idle for 16mn)
***blitzed has joined #newsgrabber [19:17]
......... (idle for 42mn)
kyan has joined #newsgrabber [19:59]
SmileyGwait
if it's a dead news site
archivebot maybe better
[20:12]
.... (idle for 17mn)
***CoolCanuc has quit IRC (Ping timeout: 260 seconds) [20:29]
JensRex has joined #newsgrabber
kyan has quit IRC (Remote host closed the connection)
[20:41]
............. (idle for 1h4mn)
kyan has joined #newsgrabber [21:46]
...... (idle for 26mn)
CoolCanuc has joined #newsgrabber [22:12]
CoolCanucI started making a proposal and such. http://archiveteam.org/index.php?title=User:CoolCanuck/Metroland Should these urls be within one .py file, or one per unique url?
Also, can we archive from issu.com , or is that a no-no ?
sorry for all the questions. I understand the version, but how do I decide the "refresh" value?
[22:13]
JAANewsGrabber is more for continuously archiving active news outlets. We could try adding those sites to ArchiveBot, but I haven't looked at them at all, so I can't tell you if that would work. [22:20]
CoolCanucmy concern is Metroland would delete their publisher account after they decide to take the website down
Wish I knew how the heck this site gets the download link. Wonder if they're using a subscription.. https://download-issuu.com/
Just successfully downloaded a file. No survey. I'm shocked
oh. all it does is get the list of images and converts to a pdf
[22:28]
...... (idle for 27mn)
Would it not be kind of silly to add the to-be-closed sites to github, though? Does it ignore 404s? What if they delete everything instead of actually taking it offline? [23:01]
JAAI don't know how NewsGrabber would react to that (I'm not very familiar with this project), but I think what we want is really just a one-shot archival of the entire websites, right? Hence ArchiveBot (or maybe something custom).
Even if they don't take it offline or delete it anytime soon, they most likely won't be publishing any further articles, so NewsGrabber is really just the wrong tool for the job (I think).
[23:06]
CoolCanucI agree [23:08]
........ (idle for 37mn)
***blitzed has quit IRC (Quit: Leaving)
rolfoid has joined #newsgrabber
[23:45]
CoolCanuc has quit IRC (Read error: Connection reset by peer)
CoolCanuc has joined #newsgrabber
[23:57]

↑back Search ←Prev date Next date→ Show only urls(Click on time to select a line by its url)