| Time |
Nickname |
Message |
|
00:02
🔗
|
Martini |
I think we need more noise on Twitter. RT #IATelethon . lets send them to the YouTube live page, until they fix telethon.archive.org |
|
00:12
🔗
|
Martini |
https://www.youtube.com/watch?v=UM71NPrb5iM |
|
00:27
🔗
|
JesseW |
Martini: I'm trying to post links to neat things on the archive... |
|
00:27
🔗
|
JesseW |
along with the hashtag |
|
00:35
🔗
|
DFJustin |
telethon.archive.org is fixed |
|
00:40
🔗
|
Martini |
Thanks. |
|
00:40
🔗
|
Martini |
http://telethon.archive.org/ is working again. |
|
00:55
🔗
|
|
Ghost_of_ has joined #archiveteam |
|
01:13
🔗
|
|
asdf has joined #archiveteam |
|
01:22
🔗
|
|
aaaaaaaaa has joined #archiveteam |
|
01:22
🔗
|
|
swebb sets mode: +o aaaaaaaaa |
|
02:04
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
|
02:05
🔗
|
|
parker_ has joined #archiveteam |
|
02:19
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 311 seconds) |
|
02:29
🔗
|
|
nertzy has joined #archiveteam |
|
02:38
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
|
02:38
🔗
|
|
parker_ has joined #archiveteam |
|
02:43
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
|
02:44
🔗
|
|
parker_ has joined #archiveteam |
|
02:46
🔗
|
|
nd1ddy has quit IRC (Read error: Connection reset by peer) |
|
02:48
🔗
|
|
parker_ has quit IRC (Remote host closed the connection) |
|
02:49
🔗
|
|
parker_ has joined #archiveteam |
|
02:59
🔗
|
|
ndiddy has joined #archiveteam |
|
03:04
🔗
|
|
asdf has quit IRC (Ping timeout: 378 seconds) |
|
03:09
🔗
|
|
Martini has quit IRC (Quit: ChatZilla 0.9.92 [Firefox 43.0.1/20151216175450]) |
|
03:15
🔗
|
|
Froggypwn has joined #archiveteam |
|
03:44
🔗
|
|
godane has quit IRC (Ping timeout: 311 seconds) |
|
03:46
🔗
|
|
godane has joined #archiveteam |
|
03:50
🔗
|
|
DDR has quit IRC (Remote host closed the connection) |
|
03:55
🔗
|
|
godane has quit IRC (Leaving.) |
|
03:55
🔗
|
|
godane has joined #archiveteam |
|
04:09
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
|
04:09
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
|
04:24
🔗
|
|
nertzy has joined #archiveteam |
|
04:28
🔗
|
|
aaaaaaaaa has quit IRC (Leaving) |
|
04:39
🔗
|
|
ndiddy has quit IRC (Read error: Connection reset by peer) |
|
05:56
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
|
06:09
🔗
|
|
nertzy has joined #archiveteam |
|
06:30
🔗
|
|
asdf has joined #archiveteam |
|
07:22
🔗
|
|
Ungstein has quit IRC (Quit: Leaving.) |
|
07:39
🔗
|
|
vitzli has joined #archiveteam |
|
08:03
🔗
|
|
BlueMaxim has quit IRC (Read error: Connection reset by peer) |
|
08:11
🔗
|
|
VADemon has quit IRC (left4dead) |
|
08:19
🔗
|
|
Boppen has quit IRC (Read error: Connection reset by peer) |
|
08:19
🔗
|
|
Boppen has joined #archiveteam |
|
08:37
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
|
08:37
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
09:18
🔗
|
|
schbirid has joined #archiveteam |
|
09:25
🔗
|
|
asdf has quit IRC (Ping timeout: 252 seconds) |
|
14:15
🔗
|
|
Muad-Dib has joined #archiveteam |
|
14:16
🔗
|
|
WinterFox has quit IRC (Remote host closed the connection) |
|
14:41
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 483 seconds) |
|
14:45
🔗
|
|
Froggypwn has joined #archiveteam |
|
15:08
🔗
|
|
signius has quit IRC (Ping timeout: 364 seconds) |
|
15:15
🔗
|
|
VADemon has joined #archiveteam |
|
15:17
🔗
|
|
Atom__ has quit IRC (Atom__) |
|
15:23
🔗
|
|
Froggypwn has quit IRC (Ping timeout: 483 seconds) |
|
15:26
🔗
|
|
Froggypwn has joined #archiveteam |
|
15:57
🔗
|
|
alberto has joined #archiveteam |
|
16:00
🔗
|
|
vitzli has quit IRC (Quit: Leaving) |
|
16:21
🔗
|
arkiver |
Me and HCross have been working for some days on a newsgrabber. |
|
16:21
🔗
|
arkiver |
The dashboard can be viewed here http://newsgrabber.harrycross.me:29000/ |
|
16:21
🔗
|
HCross |
Sites can be submitted here: https://github.com/ArchiveTeam/NewsGrabber |
|
16:30
🔗
|
arkiver |
So feel free to read the readme and make a pull requst for youe newswebsites! |
|
16:30
🔗
|
HCross |
At the moment it doesnt automagically sync to the server for archive, but ping me when you add one and Ill copy it down |
|
16:43
🔗
|
|
Ghost_of_ has joined #archiveteam |
|
16:47
🔗
|
HCross |
you can watch it underway now |
|
16:49
🔗
|
arkiver |
Basically what the system does |
|
16:49
🔗
|
arkiver |
For every newssite you want to add you have to add a small python file |
|
16:50
🔗
|
arkiver |
this file contains the URLs it will recheck with a specified interval for new URLs |
|
16:51
🔗
|
arkiver |
the file also contains some regexes to match if the URL is a newsarticle or if it some a videoURL |
|
16:51
🔗
|
arkiver |
if it's a videoURL it will be downloaded with youtube-dl |
|
17:11
🔗
|
Atluxity |
does the newsgrabber got its own channel? |
|
17:11
🔗
|
HCross |
Not yet |
|
17:12
🔗
|
Atluxity |
the news-site I am trying to submit has both rss for "top items" and "latest". Include both or just "latest"? |
|
17:13
🔗
|
arkiver |
That would be just latest |
|
17:13
🔗
|
Atluxity |
ok |
|
17:13
🔗
|
arkiver |
Just add a good refresh time so it won't miss any articles |
|
17:13
🔗
|
HCross |
The grabber has gone down for a second to update the script |
|
17:28
🔗
|
Atluxity |
this freaking site has no structure! grrrr |
|
17:29
🔗
|
Atluxity |
"latest" is small news bulletings... articles are "top items" only |
|
17:30
🔗
|
Atluxity |
no tell in url if the page got video in it or not |
|
17:31
🔗
|
HCross |
Do most of the pages in that site have videos? |
|
17:34
🔗
|
Atluxity |
nah |
|
17:34
🔗
|
Atluxity |
that would be a strech |
|
17:35
🔗
|
arkiver |
If you have multiple URLs it has to check for new URLs you can multiple |
|
17:36
🔗
|
arkiver |
Always try to add as less URLs as possible, but still get all artices |
|
17:36
🔗
|
Atluxity |
yeah, I understand |
|
17:51
🔗
|
|
JesseW has joined #archiveteam |
|
17:53
🔗
|
|
ndiddy has joined #archiveteam |
|
17:59
🔗
|
|
signius has joined #archiveteam |
|
18:03
🔗
|
|
atomotic has joined #archiveteam |
|
18:03
🔗
|
joepie91 |
arkiver: HCross: been thinking for a while about something like that, good to see it happening |
|
18:03
🔗
|
joepie91 |
:p |
|
18:04
🔗
|
arkiver |
joepie91: feel free to add as many websites as you can :) |
|
18:04
🔗
|
|
Amitari has joined #archiveteam |
|
18:04
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
|
18:05
🔗
|
Amitari |
Hey, anyone who knows wget that can help me? |
|
18:05
🔗
|
joepie91 |
arkiver: how does one test it? |
|
18:05
🔗
|
joepie91 |
also, dashboard shows nothing |
|
18:05
🔗
|
arkiver |
joepie91: it checks for new links every now and then |
|
18:05
🔗
|
arkiver |
and downloads the list of found new links every hour |
|
18:06
🔗
|
arkiver |
There's not many websites, so that's why it often doesn't show downloads |
|
18:06
🔗
|
arkiver |
joepie91: read the instructions please |
|
18:07
🔗
|
arkiver |
Instructions and looking at other items shows how everything works I think |
|
18:07
🔗
|
arkiver |
scripts will be made public later maybe |
|
18:07
🔗
|
joepie91 |
arkiver: yes, I've read the instructions. it does not answer my question :) |
|
18:08
🔗
|
joepie91 |
and eh, scripts should be public straightaway |
|
18:08
🔗
|
HCross |
joepie91, we are changing the code every half an hour at this point |
|
18:08
🔗
|
joepie91 |
(also, checks every hour? it's not uncommon for controversial articles to be removed faster than that) |
|
18:08
🔗
|
joepie91 |
HCross: ok? |
|
18:09
🔗
|
HCross |
Ye. When its more developed we are going to consider releasing |
|
18:09
🔗
|
joepie91 |
"consider releasing"? |
|
18:09
🔗
|
joepie91 |
and why does that have to wait until "when its more developed"? |
|
18:09
🔗
|
arkiver |
yeah I'll put it online |
|
18:09
🔗
|
arkiver |
I do want to keep this on one server for now though |
|
18:10
🔗
|
joepie91 |
HCross: see also https://web.archive.org/web/20150429004351/http://blog.civiccommons.org/2011/01/be-open-from-day-one |
|
18:10
🔗
|
|
RichardG has joined #archiveteam |
|
18:10
🔗
|
HCross |
So we dont get overlap. We dont want 100 peoplle all archiving BBC news at the same time for example |
|
18:10
🔗
|
Atluxity |
I need help with a regex for the newsgrabber |
|
18:10
🔗
|
joepie91 |
HCross: that is unrelated to releasing code. |
|
18:10
🔗
|
Atluxity |
videoregex should match on subdomain "tv" |
|
18:11
🔗
|
joepie91 |
if you don't want people doing that, then put in the readme that you don't want people doing that |
|
18:11
🔗
|
joepie91 |
making the code available, in this case, is a safety mechanism so that if you get hit by a bus, somebody can pick it up |
|
18:11
🔗
|
HCross |
True |
|
18:12
🔗
|
arkiver |
3 north korean websites added! |
|
18:12
🔗
|
HCross |
When the scripts get updated. - doing that now |
|
18:12
🔗
|
joepie91 |
basically, if you want people to use it carefully, just *ask* them to do so. don't immediately resort to the option of "force" (ie. keeping the code unavailable to them) |
|
18:15
🔗
|
HCross |
True, its in very early days right now |
|
18:15
🔗
|
HCross |
godane, do we have any nres on the Cryengine stuff? |
|
18:15
🔗
|
arkiver |
joepie91: yeah, we get it |
|
18:16
🔗
|
Amitari |
Anyone who can help me with wget? When I try to save a cookie before archiving a PhpBB-forum, I get the message "Remote file exists and could contain further links, |
|
18:16
🔗
|
Amitari |
but recursion is disabled -- not retrieving. |
|
18:16
🔗
|
Amitari |
" |
|
18:19
🔗
|
arkiver |
Atluxity: I'm off for some time now, can I help you later? |
|
18:20
🔗
|
HCross |
Well, the north korean websites crashed on me |
|
18:20
🔗
|
Atluxity |
arkiver: sure |
|
18:23
🔗
|
Atluxity |
https://github.com/atluxity/NewsGrabber/blob/master/services/web_nrk_no.py |
|
18:23
🔗
|
Atluxity |
they split up in so many urls :\ |
|
18:42
🔗
|
joepie91 |
HCross: arkiver: do you want example URLs for some of the BBC's older and newer formats? |
|
18:42
🔗
|
joepie91 |
some are still in use for specials |
|
18:42
🔗
|
joepie91 |
others only for historical articles |
|
18:42
🔗
|
joepie91 |
(they don't migrate - they just leave the old content where it is) |
|
18:43
🔗
|
HCross |
we have the BBC news stuff already, we are more about going after the breaking news. I dont see why not though |
|
18:43
🔗
|
joepie91 |
HCross: the BBC uses more than one format |
|
18:43
🔗
|
joepie91 |
including very fancy highly multimedial ones |
|
18:43
🔗
|
HCross |
ah. Go on then |
|
18:43
🔗
|
joepie91 |
:p |
|
18:44
🔗
|
Amitari |
Hey, could anyone here possibly help me with wget? |
|
18:45
🔗
|
joepie91 |
HCross: http://news.bbc.co.uk/2/hi/health/406713.stm, http://www.bbc.co.uk/news/resources/idt-07eeeebb-d450-4e4b-98d4-755369be7855 / http://www.bbc.com/news/special/2014/newsspec_7617/index.html, http://www.bbc.com/news/world-europe-25190119, http://www.bbc.co.uk/newsbeat/24449861, http://www.bbc.com/future/story/20131112-potato-power-to-light-the-world, http://www.bbc.co.uk/blogs/adamcurtis/posts/BUGGER, http://news.bbc.co.uk/2/hi/science/nature/ |
|
18:45
🔗
|
joepie91 |
630961.stm, http://news.bbc.co.uk/2/hi/uk_news/england/manchester/3758209.stm, http://www.bbc.co.uk/music/reviews/9gvh |
|
18:45
🔗
|
joepie91 |
err |
|
18:46
🔗
|
joepie91 |
the cut-off one is http://news.bbc.co.uk/2/hi/science/nature/630961.stm |
|
18:46
🔗
|
joepie91 |
these are all slightly different URL/content formats |
|
18:46
🔗
|
joepie91 |
for different types of content |
|
18:46
🔗
|
joepie91 |
most of these are still in use |
|
18:46
🔗
|
joepie91 |
the .stm ones are legacy, no longer in use but still referenced |
|
18:47
🔗
|
joepie91 |
the news/resources, news/special and BBC future ones are likely to have JS-loaded content |
|
18:47
🔗
|
joepie91 |
Amitari: probably best to ask in #archiveteam-bs |
|
18:47
🔗
|
Amitari |
Thanks! |
|
18:47
🔗
|
|
Amitari has left Leaving |
|
18:48
🔗
|
HCross |
joepie91, thanks. cc arkiver |
|
18:48
🔗
|
joepie91 |
HCross: arkiver: also, keep in mind that nutech is on a different domain from nu.nl, and their articles are not consistently listed on nu.nl |
|
18:48
🔗
|
joepie91 |
idem for rtlz/editienl and rtl.nl |
|
18:48
🔗
|
|
SN4T14 has quit IRC (Read error: Operation timed out) |
|
18:48
🔗
|
|
SN4T14 has joined #archiveteam |
|
18:49
🔗
|
joepie91 |
webwereld is also one worth looking into, but they also cross-post across multiple sites but not reliably |
|
18:49
🔗
|
joepie91 |
same for infoworld/pcworld |
|
18:49
🔗
|
JesseW |
urlteam tracker seems to be borked for now |
|
18:50
🔗
|
arkiver |
joepie91: https://github.com/ArchiveTeam/NewsGrabber/blob/master/services/web__bbc_com.py |
|
18:50
🔗
|
arkiver |
please have a look at those services |
|
18:51
🔗
|
arkiver |
and if you want anything added you can write a python file for it |
|
18:52
🔗
|
joepie91 |
arkiver: I don't have much time right now (or rather, until after 32C3), hence sharing the knowledge :) |
|
18:52
🔗
|
joepie91 |
plus I need some way to test things |
|
18:52
🔗
|
arkiver |
just test if the regex matches the URLs you want to extract from your seed URLs |
|
18:53
🔗
|
JesseW |
arkiver: could you look at the server logs on the urlteam tracker -- it seems to be broken |
|
18:53
🔗
|
joepie91 |
regardless, no time for PRs atm |
|
19:01
🔗
|
arkiver |
Atluxity: commented |
|
19:03
🔗
|
arkiver |
JesseW: I think chfoo has to do that |
|
19:04
🔗
|
JesseW |
ah, ok |
|
19:04
🔗
|
JesseW |
xmc: do you have access? |
|
19:10
🔗
|
|
scyther has joined #archiveteam |
|
19:38
🔗
|
|
atomotic has quit IRC (Quit: Textual IRC Client: www.textualapp.com) |
|
19:38
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
|
19:50
🔗
|
|
brayden_ has quit IRC (Read error: Connection reset by peer) |
|
19:50
🔗
|
|
brayden has joined #archiveteam |
|
19:50
🔗
|
|
swebb sets mode: +o brayden |
|
19:51
🔗
|
Atluxity |
arkiver: ack |
|
20:00
🔗
|
Start |
it seems that rather than having 1 rss feed cbc has a whole bunch: http://www.cbc.ca/rss/ |
|
20:01
🔗
|
|
maseck has quit IRC (Remote host closed the connection) |
|
20:04
🔗
|
godane |
joepie91: i'm saving those bbc news urls |
|
20:05
🔗
|
godane |
example: http://news.bbc.co.uk/2/hi/630961.stm |
|
20:05
🔗
|
godane |
you can just brute force |
|
20:11
🔗
|
|
schbirid has joined #archiveteam |
|
20:19
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
20:25
🔗
|
|
alberto has quit IRC (Ping timeout: 250 seconds) |
|
20:25
🔗
|
|
JesseW has joined #archiveteam |
|
20:34
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
|
20:38
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
20:41
🔗
|
|
maseck has joined #archiveteam |
|
21:02
🔗
|
|
xXx_ndidd has joined #archiveteam |
|
21:08
🔗
|
|
Coderjoe has quit IRC (Read error: Connection reset by peer) |
|
21:09
🔗
|
|
ndiddy has quit IRC (Read error: Operation timed out) |
|
21:14
🔗
|
|
Coderjoe has joined #archiveteam |
|
21:33
🔗
|
|
schbirid has quit IRC (Quit: Leaving) |
|
21:50
🔗
|
|
Ghost_of_ has joined #archiveteam |
|
21:55
🔗
|
Atluxity |
arkiver: updated |
|
21:56
🔗
|
|
JesseW has joined #archiveteam |
|
22:26
🔗
|
|
JesseW has quit IRC (Leaving.) |
|
22:30
🔗
|
|
scyther has quit IRC (Read error: Connection reset by peer) |
|
22:44
🔗
|
|
closure has joined #archiveteam |
|
22:45
🔗
|
|
nertzy has joined #archiveteam |
|
23:05
🔗
|
|
err3 has joined #archiveteam |
|
23:05
🔗
|
err3 |
hello |
|
23:07
🔗
|
|
nertzy has quit IRC (Quit: This computer has gone to sleep) |
|
23:10
🔗
|
Atluxity |
GREETINGS! |
|
23:10
🔗
|
err3 |
I've got an idea for archiving project |
|
23:10
🔗
|
err3 |
just in case anyone likes it |
|
23:11
🔗
|
Atluxity |
lay it on us |
|
23:11
🔗
|
err3 |
there's some good forums where people post math problems and solutions, e.g. artofproblemsolving |
|
23:11
🔗
|
err3 |
just went to it after a long time and it had totally changed, I got a shock that maybe they removed all of the old stuff - apparently they haven't |
|
23:11
🔗
|
err3 |
but it might be good to somehow make an archive of it |
|
23:12
🔗
|
err3 |
I'm not sure if it would need some special scripting to do |
|
23:12
🔗
|
Atluxity |
got some urls? |
|
23:14
🔗
|
err3 |
https://www.artofproblemsolving.com/community is it now |
|
23:14
🔗
|
err3 |
https://web.archive.org/web/20130201150755/http://www.artofproblemsolving.com/Forum/index.php used to look like this |
|
23:15
🔗
|
err3 |
let me gett a better one |
|
23:15
🔗
|
Atluxity |
wonder how big these sites are... probably not too big |
|
23:16
🔗
|
err3 |
they might not be too large, the important thing is the text (although sometimes equations get rendered into images) |
|
23:16
🔗
|
err3 |
https://web.archive.org/web/20130510031806/http://www.artofproblemsolving.com/Forum/index.php |
|
23:16
🔗
|
err3 |
thats how i remember it |
|
23:17
🔗
|
err3 |
https://web.archive.org/web/20140331091424/http://www.artofproblemsolving.com/Forum/viewforum.php?f=56 |
|
23:17
🔗
|
err3 |
i think a lot of the posts are not archived |
|
23:29
🔗
|
|
RichardG_ has joined #archiveteam |
|
23:29
🔗
|
|
RichardG has quit IRC (Read error: Connection reset by peer) |
|
23:35
🔗
|
|
Ghost_of_ has quit IRC (Quit: Leaving) |
|
23:42
🔗
|
|
WinterFox has joined #archiveteam |
|
23:44
🔗
|
HCross |
For the newsgrab, when you submit, please check the file naming. |
|
23:48
🔗
|
HCross |
its web__foo_bar_com.py |